Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Advanced Genetic Genealogy - Sat, 13 Apr 2019

Living in Canada, I had to wring my hands waiting an extra two weeks over my US neighbors for my copy of Advanced Genetic Genealogy: Techniques and Case Studies to arrive.

Packaged for me nicely and safely in bubble wrap, the book itself is physically impressive, larger than your average book:  Full letter size 8.5” x 11” (22 x 28 cm), 1 full inch (2.5 cm) thick, and despite being soft cover, weighing in at a hefty 3 pounds. (1.4 kg). Its 382 pages exclude a 4 page table of contents, a six page list of its beautiful full-color figures and tables, a 5 page preface and 2 page acknowledgement by its editor Debbie Parker Wayne, and 7 pages of author biographies.

image

The names of the chapter writers is a who’s who of genetic genealogy: Bartlett, Bettinger, Hobbs, Johnson, Johnston, Jones, Kennett, Lacopo, Owston, Powell, Russell, Stanbary,Turner and Wayne. If you know who these people are, then you are likely knowledgeable enough in this field to take in their wisdom. It is advanced. This is no beginners course. You’ll have to have experience and the knowledge of working with your DNA to fully grasp what is said.

Let’s see what can be learned.



1. Jim Bartlett talks about Segment Triangulation.

Now you have a choice. You can either spend hundreds of hours like I did delving to understand every detail in his four years of blog posts on his segmentology.org blog, or you can read this chapter. He tells you how he uses Segment Triangulation to create Triangulation Groups to allow him to do Chromosome Mapping.

My favorite line from Jim’s chapter: “You can be confident that virtually all of the segments in a Triangulation Group are IBD. This statement has been contested because it has not been proved or published. However after five years of Triangulating, I have not found any evidence to the contrary.”

p.s. I have been working the past few months to implement chromosome mapping techniques similar to what Jim describes in his chapter into the next version 3.0 of Double Match Triangulator. He gives me some new ideas to wake up to think about at 3 a.m.


2. Blaine Bettinger covers Visual Phasing.

Visual Phasing a technique to map the segments shared by three or more siblings to determine the grandparents that supplied them. This is generally done manually from GEDmatch one-to-one comparisons of the three siblings. I have not personally used Visual Phasing for myself because I’m not fortunate enough to have any sets of three siblings who have DNA tested.

This is one of the advanced techniques that has some tools available to help you, but none that yet do it for you. I’m sure the tools to do VP for you will be one of those innovations that appears in the next few years. I’m not going to be the one to build that tool (because I don’t personally need it), but I am implementing some of the ideas of Visual Phasing into DMT.


3. Kathryn Johnston talks about the X Chromosome.

You just can’t help enjoying any writing that brings up the Fibonacci sequence. Kathryn’s most interesting comment to me and something I never knew is that “Visual phasing began with X comparison and the X is still recommended as a starting point.”

I haven’t spent a lot of time on the X chromosome for my own DNA. It really is a bit of a different beast, and I love the one main property being that the ancestral line an X segment comes from cannot go through a father and his father. That can immediately eliminate false MRCAs.


4. Jim Owston on Y DNA.

Well, I’ve done the Y-111 and Big Y500 at Family Tree DNA to help with the Jewish Levite DNA studies. I’d feel better about and work harder with Y-DNA if my closest match was within my 5 generation genealogical time horizon. Alas, it is not and I can’t even use the common surname idea because my ancestors in Romania and Ukraine only adopted their surnames 5 generations ago. So until something breaks through here, I’ll have to remain an autosomal guy. I envy Jim and anyone who can include their 8 generation lineage charts that run from 1520 to 1831. Sick!

Jim has a good writeup on the benefits of going from Big Y-500 to Big Y-700. I see no personal benefit for my own genealogy to upgrade, but if I’m approached because it will help the Levite study, then I’ll likely do it for them. Technically, the study is finding people related to me, albeit along the lines of Jim’s people who are 10 to 20 generations back, but in my case, unlikely to ever be genealogically connected to me. 


5. Melissa Johnson on Unknown Parentage.

Many people do not know who their birth parent or parents are. Melissa describes the various ways to analyze your DNA matches to determine who they might be. She includes Blaine Bettinger’s Shared cM Project tables, X-DNA, Y-DNA, haplogroups, lists various background check websites, and then issues involved in targeted testing when dealing with a birth family.


6. Kimberly Powell on Endogamy.

Ah, endogamy, I know thee well. Kimberly describes all the complications that endogamy brings to the table to make DNA analysis much more challenging. She talks about matches being predicted closer than they are, how “in common with” (ICW) matches can be deceiving, and how clustering systems like the Leeds method do not give clear cut answers.

Kimberly says to check for runs of homozygosity using GEDmatch’s “Are my parents related?” tool. Interestingly for me, with my great amount of endogamy, you’d think my parents would turn out only to related at least at the 3rd or 4th cousin level. But they don’t.

image

One segment of 8.8 cM and 9.8 generations apart for an endogamous population is not much at all. Despite the endogamy of the general population of both my parents, somehow my paternal and maternal families must have remained mostly separate. My paternal side is from towns now in Romania that are a few hundred kilometers from my maternal side’s towns that are now in Ukraine.

When I check my uncle (my father’s brother), he gets no indication that his parents (my paternal grandparents) are related:

image

My paternal grandparents are from two towns now in Romania that are about 300 kilometers (200 miles) apart.

Kimberly also brings up the calculation of the coefficient of relationship, and describes how to use triangulated groups, trees, chromosome mapping and cluster analysis to help identify relationships.


7. Debbie Parker Wayne on Combining atDNA and Y-DNA.

Debbie brings up a very detailed case study from her own research to illustrate some of her methodology. Debbie’s two full pages of citations are impressive unto itself, and it shows the professionalism in her amazing research and analysis.

Debbie includes a bit of almost every technique, and her article is the only one in the book to include Ancestry’s DNA Circles.


8. Ann Turner’s on Raw Data.

Ann’s article is about the Raw Data you download from the testing company. She describes the different file structures of each company, explains RSID and SNP selection, why there are no-calls and miscalls, what phasing is and what statistical phasing is. She goes into child phasing, segments, boundaries (“The actual boundaries may be fuzzier”), builds and genetic distance. I’ve always loved the relationship versus cM versus number of segments chart (Figure 8.8) produced originally by 23andMe that Ann describes.

Then Ann goes into SNPs, overlap between the SNPs tested at the various companies, and why this is important at GEDmatch Genesis. She then talks about other tools for raw data, and finishes by mentioning whole genome sequencing (WGS).


9. Karen Stanbury on DNA and the Genealogical Proof Standard (GPS).

You’ll want to read this chapter if you are a professional genealogist who wants to incorporate DNA into the work you do for your clients. The GPS is expected in any professional work done. Karen describes the testing plan, documentation, focus study groups, correlation, the formulation of a hypothesis, testing the hypothesis, and writing the conclusion.


10. Patricia Hobbs, a Case Study.

Patricia basically follows the principles that Karen described in her chapter, and describes the path taken to use documents and DNA evidence to identify an unknown ancestor. This is another one of those papers that you usually see in an advanced genealogical journal, and it definitely shows you what you must attempt to achieve if you want your work to be published. Very impressive, and way beyond what I ever hope to achieve.


11. Thomas Jones on Publishing Your Results.

Dr. Jones is the author of the classic “Mastering Genealogical Proof” and he applies all his knowledge and techniques into this chapter. His conclusion: “When genealogist, geneticist, and genetic genealogists use DNA test results to help establish genealogical conclusions, they are genealogical researcher. When they write about that research, they become scholarly writers. When their written work helps present-day and future researchers and members of the families that they have studied, they have met their research and writing goals.”


12. Judy Russell on Ethics in Genetic Genealogy.

I love Judy. I read every one of her Legal Genealogist blog posts. But arguing about ethics, like politics, is something I prefer to leave to others. Judy is an expert in these matters. If you worried about any ethical matter with respect to a DNA test, get this book and read this chapter.


13. Michael Lacopo on Uncovering Family Secrets.

I was dreadfully afraid to take a DNA test several years ago, simply because I didn’t want to find out that my father wasn’t my father. If you would have looked at pictures of my father and his siblings and me, you would have said I had nothing to worry about. I ended up getting my uncle to take the test and I took it myself and I’m happy to report that he is indeed my full uncle.

Do you have that worry? Well you should. No matter who you are, you are sure to find a few skeletons in the closet. They may not be immediate family, but they will occur among your DNA relatives. The reasons are varied, sometimes  covered up deeds, sometimes mistakes (switch at birth), or the result of violent crime (e.g. rape). Michael’s chapter is a wonderful treatise on the psychology behind all this. He talks about identify and self, privacy and outcomes. His chapter and Judy’s chapter work hand in hand.


14. Debbie Kennett on the Future!

They couldn’t have picked a better person to write this chapter. This chapter alone is worth the price of the book. Debbie talks about the promise and limitations of:  Y-DNA testing, mtDNA testing, Autosomal DNA testing, Whole Genome Sequencing, ancestral reconstruction, DNA from our ancestors,and the power of big data.

Debbie was nice enough in her chapter of Whole Genome Sequencing to make mention of one of my posts about the VCF file. As a result, I’m proud to have my name is listed in the index of the book on page 354 right after Debbie’s.

Debbie sees the time many years in the future when we will take a DNA test, put our name into a database, and produce an instant, fully sourced family tree, complete with family photographs and composite facial reconstructions. I guess something like this:



Conclusion

If you feel you are ready to plunge into some advanced material to take you to the next level, don’t wait. Get the book now. You don’t have much time to learn this because the field is growing and advancing as we speak. Within a few years, we’ll have a whole new advanced set of tools and ideas that will be developed to help us with our genealogical DNA endeavors. Debbie Parker Wayne’s AGG will be the prerequisite of required knowledge to get to that next level.

Now don’t think for a second that I’ve been able to read and digest everything in this book over the past two days. No, I’ve skimmed and read some parts just to get a feel of it and to write this blog post. It’s going to take me a few months to read it all in detail and take in everything.

Final review score:  A++

WGS Result Files - Sat, 13 Apr 2019

I received the rest of my raw data files for my WGS (Whole Genome Sequencing) test today. It was shipped from their lab in Italy and came on a 1 TB hard drive.

image

Previously I was able to download my VCF files from Dante’s site. I reported on the files a couple of months ago in my post: My Whole Genome Sequencing. The VCF File. Those files were compressed and totaled 224 MB and expanded to 869 MB. The VCF (Variant Call Format) files only contain the variants, i.e. the readings where I vary from the human genome reference.

The files supplied this time include all my data, not just the variants. So they are much larger. As a result they were sent to me on the large hard drive. They include the BAM and FASTQ files.

The files are provided in three folders:

  • clean_data
  • result_alignment
  • result_variation



The FASTQ files

The clean_data directory contains 16 files named something like:
     aaaaaaaaaaa_L01_5mm_n.fq.gz

where aaa is some identifier, and mm run from 78 to 85 and n is 1 or 2.

Each file is about 8 GB in size and is gzip compressed at about 34%. I have a fast Intel i7 computer and it takes 30 minutes to uncompress one of these files to its full size of about 22 GB.

When unzipped, the .gz drops off the file name and the .fq suffix represents a FASTQ file.

Here’s what the beginning of one of the unzipped FASTQ files looks like:

image

Shown above are the first 4 groups of readings in one of the files, where 4 lines make up a reading. The first line of each group of 4 is an identifier, the 2nd are exactly 100 base pari values, the 3rd is just a plus (at least in the records I glanced at) and the 4th are codes that represent the quality of each of the 100 reads.

Also in the directory is a small excel file that contains some summary statistics from my WGS test. It contains:

image

Sample is my kit number. There were 1.5 billion reads. In order to get an average of 30x coverage on 3 billion base pairs, I figure the average read length would have to be 60 base pairs.

Each of the eight 5mm files have two pictures associated with them that show some results, e.g.

aaaaaaaaaaa_LO1_578.base.png contains:

image

and aaaaaaaaaaa_L01_578.qual.png contains:

image

Most of this is all new to me too, so I can’t explain what all this means yet.



The BAM file

The result_alignment directory contains the aaaaaaaaaaaaaa.bam file. It is 115 GB in size compressed at 27% which expands to 425 GB. Decompression time for this file on my computer is over 17 hours. For most purposes, you don’t want to decompress this file since most genome analysis programs work with the BAM file itself and with a small bam.bai (BAM index) file of 8 MB that is also in the directory. The BAM index file allows the programs to go directly to the section of the genome that the analysis program needs.

This directory also contains the same summary xls file that was with the FASTQ files (see above) and three png images:

aaaaaaaaaaa.Depth.png that shows that I mostly achieved at least 30x coverage

image

aaaaaaaaaaa.Cumulative.png that shows the cumulative distribution of the depth

image

aaaaaaaaaaa.Insert.png

image

Insert size has something to do with the analysis process. I used to know what paired reads are, but I forgot. I’ll have to look that up again if I ever have to use them.



Variation Files

There are four subdirectories named sv, snp, indel and cnv:

image

They contain a number of files, some gzipped (which I decompressed in the above listing). Basically, these are various files indicating all my gene, exome and genome differences from the human reference for both my SNPs and my INDELs (insertions/deletions). These files are in a different format than the two VCF files I downloaded for SNPs and INDELs earlier.



What’s Ahead

I purchased the Long Read WGS test from Dante a few days ago. I think I’m going to wait until I get the results from my Long Read test. This will likely take a few months. Once I get the long read results, I’ll look at the BAM and FASTQ files from both tests, compare them to each other, and see what I can learn from them.

With just one test, you can’t tell how good it is. But with two, you can compare their results to each other. It should be interesting.

Combine Kits into One Superkit on GEDmatch Genesis - Sat, 6 Apr 2019

Today GEDmatch Genesis added a new Tier 1 application. They state:

image

I did that myself manually with 5 kits about 6 months ago, uploaded my combined raw data to GEDmatch Genesis, and reported the results in my post: The Benefits of Combining Your DNA Raw Data.

I thought I’d try the new GEDmatch Genesis application to see if it produces essentially the same result.

I selected the Tier 1 “Combine mupltiple kits into 1 superkit” application and it gave the the option to select up to 4 kits that are already uploaded. I had all my 5 kits uploaded and I selected FTDNA, 23andMe, Ancestry and LivingDNA. I left out MyHeritage which uses includes almost the same SNPs as my FTDNA file does.

image

I pressed the “Generate” button and within a second, I got my combined kit:

image

Comparing my kits using the GEDmatch Diagnostic utility gives:

image

When I manually combined the kits, I got 1,389,750 SNPs, but GEDmatch only combines the 1,123,247 SNPs it wants to combine that it knows it is going to use. Slimmed SNPs are what GEDmatch actually uses for comparisons with other kits. I’m surprised that GEDmatch’s 834,457 slimmed SNPs are over 20,000 more than my manually combined kits. I have no explanation for that.

I’ve included my Whole Genome kit from Dante, that GEDmatch only loads the SNPs in the VCF file. Those SNPs are the ones where I am different from the human reference genome. The SNPs where I am the same as the human reference genome are not included. The GEDmatch people still have to fix the upload of VCF files so that human reference genomes are added when the SNP is not included in the file.

The one to one comparison was possible immediately, so I compared the GEDmatch combined kit to each of my individual kits, and to my manually created All-5 kit.

image

All of the comparisons indicate that I match myself at least 99.210%. It’s not important that there are some small breaks in the matching segments which results in more than 22 shared segments. I expect that when the one-to-many comparisons become available, the overlaps will improve just as they did with my manually combined file.



The Bottom Line

If you’ve tested with multiple companies and you subscribe to Tier 1, you should combine your kits to get better comparisons at GEDmatch Genesis. Make sure you make this combined kit the kit for yourself that you use for matching, and change all the others to Research so that you show up only once in other people’s match list.

The only unfortunate thing is that you don’t have access to your raw data at GEDmatch. So you won’t know exactly what they did and you won’t have the raw data for yourself to look at or use for other purposes.