Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

WGS Long Reads Might Not Be Long Enough - Wed, 17 Apr 2019

Today my Dante Labs kit for my Whole Genome Sequencing (WGS) Long Reads arrived. Dante became the first company to make WGS Long Reads available to the general public. The price they are charging is $999 USD, but past customers of Dante Labs are eligible for a $200 USD discount putting it down to $799. In 2016 the cost of long read sequencing was around $17K, and they hoped to get the price down to $3K by 2018. Here it is, 2019, and it’s available to the general public at $1K.

image

I had purchased a Dante Labs WGS, the standard short reads test, last August (2018) when they had it on sale for $399 USD. That was a great price as they had only a few months earlier lowered it from $999 USD, and a year earlier you’d have had to pay several thousand dollars for any whole genome test from anyone. Dante currently offers their standard short read WGS for $599, but if you want it, you can wait for DNA Day or other sales, and I’m sure it will come down.

In October, when Dante had my sample, I had started reading about long read WGS technology, so I asked Dante if they had that technology available. They said they did. I asked how much that would be. They said $1,750 USD. I asked if they could do a long reads test from my sample and they checked and said, no, the sample had started sequencing already.

So I wasn’t able to do the long read test back in October. But it worked out anyway. Now, I will have both the short read test and the long read test for $550 less than the cost would have been for the long read test alone just 6 months ago. This is actually excellent because I will be able to analyze the short read test, analyze the long read test, and then compare the two. When you have just one test you can make no estimate as to the error rate, but when you have two tests to compare, then the differences represent an error in one of the tests and an average error rate can be calculated.



What Good is WGS for Genealogical Cousin Matching?

WGS testing, whether long reads or short reads, provide no help for relative matching. Matching is based on the 700,000 or so SNPs that a company tests. Those SNPs are spread out over the 3 billion base pairs of your genome. The standard DNA tests you take do a good job of identifying those SNPs for matching purposes.

WGS testing is for determining all your 3 billion base pairs and finding all the SNPs where you vary from the human reference. From my short read WGS test, my VCF file had 3,442,712 entries, which are the SNPs where I differ from the human reference. The SNPs other than the 700,000 the company tests are not used for matching, so getting their values does not help matching. Those extra SNPs are very important for medical studies, but not matching. The 700,000 vary enough already that DNA companies would get very little benefit by adding to that number.

The reason to combine raw data from multiple companies, such as you can now do at GEDmatch is because GEDmatch compares tested SNPs between different companies. Some companies have very little overlap between them, i.e. less than 100,000 may be in common and available to be compared which is too small for reliable matching. Combining the multiple kits will increase that overlap number for you.

So for genealogical purposes, you’re likely better off spending your money taking a few standard DNA tests from companies who give you your matches. Then you can create a combined kit at GEDmatch Genesis. A WGS test would not help you with this.


So Why Did I Take a WGS Test?

Other than insatiable curiosity and the need to know, I was hoping to see what, if anything WGS tests will do that could help a genetic genealogist. My current conclusion, (as I just wrote) is not that much.

For analysis of your DNA for health purposes, you will want a WGS test. Most regular DNA companies do not test many SNPs that have known health effects. Even 23andMe only tests a subset of medically-related SNPs. Dante Labs specializes in reports for medical purposes. When you take a test with them, you can request detailed custom reports on specific ailments you may have, like this sample report on epilepsy.

But for me, I’m not really interested in the medical information.


So Why Did I Want To Take a Long Read WGS Test?

A Nanopore Technologies white paper about The Advantages of Long Reads for Genome Assembly gave me the idea that maybe the long reads would overlap enough, that they could be used to phase my raw data. Phasing is separating out the pair of allele values of each SNP into their paternal and maternal values. I would thus find the 22 autosomal chromosomes of DNA that I got from my father and the 23 autosomal chromosomes I got from my mother. If you phase your DNA and create a raw data file from it, you can use it to find the people who match just one parent.

Typically, when you are like me and your parents have passed away and they had never DNA tested, phasing would need to be done with the raw data of close relatives such as siblings, children, aunts, uncles or cousins, nieces or nephews who did test. You can use tools like Kevin Borland’s DNA Reconstruction Toolkit. But I only have an uncle who has tested. Just an uncle isn’t quite enough. Maybe, I thought, long reads would overlap enough to span the entire chromosome and voila, you’ve phased it.

Dante’s long reads uses Oxford Nanopore Promethion technology. The specs are 30x with N50>20,000bp.  That means that 50% of the reads will be longer than 20,000 contiguous base pairs and enough reads are made to give an average coverage of 30 reads for every base pair in the genome. By comparison, short reads average only 150 contiguous base pairs.

Let’s see: 30 x 3 billion base pairs / 20,000 = 4.5 million long reads are made.


Unfortunately, Long Reads Might Not Be Long Enough

Despite my original thought that 4.5 million overlapping reads of 20,000 contiguous base pairs should cover the whole genome, apparently that isn’t the case. The long reads can reconstruct good sized pieces of a chromosome, which are called Contigs. But when you have long stretches where there are few SNPs and for those that are there, the allele values are both the same, then the long reads will not be able to cross the gap. How often does that happen?

Well, as I mentioned above, my VCF file indicates I have 3,442,712 SNPs that are different than the human reference genome. Of those 2,000,090 SNPs have different allele values, meaning we can use one value to represent one chromosome and the other value to represent the other chromosome of the pair. One long read starts a config. An overlapping long read must contain one of the SNPs with different allele values in the contig in order to extend it.

It sort of works like this:

image

Read 1 includes two SNPs. We know the T and C go together on one chromosome, and the C and G go together on the other. So Read 1 is a contig.

Since Read 2 overlaps with Read 1, we can extend the Read 1 contig.

But the next read, Read 3 does not reach back far enough to include the SNP with the CG values. So we cannot tell whether the C or the G connects to the A or the G in Read 3. So our first Contig ends with the AA at the end of Read 2, and the second Contig starts at the AA at the beginning of Read 3.

How many contigs will we have. Quite a few are possible. Here are some rough calculations just to get an idea of what the number might be.

I took all my 2 million SNPs with different values and ordered them within chromosome by base pair address. I then found the difference between the next base pair address and the current. This gives the number of base pairs in a row with no differences.

I then sorted those and plotted them. Here’s the graph:

image

This says that 2% of my SNPs with different allele values are at 15,000 or more base pairs away from the next SNP with different allele values.  Out of my 2 million SNPs with different allele values, 2% means 40,000.

0.2% are at least 70,000 or more base pairs away. Out of my 2 million SNPs, that’s 4,000.

Since my long read test is a N50>20,000bp, only half my long reads will be longer than 20,000. I do get 30x coverage or an average of about 30 reads on any base pair position, so let’s say our average longest of the 30 reads is 70,000 base pairs. Then there would be about 4,000 regions that the can’t be spanned. Some may be adjacent to each other, so I may get something like 3,000 contigs.

This would give me about 3,000 pieces of my genome. Some will be bigger and some will be smaller, but they should average about 1 million base pairs (which is about 1 cM).

There are methods called scaffolding to try to assemble these pieces correctly to the same chromosome. This is all state of the art stuff to handle long read WGS, so I’ve got some reading to do to understand it all.


Forward Thinking

I look forward to getting my long read WGS results and then comparing them to my short read WGS and my combined raw data file from my 5 standard DNA tests. I know I will learn something from that.

I intend to see how many contigs I get out of the long reads. Maybe my estimates above are wrong and I only get 300 contigs instead of 3,000. I might be able to do something with that and figure out how to scaffold to separate out my allele values into each of the pairs of each chromosome.

And maybe I’ll discover something I hadn’t even thought of. In a few months when I get my long read results, we’ll see.

Advanced Genetic Genealogy - Sat, 13 Apr 2019

Living in Canada, I had to wring my hands waiting an extra two weeks over my US neighbors for my copy of Advanced Genetic Genealogy: Techniques and Case Studies to arrive.

Packaged for me nicely and safely in bubble wrap, the book itself is physically impressive, larger than your average book:  Full letter size 8.5” x 11” (22 x 28 cm), 1 full inch (2.5 cm) thick, and despite being soft cover, weighing in at a hefty 3 pounds. (1.4 kg). Its 382 pages exclude a 4 page table of contents, a six page list of its beautiful full-color figures and tables, a 5 page preface and 2 page acknowledgement by its editor Debbie Parker Wayne, and 7 pages of author biographies.

image

The names of the chapter writers is a who’s who of genetic genealogy: Bartlett, Bettinger, Hobbs, Johnson, Johnston, Jones, Kennett, Lacopo, Owston, Powell, Russell, Stanbary,Turner and Wayne. If you know who these people are, then you are likely knowledgeable enough in this field to take in their wisdom. It is advanced. This is no beginners course. You’ll have to have experience and the knowledge of working with your DNA to fully grasp what is said.

Let’s see what can be learned.



1. Jim Bartlett talks about Segment Triangulation.

Now you have a choice. You can either spend hundreds of hours like I did delving to understand every detail in his four years of blog posts on his segmentology.org blog, or you can read this chapter. He tells you how he uses Segment Triangulation to create Triangulation Groups to allow him to do Chromosome Mapping.

My favorite line from Jim’s chapter: “You can be confident that virtually all of the segments in a Triangulation Group are IBD. This statement has been contested because it has not been proved or published. However after five years of Triangulating, I have not found any evidence to the contrary.”

p.s. I have been working the past few months to implement chromosome mapping techniques similar to what Jim describes in his chapter into the next version 3.0 of Double Match Triangulator. He gives me some new ideas to wake up to think about at 3 a.m.


2. Blaine Bettinger covers Visual Phasing.

Visual Phasing a technique to map the segments shared by three or more siblings to determine the grandparents that supplied them. This is generally done manually from GEDmatch one-to-one comparisons of the three siblings. I have not personally used Visual Phasing for myself because I’m not fortunate enough to have any sets of three siblings who have DNA tested.

This is one of the advanced techniques that has some tools available to help you, but none that yet do it for you. I’m sure the tools to do VP for you will be one of those innovations that appears in the next few years. I’m not going to be the one to build that tool (because I don’t personally need it), but I am implementing some of the ideas of Visual Phasing into DMT.


3. Kathryn Johnston talks about the X Chromosome.

You just can’t help enjoying any writing that brings up the Fibonacci sequence. Kathryn’s most interesting comment to me and something I never knew is that “Visual phasing began with X comparison and the X is still recommended as a starting point.”

I haven’t spent a lot of time on the X chromosome for my own DNA. It really is a bit of a different beast, and I love the one main property being that the ancestral line an X segment comes from cannot go through a father and his father. That can immediately eliminate false MRCAs.


4. Jim Owston on Y DNA.

Well, I’ve done the Y-111 and Big Y500 at Family Tree DNA to help with the Jewish Levite DNA studies. I’d feel better about and work harder with Y-DNA if my closest match was within my 5 generation genealogical time horizon. Alas, it is not and I can’t even use the common surname idea because my ancestors in Romania and Ukraine only adopted their surnames 5 generations ago. So until something breaks through here, I’ll have to remain an autosomal guy. I envy Jim and anyone who can include their 8 generation lineage charts that run from 1520 to 1831. Sick!

Jim has a good writeup on the benefits of going from Big Y-500 to Big Y-700. I see no personal benefit for my own genealogy to upgrade, but if I’m approached because it will help the Levite study, then I’ll likely do it for them. Technically, the study is finding people related to me, albeit along the lines of Jim’s people who are 10 to 20 generations back, but in my case, unlikely to ever be genealogically connected to me. 


5. Melissa Johnson on Unknown Parentage.

Many people do not know who their birth parent or parents are. Melissa describes the various ways to analyze your DNA matches to determine who they might be. She includes Blaine Bettinger’s Shared cM Project tables, X-DNA, Y-DNA, haplogroups, lists various background check websites, and then issues involved in targeted testing when dealing with a birth family.


6. Kimberly Powell on Endogamy.

Ah, endogamy, I know thee well. Kimberly describes all the complications that endogamy brings to the table to make DNA analysis much more challenging. She talks about matches being predicted closer than they are, how “in common with” (ICW) matches can be deceiving, and how clustering systems like the Leeds method do not give clear cut answers.

Kimberly says to check for runs of homozygosity using GEDmatch’s “Are my parents related?” tool. Interestingly for me, with my great amount of endogamy, you’d think my parents would turn out only to related at least at the 3rd or 4th cousin level. But they don’t.

image

One segment of 8.8 cM and 9.8 generations apart for an endogamous population is not much at all. Despite the endogamy of the general population of both my parents, somehow my paternal and maternal families must have remained mostly separate. My paternal side is from towns now in Romania that are a few hundred kilometers from my maternal side’s towns that are now in Ukraine.

When I check my uncle (my father’s brother), he gets no indication that his parents (my paternal grandparents) are related:

image

My paternal grandparents are from two towns now in Romania that are about 300 kilometers (200 miles) apart.

Kimberly also brings up the calculation of the coefficient of relationship, and describes how to use triangulated groups, trees, chromosome mapping and cluster analysis to help identify relationships.


7. Debbie Parker Wayne on Combining atDNA and Y-DNA.

Debbie brings up a very detailed case study from her own research to illustrate some of her methodology. Debbie’s two full pages of citations are impressive unto itself, and it shows the professionalism in her amazing research and analysis.

Debbie includes a bit of almost every technique, and her article is the only one in the book to include Ancestry’s DNA Circles.


8. Ann Turner’s on Raw Data.

Ann’s article is about the Raw Data you download from the testing company. She describes the different file structures of each company, explains RSID and SNP selection, why there are no-calls and miscalls, what phasing is and what statistical phasing is. She goes into child phasing, segments, boundaries (“The actual boundaries may be fuzzier”), builds and genetic distance. I’ve always loved the relationship versus cM versus number of segments chart (Figure 8.8) produced originally by 23andMe that Ann describes.

Then Ann goes into SNPs, overlap between the SNPs tested at the various companies, and why this is important at GEDmatch Genesis. She then talks about other tools for raw data, and finishes by mentioning whole genome sequencing (WGS).


9. Karen Stanbury on DNA and the Genealogical Proof Standard (GPS).

You’ll want to read this chapter if you are a professional genealogist who wants to incorporate DNA into the work you do for your clients. The GPS is expected in any professional work done. Karen describes the testing plan, documentation, focus study groups, correlation, the formulation of a hypothesis, testing the hypothesis, and writing the conclusion.


10. Patricia Hobbs, a Case Study.

Patricia basically follows the principles that Karen described in her chapter, and describes the path taken to use documents and DNA evidence to identify an unknown ancestor. This is another one of those papers that you usually see in an advanced genealogical journal, and it definitely shows you what you must attempt to achieve if you want your work to be published. Very impressive, and way beyond what I ever hope to achieve.


11. Thomas Jones on Publishing Your Results.

Dr. Jones is the author of the classic “Mastering Genealogical Proof” and he applies all his knowledge and techniques into this chapter. His conclusion: “When genealogist, geneticist, and genetic genealogists use DNA test results to help establish genealogical conclusions, they are genealogical researcher. When they write about that research, they become scholarly writers. When their written work helps present-day and future researchers and members of the families that they have studied, they have met their research and writing goals.”


12. Judy Russell on Ethics in Genetic Genealogy.

I love Judy. I read every one of her Legal Genealogist blog posts. But arguing about ethics, like politics, is something I prefer to leave to others. Judy is an expert in these matters. If you worried about any ethical matter with respect to a DNA test, get this book and read this chapter.


13. Michael Lacopo on Uncovering Family Secrets.

I was dreadfully afraid to take a DNA test several years ago, simply because I didn’t want to find out that my father wasn’t my father. If you would have looked at pictures of my father and his siblings and me, you would have said I had nothing to worry about. I ended up getting my uncle to take the test and I took it myself and I’m happy to report that he is indeed my full uncle.

Do you have that worry? Well you should. No matter who you are, you are sure to find a few skeletons in the closet. They may not be immediate family, but they will occur among your DNA relatives. The reasons are varied, sometimes  covered up deeds, sometimes mistakes (switch at birth), or the result of violent crime (e.g. rape). Michael’s chapter is a wonderful treatise on the psychology behind all this. He talks about identify and self, privacy and outcomes. His chapter and Judy’s chapter work hand in hand.


14. Debbie Kennett on the Future!

They couldn’t have picked a better person to write this chapter. This chapter alone is worth the price of the book. Debbie talks about the promise and limitations of:  Y-DNA testing, mtDNA testing, Autosomal DNA testing, Whole Genome Sequencing, ancestral reconstruction, DNA from our ancestors,and the power of big data.

Debbie was nice enough in her chapter of Whole Genome Sequencing to make mention of one of my posts about the VCF file. As a result, I’m proud to have my name is listed in the index of the book on page 354 right after Debbie’s.

Debbie sees the time many years in the future when we will take a DNA test, put our name into a database, and produce an instant, fully sourced family tree, complete with family photographs and composite facial reconstructions. I guess something like this:



Conclusion

If you feel you are ready to plunge into some advanced material to take you to the next level, don’t wait. Get the book now. You don’t have much time to learn this because the field is growing and advancing as we speak. Within a few years, we’ll have a whole new advanced set of tools and ideas that will be developed to help us with our genealogical DNA endeavors. Debbie Parker Wayne’s AGG will be the prerequisite of required knowledge to get to that next level.

Now don’t think for a second that I’ve been able to read and digest everything in this book over the past two days. No, I’ve skimmed and read some parts just to get a feel of it and to write this blog post. It’s going to take me a few months to read it all in detail and take in everything.

Final review score:  A++

WGS Result Files - Sat, 13 Apr 2019

I received the rest of my raw data files for my WGS (Whole Genome Sequencing) test today. It was shipped from their lab in Italy and came on a 1 TB hard drive.

image

Previously I was able to download my VCF files from Dante’s site. I reported on the files a couple of months ago in my post: My Whole Genome Sequencing. The VCF File. Those files were compressed and totaled 224 MB and expanded to 869 MB. The VCF (Variant Call Format) files only contain the variants, i.e. the readings where I vary from the human genome reference.

The files supplied this time include all my data, not just the variants. So they are much larger. As a result they were sent to me on the large hard drive. They include the BAM and FASTQ files.

The files are provided in three folders:

  • clean_data
  • result_alignment
  • result_variation



The FASTQ files

The clean_data directory contains 16 files named something like:
     aaaaaaaaaaa_L01_5mm_n.fq.gz

where aaa is some identifier, and mm run from 78 to 85 and n is 1 or 2.

Each file is about 8 GB in size and is gzip compressed at about 34%. I have a fast Intel i7 computer and it takes 30 minutes to uncompress one of these files to its full size of about 22 GB.

When unzipped, the .gz drops off the file name and the .fq suffix represents a FASTQ file.

Here’s what the beginning of one of the unzipped FASTQ files looks like:

image

Shown above are the first 4 groups of readings in one of the files, where 4 lines make up a reading. The first line of each group of 4 is an identifier, the 2nd are exactly 100 base pari values, the 3rd is just a plus (at least in the records I glanced at) and the 4th are codes that represent the quality of each of the 100 reads.

Also in the directory is a small excel file that contains some summary statistics from my WGS test. It contains:

image

Sample is my kit number. There were 1.5 billion reads. In order to get an average of 30x coverage on 3 billion base pairs, I figure the average read length would have to be 60 base pairs.

Each of the eight 5mm files have two pictures associated with them that show some results, e.g.

aaaaaaaaaaa_LO1_578.base.png contains:

image

and aaaaaaaaaaa_L01_578.qual.png contains:

image

Most of this is all new to me too, so I can’t explain what all this means yet.



The BAM file

The result_alignment directory contains the aaaaaaaaaaaaaa.bam file. It is 115 GB in size compressed at 27% which expands to 425 GB. Decompression time for this file on my computer is over 17 hours. For most purposes, you don’t want to decompress this file since most genome analysis programs work with the BAM file itself and with a small bam.bai (BAM index) file of 8 MB that is also in the directory. The BAM index file allows the programs to go directly to the section of the genome that the analysis program needs.

This directory also contains the same summary xls file that was with the FASTQ files (see above) and three png images:

aaaaaaaaaaa.Depth.png that shows that I mostly achieved at least 30x coverage

image

aaaaaaaaaaa.Cumulative.png that shows the cumulative distribution of the depth

image

aaaaaaaaaaa.Insert.png

image

Insert size has something to do with the analysis process. I used to know what paired reads are, but I forgot. I’ll have to look that up again if I ever have to use them.



Variation Files

There are four subdirectories named sv, snp, indel and cnv:

image

They contain a number of files, some gzipped (which I decompressed in the above listing). Basically, these are various files indicating all my gene, exome and genome differences from the human reference for both my SNPs and my INDELs (insertions/deletions). These files are in a different format than the two VCF files I downloaded for SNPs and INDELs earlier.



What’s Ahead

I purchased the Long Read WGS test from Dante a few days ago. I think I’m going to wait until I get the results from my Long Read test. This will likely take a few months. Once I get the long read results, I’ll look at the BAM and FASTQ files from both tests, compare them to each other, and see what I can learn from them.

With just one test, you can’t tell how good it is. But with two, you can compare their results to each other. It should be interesting.