WGS Long Reads Might Not Be Long Enough - Wed, 17 Apr 2019
Today my Dante Labs kit for my Whole Genome Sequencing (WGS) Long Reads arrived. Dante became the first company to make WGS Long Reads available to the general public. The price they are charging is $999 USD, but past customers of Dante Labs are eligible for a $200 USD discount putting it down to $799. In 2016 the cost of long read sequencing was around $17K, and they hoped to get the price down to $3K by 2018. Here it is, 2019, and it’s available to the general public at $1K.
I had purchased a Dante Labs WGS, the standard short reads test, last August (2018) when they had it on sale for $399 USD. That was a great price as they had only a few months earlier lowered it from $999 USD, and a year earlier you’d have had to pay several thousand dollars for any whole genome test from anyone. Dante currently offers their standard short read WGS for $599, but if you want it, you can wait for DNA Day or other sales, and I’m sure it will come down.
In October, when Dante had my sample, I had started reading about long read WGS technology, so I asked Dante if they had that technology available. They said they did. I asked how much that would be. They said $1,750 USD. I asked if they could do a long reads test from my sample and they checked and said, no, the sample had started sequencing already.
So I wasn’t able to do the long read test back in October. But it worked out anyway. Now, I will have both the short read test and the long read test for $550 less than the cost would have been for the long read test alone just 6 months ago. This is actually excellent because I will be able to analyze the short read test, analyze the long read test, and then compare the two. When you have just one test you can make no estimate as to the error rate, but when you have two tests to compare, then the differences represent an error in one of the tests and an average error rate can be calculated.
What Good is WGS for Genealogical Cousin Matching?
WGS testing, whether long reads or short reads, provide no help for relative matching. Matching is based on the 700,000 or so SNPs that a company tests. Those SNPs are spread out over the 3 billion base pairs of your genome. The standard DNA tests you take do a good job of identifying those SNPs for matching purposes.
WGS testing is for determining all your 3 billion base pairs and finding all the SNPs where you vary from the human reference. From my short read WGS test, my VCF file had 3,442,712 entries, which are the SNPs where I differ from the human reference. The SNPs other than the 700,000 the company tests are not used for matching, so getting their values does not help matching. Those extra SNPs are very important for medical studies, but not matching. The 700,000 vary enough already that DNA companies would get very little benefit by adding to that number.
The reason to combine raw data from multiple companies, such as you can now do at GEDmatch is because GEDmatch compares tested SNPs between different companies. Some companies have very little overlap between them, i.e. less than 100,000 may be in common and available to be compared which is too small for reliable matching. Combining the multiple kits will increase that overlap number for you.
So for genealogical purposes, you’re likely better off spending your money taking a few standard DNA tests from companies who give you your matches. Then you can create a combined kit at GEDmatch Genesis. A WGS test would not help you with this.
So Why Did I Take a WGS Test?
Other than insatiable curiosity and the need to know, I was hoping to see what, if anything WGS tests will do that could help a genetic genealogist. My current conclusion, (as I just wrote) is not that much.
For analysis of your DNA for health purposes, you will want a WGS test. Most regular DNA companies do not test many SNPs that have known health effects. Even 23andMe only tests a subset of medically-related SNPs. Dante Labs specializes in reports for medical purposes. When you take a test with them, you can request detailed custom reports on specific ailments you may have, like this sample report on epilepsy.
But for me, I’m not really interested in the medical information.
So Why Did I Want To Take a Long Read WGS Test?
A Nanopore Technologies white paper about The Advantages of Long Reads for Genome Assembly gave me the idea that maybe the long reads would overlap enough, that they could be used to phase my raw data. Phasing is separating out the pair of allele values of each SNP into their paternal and maternal values. I would thus find the 22 autosomal chromosomes of DNA that I got from my father and the 23 autosomal chromosomes I got from my mother. If you phase your DNA and create a raw data file from it, you can use it to find the people who match just one parent.
Typically, when you are like me and your parents have passed away and they had never DNA tested, phasing would need to be done with the raw data of close relatives such as siblings, children, aunts, uncles or cousins, nieces or nephews who did test. You can use tools like Kevin Borland’s DNA Reconstruction Toolkit. But I only have an uncle who has tested. Just an uncle isn’t quite enough. Maybe, I thought, long reads would overlap enough to span the entire chromosome and voila, you’ve phased it.
Dante’s long reads uses Oxford Nanopore Promethion technology. The specs are 30x with N50>20,000bp. That means that 50% of the reads will be longer than 20,000 contiguous base pairs and enough reads are made to give an average coverage of 30 reads for every base pair in the genome. By comparison, short reads average only 150 contiguous base pairs.
Let’s see: 30 x 3 billion base pairs / 20,000 = 4.5 million long reads are made.
Unfortunately, Long Reads Might Not Be Long Enough
Despite my original thought that 4.5 million overlapping reads of 20,000 contiguous base pairs should cover the whole genome, apparently that isn’t the case. The long reads can reconstruct good sized pieces of a chromosome, which are called Contigs. But when you have long stretches where there are few SNPs and for those that are there, the allele values are both the same, then the long reads will not be able to cross the gap. How often does that happen?
Well, as I mentioned above, my VCF file indicates I have 3,442,712 SNPs that are different than the human reference genome. Of those 2,000,090 SNPs have different allele values, meaning we can use one value to represent one chromosome and the other value to represent the other chromosome of the pair. One long read starts a config. An overlapping long read must contain one of the SNPs with different allele values in the contig in order to extend it.
It sort of works like this:
Read 1 includes two SNPs. We know the T and C go together on one chromosome, and the C and G go together on the other. So Read 1 is a contig.
Since Read 2 overlaps with Read 1, we can extend the Read 1 contig.
But the next read, Read 3 does not reach back far enough to include the SNP with the CG values. So we cannot tell whether the C or the G connects to the A or the G in Read 3. So our first Contig ends with the AA at the end of Read 2, and the second Contig starts at the AA at the beginning of Read 3.
How many contigs will we have. Quite a few are possible. Here are some rough calculations just to get an idea of what the number might be.
I took all my 2 million SNPs with different values and ordered them within chromosome by base pair address. I then found the difference between the next base pair address and the current. This gives the number of base pairs in a row with no differences.
I then sorted those and plotted them. Here’s the graph:
This says that 2% of my SNPs with different allele values are at 15,000 or more base pairs away from the next SNP with different allele values. Out of my 2 million SNPs with different allele values, 2% means 40,000.
0.2% are at least 70,000 or more base pairs away. Out of my 2 million SNPs, that’s 4,000.
Since my long read test is a N50>20,000bp, only half my long reads will be longer than 20,000. I do get 30x coverage or an average of about 30 reads on any base pair position, so let’s say our average longest of the 30 reads is 70,000 base pairs. Then there would be about 4,000 regions that the can’t be spanned. Some may be adjacent to each other, so I may get something like 3,000 contigs.
This would give me about 3,000 pieces of my genome. Some will be bigger and some will be smaller, but they should average about 1 million base pairs (which is about 1 cM).
There are methods called scaffolding to try to assemble these pieces correctly to the same chromosome. This is all state of the art stuff to handle long read WGS, so I’ve got some reading to do to understand it all.
Forward Thinking
I look forward to getting my long read WGS results and then comparing them to my short read WGS and my combined raw data file from my 5 standard DNA tests. I know I will learn something from that.
I intend to see how many contigs I get out of the long reads. Maybe my estimates above are wrong and I only get 300 contigs instead of 3,000. I might be able to do something with that and figure out how to scaffold to separate out my allele values into each of the pairs of each chromosome.
And maybe I’ll discover something I hadn’t even thought of. In a few months when I get my long read results, we’ll see.