Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Whole Genome: The VCF File, Part 2 - Mon, 22 Apr 2019

A couple of months ago, I compared my VCF file to my DNA test results.

The Variant Call Format (VCF) file is given to you when you do a Whole Genome Sequence (WGS) test. That test finds your DNA values for your whole genome, all 3 billion positions, not just the 700,000 or so positions that a standard DNA test gives you.

But most of those 3 billion positions are the same for most humans. The ones that differ are called Single-Nucleotide Polymorphisms (SNPs) because they “morph” and can have differing values among humans. The standard DNA companies test a selection of the SNPs that differ the most, and they can use the 700,000 they selected for matching people without having to test all 3 billion positions. It works very well. WGS tests are not needed for finding relatives.


Converting VCF to a Raw Data File

But near the end of my last post, I was trying to see if the VCF file could be converted into a raw data file that could be uploaded to GEDmatch or a DNA company that allows raw data uploads.

My VCF file contains 3,442,712 SNPs whose value differ for me from the standard human reference genome. Of those, I found 471,923 SNPs were the same SNPs (by chromosome and position) as those in my raw data file that I created by combining the raw data from 5 companies (FTDNA, MyHeritage, Ancestry, 23andMe and LivingDNA). I compared them in my first analysis and found that 2,798 of them differed, which is only 0.6%. 

At the time, I didn’t think that was too bad an error rate. So I thought a good way to make a raw data file from a VCF file would be:

  1. Take a raw data file you already have to use as a template.
  2. Blank out all the values
  3. Add values for the positions that are in the VCF file
  4. Fill in the others with the human reference genome value.

The basis of that idea is that if it’s not a variant in the variant file, then it must be the reference value.

Today on Facebook, Ann Turner told me that that’s not necessarily the case. The reason she believes, is that the VCF file does not contain all the variant SNPs. And the discrepancies were enough to break her comparison of “herself” with “herself” into 161 segments.


So What’s Really Different Between VCF and Raw Data?

In my first analysis, I only compared whether the values were the same or not, giving that 0.6% difference. I did not look at the specific values. Let’s do that now:

image

For this analysis, let’s not worry about the rows: DD (Deletions), DI (Deletion/Insertions), II (Insertions) or – (no-calls), since they are only in the raw data and not in the VCF file.

The green values down the diagonal are the agreement between the All-5 raw data file and the VCF file. Any numbers above and below that diagonal are disagreements between the two. Those are the 0.6% where one is wrong for sure, but we don’t know which.

But let me now point you to those yellowed numbers in the “Not in VCF” column. Those are all heterozygous values, with two different letters. AC, AG, AT, CG, CT or GT. If they have two different letters, then they cannot be human reference values. One of the two letters is a variant and those entries should have been in the VCF file. But they were not.

This creates even a bigger concern than our earlier 0.6% mismatch. If we total these yellow counts, we find there’s 10,705 or 1.2% of the 881,119 SNPs that are not in the VCF file that should have been in the VCF file.

Again, we don’t know who is wrong, the raw data, or the VCF file. But from Ann’s observations, we’d have to say at least some of those heterozygous values must have been left out and when reference values were added instead to Ann’s file, they caused the match breaking that resulted in 161 segments.


Which is Correct: VCF or Raw Data

When you are comparing two things, and you know one is wrong, you don’t know which of the two is the wrong one. You need others to compare with. I am awaiting the results of my long read WGS test, and when that comes I’ll have a third to compare.

But until then, can I get an idea of which of the two files might be more often correct? There’s one thing I can do.

I can provide the same table as I did above, but for the X and Y chromosomes. Since I’m a male, I only have one X and one Y chromosome. The value could be shown as a single value, but it is still read as a double value and therefore shown as a double value. The caveat is that the two letters must be the same or the read is definitely incorrect. (Note that this table excludes 688 SNPs that are in the pseudoautosomal region of the X or Y which can recombine and have two different allele values).

So let’s see the table:

image

The top left section contains the agreed upon values (in green) between the All 5 raw data file and the VCF file. The counts in that section above and below the green values are disagreements between All 5 and VCF and we don’t know which is correct and which is wrong.

The numbers in red are incorrect reads. Those on the left side are incorrect reads for the All 5 raw data file. It has 219 incorrect reads versus 40,848 correct reads, a ratio of 1 every 186.

The right side are incorrect reads for the VCF file. It has 13 incorrect reads versus 9,605 correct reads. That’s a ratio of 1 every 739 reads.

Now, verging on the realm of hyperbole, the difference in ratios could indicate that an error in a standard DNA test is 4 times (739 / 186) more common than an error in a VCF file.

And applying that ratio to the 10,705 heterozygous values that should have been in the VCF file, we would say that 8,564 would be because the raw data file is wrong, and 2,141 because the VCF file should have included them but did not.

And if 2,141 values out of your DNA file created from the VCF file are incorrect, couldn’t that quite easily have caused the 161 segments that Ann observed?

Yes, this is all conjecture. But the point is that maybe the VCF file is leaving out a significant number of variants. If that is the case, then we can’t just put in a reference value when there is no value in a VCF file. And that would mean a raw data file created from a VCF file and filled in by human reference values may not have enough accuracy to be usable for matching purposes.

Compare Your Number of DNA Matches Among Companies - Sun, 21 Apr 2019

I saw a post on Facebook trying to compare the number of relative matches a person hade at different DNA testing companies.

Here’s my results with my endogamy:

image

Note that there are a few things to consider when you or anyone else does a comparison of your number of DNA matches.

A few companies only give you a specific number of matches. GEDmatch gives 2,000, GEDmatch Genesis gives 3,000, and 23andMe limit you to 2,000 but only lets you see those that have opted in to sharing.

Some companies give you all your matches down to some minimum threshold. The minimum cM is not necessarily the only criteria. Largest segment length and number of segments may be considered. Ultimately, that works out to an effective minimum Total cM for me at Living DNA of 37 cM, at Family Tree DNA of 17 cM, at MyHeritage of 12 cM and at AncestryDNA of 6 cM. Since Ancestry DNA goes right down to matches who have a single segment matching of just 6 cM, I expectedly have a very large total number of matches with them. Even without endogamy, you will likely have your largest number of matches at AncestryDNA as well because of this low matching limit.

If you look at only larger matches, you get a completely different story. I counted the number of matches I had that were a total of 50 cM or more. You’ll see I have very few, just 56 matches at Ancestry DNA. That’s because Ancestry uses their Timber algorithm to eliminate many segments they consider to be false. Whereas Family Tree DNA have a lot of matches 50 cM or more simply because they include segments right down to 1 cM in their total, and therefore will have a larger Total cM than the same match at another company

I’ve added a Database size column. These are numbers I have visually read off of Leah Larkin’s www.theDNAgeek.com/dna-tests chart of Autosomal DNA Database Growth as of April 2019.

When you divide the matches by the database size, in my case, my largest proportion of the database I match to is 1.7% at Family Tree DNA, and then 1.1% at AncestryDNA.

All those statistics are just statistics. What’s much more important and what we as genealogists want, are people who we can determine our exact connection to and can determine a Most Recent Common Ancestor for. These are the people who through DNA and genealogy, will help us to expand our family tree.

My endogamy does give me lots of matches, but I have few connections I can determine because Romanian and Ukrainian records rarely go back further than the mid 1800’s limiting me to about 5 generations genealogically. My best success so far in finding DNA testers whose relationship (well at least one, there may be others) I’ve been able to determine are 11 relatives at AncestryDNA and 10 at 23andMe. At Ancestry, that’s 11 out of the 56 people sharing 50 cM or more.

Ah the thrill of another cousin testing! Two days ago, a 1st cousin on my Mom’s side showed up at AncestryDNA. And just this morning, a 1C1R on my Dad’s side showed up at 23andMe. Go Testers Go!!

If you’ve already tested everywhere, GEDmatch and DNA.Land won’t help you on this front, because they only accept uploads from other companies. So you’ll already match with them at the company they originally tested at. But these upload sites will help you with the additional tools they provide.

My upload to DNA Land does not show any matches for me. I think that’s strange considering that they show 50 matches (their limit) for my uncle’s upload (which I’ve included in the above chart). The 50 matches go down to 48 cM (of what they call Total Recent Shared Length). If you haven’t seen their match report, it is interesting and looks like this:

image

In conclusion, each company uses different algorithms to determine what they consider a match. So it is not really possible to fairly compare matches between companies.

The main thing you want to find from your matches are identifiable relatives. So the best bet is still to fish in as many ponds as you can.

WGS Long Reads Might Not Be Long Enough - Wed, 17 Apr 2019

Today my Dante Labs kit for my Whole Genome Sequencing (WGS) Long Reads arrived. Dante became the first company to make WGS Long Reads available to the general public. The price they are charging is $999 USD, but past customers of Dante Labs are eligible for a $200 USD discount putting it down to $799. In 2016 the cost of long read sequencing was around $17K, and they hoped to get the price down to $3K by 2018. Here it is, 2019, and it’s available to the general public at $1K.

image

I had purchased a Dante Labs WGS, the standard short reads test, last August (2018) when they had it on sale for $399 USD. That was a great price as they had only a few months earlier lowered it from $999 USD, and a year earlier you’d have had to pay several thousand dollars for any whole genome test from anyone. Dante currently offers their standard short read WGS for $599, but if you want it, you can wait for DNA Day or other sales, and I’m sure it will come down.

In October, when Dante had my sample, I had started reading about long read WGS technology, so I asked Dante if they had that technology available. They said they did. I asked how much that would be. They said $1,750 USD. I asked if they could do a long reads test from my sample and they checked and said, no, the sample had started sequencing already.

So I wasn’t able to do the long read test back in October. But it worked out anyway. Now, I will have both the short read test and the long read test for $550 less than the cost would have been for the long read test alone just 6 months ago. This is actually excellent because I will be able to analyze the short read test, analyze the long read test, and then compare the two. When you have just one test you can make no estimate as to the error rate, but when you have two tests to compare, then the differences represent an error in one of the tests and an average error rate can be calculated.



What Good is WGS for Genealogical Cousin Matching?

WGS testing, whether long reads or short reads, provide no help for relative matching. Matching is based on the 700,000 or so SNPs that a company tests. Those SNPs are spread out over the 3 billion base pairs of your genome. The standard DNA tests you take do a good job of identifying those SNPs for matching purposes.

WGS testing is for determining all your 3 billion base pairs and finding all the SNPs where you vary from the human reference. From my short read WGS test, my VCF file had 3,442,712 entries, which are the SNPs where I differ from the human reference. The SNPs other than the 700,000 the company tests are not used for matching, so getting their values does not help matching. Those extra SNPs are very important for medical studies, but not matching. The 700,000 vary enough already that DNA companies would get very little benefit by adding to that number.

The reason to combine raw data from multiple companies, such as you can now do at GEDmatch is because GEDmatch compares tested SNPs between different companies. Some companies have very little overlap between them, i.e. less than 100,000 may be in common and available to be compared which is too small for reliable matching. Combining the multiple kits will increase that overlap number for you.

So for genealogical purposes, you’re likely better off spending your money taking a few standard DNA tests from companies who give you your matches. Then you can create a combined kit at GEDmatch Genesis. A WGS test would not help you with this.


So Why Did I Take a WGS Test?

Other than insatiable curiosity and the need to know, I was hoping to see what, if anything WGS tests will do that could help a genetic genealogist. My current conclusion, (as I just wrote) is not that much.

For analysis of your DNA for health purposes, you will want a WGS test. Most regular DNA companies do not test many SNPs that have known health effects. Even 23andMe only tests a subset of medically-related SNPs. Dante Labs specializes in reports for medical purposes. When you take a test with them, you can request detailed custom reports on specific ailments you may have, like this sample report on epilepsy.

But for me, I’m not really interested in the medical information.


So Why Did I Want To Take a Long Read WGS Test?

A Nanopore Technologies white paper about The Advantages of Long Reads for Genome Assembly gave me the idea that maybe the long reads would overlap enough, that they could be used to phase my raw data. Phasing is separating out the pair of allele values of each SNP into their paternal and maternal values. I would thus find the 22 autosomal chromosomes of DNA that I got from my father and the 23 autosomal chromosomes I got from my mother. If you phase your DNA and create a raw data file from it, you can use it to find the people who match just one parent.

Typically, when you are like me and your parents have passed away and they had never DNA tested, phasing would need to be done with the raw data of close relatives such as siblings, children, aunts, uncles or cousins, nieces or nephews who did test. You can use tools like Kevin Borland’s DNA Reconstruction Toolkit. But I only have an uncle who has tested. Just an uncle isn’t quite enough. Maybe, I thought, long reads would overlap enough to span the entire chromosome and voila, you’ve phased it.

Dante’s long reads uses Oxford Nanopore Promethion technology. The specs are 30x with N50>20,000bp.  That means that 50% of the reads will be longer than 20,000 contiguous base pairs and enough reads are made to give an average coverage of 30 reads for every base pair in the genome. By comparison, short reads average only 150 contiguous base pairs.

Let’s see: 30 x 3 billion base pairs / 20,000 = 4.5 million long reads are made.


Unfortunately, Long Reads Might Not Be Long Enough

Despite my original thought that 4.5 million overlapping reads of 20,000 contiguous base pairs should cover the whole genome, apparently that isn’t the case. The long reads can reconstruct good sized pieces of a chromosome, which are called Contigs. But when you have long stretches where there are few SNPs and for those that are there, the allele values are both the same, then the long reads will not be able to cross the gap. How often does that happen?

Well, as I mentioned above, my VCF file indicates I have 3,442,712 SNPs that are different than the human reference genome. Of those 2,000,090 SNPs have different allele values, meaning we can use one value to represent one chromosome and the other value to represent the other chromosome of the pair. One long read starts a config. An overlapping long read must contain one of the SNPs with different allele values in the contig in order to extend it.

It sort of works like this:

image

Read 1 includes two SNPs. We know the T and C go together on one chromosome, and the C and G go together on the other. So Read 1 is a contig.

Since Read 2 overlaps with Read 1, we can extend the Read 1 contig.

But the next read, Read 3 does not reach back far enough to include the SNP with the CG values. So we cannot tell whether the C or the G connects to the A or the G in Read 3. So our first Contig ends with the AA at the end of Read 2, and the second Contig starts at the AA at the beginning of Read 3.

How many contigs will we have. Quite a few are possible. Here are some rough calculations just to get an idea of what the number might be.

I took all my 2 million SNPs with different values and ordered them within chromosome by base pair address. I then found the difference between the next base pair address and the current. This gives the number of base pairs in a row with no differences.

I then sorted those and plotted them. Here’s the graph:

image

This says that 2% of my SNPs with different allele values are at 15,000 or more base pairs away from the next SNP with different allele values.  Out of my 2 million SNPs with different allele values, 2% means 40,000.

0.2% are at least 70,000 or more base pairs away. Out of my 2 million SNPs, that’s 4,000.

Since my long read test is a N50>20,000bp, only half my long reads will be longer than 20,000. I do get 30x coverage or an average of about 30 reads on any base pair position, so let’s say our average longest of the 30 reads is 70,000 base pairs. Then there would be about 4,000 regions that the can’t be spanned. Some may be adjacent to each other, so I may get something like 3,000 contigs.

This would give me about 3,000 pieces of my genome. Some will be bigger and some will be smaller, but they should average about 1 million base pairs (which is about 1 cM).

There are methods called scaffolding to try to assemble these pieces correctly to the same chromosome. This is all state of the art stuff to handle long read WGS, so I’ve got some reading to do to understand it all.


Forward Thinking

I look forward to getting my long read WGS results and then comparing them to my short read WGS and my combined raw data file from my 5 standard DNA tests. I know I will learn something from that.

I intend to see how many contigs I get out of the long reads. Maybe my estimates above are wrong and I only get 300 contigs instead of 3,000. I might be able to do something with that and figure out how to scaffold to separate out my allele values into each of the pairs of each chromosome.

And maybe I’ll discover something I hadn’t even thought of. In a few months when I get my long read results, we’ll see.