Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

WGS – The Raw VCF file and the gVCF file - Tue, 7 May 2019

As I noted in my last post, Whole Genome: The VCF File, Part 2, the SNP VCF (Variant Call Format) file that Dante Labs gives tester of WGS (Whole Genome Sequencing), does not quite have everything that is needed to generate a raw data file that can be uploaded to various DNA sites for people matching.

The VCF file contains SNPs that vary from the standard human reference genome. The thinking is then that any SNPs needed for the raw data file (to match with the SNPs that are tested by the chips used by Ancestry, 23andMe, Family Tree DNA, MyHeritage DNA and Living DNA) that are not in the file can simply be replaced by the value from the standard human reference.

But they can’t.

The problem is that the VCF contains only those variants that passes a set of filters meeting some quality controls to ensure that the read was good enough. Those SNPs that fail the filters are not included in the VCF file. So some actual variants that should be in the VCF file don’t make it. These are known as true negatives. Substituting the standard reference genome values for those will give you incorrect values.

How many might there be? Well take a look again at the table from my last post.

image

When I created my “All 5” raw data file, it ended up containing 1,343,424 SNPs.

My VCF file contains 3,442,712 variant SNPs. Of those, 462,305 (13.4%) were at positions in my All 5 file. 

The green highlighted values are those that matched the consensus of my All 5 file. If you exclude the no-calls and deletions in my All 5 file, then that’s 453,534 out of (462,305 – 2 – 4 – 2 – 6,156 =) 456,141 or 99.4% agreement on the values. Only 0.6% disagree which isn’t too bad.

But the real problem are those yellowed values. Those are SNPs with two different allele values (heterozygous) and they by definition must be variants. Yet, the VCF file does not include them. These amount to another 10,705 SNPs that either were wrong in the chip reads, or should have been in the VCF file but were not. This means that as many as 2.3% of the VCF values may not be included in the VCF.

There could be others as well. The AA, CC, GG and TT values just above the yellow cells that are not in the VCF may have been variants but not included in the VCF file. e.g. If the reference value was G and the SNP value was AA, then this homozygous read is a variant and should be among the green values in the AA row and AA column. But the VCF may not contain it. We can’t tell how many of these there might be until/unless we get a third test done to compare.

It’s for these true-negative values, that substituting the reference genome value would be an incorrect thing to do and build you a substandard raw data file with likely thousands of incorrect values in it.


The Raw VCF File

Dante gives you access to your filtered VCF file. There is another file available that you can get. If you copy the link to your Dante Labs VCF file and change the end of the filename from “snp.vcf.gz” to “raw.snp.vcf.gz”, you can download what I’m calling the Raw VCF file.

This file contains the unfiltered variants found by your whole genome test. None of the quality measures were applied to this file. The quality filtering that Dante does is designed to get remove most of the false positives, i.e. to prevent SNPs that are not variants from being reported to be variants.

My Raw VCF contains 3,973,659 SNPs. It includes every one of the 3,442,712 SNPs in my filtered SNP VCF file, plus the 712,064 SNPs (17.9%) that it found that it filtered out because the quality didn’t wasn’t enough to be sure they were true variants.

So you can use this Raw VCF file to get more of your true negatives back, but by doing so, you will also add false positives to it.

It’s a tricky situation. It’s a trade-off like a teeter totter, with false positives on one side and true negatives on the other. Dante picked a set a filters that presumably take a good compromise position that does its best to minimize the two types of mistakes.

The one nice thing about this Raw VCF file is that it includes my 46 mtDNA variants. The VCF filters somehow removed these and they are not in the filtered VCF file. Once I got these, I was able to put them into 23andMe raw data format and upload it to James Lick’s mtDNA Haplogroup utility. The format I used is:

i703360 MT 114 T

Since the raw file doesn’t give the RSID, I just used i703360 for everything, hoping that James Lick didn’t use it, and it appears he doesn’t. I used MT as the 2nd field because that’s what 23andMe has. 114 is the position number. T is the value. Those are Tabs (Hex 09) between the fields, not spaces.

I know these were my mtDNA variants because Lick’s routine gave me my correct mt haplogroup: K1a1b1a.

However, for most purposes, this Raw VCF File is likely worse to use than the filtered VCF file since it will include more false positives (variants that are not really variants) than will the filtered VCF file.

And the Raw VCF File also won’t help in our goal to produce a raw data file that can be uploaded to sites for people matching. Reducing true negatives is good, but increasing false positives is bad.


The Genome VCF (gVCF) File

So we still want/need/hope for a way to produce that raw data file from our WGS results. Too many errors are introduced by adding human reference values where they don’t have values in the VCF file. Using the Raw VCF file will alleviate some of the true negatives but it also will increase the false positives, which is not a good trade-off.

There is an intermediate file. It is called the gVCF or Genome VCF file. It contains all the reads of the Raw VCF file and then is said to fill in the gaps with the human reference genome.

Well, that really doesn’t help. It is still basically the Raw VCF file. All it does is supposedly make the lookup of the human reference genome a little simpler.

I requested my gVCF from Dante and they made it available to me. It was 3.6 GB in size and took 30 minutes for me to download. It is a compressed file and took 5 minutes to unzip. The full file is 24.6 GB

Here’s a snippet from it:

image

The long lines are the variants. The short lines are the positions where I don’t have a variant.

Position 14464 on Chr 1 is the first variant in my filtered VCF file. This file contains all the variants in my filtered VCF file.

Positions 14464 and 14653 are my 6th and 7th variants in my Raw VCF file. This file contains all the variants in my Raw VCF file.

But positions 14523, 14542, 14574, 14673, 14907 and 14930 are variants that are in this file, but not in either my Raw VCF file or my filtered VCF file. So there must even be more filtering done even before the variants even make it to the Raw VCF file. Maybe some of the values on the line (like the BaseQRankSum for example) indicate uncertain reads and have something to do with them being left out. None-the-less, I wouldn’t get excited about adding these additional variants because many likely will be false positives if you do.

At least the hope is that you’ll get every read in every position from this file.

But no. It doesn’t even do that.

The lines given that are not variants often contain an END value. e.g., the first 3 lines I displayed above contain:

14373 . T END=14379
14380 . C END=14383
14384 . T END=14396

Does the value T represent all the values from positions 14373 to 14379, or does it just represent the first position? My perusal of the VCF specs finds:

The value in the POS field refers to the position of the first base in the String.

To verify, I took some homozygous SNPs from my DNA tests that agree between different companies:

Chr 1: position 1040026 was read as TT by all 5 companies. My gVCF has:

image

So it says C at position 1039779, ending at position 1040048. That’s not T.

Try another: Chr 1: position 1110019 was read as AA by all 5. My gVCF has:

image

Here it says C at position 1109397, ending at position 1110155. That’s not A.

So the value shown refers to the first position. You do not know what the other positions up to the end positions hold.

And in fact, my gVCF file contains 308,122,966 data lines. That is about 10% of the 3 billion base pairs we have. So only 1 out of 10 positions are reported in the gVCF file with either your variant value or the human genome value.

None-the-less, it doesn’t matter whether the gVCF file contains all the human reference genome values or not. The variants in it are even more liberal than the Raw VCF file and would introduce even more false positives if it is used the generate a raw data file.


A Possible Solution

Between the filtered VCF, the Raw VCF and the gVCF files, none of them alone have the data in it to generate a raw data file that can be uploaded to GEDmatch and other DNA testing sites. And they have a sliding range of variants from optimistic (gVCF) to liberal (Raw VCF) to conservative (filtered VCF).

The problem is that DNA results include both false positives and true negatives. DNA testing companies get around the uncertain values by indicating them to be no-calls. That works because no-calls are always considered a match, and therefore they won’t break a matching segment. As long as there are not too many of them (i.e. 5% or less), the segment matching should work well.

So I believe we can generate a raw data file by doing this:

  1. Make a list of all the SNPs you want raw data for.
  2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO have a good set of them included with his DNA Kit Studio.
  3. The positions of variants in your gVCF file should be marked as no-calls. Many of these variants are false, but we don’t want them to break a match.
  4. The positions of variants in your filtered VCF should be marked as having that variant. This will overwrite most of the optimistic no-calls marked in step 3 with filtered reliable values.

I likely will try this myself. When I do, it will be worthy of its own blog post.

Whole Genome: The VCF File, Part 2 - Mon, 22 Apr 2019

A couple of months ago, I compared my VCF file to my DNA test results.

The Variant Call Format (VCF) file is given to you when you do a Whole Genome Sequence (WGS) test. That test finds your DNA values for your whole genome, all 3 billion positions, not just the 700,000 or so positions that a standard DNA test gives you.

But most of those 3 billion positions are the same for most humans. The ones that differ are called Single-Nucleotide Polymorphisms (SNPs) because they “morph” and can have differing values among humans. The standard DNA companies test a selection of the SNPs that differ the most, and they can use the 700,000 they selected for matching people without having to test all 3 billion positions. It works very well. WGS tests are not needed for finding relatives.


Converting VCF to a Raw Data File

But near the end of my last post, I was trying to see if the VCF file could be converted into a raw data file that could be uploaded to GEDmatch or a DNA company that allows raw data uploads.

My VCF file contains 3,442,712 SNPs whose value differ for me from the standard human reference genome. Of those, I found 471,923 SNPs were the same SNPs (by chromosome and position) as those in my raw data file that I created by combining the raw data from 5 companies (FTDNA, MyHeritage, Ancestry, 23andMe and LivingDNA). I compared them in my first analysis and found that 2,798 of them differed, which is only 0.6%. 

At the time, I didn’t think that was too bad an error rate. So I thought a good way to make a raw data file from a VCF file would be:

  1. Take a raw data file you already have to use as a template.
  2. Blank out all the values
  3. Add values for the positions that are in the VCF file
  4. Fill in the others with the human reference genome value.

The basis of that idea is that if it’s not a variant in the variant file, then it must be the reference value.

Today on Facebook, Ann Turner told me that that’s not necessarily the case. The reason she believes, is that the VCF file does not contain all the variant SNPs. And the discrepancies were enough to break her comparison of “herself” with “herself” into 161 segments.


So What’s Really Different Between VCF and Raw Data?

In my first analysis, I only compared whether the values were the same or not, giving that 0.6% difference. I did not look at the specific values. Let’s do that now:

image

For this analysis, let’s not worry about the rows: DD (Deletions), DI (Deletion/Insertions), II (Insertions) or – (no-calls), since they are only in the raw data and not in the VCF file.

The green values down the diagonal are the agreement between the All-5 raw data file and the VCF file. Any numbers above and below that diagonal are disagreements between the two. Those are the 0.6% where one is wrong for sure, but we don’t know which.

But let me now point you to those yellowed numbers in the “Not in VCF” column. Those are all heterozygous values, with two different letters. AC, AG, AT, CG, CT or GT. If they have two different letters, then they cannot be human reference values. One of the two letters is a variant and those entries should have been in the VCF file. But they were not.

This creates even a bigger concern than our earlier 0.6% mismatch. If we total these yellow counts, we find there’s 10,705 or 1.2% of the 881,119 SNPs that are not in the VCF file that should have been in the VCF file.

Again, we don’t know who is wrong, the raw data, or the VCF file. But from Ann’s observations, we’d have to say at least some of those heterozygous values must have been left out and when reference values were added instead to Ann’s file, they caused the match breaking that resulted in 161 segments.


Which is Correct: VCF or Raw Data

When you are comparing two things, and you know one is wrong, you don’t know which of the two is the wrong one. You need others to compare with. I am awaiting the results of my long read WGS test, and when that comes I’ll have a third to compare.

But until then, can I get an idea of which of the two files might be more often correct? There’s one thing I can do.

I can provide the same table as I did above, but for the X and Y chromosomes. Since I’m a male, I only have one X and one Y chromosome. The value could be shown as a single value, but it is still read as a double value and therefore shown as a double value. The caveat is that the two letters must be the same or the read is definitely incorrect. (Note that this table excludes 688 SNPs that are in the pseudoautosomal region of the X or Y which can recombine and have two different allele values).

So let’s see the table:

image

The top left section contains the agreed upon values (in green) between the All 5 raw data file and the VCF file. The counts in that section above and below the green values are disagreements between All 5 and VCF and we don’t know which is correct and which is wrong.

The numbers in red are incorrect reads. Those on the left side are incorrect reads for the All 5 raw data file. It has 219 incorrect reads versus 40,848 correct reads, a ratio of 1 every 186.

The right side are incorrect reads for the VCF file. It has 13 incorrect reads versus 9,605 correct reads. That’s a ratio of 1 every 739 reads.

Now, verging on the realm of hyperbole, the difference in ratios could indicate that an error in a standard DNA test is 4 times (739 / 186) more common than an error in a VCF file.

And applying that ratio to the 10,705 heterozygous values that should have been in the VCF file, we would say that 8,564 would be because the raw data file is wrong, and 2,141 because the VCF file should have included them but did not.

And if 2,141 values out of your DNA file created from the VCF file are incorrect, couldn’t that quite easily have caused the 161 segments that Ann observed?

Yes, this is all conjecture. But the point is that maybe the VCF file is leaving out a significant number of variants. If that is the case, then we can’t just put in a reference value when there is no value in a VCF file. And that would mean a raw data file created from a VCF file and filled in by human reference values may not have enough accuracy to be usable for matching purposes.

Compare Your Number of DNA Matches Among Companies - Sun, 21 Apr 2019

I saw a post on Facebook trying to compare the number of relative matches a person hade at different DNA testing companies.

Here’s my results with my endogamy:

image

Note that there are a few things to consider when you or anyone else does a comparison of your number of DNA matches.

A few companies only give you a specific number of matches. GEDmatch gives 2,000, GEDmatch Genesis gives 3,000, and 23andMe limit you to 2,000 but only lets you see those that have opted in to sharing.

Some companies give you all your matches down to some minimum threshold. The minimum cM is not necessarily the only criteria. Largest segment length and number of segments may be considered. Ultimately, that works out to an effective minimum Total cM for me at Living DNA of 37 cM, at Family Tree DNA of 17 cM, at MyHeritage of 12 cM and at AncestryDNA of 6 cM. Since Ancestry DNA goes right down to matches who have a single segment matching of just 6 cM, I expectedly have a very large total number of matches with them. Even without endogamy, you will likely have your largest number of matches at AncestryDNA as well because of this low matching limit.

If you look at only larger matches, you get a completely different story. I counted the number of matches I had that were a total of 50 cM or more. You’ll see I have very few, just 56 matches at Ancestry DNA. That’s because Ancestry uses their Timber algorithm to eliminate many segments they consider to be false. Whereas Family Tree DNA have a lot of matches 50 cM or more simply because they include segments right down to 1 cM in their total, and therefore will have a larger Total cM than the same match at another company

I’ve added a Database size column. These are numbers I have visually read off of Leah Larkin’s www.theDNAgeek.com/dna-tests chart of Autosomal DNA Database Growth as of April 2019.

When you divide the matches by the database size, in my case, my largest proportion of the database I match to is 1.7% at Family Tree DNA, and then 1.1% at AncestryDNA.

All those statistics are just statistics. What’s much more important and what we as genealogists want, are people who we can determine our exact connection to and can determine a Most Recent Common Ancestor for. These are the people who through DNA and genealogy, will help us to expand our family tree.

My endogamy does give me lots of matches, but I have few connections I can determine because Romanian and Ukrainian records rarely go back further than the mid 1800’s limiting me to about 5 generations genealogically. My best success so far in finding DNA testers whose relationship (well at least one, there may be others) I’ve been able to determine are 11 relatives at AncestryDNA and 10 at 23andMe. At Ancestry, that’s 11 out of the 56 people sharing 50 cM or more.

Ah the thrill of another cousin testing! Two days ago, a 1st cousin on my Mom’s side showed up at AncestryDNA. And just this morning, a 1C1R on my Dad’s side showed up at 23andMe. Go Testers Go!!

If you’ve already tested everywhere, GEDmatch and DNA.Land won’t help you on this front, because they only accept uploads from other companies. So you’ll already match with them at the company they originally tested at. But these upload sites will help you with the additional tools they provide.

My upload to DNA Land does not show any matches for me. I think that’s strange considering that they show 50 matches (their limit) for my uncle’s upload (which I’ve included in the above chart). The 50 matches go down to 48 cM (of what they call Total Recent Shared Length). If you haven’t seen their match report, it is interesting and looks like this:

image

In conclusion, each company uses different algorithms to determine what they consider a match. So it is not really possible to fairly compare matches between companies.

The main thing you want to find from your matches are identifiable relatives. So the best bet is still to fish in as many ponds as you can.