Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Determining VCF Accuracy - Mon, 13 May 2019

In my last post, I was able to create a raw data file from the Whole Genome Sequencing (WGS) BAM file using the WGS Extract program. It seemed to work quite well.

But my previous post to that: WGS – The Raw VCF file and the gVCF file, I was trying to see if I could create a raw data file from the Variant Call Format (VCF) file. I ended that post with a procedure that I thought could generate a raw data file, which was:

  1. Make a list of all the SNPs you want raw data for.
  2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO has good set of them included with his DNA Kit Studio.
  3. The positions of variants in your gVCF file should be marked as no-calls. Many of these variants are false, but we don’t want them to break a match.
  4. The positions of variants in your filtered VCF should be marked as having that variant. This will overwrite most of the optimistic no-calls marked in step 3 with filtered reliable values.

When I wrote that, I had thought that the gVCF file contained more variants in it than the Raw VCF file had. During my analysis since then, I found out that is not true. The Raw VCF contains all the unfiltered variants. Everything that might be considered to be a variant is in the Raw VCF file. The gVCF includes the same SNP variants that are in the Raw VCF file, but also includes all the insertion/deletions as well as about 10% of the non-variant positions. It’s the  non-variant positions that makes the gVCF such a large file.

So right away, in Step 3 of the above proposed procedure, the Raw VCF file can be suggested instead of the gVCF file and will give the same results. That is a good thing since the Raw VCF file is much smaller than the gVCF file so it will be faster to process. Also the Raw VCF file and the filtered VCF file include the same fields. My gVCF included different fields and would need to be processed differently than the other two.

(An aside:  I also found out that my gVCF supplied to me by Dante did not have enough information in it to determine what the variant is. It gives the REF and ALT field values, but does not include the AC field. The AC field gives the frequency of the ALT value, either 1 or 2.

  • If REF=A, ALT=C, AC=1, then the variant value is AC.
  • If REF=A, ALT=C, AC=2, then the variant value is CC
  • If REF=A, ALT=C, AC is not given, then the variant value can be AC or CC.

For me to make any use of my gVCF file, for not just this purpose but any purpose, I would have to go back and ask Dante to recreate it for me and include the AC field in the variant records.  End aside.)


Estimating Type 1 and Type 2 Errors

We now need to see if the above procedure using the Raw VCF file in step 3 and the filtered VCF file in step 4 will be accurate enough to use.

We are dealing with two types of errors.

Type 1: False Positive: The SNP is not a variant, but the VCF file specifies that it is a variant.

Type 2: True Negative:  The SNP is a variant, but the VCF file specifies that it is not a variant.

Both are errors that we want to minimize, since either error will give us an incorrect value.

To determine the Type 1 and Type 2 error rate, I used the 959,368 SNPs that the WGS Extract program produced for me from my BAM file. That program uses well a developed and respected genomic library of analysis functions called samtools, so the values it extracted from my WGS via my BAM file are as good as they can get. It is essential that I have as correct values as possible for this analysis, so I removed 2,305 values that might be wrong because some of my chip test results disagreed with. I also removed 477 values that WGS Extract included but were at insertion or deletion positions.

From the remaining values, I could only use positions where I could determine the reference value. This included 458,894 variant positions, which always state the reference value, as well as the 10% or so of non-variant reference values that I could determine from my gVCF file. That amounted to 42,552 non-variants.

Assuming these variant and non-variant positions all have accurate values from the WGS extract, we can now compute the two types of errors for my filtered VCF file and for my Raw VCF file.

image

In creating a VCF, the filtering is designed to eliminate as many Type 1 errors as possible, so that the variants you are given are almost surely true variants. The Raw VCF only had 0.13% Type 1 errors, and the filtering reduced this to a very small 0.08%.

Type 1 and Type 2 errors work against each other. Doing anything to decrease the number of Type 1 errors will increase the number of Type 2 errors and vise versa.

The Raw Data file turns out to only have 0.06% Type 2 errors, quite an acceptable percentage. But this gets increased by the filtering to a whopping 0.76%.

This value of 0.76% represents the number of true variants that are left out of the filtered VCF file. This is what is causing the problem with using the filtered VCF file to produce a raw data file. When the SNPs that are not in the filtered VCF file are replaced by reference values, they will be wrong. These extra errors are enough to cause some matching segments to no longer match. And a comparison of a person’s raw dna with his raw dna generated from a filtered VCF file will not match well enough.

If instead, the Raw VCF file is used, the Type 2 errors are considerably reduced. The Type 1 errors are only slightly increased, well under worrisome levels.

Since there are approximately the same number of variants as non-variants among our SNPs, the two error rates can be averaged to give you an idea of the percentage of SNPs expected to have an erroneous value.  Using the Raw VCF instead of the filtered VCF will reduce the overall error rate down from 0.42% to 0.09%, a 79% reduction in errors.

This could be reduced a tiny bit more. If the Raw VCF non-variants are all marked as no-calls, and then the Filtered VCF non-variants are replaced by the reference values, then 20 of the 55 Type 1 Errors in my example above, instead of being wrong, will be marked as no-calls. No-calls are not really correct, but they aren’t wrong either. For the sake of reducing the average error rate from 0.09% to 0.07%, it’s likely not worth the extra effort of processing both VCF files.


Conclusion

Taking all the above account, my final suggested procedure to create a raw data file from a VCF file is to use only the Raw VCF file and not the filtered VCF file, as follows:

  1. Make a list of all the SNPs you want raw data for.
  2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO has good set of them included with his DNA Kit Studio.
  3. Mark the positions of the variants in your Raw VCF with the value of that variant. These will overwrite the reference values assigned in step 2.

Voila!  So from a Raw VCF file, use this procedure. Do not use a filtered VCF file.

If you have a BAM file, use WGS Extract from yesterday’s post.




Update: May 14: Ann Turner pointed out to me (in relation to my “Aside” above, that in addition to the AC (allele count) field, the GT (genotype) field could supply the information to correctly identify what the variant is. Unfortunately, the gVCF file Dante supplied me with has missing values for that field’s values.

I’ve looked at all the other fields in my gVCF file and entries that leave out the BaseQRandSum and ClippingRankSum fields as they often indicate a homozygous variant, but I’ve found several thousand SNPs among the variants that constitute too many exceptions to use this as a "rule".

Wilhelm HO is working on implementing the sort of procedure I suggest into his DNA Kit Studio. It likely will be included when he releases Version 2.4, and his tool will then be able to produce a raw data file from a VCF file and will also extract a mtDNA file for you that you can upload to James Lick’s site for mtDNA Haplogroup analysis.

Creating a Raw Data File from a WGS BAM file - Sun, 12 May 2019

I was wondering in my last post if I could create a raw data file that could be uploaded to to GEDmatch or DNA testing company from my Whole Genome Sequencing (WGS) results. I was trying to use one of the Variant Call Format (VCF) files. Those only include where you vary from the human reference. So logically you would think that all the locations not listed must be human reference values. But that was giving less than adequate results.

Right while I was exploring that, there was a beta announced for a WGS Extract program. It works in Windows and you can get it here

image

This is not a program for the fainthearted. The download is over 2 GB because it includes the reference genome in hg19 (Build 37) and hg38 (Build 38) formats. It also includes a windows version of samtools which it runs in the background as well as the full python language.

I was so overwhelmed by what it brought that I had to ask the author how to run the program. I was embarrassed to find out that all I had to do was run the “start.bat” file that was in the main directory of the download, which opens up a command window that automatically starts the program for you, bringing up the screen I show above.

WGS Extract has a few interesting functions, but let me talk here about that one labeled “Autosomes and X chromosome” with the button: “Generate file in 23andmeV3 format”.  I selected my BAM (Binary Sequence Alignment Map) file, a 110 GB file I received by mail on a 500 GB hard drive (with some other files) from Dante. I pressed the Generate file button, and presto, 1 hour and 4 minutes later, a raw data file in 23andMe v3 format was generated as well as a zipped (compressed) version of the same file.

This was perfect for me. I had already tested at 5 companies, and had downloads of FTDNA, MyHeritage, Ancestry, Living DNA and 23andMe v5 raw data files. I had previously combined these 5 files into what I call my All 5 file.

The file WGS Extract produced had 959,368 SNPs in it. That’s a higher number of SNPs than most chips produce, and since it was based on the 23andMe v3 chip, I knew there should be quite a few SNPs in it that hadn’t been tested by my other 5 companies.

You know me. I did some analysis:

image

The overlap (i.e. SNPs in common) varied from a high of 693,729 with my MyHeritage test, to a low of 183,165 with Living DNA. These are excellent overlap numbers – a bit of everything.

Each test had a number of no-calls, so I compared all the other values with what WGS Extract gave me, and there was a 98.1% agreement. That’s a 2% error that is either in the chip test, or in the WGS test, but from this, I cannot tell whether its the chips or the WGS that are the incorrect values. But in each case, one of them is.

When I compare this file to my All 5 file, which has 1,389,750 SNPs in it, I see that there are an extra 211,747 SNPs in my WGS file. That means I’ll be able to create a new file, an All 6 file, that will have 1,601,497 SNPs in it.

More SNPs don’t mean more matches. In fact they usually mean fewer matches, but better matches. The matches that are more likely to be false are the ones that get excluded.

In addition to including the new matches, I also wanted to update the 747,621 SNPs in the file to the same SNPs in my All 5 file. As noted in the above table, I had 2,305 SNPs whose values disagreed, so I changed them to no calls. No calls are the same as an unknown value, and for matching purpose, are always considered to be a match. Having more no calls will make you more “matchy” and like having less overlap, you’ll have more false matches. The new SNPs added included another 905 no calls. But then, of the 20,329 no calls I had in my All 5 file, the WGS test had values for 9,993 of them.

So my number of no calls went from:

20,329 + 2,305 + 905 - 9,993 =  13,546, a reduction of 6,783.

I started with 20,329 no calls in 1,389,750 SNPs (1.5%),
and reduced that to 13,546 no calls in 1,601,497 SNPs (0.8%)

A few days ago, I was wondering how much work it take to get raw data for the SNPs needed for genealogical purposes out of my WGS test. A few days later, with this great program, it turns out to be no work at all. (It probably was a lot of work for the author, though.)

I have uploaded both the 23andMe v3 file, as well as my new All 6 file to GEDmatch to see how both do at matching. I’ve marked both research. But I expect once the matching process is completed, I’ll make my All 6 file my main file and relegate my All 5 file back to research mode.

Here are the stats at GEDmatch for those who know what these are:

WGS Extract SNPs:  original 959,368; usable 888234; slimmed 617,355
All 5 SNPs: original 1,389,750; usable 1,128,146; slimmed 813,196
All 6 SNPs: original 1,601,497; usable 1,332,260; slimmed 951,871

WGS – The Raw VCF file and the gVCF file - Tue, 7 May 2019

As I noted in my last post, Whole Genome: The VCF File, Part 2, the SNP VCF (Variant Call Format) file that Dante Labs gives tester of WGS (Whole Genome Sequencing), does not quite have everything that is needed to generate a raw data file that can be uploaded to various DNA sites for people matching.

The VCF file contains SNPs that vary from the standard human reference genome. The thinking is then that any SNPs needed for the raw data file (to match with the SNPs that are tested by the chips used by Ancestry, 23andMe, Family Tree DNA, MyHeritage DNA and Living DNA) that are not in the file can simply be replaced by the value from the standard human reference.

But they can’t.

The problem is that the VCF contains only those variants that passes a set of filters meeting some quality controls to ensure that the read was good enough. Those SNPs that fail the filters are not included in the VCF file. So some actual variants that should be in the VCF file don’t make it. These are known as true negatives. Substituting the standard reference genome values for those will give you incorrect values.

How many might there be? Well take a look again at the table from my last post.

image

When I created my “All 5” raw data file, it ended up containing 1,343,424 SNPs.

My VCF file contains 3,442,712 variant SNPs. Of those, 462,305 (13.4%) were at positions in my All 5 file. 

The green highlighted values are those that matched the consensus of my All 5 file. If you exclude the no-calls and deletions in my All 5 file, then that’s 453,534 out of (462,305 – 2 – 4 – 2 – 6,156 =) 456,141 or 99.4% agreement on the values. Only 0.6% disagree which isn’t too bad.

But the real problem are those yellowed values. Those are SNPs with two different allele values (heterozygous) and they by definition must be variants. Yet, the VCF file does not include them. These amount to another 10,705 SNPs that either were wrong in the chip reads, or should have been in the VCF file but were not. This means that as many as 2.3% of the VCF values may not be included in the VCF.

There could be others as well. The AA, CC, GG and TT values just above the yellow cells that are not in the VCF may have been variants but not included in the VCF file. e.g. If the reference value was G and the SNP value was AA, then this homozygous read is a variant and should be among the green values in the AA row and AA column. But the VCF may not contain it. We can’t tell how many of these there might be until/unless we get a third test done to compare.

It’s for these true-negative values, that substituting the reference genome value would be an incorrect thing to do and build you a substandard raw data file with likely thousands of incorrect values in it.


The Raw VCF File

Dante gives you access to your filtered VCF file. There is another file available that you can get. If you copy the link to your Dante Labs VCF file and change the end of the filename from “snp.vcf.gz” to “raw.snp.vcf.gz”, you can download what I’m calling the Raw VCF file.

This file contains the unfiltered variants found by your whole genome test. None of the quality measures were applied to this file. The quality filtering that Dante does is designed to get remove most of the false positives, i.e. to prevent SNPs that are not variants from being reported to be variants.

My Raw VCF contains 3,973,659 SNPs. It includes every one of the 3,442,712 SNPs in my filtered SNP VCF file, plus the 712,064 SNPs (17.9%) that it found that it filtered out because the quality didn’t wasn’t enough to be sure they were true variants.

So you can use this Raw VCF file to get more of your true negatives back, but by doing so, you will also add false positives to it.

It’s a tricky situation. It’s a trade-off like a teeter totter, with false positives on one side and true negatives on the other. Dante picked a set a filters that presumably take a good compromise position that does its best to minimize the two types of mistakes.

The one nice thing about this Raw VCF file is that it includes my 46 mtDNA variants. The VCF filters somehow removed these and they are not in the filtered VCF file. Once I got these, I was able to put them into 23andMe raw data format and upload it to James Lick’s mtDNA Haplogroup utility. The format I used is:

i703360 MT 114 T

Since the raw file doesn’t give the RSID, I just used i703360 for everything, hoping that James Lick didn’t use it, and it appears he doesn’t. I used MT as the 2nd field because that’s what 23andMe has. 114 is the position number. T is the value. Those are Tabs (Hex 09) between the fields, not spaces.

I know these were my mtDNA variants because Lick’s routine gave me my correct mt haplogroup: K1a1b1a.

However, for most purposes, this Raw VCF File is likely worse to use than the filtered VCF file since it will include more false positives (variants that are not really variants) than will the filtered VCF file.

And the Raw VCF File also won’t help in our goal to produce a raw data file that can be uploaded to sites for people matching. Reducing true negatives is good, but increasing false positives is bad.


The Genome VCF (gVCF) File

So we still want/need/hope for a way to produce that raw data file from our WGS results. Too many errors are introduced by adding human reference values where they don’t have values in the VCF file. Using the Raw VCF file will alleviate some of the true negatives but it also will increase the false positives, which is not a good trade-off.

There is an intermediate file. It is called the gVCF or Genome VCF file. It contains all the reads of the Raw VCF file and then is said to fill in the gaps with the human reference genome.

Well, that really doesn’t help. It is still basically the Raw VCF file. All it does is supposedly make the lookup of the human reference genome a little simpler.

I requested my gVCF from Dante and they made it available to me. It was 3.6 GB in size and took 30 minutes for me to download. It is a compressed file and took 5 minutes to unzip. The full file is 24.6 GB

Here’s a snippet from it:

image

The long lines are the variants. The short lines are the positions where I don’t have a variant.

Position 14464 on Chr 1 is the first variant in my filtered VCF file. This file contains all the variants in my filtered VCF file.

Positions 14464 and 14653 are my 6th and 7th variants in my Raw VCF file. This file contains all the variants in my Raw VCF file.

But positions 14523, 14542, 14574, 14673, 14907 and 14930 are variants that are in this file, but not in either my Raw VCF file or my filtered VCF file. So there must even be more filtering done even before the variants even make it to the Raw VCF file. Maybe some of the values on the line (like the BaseQRankSum for example) indicate uncertain reads and have something to do with them being left out. None-the-less, I wouldn’t get excited about adding these additional variants because many likely will be false positives if you do.

At least the hope is that you’ll get every read in every position from this file.

But no. It doesn’t even do that.

The lines given that are not variants often contain an END value. e.g., the first 3 lines I displayed above contain:

14373 . T END=14379
14380 . C END=14383
14384 . T END=14396

Does the value T represent all the values from positions 14373 to 14379, or does it just represent the first position? My perusal of the VCF specs finds:

The value in the POS field refers to the position of the first base in the String.

To verify, I took some homozygous SNPs from my DNA tests that agree between different companies:

Chr 1: position 1040026 was read as TT by all 5 companies. My gVCF has:

image

So it says C at position 1039779, ending at position 1040048. That’s not T.

Try another: Chr 1: position 1110019 was read as AA by all 5. My gVCF has:

image

Here it says C at position 1109397, ending at position 1110155. That’s not A.

So the value shown refers to the first position. You do not know what the other positions up to the end positions hold.

And in fact, my gVCF file contains 308,122,966 data lines. That is about 10% of the 3 billion base pairs we have. So only 1 out of 10 positions are reported in the gVCF file with either your variant value or the human genome value.

None-the-less, it doesn’t matter whether the gVCF file contains all the human reference genome values or not. The variants in it are even more liberal than the Raw VCF file and would introduce even more false positives if it is used the generate a raw data file.


A Possible Solution

Between the filtered VCF, the Raw VCF and the gVCF files, none of them alone have the data in it to generate a raw data file that can be uploaded to GEDmatch and other DNA testing sites. And they have a sliding range of variants from optimistic (gVCF) to liberal (Raw VCF) to conservative (filtered VCF).

The problem is that DNA results include both false positives and true negatives. DNA testing companies get around the uncertain values by indicating them to be no-calls. That works because no-calls are always considered a match, and therefore they won’t break a matching segment. As long as there are not too many of them (i.e. 5% or less), the segment matching should work well.

So I believe we can generate a raw data file by doing this:

  1. Make a list of all the SNPs you want raw data for.
  2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO have a good set of them included with his DNA Kit Studio.
  3. The positions of variants in your gVCF file should be marked as no-calls. Many of these variants are false, but we don’t want them to break a match.
  4. The positions of variants in your filtered VCF should be marked as having that variant. This will overwrite most of the optimistic no-calls marked in step 3 with filtered reliable values.

I likely will try this myself. When I do, it will be worthy of its own blog post.