Louis Kessler’s Behold Blog

Creating a Raw Data File from a WGS BAM file - Sun, 12 May 2019

I was wondering in my last post if I could create a raw data file that could be uploaded to to GEDmatch or DNA testing company from my Whole Genome Sequencing (WGS) results. I was trying to use one of the Variant Call Format (VCF) files. Those only include where you vary from the human reference. So logically you would think that all the locations not listed must be human reference values. But that was giving less than adequate results.

Right while I was exploring that, there was a beta announced for a WGS Extract program. It works in Windows and you can get it here:

This is not a program for the fainthearted. The download is over 2 GB because it includes the reference genome in hg19 (Build 37) and hg38 (Build 38) formats. It also includes a windows version of samtools which it runs in the background as well as the full python language.

I was so overwhelmed by what it brought that I had to ask the author how to run the program. I was embarrassed to find out that all I had to do was run the “start.bat” file that was in the main directory of the download, which opens up a command window that automatically starts the program for you, bringing up the screen I show above.

WGS Extract has a few interesting functions, but let me talk here about that one labeled “Autosomes and X chromosome” with the button: “Generate file in 23andmeV3 format”. I selected my BAM (Binary Sequence Alignment Map) file, a 110 GB file I received by mail on a 500 GB hard drive (with some other files) from Dante. I pressed the Generate file button, and presto, 1 hour and 4 minutes later, a raw data file in 23andMe v3 format was generated as well as a zipped (compressed) version of the same file.

This was perfect for me. I had already tested at 5 companies, and had downloads of FTDNA, MyHeritage, Ancestry, Living DNA and 23andMe v5 raw data files. I had previously combined these 5 files into what I call my All 5 file.

The file WGS Extract produced had 959,368 SNPs in it. That’s a higher number of SNPs than most chips produce, and since it was based on the 23andMe v3 chip, I knew there should be quite a few SNPs in it that hadn’t been tested by my other 5 companies.

You know me. I did some analysis:

The overlap (i.e. SNPs in common) varied from a high of 693,729 with my MyHeritage test, to a low of 183,165 with Living DNA. These are excellent overlap numbers – a bit of everything.

Each test had a number of no-calls, so I compared all the other values with what WGS Extract gave me, and there was a 98.1% agreement. That’s a 2% error that is either in the chip test, or in the WGS test, but from this, I cannot tell whether its the chips or the WGS that are the incorrect values. But in each case, one of them is.

When I compare this file to my All 5 file, which has 1,389,750 SNPs in it, I see that there are an extra 211,747 SNPs in my WGS file. That means I’ll be able to create a new file, an All 6 file, that will have 1,601,497 SNPs in it.

More SNPs don’t mean more matches. In fact they usually mean fewer matches, but better matches. The matches that are more likely to be false are the ones that get excluded.

In addition to including the new matches, I also wanted to update the 747,621 SNPs in the file to the same SNPs in my All 5 file. As noted in the above table, I had 2,305 SNPs whose values disagreed, so I changed them to no calls. No calls are the same as an unknown value, and for matching purpose, are always considered to be a match. Having more no calls will make you more “matchy” and like having less overlap, you’ll have more false matches. The new SNPs added included another 905 no calls. But then, of the 20,329 no calls I had in my All 5 file, the WGS test had values for 9,993 of them.

So my number of no calls went from:

20,329 + 2,305 + 905 - 9,993 = 13,546, a reduction of 6,783.

I started with 20,329 no calls in 1,389,750 SNPs (1.5%),
and reduced that to 13,546 no calls in 1,601,497 SNPs (0.8%)

A few days ago, I was wondering how much work it take to get raw data for the SNPs needed for genealogical purposes out of my WGS test. A few days later, with this great program, it turns out to be no work at all. (It probably was a lot of work for the author, though.)

I have uploaded both the 23andMe v3 file, as well as my new All 6 file to GEDmatch to see how both do at matching. I’ve marked both research. But I expect once the matching process is completed, I’ll make my All 6 file my main file and relegate my All 5 file back to research mode.

Here are the stats at GEDmatch for those who know what these are:

WGS Extract SNPs: original 959,368; usable 888234; slimmed 617,355
All 5 SNPs: original 1,389,750; usable 1,128,146; slimmed 813,196
All 6 SNPs: original 1,601,497; usable 1,332,260; slimmed 951,871

WGS – The Raw VCF file and the gVCF file - Tue, 7 May 2019

As I noted in my last post, Whole Genome: The VCF File, Part 2, the SNP VCF (Variant Call Format) file that Dante Labs gives tester of WGS (Whole Genome Sequencing), does not quite have everything that is needed to generate a raw data file that can be uploaded to various DNA sites for people matching.

The VCF file contains SNPs that vary from the standard human reference genome. The thinking is then that any SNPs needed for the raw data file (to match with the SNPs that are tested by the chips used by Ancestry, 23andMe, Family Tree DNA, MyHeritage DNA and Living DNA) that are not in the file can simply be replaced by the value from the standard human reference.

But they can’t.

The problem is that the VCF contains only those variants that passes a set of filters meeting some quality controls to ensure that the read was good enough. Those SNPs that fail the filters are not included in the VCF file. So some actual variants that should be in the VCF file don’t make it. These are known as true negatives. Substituting the standard reference genome values for those will give you incorrect values.

How many might there be? Well take a look again at the table from my last post.

When I created my “All 5” raw data file, it ended up containing 1,343,424 SNPs.

My VCF file contains 3,442,712 variant SNPs. Of those, 462,305 (13.4%) were at positions in my All 5 file.

The green highlighted values are those that matched the consensus of my All 5 file. If you exclude the no-calls and deletions in my All 5 file, then that’s 453,534 out of (462,305 – 2 – 4 – 2 – 6,156 =) 456,141 or 99.4% agreement on the values. Only 0.6% disagree which isn’t too bad.

But the real problem are those yellowed values. Those are SNPs with two different allele values (heterozygous) and they by definition must be variants. Yet, the VCF file does not include them. These amount to another 10,705 SNPs that either were wrong in the chip reads, or should have been in the VCF file but were not. This means that as many as 2.3% of the VCF values may not be included in the VCF.

There could be others as well. The AA, CC, GG and TT values just above the yellow cells that are not in the VCF may have been variants but not included in the VCF file. e.g. If the reference value was G and the SNP value was AA, then this homozygous read is a variant and should be among the green values in the AA row and AA column. But the VCF may not contain it. We can’t tell how many of these there might be until/unless we get a third test done to compare.

It’s for these true-negative values, that substituting the reference genome value would be an incorrect thing to do and build you a substandard raw data file with likely thousands of incorrect values in it.

The Raw VCF File

Dante gives you access to your filtered VCF file. There is another file available that you can get. If you copy the link to your Dante Labs VCF file and change the end of the filename from “snp.vcf.gz” to “raw.snp.vcf.gz”, you can download what I’m calling the Raw VCF file.

This file contains the unfiltered variants found by your whole genome test. None of the quality measures were applied to this file. The quality filtering that Dante does is designed to get remove most of the false positives, i.e. to prevent SNPs that are not variants from being reported to be variants.

My Raw VCF contains 3,973,659 SNPs. It includes every one of the 3,442,712 SNPs in my filtered SNP VCF file, plus the 712,064 SNPs (17.9%) that it found that it filtered out because the quality didn’t wasn’t enough to be sure they were true variants.

So you can use this Raw VCF file to get more of your true negatives back, but by doing so, you will also add false positives to it.

It’s a tricky situation. It’s a trade-off like a teeter totter, with false positives on one side and true negatives on the other. Dante picked a set a filters that presumably take a good compromise position that does its best to minimize the two types of mistakes.

The one nice thing about this Raw VCF file is that it includes my 46 mtDNA variants. The VCF filters somehow removed these and they are not in the filtered VCF file. Once I got these, I was able to put them into 23andMe raw data format and upload it to James Lick’s mtDNA Haplogroup utility. The format I used is:

i703360 MT 114 T

Since the raw file doesn’t give the RSID, I just used i703360 for everything, hoping that James Lick didn’t use it, and it appears he doesn’t. I used MT as the 2nd field because that’s what 23andMe has. 114 is the position number. T is the value. Those are Tabs (Hex 09) between the fields, not spaces.

I know these were my mtDNA variants because Lick’s routine gave me my correct mt haplogroup: K1a1b1a.

However, for most purposes, this Raw VCF File is likely worse to use than the filtered VCF file since it will include more false positives (variants that are not really variants) than will the filtered VCF file.

And the Raw VCF File also won’t help in our goal to produce a raw data file that can be uploaded to sites for people matching. Reducing true negatives is good, but increasing false positives is bad.

The Genome VCF (gVCF) File

So we still want/need/hope for a way to produce that raw data file from our WGS results. Too many errors are introduced by adding human reference values where they don’t have values in the VCF file. Using the Raw VCF file will alleviate some of the true negatives but it also will increase the false positives, which is not a good trade-off.

There is an intermediate file. It is called the gVCF or Genome VCF file. It contains all the reads of the Raw VCF file and then is said to fill in the gaps with the human reference genome.

Well, that really doesn’t help. It is still basically the Raw VCF file. All it does is supposedly make the lookup of the human reference genome a little simpler.

I requested my gVCF from Dante and they made it available to me. It was 3.6 GB in size and took 30 minutes for me to download. It is a compressed file and took 5 minutes to unzip. The full file is 24.6 GB

Here’s a snippet from it:

The long lines are the variants. The short lines are the positions where I don’t have a variant.

Position 14464 on Chr 1 is the first variant in my filtered VCF file. This file contains all the variants in my filtered VCF file.

Positions 14464 and 14653 are my 6th and 7th variants in my Raw VCF file. This file contains all the variants in my Raw VCF file.

But positions 14523, 14542, 14574, 14673, 14907 and 14930 are variants that are in this file, but not in either my Raw VCF file or my filtered VCF file. So there must even be more filtering done even before the variants even make it to the Raw VCF file. Maybe some of the values on the line (like the BaseQRankSum for example) indicate uncertain reads and have something to do with them being left out. None-the-less, I wouldn’t get excited about adding these additional variants because many likely will be false positives if you do.

At least the hope is that you’ll get every read in every position from this file.

But no. It doesn’t even do that.

The lines given that are not variants often contain an END value. e.g., the first 3 lines I displayed above contain:

14373 . T END=14379
14380 . C END=14383
14384 . T END=14396

Does the value T represent all the values from positions 14373 to 14379, or does it just represent the first position? My perusal of the VCF specs finds:

The value in the POS ﬁeld refers to the position of the ﬁrst base in the String.

To verify, I took some homozygous SNPs from my DNA tests that agree between different companies:

Chr 1: position 1040026 was read as TT by all 5 companies. My gVCF has:

So it says C at position 1039779, ending at position 1040048. That’s not T.

Try another: Chr 1: position 1110019 was read as AA by all 5. My gVCF has:

Here it says C at position 1109397, ending at position 1110155. That’s not A.

So the value shown refers to the first position. You do not know what the other positions up to the end positions hold.

And in fact, my gVCF file contains 308,122,966 data lines. That is about 10% of the 3 billion base pairs we have. So only 1 out of 10 positions are reported in the gVCF file with either your variant value or the human genome value.

None-the-less, it doesn’t matter whether the gVCF file contains all the human reference genome values or not. The variants in it are even more liberal than the Raw VCF file and would introduce even more false positives if it is used the generate a raw data file.

A Possible Solution

Between the filtered VCF, the Raw VCF and the gVCF files, none of them alone have the data in it to generate a raw data file that can be uploaded to GEDmatch and other DNA testing sites. And they have a sliding range of variants from optimistic (gVCF) to liberal (Raw VCF) to conservative (filtered VCF).

The problem is that DNA results include both false positives and true negatives. DNA testing companies get around the uncertain values by indicating them to be no-calls. That works because no-calls are always considered a match, and therefore they won’t break a matching segment. As long as there are not too many of them (i.e. 5% or less), the segment matching should work well.

So I believe we can generate a raw data file by doing this:

Make a list of all the SNPs you want raw data for.
Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO have a good set of them included with his DNA Kit Studio.
The positions of variants in your gVCF file should be marked as no-calls. Many of these variants are false, but we don’t want them to break a match.
The positions of variants in your filtered VCF should be marked as having that variant. This will overwrite most of the optimistic no-calls marked in step 3 with filtered reliable values.

I likely will try this myself. When I do, it will be worthy of its own blog post.

4 Comments

Whole Genome: The VCF File, Part 2 - Mon, 22 Apr 2019

A couple of months ago, I compared my VCF file to my DNA test results.

The Variant Call Format (VCF) file is given to you when you do a Whole Genome Sequence (WGS) test. That test finds your DNA values for your whole genome, all 3 billion positions, not just the 700,000 or so positions that a standard DNA test gives you.

But most of those 3 billion positions are the same for most humans. The ones that differ are called Single-Nucleotide Polymorphisms (SNPs) because they “morph” and can have differing values among humans. The standard DNA companies test a selection of the SNPs that differ the most, and they can use the 700,000 they selected for matching people without having to test all 3 billion positions. It works very well. WGS tests are not needed for finding relatives.

Converting VCF to a Raw Data File

But near the end of my last post, I was trying to see if the VCF file could be converted into a raw data file that could be uploaded to GEDmatch or a DNA company that allows raw data uploads.

My VCF file contains 3,442,712 SNPs whose value differ for me from the standard human reference genome. Of those, I found 471,923 SNPs were the same SNPs (by chromosome and position) as those in my raw data file that I created by combining the raw data from 5 companies (FTDNA, MyHeritage, Ancestry, 23andMe and LivingDNA). I compared them in my first analysis and found that 2,798 of them differed, which is only 0.6%.

At the time, I didn’t think that was too bad an error rate. So I thought a good way to make a raw data file from a VCF file would be:

Take a raw data file you already have to use as a template.
Blank out all the values
Add values for the positions that are in the VCF file
Fill in the others with the human reference genome value.

The basis of that idea is that if it’s not a variant in the variant file, then it must be the reference value.

Today on Facebook, Ann Turner told me that that’s not necessarily the case. The reason she believes, is that the VCF file does not contain all the variant SNPs. And the discrepancies were enough to break her comparison of “herself” with “herself” into 161 segments.

So What’s Really Different Between VCF and Raw Data?

In my first analysis, I only compared whether the values were the same or not, giving that 0.6% difference. I did not look at the specific values. Let’s do that now:

For this analysis, let’s not worry about the rows: DD (Deletions), DI (Deletion/Insertions), II (Insertions) or – (no-calls), since they are only in the raw data and not in the VCF file.

The green values down the diagonal are the agreement between the All-5 raw data file and the VCF file. Any numbers above and below that diagonal are disagreements between the two. Those are the 0.6% where one is wrong for sure, but we don’t know which.

But let me now point you to those yellowed numbers in the “Not in VCF” column. Those are all heterozygous values, with two different letters. AC, AG, AT, CG, CT or GT. If they have two different letters, then they cannot be human reference values. One of the two letters is a variant and those entries should have been in the VCF file. But they were not.

This creates even a bigger concern than our earlier 0.6% mismatch. If we total these yellow counts, we find there’s 10,705 or 1.2% of the 881,119 SNPs that are not in the VCF file that should have been in the VCF file.

Again, we don’t know who is wrong, the raw data, or the VCF file. But from Ann’s observations, we’d have to say at least some of those heterozygous values must have been left out and when reference values were added instead to Ann’s file, they caused the match breaking that resulted in 161 segments.

Which is Correct: VCF or Raw Data

When you are comparing two things, and you know one is wrong, you don’t know which of the two is the wrong one. You need others to compare with. I am awaiting the results of my long read WGS test, and when that comes I’ll have a third to compare.

But until then, can I get an idea of which of the two files might be more often correct? There’s one thing I can do.

I can provide the same table as I did above, but for the X and Y chromosomes. Since I’m a male, I only have one X and one Y chromosome. The value could be shown as a single value, but it is still read as a double value and therefore shown as a double value. The caveat is that the two letters must be the same or the read is definitely incorrect. (Note that this table excludes 688 SNPs that are in the pseudoautosomal region of the X or Y which can recombine and have two different allele values).

So let’s see the table:

The top left section contains the agreed upon values (in green) between the All 5 raw data file and the VCF file. The counts in that section above and below the green values are disagreements between All 5 and VCF and we don’t know which is correct and which is wrong.

The numbers in red are incorrect reads. Those on the left side are incorrect reads for the All 5 raw data file. It has 219 incorrect reads versus 40,848 correct reads, a ratio of 1 every 186.

The right side are incorrect reads for the VCF file. It has 13 incorrect reads versus 9,605 correct reads. That’s a ratio of 1 every 739 reads.

Now, verging on the realm of hyperbole, the difference in ratios could indicate that an error in a standard DNA test is 4 times (739 / 186) more common than an error in a VCF file.

And applying that ratio to the 10,705 heterozygous values that should have been in the VCF file, we would say that 8,564 would be because the raw data file is wrong, and 2,141 because the VCF file should have included them but did not.

And if 2,141 values out of your DNA file created from the VCF file are incorrect, couldn’t that quite easily have caused the 161 segments that Ann observed?

Yes, this is all conjecture. But the point is that maybe the VCF file is leaving out a significant number of variants. If that is the case, then we can’t just put in a reference value when there is no value in a VCF file. And that would mean a raw data file created from a VCF file and filled in by human reference values may not have enough accuracy to be usable for matching purposes.

Louis Kessler’s Behold Blog

Creating a Raw Data File from a WGS BAM file - Sun, 12 May 2019

WGS – The Raw VCF file and the gVCF file - Tue, 7 May 2019

The Raw VCF File

The Genome VCF (gVCF) File

A Possible Solution

Whole Genome: The VCF File, Part 2 - Mon, 22 Apr 2019

Converting VCF to a Raw Data File

So What’s Really Different Between VCF and Raw Data?

Which is Correct: VCF or Raw Data

Older Entries Newer entries

Search the Blog & Forum

Latest Posts RSS (?)

Latest Comments RSS

Archives

Awards and Reviews

I Geneablog

RootsTech 2017

UTP 10 - 2016
UTP 3 - 2013

Gaenovium 2014

RootsTech 2014

Professional Genealogist

About Me

Also Find Me At...

Content Protected By:

Louis Kessler’s Behold Blog

Creating a Raw Data File from a WGS BAM file - Sun, 12 May 2019

WGS – The Raw VCF file and the gVCF file - Tue, 7 May 2019

The Raw VCF File

The Genome VCF (gVCF) File

A Possible Solution

Whole Genome: The VCF File, Part 2 - Mon, 22 Apr 2019

Converting VCF to a Raw Data File

So What’s Really Different Between VCF and Raw Data?

Which is Correct: VCF or Raw Data

Older Entries Newer entries

Search the Blog & Forum

Latest Posts RSS (?)

Latest Comments RSS

Archives

Awards and Reviews

I Geneablog

RootsTech 2017

UTP 10 - 2016UTP 3 - 2013

Gaenovium 2014

RootsTech 2014

Professional Genealogist

About Me

Also Find Me At...

Content Protected By:

UTP 10 - 2016
UTP 3 - 2013