My Whole Genome Sequencing. The VCF File - Wed, 6 Feb 2019
I received my results from my Dante Labs Whole Genome test last week. I purchased the test last August when I was able to get it for $399 USD. There were two health reports that I requested that are written in ancient Latin as far as my understanding of them goes. Then there were the VCF files which I was more interested in. The FASTQ and BAM files will be sent to me on a hard drive in a few weeks.
A Variant Call Format (VCF) file basically contains the differences between me and the “standard reference human”. There were two VCF files included in my results. One with my individual SNPs which was 143 MB and the other with Insertions and Deletions which was 43 MB. The individual SNP file is of most interest, because it is that file that contains the autosomal SNP data that DNA testing companies use for genealogical matching.
These files are in gz compressed format. When expanded (to 869 MB and 224 MB) they are standard text files and a bit of the individual SNP file looks like this:
My VCF file has a header section of 141 lines. The first line of the file (not shown above) indicates that this file’s format is Version 4.2 of VCF. Another important line in the header is line 139 above, which specifies the reference genome to be ucsc,bg19.fasta. The ucsc is for the University of California Santa Cruz Genomics Institute who maintain and make available genome information at genome.ucsc.edu. The bg19 refers to the hg19 assembly of the human genome, which is also call Build 37, and is the version of the genome currently used by most of the DNA testing companies. And fasta is a format that lists all the reference values of the genome.
The header in my VCF file followed by 3,442,712 lines that represent each SNP where I am different from the reference value. “SNP” is an abbreviation for Single-nucleotide polymorphism. The “polymorphism” refers to something that can have more than one form, so when you hear SNP, think of a position on the genome where humans can differ from each other.
Each line contains:
- #CHROM, the chromosome number of the SNP. My file includes data for Chromosomes 1 to 22, X and Y.
- POS, the position of the SNP on the chromosome
- ID, the RSID of the SNP, i.e. a name it is given to reference it. In my VCF file from Dante, no RSIDs are given and the ID is shown as a period on every line. That’s not a problem, since most DNA match is done by position, not RSIDs which can change positions between Builds.
- REF, the value of that position on that chromosome in the reference genome and is one of A, C, G and T. This is usually the SNP value that most people have, e.g. if REF = A, then the pair AA with be the reference value for that SNP, i.e. A from their father and A from their mother.
- ALT, the alternative values that I have. Usually it is one value, one of A, C, G and T and is different from the REF value. Occasionally it is two values, both different from the REF value, e.g. REF = A, ALT = C,T
- QUAL, is a number estimating the quality of the read that was done in my test for that SNP. A higher number is better quality.
- FILTER, is an evaluation as to whether that SNPs value is reliable. My file only included SNPs with a filter value of PASS.
- INFO and FORMAT, contains detailed information about the read at that SNP. The most important field is the AC field. If AC=2, then the ALT value will be both values of the pair. Otherwise the REF value will be the leftover value. e.g:
- REF=A, ALT=C, AC=2, then SNP=CC
- REF=A, ALT=C, AC=1, then SNP=AC
- REF=A, ALC=C,T, AC=1, then SNP=CT
So from this file, using the REF, ALT and AC values on each line, I can compute the SNP value for the position given on the chromosome.
These are the counts of each computed SNP value for my file:
Remember that the above counts of homozygous readings (where both alleles are the same: AA, CC, GG or TT) do not include any SNPs which have the same reference value. If they are the same as the reference value, then they are not included in the VCF file.
Also note that since I’m a male, one allele should be shown for the X and Y chromosomes. I should not have any heterozygous (alleles are different) readings there. These might either be errors in the reads, or maybe they are reading the pseudo-autosomal regions on the X and Y where crossover might occur. I’m not sure why the number of my homozygous variants for Y are so low. But for genealogical matching purposes, I’m more interested in 1 to 22 and X.
The 1000 Genomes Project Consortium in 2015 found over 84.7 million SNPs among 2,504 individuals from 26 populations. They also found that “a typical genome differs from the reference human genome at 4.1 million to 5.0 million “sites” out of the 3.3 billion base pairs, so that’s only 0.14%. That means that 99.86% of our genomes are identical. These “sites” will include my 3,442,712 SNPs in the table above, as well as the 867,091 inserts and deletions from my other VCF file. So my total is 4,309,803 sites, which is in the correct range.
Comparing VCF values to my Raw Data
I’ve tested my DNA with 5 companies that have provided me with raw DNA results. The companies tested and gave me the results for from 618,640 SNPs (Living DNA) to 720,816 SNPs (MyHeritage DNA). There was overlap in what SNPs the companies tested. When I took the results of all 5 tests and combined them into one raw data file, I ended up with 1,389,750 unique SNPs.
A whole genome test is a test of all your DNA. My Dante WGS results provide me with values for all positions on all my chromosomes. These will come in 2 huge files I will receive soon on a hard drive.
The VCF files that I’m talking about in this article tell me what differs from the reference, so it is logical to assume that all values that are not in the VCF file are the same as the reference. Through deduction, you would think that I could state with certainty that the positions not specified in the VCF file would have the reference value. But that won’t always be true because the VCF contains only the SNPs that have “PASS” as the Filter value. We don’t know what the values are for those that are not marked as PASS from just the VCF. In fact, I don’t even know how many are not marked PASS, whether it is a lot or a few. Since this is a 30x (30 times coverage) WGS test, I would assume that the vast majority of the positions have been read correctly. Once I get the FASTA and BAM files, I’ll see if I can look at this in more detail.
My VCF file contains 471,923 SNPs that are in my combined raw data. So 34.0% of my combined raw data are specified in the VCF file. The other 3,837,880 SNPs in the VCF file are SNPs that none of the 5 DNA testing companies had tested. We’ll ignore those for now.
Here’s a summary of the 471,923 SNPs in common between my VCF file and my combined raw data file:
Of these, 98.0% were the same as they were in my combined raw data file.
The “New” column represent 6,321 SNPs that were no-calls in my combined raw data file, so my VCF allows me to define those.
The “Verify” column represents 228 SNPs that had disagreements between two or more of the raw data files, so I had set them to a no-call. The VCF could prove to be a tie-breaker in this case, but I’ll just continue to call these no-calls just to be safe.
The “Diff” column represent 2,798 SNPs that had a value in my combined raw data file, but the VCF value disagrees with it.
I could use this information to improve my raw data. I could assign values to the 6,321 no-calls, but I should then also turn 2,798 assigned values into no-calls. That would still reduce my overall number of no-calls down by 3,523, from 20,688 (1.5%) to 17,165 (1.2%).
How Can Genetic Genealogists Use a VCF file
Two ways:
1. Upload the VCF file to a DNA matching service that accepts it.
2. Use it to create a raw data file which you can then upload to a DNA matching service that accepts it.
Uploading a VCF file to GEDmatch Genesis
One would hope that if they did a whole genome test, they would be able to upload their whole genome data to one of the companies that do DNA matching.
The only company that currently takes VCF uploads is GEDmatch Genesis. I was patient and waited the 5 minutes until the browser responded after I hit the Upload button. Then it didn’t take very long to for GEDmatch to load the file and it provided this processing:
I made that kit “Research” and waited a day until GEDmatch completed the matching for the kit. Once the results came back, I found a problem.
The GEDmatch File Diagnostic Utility run on my combined raw data which I had previously uploaded gives this:
When I run the diagnostics on my VCF file from Dante, I get this:
As correctly reported by the diagnostics, the All 5 file has 1,389,750 SNPs in it, and the WGS file has 3,442,712 SNPs in it.
The diagnostic then reports that my All 5 files has 1,128,146 usable SNPs which are then slimmed to 813,196 SNPs. The slimmed SNPs are the ones that GEDmatch Genesis uses for matching. They are the ones that are the most different between people and give you the most “bang for the buck”.
But my VCF file only had 590,334 useable SNPs which get slimmed to only 231,588 SNPs. That is way less than my All 5 file has. A WGS tests the whole genome, so it should give more SNPs than any other test or even combined tests give. So something was wrong.
Also, when I did a One to Many of my WGS kit, it matched most closely to my All 5 kit, which it should. But then it was closely followed by a whole bunch of kits of other people who are matching me close to identically. All those kits appear to be other whole genome tests.
It then became obvious to me that GEDmatch Genesis is only using the variant SNPs from the VCF file. The reason why I get complete matches with other WGS kits is that if two people both have a variant at a position, then there is an extremely high probability that your variant is the same. And all GEDmatch is comparing between WGS files are variants.
The procedure that GEDmatch or anyone else who wants to load a VCF file needs to do is this:
- If a line in the VCF file has one REF value and one ALT value, then
- If the INFO field contains: “AC=1”, then you take the two of them. e.g. REF=T, ALT=C, then value is TC (or CT if you sort alphabetically)
- If the INFO field contains: “AC=2”, then you use the ALT value twice. e.g. REF=T, ALT=C, then value is CC.
- If a line in the VCF file has one REF and two ALT values, then you take both the ALT values. e.g. REF=T, ALT=C,G, then value is CG. There are only a few hundred of these in my VCF file.
- If a SNP that they use is not in the VCF file, then use the reference. e.g. REF=C, to give the value CC. They’ll need to have a reference table with the Build 37 genome reference values for all the SNPs that they use. This table would be the same for everyone.
I reported this to GEDmatch and John Olson replied back and confirmed that they are not adding the reference values. He said the VCF upload will have to wait until they get caught up on their Genesis conversion issues.
Using DNA Kit Studio to Create a Raw Data File from a VCF
Wilhelm H. created a wonderful little program called DNA Kit Studio that includes a VCF to RAW converter in it.
It originally did not accept my VCF from Dante. I contacted Wilhelm and the reason was that Dante did not include RSID values. Wilhelm made the change and sent me a beta of the program for me to try. It now created the raw data file, and correctly did steps 1a, 1b, and 2, above. But he, like GEDmatch, also was not including the reference genome value for the other positions.
I gave Wilhelm links to a couple of open source sites that have most of the reference values for the 23andMe and Ancestry SNPs that the companies test for. And likely when I get the rest of my whole genome data (the Fasta and BAM files), I’ll figure out how to determine all the reference values myself.
If you can’t wait for Wilhelm to finish his update to his VCF to RAW converter, or if you don’t want to do the task yourself, you could use Wilhelm’s service and he’ll convert it for you for a small fee.
Conclusion: Is a WGS Test Useful for Matching?
For the purposes of matching, it really only takes a raw data file from any of the major DNA testing companies to get you going. GEDmatch and some of the testing companies will accept uploads and you can get into most databases with just the one test.
You will get slightly more accurate matches at GEDmatch Genesis if you take a test from two companies, one using the old chip (AncestryDNA, Family Tree DNA or MyHeritage DNA) and one using the new chip (23andMe or Living DNA) and then use a tool like DNA Kit Studio to combine them before uploading.
But currently, I don’t see that the WGS test provides enough added utility to make it something genetic genealogists need for matching purposes.
Also see: