Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

MyHeritage New AutoClustering Feature is now Live - Fri, 1 Mar 2019

Another new feature announced during #RootsTech is the MyHeritage DNA integration of Evert-Jan Blom’s AutoClustering method. MyHeritage has become the first major DNA service to offer clustering.

MyHeritage’s blog post Introducing AutoClusters for DNA Matches was posted yesterday and describes their new service.

MyHeritage in their post says:

“This new tool was developed in collaboration with Evert-Jan Blom of GeneticAffairs.com, based on technology that he created, further enhanced by the MyHeritage team. Our enhancements include better clustering of endogamous populations (people who lived in isolated communities with a high rate of intermarriages, such as Ashkenazi Jews and Acadians), and automatic threshold selection for optimal clustering so that users need not experiment with any parameters.”

I looked at several autoclustering methods a month ago in my Comparing Genetic Clusters post. I included Evert-Jan’s Genetic Affairs program at the time. Those methods at the time all used Ancestry matches. Now I’m interested seeing what AutoClustering does with my MyHeritage matches, especially in the light of my endogamy.

So let’s try it.

On my DNA Tools page, I selected AutoClusters.

image

The illustration shows a clustering example diagram.

The “Generate clusters for” and “Kit:” dropdowns allow me to select from my 3 possible kits:

  1. The MyHeritage DNA test that I took.
  2. The FTDNA test that I took that I uploaded to MyHeritage
  3. The FTDNA test that my uncle took that I uploaded to MyHeritage.

I pressed the Generate button for each of the 3 kits. After pressing the button, up pops the following box:

image

I did this yesterday about noon hour, 5 hours or so after this service went live. I saw on Facebook that some people were receiving their results in an hour or two. But as more people found out about this and submitted their requests, the queue started to grow. I did not get my results until the next morning.when I found the three results in my inbox. The emails were from 1:30 a.m., so they took over 13 hours to get generated and sent to me. I expect that the waiting time will come down considerably once the initial excitement period subsides.

What You Get

You get a zip file (mine are about 80 KB each) which expand to three files:

  1. An HTML (browser) file that displays your cluster chart with the amazing bit of animation that Evert-Jan developed to organize the clusters in front of your very eyes. Just hit refresh (F5) to display this hypnotizing effect over and over.
  2. A CSV (comma delimited file) that contains all the data in columns that can be loaded into Excel for analysis.
  3. A ReadMe.pdf file that gives you information about the analysis done for you.

The HTML and CSV files are given the name:

Louis Kessler Auto Clusters – kk-kkkkkk – March 01 2019.sss

where kk-kkkkkk is the kit number and sss is .html or .csv.

The ReadMe.pdf name always has that name. So if you don’t rename it, one will overwrite the other. They are identical except that they contain information about your clustering run, so you should rename it to associate it with the other two files.  My info from the three ReadMe files, along with my match statistics tell me the following:

image

The clustering algorithm in all cases excluded my match with my uncle.

My test gave me 9,315 matches, of which 119 are between 80 cM and 350 cM and the clustering algorithm excluded 19 singletons and grouped the other 100 into 26 clusters.

My transfer from FTDNA was similar. My uncle’s transfer from FTDNA had more closer matches than I had. That’s the advantage of testing someone a generation back. So the clustering algorithm used a smaller range, 85 cM to 350 cM to only include 100 people.

My Test versus My Transfer

These are my clusters from my MyHeritage DNA test:

image

These are my clusters from my transfer from FTDNA:

image

They look almost identical and that is good. There are two fewer clusters from the transfer file, but you can barely tell.  And despite my endogamy, there are not a lot of grey squares representing matches outside of clusters.

When I compare individual people and the groups they are in, I note that the groups are numbered differently in the two reports, but I can align the groups and most of the people match. There were 100 people in the MyHeritage clusters, and 94 in the FTDNA transfer clusters. MyHeritage has 9 people that FTDNA doesn’t have, and FTDNA has 3 people that MyHeritage doesn’t have. Of the remaining 91 matching people, 11 of them disagree as to which group they are in.  So that leaves 80 people who are put in the same group from both clusterings. Pretty good.

Determining Common Ancestors

Unlike my Ancestry DNA matches, where I know my relationship to about 10 of my matches, at MyHeritage other than my uncle, I don’t know how I’m related to any of my 9,314 matches or to my uncle’s 10,834 matches. So at MyHeritage, I cannot use known tested relatives to determine common ancestors for some of the clusters.



Comparing My Clusters with my Uncle’s Clusters

At MyHeritage, I can look at the Shared DNA Matches between myself and my uncle. My uncle is my father’s full brother, and I share 1,994 cM on 52 segments with him. So our shared matches should mostly be on my paternal side. The matches I have that I don’t share with my uncle should mostly be on my maternal side.  This is one comparison that I cannot not do at Ancestry, since I only tested my uncle at FTDNA and Ancestry does not accept uploads of raw data from other companies.

The Shared DNA Match list with my uncle shows only 3,114 Matches. To my surprise, that’s only 33% of the 9,314 matches I have.  Since my uncle represents my full paternal side, you’d expect that it would be 50%. I’m guessing that either more people on my maternal side tested at MyHeritage than my paternal side, or maybe endogamy is allowing me to match some people by combining my maternal and paternal totals – and my uncle simply doesn’t meet the criteria to match them. By comparison, at FTDNA, my uncle and I match 10,182 people (57%) in common out of my 17,881 matches and my uncle’s 18,680 matches.

When I go through my Shared DNA Matches that I have with my uncle, I find just 6 matches among the 100 people in my clusters, and they are in 5 different clusters. Not only that, two of those are in my uncle’s excluded singletons, so that leaves just 4 people in common between my clusters and my uncle’s clusters.

The low number of people in common prevents me from combining my uncle’s clusters with mine to try to identify whether my clusters are on my paternal or maternal side. I’m very surprised that this happens, but it is likely because my uncle’s top 100 matches have little overlap with my top 100 matches.

So I won’t be able to directly compare my uncle’s clusters to mine by person as I had hoped. None-the-less, lets go forward anyway and look at my uncle’s clusters:

image

This also does not look much different than my clustering, but the people that make them up are different. We only have 4 of the people of the 100 shown here in common between us.

Clustering is potentially very useful if you know your relationship to some of the matches. Unfortunately for me at MyHeritage, I’ll have to wait until I determine my relationships with some of my DNA matches before I’ll be able to make full use of MyHeritage’s new clustering information.

DNA meets Trees at AncestryDNA and MyHeritage DNA - Wed, 27 Feb 2019

Today’s the first day of #RootsTech. This is the day that many of the genealogy companies announce new features on their site.

So it is not ironic at all that today, both AncestryDNA and MyHeritage DNA announced a new feature that matches up your tree with the trees of your DNA matches and shows you the results.

DNA matching is a tool to assist your genealogy research. But up to now, you’ve had to do most of the tree inspection on your own. Finally, we’ve got not one, but two new automated system to save lots of time.

Ancestry DNA

Ancestry’s official announcement may come tomorrow (Feb 28) at RootsTech in Crista Cowan’s talk at 1:30 MST titled. “What You Don’t Know about Ancestry”. This will be live streamed tomorrow, so if you read this post in time, you can listen to Crista live.

To get the new feature, you currently have to go to into your Ancestry account and from the menu, select “Extras” and under that “Ancestry Lab”. Then on the Ancestry Lab page, you should enable their Beta features.

image

After you opt in, you then can go to your DNA Matches page, and in the “Filtered by” drop down, you’ll see “Common ancestors”.  Select that, and hopefully you’ll get a few matches. I got the 4 you see below.

image

If you click on one of the people’s names, you get the comparison page. There is now a Common Ancestors box. For people who are not in the above list, you’ll get this box:

image

But for people who are in this list, the box will contain something quite exciting:

image

The people shown are the common ancestors of myself and my match. Even more exciting is if you now click on the “View relationship” link for either ancestor, you get:

image

And if you expand those dropdowns that are hiding two generations, it gives:

image

This match, as well as my other three, are all correct matches. I previously knew my connection to two of these people. The other two were people correctly placed into my tree that I did not know had tested.

This is really great! Finally, the companies are doing something intelligent to match you with your DNA match via your combined trees. Bravo!

I posted a survey on the Facebook group: Genetic Genealogy Tips, to find out how many matches people were getting. Although I only got 4 matches, almost half the people reported between 100 and 999 matches! That’s a lot of connections many people will now be able to make that previously required laborious manual tree inspection.

MyHeritage DNA

MyHeritage at almost exactly the same time implemented almost exactly the same feature. They have already announced their matching system which they call: The Theory of Family Relativity.

To access your matches, go to your MyHeritage DNA Matches page, select “Filters” and then select: “Has Theory of Family Relativity”.

image

Unfortunately for me, what I get is no results, so I won’t be able to give you a personal illustration of what it looks like.

image

But I would expect there would likely be a match tree similar to what AncestryDNA gives. This is the illustration from their announcement:

image

I posted the same survey on Genetic Genealogy Tips asking how many MyHeritage DNA matches everyone had. People generally had fewer matches than at Ancestry DNA, but about 65% did have matches and 10% marked that they had between 20 and 99 with a few having more than 100.




Update: Mar 2, 2019:  It seems I took a shortcut to get to the Ancestry Connected Trees. The “Common Ancestors” feature apparently existed before. But I didn’t have any so I didn’t know about it.

The new Ancestry feature is actually called ThruLines and you can get to it from Your DNA Results Summary page:

 image

Clicking through on the “Explore ThruLines” button takes you to a page showing all your direct ancestors:

image

Any ancestors through which the ThruLines algorithm finds a potential DNA relative will be marked with a “Potential Ancestor” indicator.  Three of my ancestors have this:

image

Clicking on their tiles will then take you to the same 4 connections that I found in my clicking on “Common Ancestors” as I first described in the post.  Herz Tzvi and Dwora both take me to the same two DNA relatives (since they were husband/wife) and Manascu takes me to the other two DNA relatives. It it titled “ThruLines” rather than Common Ancestors and can show the connection to more than one DNA relative at a time. But essentially it is the same information, just accessed by ancestor rather than by DNA match.

image

My Whole Genome Sequencing. The VCF File - Wed, 6 Feb 2019

I received my results from my Dante Labs Whole Genome test last week. I purchased the test last August when I was able to get it for $399 USD. There were two health reports that I requested that are written in ancient Latin as far as my understanding of them goes. Then there were the VCF files which I was more interested in. The FASTQ and BAM files will be sent to me on a hard drive in a few weeks.

A Variant Call Format (VCF) file basically contains the differences between me and the “standard reference human”. There were two VCF files included in my results. One with my individual SNPs which was 143 MB and the other with Insertions and Deletions which was 43 MB. The individual SNP file is of most interest, because it is that file that contains the autosomal SNP data that DNA testing companies use for genealogical matching.

These files are in gz compressed format. When expanded (to 869 MB and 224 MB) they are standard text files and a bit of the individual SNP file looks like this:

image

My VCF file has a header section of 141 lines. The first line of the file (not shown above) indicates that this file’s format is Version 4.2 of VCF. Another important line in the header is line 139 above, which specifies the reference genome to be ucsc,bg19.fasta.  The ucsc is for the University of California Santa Cruz Genomics Institute who maintain and make available genome information at genome.ucsc.edu. The bg19 refers to the hg19 assembly of the human genome, which is also call Build 37, and is the version of the genome currently used by most of the DNA testing companies. And fasta is a format that lists all the reference values of the genome.

The header in my VCF file followed by 3,442,712 lines that represent each SNP where I am different from the reference value. “SNP” is an abbreviation for Single-nucleotide polymorphism. The “polymorphism” refers to something that can have more than one form, so when you hear SNP, think of a position on the genome where humans can differ from each other.

Each line contains:

  • #CHROM, the chromosome number of the SNP.  My file includes data for Chromosomes 1 to 22, X and Y.
  • POS, the position of the SNP on the chromosome
  • ID, the RSID of the SNP, i.e. a name it is given to reference it.  In my VCF file from Dante, no RSIDs are given and the ID is shown as a period on every line. That’s not a problem, since most DNA match is done by position, not RSIDs which can change positions between Builds.
  • REF, the value of that position on that chromosome in the reference genome and is one of A, C, G and T. This is usually the SNP value that most people have, e.g. if REF = A, then the pair AA with be the reference value for that SNP, i.e. A from their father and A from their mother.
  • ALT, the alternative values that I have. Usually it is one value, one of A, C, G and T and is different from the REF value. Occasionally it is two values, both different from the REF value, e.g. REF = A, ALT = C,T
  • QUAL, is a number estimating the quality of the read that was done in my test for that SNP. A higher number is better quality. 
  • FILTER, is an evaluation as to whether that SNPs value is reliable. My file only included SNPs with a filter value of PASS.
  • INFO and FORMAT, contains detailed information about the read at that SNP. The most important field is the AC field. If AC=2, then the ALT value will be both values of the pair. Otherwise the REF value will be the leftover value. e.g:
  • REF=A, ALT=C, AC=2, then SNP=CC
  • REF=A, ALT=C, AC=1, then SNP=AC
  • REF=A, ALC=C,T, AC=1, then SNP=CT

So from this file, using the REF, ALT and AC values on each line, I can compute the SNP value for the position given on the chromosome.

These are the counts of each computed SNP value for my file:

image

Remember that the above counts of homozygous readings (where both alleles are the same: AA, CC, GG or TT) do not include any SNPs which have the same reference value. If they are the same as the reference value, then they are not included in the VCF file.

Also note that since I’m a male, one allele should be shown for the X and Y chromosomes. I should not have any heterozygous (alleles are different) readings there. These might either be errors in the reads, or maybe they are reading the pseudo-autosomal regions on the X and Y where crossover might occur. I’m not sure why the number of my homozygous variants for Y are so low. But for genealogical matching purposes, I’m more interested in 1 to 22 and X.

The 1000 Genomes Project Consortium in 2015 found over 84.7 million SNPs among 2,504 individuals from 26 populations. They also found that “a typical genome differs from the reference human genome at 4.1 million to 5.0 million “sites” out of the 3.3 billion base pairs, so that’s only 0.14%. That means that 99.86% of our genomes are identical. These “sites” will include my 3,442,712 SNPs in the table above, as well as the 867,091 inserts and deletions from my other VCF file. So my total is 4,309,803 sites, which is in the correct range.

   

Comparing VCF values to my Raw Data

I’ve tested my DNA with 5 companies that have provided me with raw DNA results. The companies tested and gave me the results for from 618,640 SNPs (Living DNA) to 720,816 SNPs (MyHeritage DNA). There was overlap in what SNPs the companies tested. When I took the results of all 5 tests and combined them into one raw data file, I ended up with 1,389,750 unique SNPs.

A whole genome test is a test of all your DNA. My Dante WGS results provide me with values for all positions on all my chromosomes. These will come in 2 huge files I will receive soon on a hard drive.

The VCF files that I’m talking about in this article tell me what differs from the reference, so it is logical to assume that all values that are not in the VCF file are the same as the reference. Through deduction, you would think that I could state with certainty that the positions not specified in the VCF file would have the reference value. But that won’t always be true because the VCF contains only the SNPs that have “PASS” as the Filter value. We don’t know what the values are for those that are not marked as PASS from just the VCF. In fact, I don’t even know how many are not marked PASS, whether it is a lot or a few. Since this is a 30x (30 times coverage) WGS test, I would assume that the vast majority of the positions have been read correctly. Once I get the FASTA and BAM files, I’ll see if I can look at this in more detail.

My VCF file contains 471,923 SNPs that are in my combined raw data. So 34.0% of my combined raw data are specified in the VCF file. The other 3,837,880 SNPs in the VCF file are SNPs that none of the 5 DNA testing companies had tested. We’ll ignore those for now.

Here’s a summary of the 471,923 SNPs in common between my VCF file and my combined raw data file:

image

Of these, 98.0% were the same as they were in my combined raw data file.

The “New” column represent 6,321 SNPs that were no-calls in my combined raw data file, so my VCF allows me to define those.

The “Verify” column represents 228 SNPs that had disagreements between two or more of the raw data files, so I had set them to a no-call. The VCF could prove to be a tie-breaker in this case, but I’ll just continue to call these no-calls just to be safe.

The “Diff” column represent 2,798 SNPs that had a value in my combined raw data file, but the VCF value disagrees with it.

I could use this information to improve my raw data. I could assign values to the 6,321 no-calls, but I should then also turn 2,798 assigned values into no-calls. That would still reduce my overall number of no-calls down by 3,523, from 20,688 (1.5%) to 17,165 (1.2%).


How Can Genetic Genealogists Use a VCF file

Two ways:

1. Upload the VCF file to a DNA matching service that accepts it.

2. Use it to create a raw data file which you can then upload to a DNA matching service that accepts it.


Uploading a VCF file to GEDmatch Genesis

One would hope that if they did a whole genome test, they would be able to upload their whole genome data to one of the companies that do DNA matching.

The only company that currently takes VCF uploads is GEDmatch Genesis. I was patient and waited the 5 minutes until the browser responded after I hit the Upload button. Then it didn’t take very long to for GEDmatch to load the file and it provided this processing:

image

I made that kit “Research” and waited a day until GEDmatch completed the matching for the kit. Once the results came back, I found a problem.

The GEDmatch File Diagnostic Utility run on my combined raw data which I had previously uploaded gives this:

  image

When I run the diagnostics on my VCF file from Dante, I get this:

image

As correctly reported by the diagnostics, the All 5 file has 1,389,750 SNPs in it, and the WGS file has 3,442,712 SNPs in it.

The diagnostic then reports that my All 5 files has 1,128,146 usable SNPs which are then slimmed to 813,196 SNPs. The slimmed SNPs are the ones that GEDmatch Genesis uses for matching. They are the ones that are the most different between people and give you the most “bang for the buck”.

But my VCF file only had 590,334 useable SNPs which get slimmed to only 231,588 SNPs. That is way less than my All 5 file has. A WGS tests the whole genome, so it should give more SNPs than any other test or even combined tests give. So something was wrong.

Also, when I did a One to Many of my WGS kit, it matched most closely to my All 5 kit, which it should. But then it was closely followed by a whole bunch of kits of other people who are matching me close to identically. All those kits appear to be other whole genome tests.

It then became obvious to me that GEDmatch Genesis is only using the variant SNPs from the VCF file.  The reason why I get complete matches with other WGS kits is that if two people both have a variant at a position, then there is an extremely high probability that your variant is the same. And all GEDmatch is comparing between WGS files are variants.

The procedure that GEDmatch or anyone else who wants to load a VCF file needs to do is this:

  1. If a line in the VCF file has one REF value and one ALT value, then
    • If the INFO field contains:  “AC=1”, then you take the two of them.  e.g.  REF=T, ALT=C, then value is TC (or CT if you sort alphabetically)
    • If the INFO field contains:  “AC=2”, then you use the ALT value twice.  e.g.  REF=T, ALT=C, then value is CC.
  2. If a line in the VCF file has one REF and two ALT values, then you take both the ALT values.  e.g.  REF=T, ALT=C,G, then value is CG.  There are only a few hundred of these in my VCF file.
  3. If a SNP that they use is not in the VCF file, then use the reference. e.g. REF=C, to give the value CC.  They’ll need to have a reference table with the Build 37 genome reference values for all the SNPs that they use. This table would be the same for everyone.

I reported this to GEDmatch and John Olson replied back and confirmed that they are not adding the reference values. He said the VCF upload will have to wait until they get caught up on their Genesis conversion issues.


Using DNA Kit Studio to Create a Raw Data File from a VCF

Wilhelm H. created a wonderful little program called DNA Kit Studio that includes a VCF to RAW converter in it.

image

It originally did not accept my VCF from Dante. I contacted Wilhelm and the reason was that Dante did not include RSID values. Wilhelm made the change and sent me a beta of the program for me to try. It now created the raw data file, and correctly did steps 1a, 1b, and 2, above.  But he, like GEDmatch, also was not including the reference genome value for the other positions.

I gave Wilhelm links to a couple of open source sites that have most of the reference values for the 23andMe and Ancestry SNPs that the companies test for. And likely when I get the rest of my whole genome data (the Fasta and BAM files), I’ll figure out how to determine all the reference values myself.

If you can’t wait for Wilhelm to finish his update to his VCF to RAW converter, or if you don’t want to do the task yourself, you could use Wilhelm’s service and he’ll convert it for you for a small fee.


Conclusion:  Is a WGS Test Useful for Matching?

For the purposes of matching, it really only takes a raw data file from any of the major DNA testing companies to get you going. GEDmatch and some of the testing companies will accept uploads and you can get into most databases with just the one test.

You will get slightly more accurate matches at GEDmatch Genesis if you take a test from two companies, one using the old chip (AncestryDNA, Family Tree DNA or MyHeritage DNA) and one using the new chip (23andMe or Living DNA) and then use a tool like DNA Kit Studio to combine them before uploading.

But currently, I don’t see that the WGS test provides enough added utility to make it something genetic genealogists need for matching purposes.

Also see: