Login to participate
  
Register   Lost ID/password?
Louis Kessler’s Behold Blog » Blog Entry           prev Prev   Next next

The Benefits of Combining Your DNA Raw Data - Mon, 17 Sep 2018

In my last post, I compared the raw data from 5 DNA testing companies. I ended that article by describing how I then took my 5 sets of raw data and combined them into a more complete set. I ended up with a file that contained 1,389,750 different SNPs of which just 20,688 (1.5%) were no-calls. I then described how I uploaded it to GEDmatch Genesis.

GEDmatch Genesis has since fully processed my combined raw data file. Let me from here on call it my All 5 kit.

I had earlier this year (in January) uploaded my 23andMe raw data to GEDmatch Genesis. That file had 613,899 SNPs with 16,906 (2.8%) no calls.

So the question is whether or not there are any tangible benefits from combining your raw data together and using that instead of using the raw data from one company. Let’s take a little adventure and see.


We Want to Ensure We Have Adequate Overlap

We want accurate matches. One of the challenges the people at GEDmatch are taking on is to overcome the differences in the coverage of SNPs provided by the DNA testing companies. The situation became worse when the new v5 chip was released and 23andMe and Living DNA both started using it. The new chip tests many different SNPs than the old chips do and only about 14% of the SNPs are in common with the old chip. This reduces the accuracy of matches of tests from 23andMe or Living DNA when compared to tests from Family Tree DNA, AncestryDNA or MyHeritage DNA.

That 14% I allude to is known as overlap. GEDmatch defines overlap as:

‘Overlap’ is the number of positions that exist in common between both kits, without regard to whether they match or not. The amount of overlap, along with the largest cM amount,
is usually a good indication of the relative quality of the match.

Overlap as a measurement was included in GEDmatch Genesis’ new One-to-Many report.

Important note: GEDmatch Genesis has two One-To-Many reports. The second has (Proto) written in red, obviously meaning it is a prototype or experimental. (Genesis actually has a third One-To-Many reports in its Tier 1 Utilities, but that one does not give overlaps).

The first report does not give correct overlap values. The second report with Proto does.

image

On the GEDmatch Facebook group, Aaron Wells, one of the developers at GEDmatch, said this about the Proto version of the One-To-Many report:

The "Proto" version is one that will be the final version of One-To-Many once migration is complete. There is now a brand new database, which will be the final database of matches. It was first populated with all native genesis kit matches to other native genesis kits. Next, the original gedmatch matches are being moved over as we speak, populating this new database. The "Proto" one-to-many is pulling matches from the new and final database.

The One-to-Many Proto report shows the overlap in its own column:

image

The above table shows the beginning of my GEDmatch Genesis One-to-Many(Proto) report for my 23andMe kit. The first line I match to is my All 5 kit, and that matches at 3572.4 cM which is 100%, so thankfully I match myself. The other lines show matches with other people. So far my closest match at GEDmatch Genesis is only 88.9 cM and I don’t know how I am related to that person or any of my other matches there.

Now lets take a look at the overlap column of this report. You’ll see a number of 23andMe kits that I match to with an overlap of about 317,000. You’ll see 3 FTDNA kits I match to and one Ancestry kit I match to with overlaps of about 70,000 shaded light red. And you’ll see a 23andMe kit and an unspecified kit with overlaps of about 50,000 shown in a darker red. I did not add the red shading. It is on the report from GEDmatch Genesis. They write:

Matches with low overlap have that field highlighted with a pink or red background, depending on the overlap value.
Matches with very low overlap are not shown.

So obviously, the quality of the matches in red are suspect. We want to see if a combined raw data file can improve on that.

The overlap of my 23andMe kit with most of the other people’s 23andMe kits are good as would be expected. They both use the same chip and effectively test the same SNPs. Theoretically there should be 100% overlap, but I have found when comparing different people’s Family Tree DNA raw data that they average 702,000 SNPs with one being as high as 708,092 and one as low as 680,544 SNPs. So there are variations of a percent or two in coverage that may occur even from the same chip.

My 23andMe v5 test results included 613,899 SNPs and other people testing at 23andMe on the v5 chip should get about the same. So the overlap should be something over 600,000, but GEDmatch Genesis is only showing about 317,000. Ann Turner told me on the Facebook GEDmatch User Group that GEDmatch Genesis discards SNPs with a low minor allele frequency, i.e. only the SNPs whose values vary the most often will be used. So GEDmatch Genesis is using the 50% of SNPs that give them the best bang for the buck. This means that the overlap values that are reported by them are about half of the overlap numbers you’ll find in the Autosomal SNP comparison chart at the ISOGG Wiki.



How Much Overlap is Not Enough?

The overlap of the v5 chip kits against non-v5 chip kits is not good. When I analyze my One-to-Many report in Excel, I can group the amounts of overlap by GEDmatch’s color coding. They use dark red for my overlaps of less than about 55,000, a lighter red for overlaps between 55,000 and 72,000, and an even lighter red up to 100,000. Anything above 100,000 is not shaded and is deemed okay.

People who upload to GEDmatch Genesis are asked to supply the name of the company they tested with. It’s free form text, but most of the company names listed are identifiable.

The One-To-Many(Proto) report gives the 3,000 closest matches. So here are the overlaps for the 3,000 matches of my 23andMe test against other people’s kits from various companies.

image

Over half of my matches also tested with 23andMe. But almost 34% of them have low matches because those were from tests done before 23andMe started using the v5 chip a couple of years ago.

Almost all of the Ancestry DNA, Family Tree DNA and MyHeritage DNA tests have low overlap with my 23andMe kit.

Almost all the Living DNA kits had good overlap with my 23andMe kit. That’s because Living DNA has only used the v5 chip and tests almost the same SNPs as 23andMe.

The bottom line is that over half (54.5%) of my matches are deemed to not have enough overlap.


The Improvement You Get Using Combined Raw Data

My All 5 kit includes all the SNPs tested, not only from 23andMe, but also from the other 4 major companies: Ancestry DNA, Family Tree DNA, Living DNA and MyHeritage DNA.

Analyzing my 3,000 matches of my All 5 kit in the same manner gives these results:

image

Well that is a lot better. Almost all the low overlap matches are gone.


How has the Overlap Changed?

Of the 3,000 matches to my 23andMe kit, 611 of them were not also matches to my All 5 kit. They got dropped off the list and replaced by new ones. That means over 20% of the matches were refuted or at least found to be not as good as they were originally stated to be after I improved my overlap count with the combined raw data.

If I compare the 2,389 matches that were in common to both kits, I get this very interesting graph:

image

The bottom axis is the overlap that the match has with my 23andMe kit and the left axis is the overlap the same match has with my All 5 kit.

The diagonal line is where the overlap is the same. In all cases the All 5 kit was at least as good an overlap as the 23andMe kit. That is expected because it includes all my 23andMe SNPs.

Around 300,000 and 300,000 you’ll see a blob of yellow (Living DNA) and blue (23andMe). These are the v5 tests. My 23andMe kit already had all the SNPs from that test, so my combined kit made very little improvement there.

But that line of dots on the left is what got improved. Those are all the matches that had under 100,000 overlap with my 23andMe test. They all got improved significantly when matched with my All 5 kit.

In case you’re wondering, those 3 highest green dots were matches to kits from DNA Land (at the top), an FTDNA kit merged with a 23andMe kit (next highest), and a Genes for Good kit (third highest)


But Have My Matches Themselves Changed?

Let’s now compare those 2,389 individual matches. GEDmatch Genesis’ One-to-Many report gives Total cM, Largest segment cM and a Gen value, which is an estimate of the number of generations back to your common ancestor.

By using a combined raw data file, this is how many of my match numbers changed:

image

More than half the matches had exactly the same values with my combined raw data as they had previously.  Those were mostly matches with v5 tests where my 23andMe data already had good overlap and gave good results.

For the matches that changed, the average Total cM went down 7.0 cM, the average Largest segment cM went down 1.5 cM, and the gen value increased by 0.3 generations. 

This means the addition of extra SNPs made the relationships with the matches a little bit weaker. It also means that a match without good overlap slightly overestimates the strength of the match.

The elimination of some matches in combination with the reduction of the Total cM of some matches resulted in reducing the Total cM of the 3,000th match from 54.4 cM with the 23andMe data, to 35.8 cM with the combined data.



One Specific Match

Let’s see what the effect that a combined raw data file has on one specific match. Here’s my top Ancestry DNA match, who is displayed on the 2nd line of the first table that is at the beginning of this post.

When compared with my 23andMe kit, this Ancestry DNA kit is an 88.9 cM match, with largest segment 19.0 cM and 3.7 generations. It has an overlap of 71,660 which is said not to be good.

Here is that Ancestry kit’s One-to-One comparison with my 23andMe kit for chromosomes 1 to 6:

image

When compared with my All 5 kit, the same Ancestry DNA kit is a total match of 79.1 cM, with largest segment 19.1 cM and 3.8 generations. It now has an overlap of 308,924.

The main difference is the dropping of the 8.8 cM match on chromosome 6 from the match list. Here’s the One-to-One comparison with my All 5 kit:

image

You will notice that there appears to be more red lines with the All 5 kit. The red lines are Base Pairs with No Matches. This indicates again that adding more SNPs to your raw data and getting more overlap will tend to disprove matches that might otherwise have been thought to be valid.

Also notice the information in the diagrams about the two segment matches on chromosome 3. Those segments only included 578 and 437 SNPs with the 23andMe kit, but that went up to 2,050 and 1,604 SNPs with the All 5 kit.  That’s an increase in SNP density by a factor of 4.


Conclusion and Recommendation

Combining raw data from different companies does seem to provide some degree of increased accuracy in your match information and eliminates a few incorrect segments. Using it will drop off some of your more distant matches from your match list and replace them with others.

The real issue is overlap. The v5 chip from 23andMe and Living DNA works well against other v5 chips but not as well against the older chips. The same goes for the non-v5 chips of AncestryDNA, Family Tree DNA and MyHeritage DNA. They work well for comparisons with each other, but not as well against the v5 chip.

You might want to consider testing once at 23andMe or Living DNA with the v5 chip, and once with one of the other companies using the older chips. Then you can combine your two sets of raw data with this nice free tool by Wilhelm Halys called: DNA Kit Studio.

Then whenever you want to transfer or upload your raw data to a DNA testing or analysis site, you can use this combined raw data for slightly more reliable results.

Will you really notice the difference? To be honest, I doubt it.

No Comments Yet

 

The Following 6 Sites Have Linked Here

  1. Kan man testa sig hos ett bolag och föra över till andra? - Peter Sjölund - Släkt & DNA : Wed, 16 Jan 2019
    (‎You can test out of a company and transfer to the other?‎) ... The Benefits of Combining Your DNA Raw Data « Louis Kessler’s Behold Blog

  2. DNA Genics - How to create a super DNA kit for improving DNA Test (Matching and Ancestry Results) free, easy and fast : Mon, 8 Apr 2019
    As well, we recommend to visit this amazing article written by Louis Kessler about the benefits of merging your kits to create a big DNA Kit.

  3. Topic: GEDMATCH - What is overlap? on RootsChat : Wed, 12 Jun 2019
    So I googled it and came up with the following, in order of finding - http://www.beholdgenealogy.com/blog/?p=2717

  4. Topic: Gedmatch low overlap query : Wed, 12 Jun 2019
    Having finally got the correct google search terms, I've found this on Gedmatch and low overlap - http://www.beholdgenealogy.com/blog/?p=2717.

  5. If you haven\'t tested with AncestryDNA, should you do it right now? Question by Edisoon William : Mon, 19 Aug 2019
    "In September 2018, Louis Kessler wrote about "The Benefits of Combining Your DNA Raw Data." He took the results from his tests at all five of the major testing companies and manually merged them into a single "super kit" that resulted in just under 1.4 million unique SNPs."

  6. Best DNA Raw Data: 23andMe, AncestryDNA, or MyHeritage? – Genetic Genie : Tue, 12 May 2020
    [...] Combined files can be useful for both health analysis and genetic genealogy. Combining kits can also increase accuracy in your match information when using the online service GEDmatch. [...]

Leave a Comment

You must login to comment.

Login to participate
  
Register   Lost ID/password?