Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Combine Kits into One Superkit on GEDmatch Genesis - Sat, 6 Apr 2019

Today GEDmatch Genesis added a new Tier 1 application. They state:

image

I did that myself manually with 5 kits about 6 months ago, uploaded my combined raw data to GEDmatch Genesis, and reported the results in my post: The Benefits of Combining Your DNA Raw Data.

I thought I’d try the new GEDmatch Genesis application to see if it produces essentially the same result.

I selected the Tier 1 “Combine mupltiple kits into 1 superkit” application and it gave the the option to select up to 4 kits that are already uploaded. I had all my 5 kits uploaded and I selected FTDNA, 23andMe, Ancestry and LivingDNA. I left out MyHeritage which uses includes almost the same SNPs as my FTDNA file does.

image

I pressed the “Generate” button and within a second, I got my combined kit:

image

Comparing my kits using the GEDmatch Diagnostic utility gives:

image

When I manually combined the kits, I got 1,389,750 SNPs, but GEDmatch only combines the 1,123,247 SNPs it wants to combine that it knows it is going to use. Slimmed SNPs are what GEDmatch actually uses for comparisons with other kits. I’m surprised that GEDmatch’s 834,457 slimmed SNPs are over 20,000 more than my manually combined kits. I have no explanation for that.

I’ve included my Whole Genome kit from Dante, that GEDmatch only loads the SNPs in the VCF file. Those SNPs are the ones where I am different from the human reference genome. The SNPs where I am the same as the human reference genome are not included. The GEDmatch people still have to fix the upload of VCF files so that human reference genomes are added when the SNP is not included in the file.

The one to one comparison was possible immediately, so I compared the GEDmatch combined kit to each of my individual kits, and to my manually created All-5 kit.

image

All of the comparisons indicate that I match myself at least 99.210%. It’s not important that there are some small breaks in the matching segments which results in more than 22 shared segments. I expect that when the one-to-many comparisons become available, the overlaps will improve just as they did with my manually combined file.



The Bottom Line

If you’ve tested with multiple companies and you subscribe to Tier 1, you should combine your kits to get better comparisons at GEDmatch Genesis. Make sure you make this combined kit the kit for yourself that you use for matching, and change all the others to Research so that you show up only once in other people’s match list.

The only unfortunate thing is that you don’t have access to your raw data at GEDmatch. So you won’t know exactly what they did and you won’t have the raw data for yourself to look at or use for other purposes.

Genetic Affairs Clustering at 23andMe - Wed, 20 Mar 2019

Today Evert-Jan Blom, author of Genetic Affairs and the new clustering algorithm implemented by MyHeritage DNA, posted on the Genetic Genealogy Tips & Techniques group on Facebook. He announced some improvements to his AutoCluster analysis on Genetic Affairs for 23andMe matches.

He posted:

A well known feature for the DNA relatives list on 23andme are the Relatives in Common. What is interesting is that 23andme, just like MyHeritage, supplies the shared cM values between shared matches. On MyHeritage, we use this data to improve the analysis of people from endogamous populations.
In addition, there is a Shared DNA column in the Relatives in Common list. The Matches marked with a “Yes” have overlapping segments – and, according to the research of Jim Bartlett (https://segmentology.org/2017/…/20/triangulation-at-23andme/) over 99% of the time these matches form a Triangulated Group (TG).

The shared cM values between matches as well as TG information is now employed for 23andme AutoCluster analyses. …

But what about these TG data? … The rectangles that contain a DNA helix symbol have overlapping segments and probably form a TG. I’ve already discovered some clusters that could be extended by taking into account some grey cells that in fact were TGs. …

So we supplement the ICW based 23andme AutoCluster analyses with TG data which already improves the analysis. And, although we know that not all members of a (large) cluster will form TGs, wouldn’t it be interesting to only take into account TG data? We thoughts so as well and therefore are this feature is now also available for 23andme AutoCluster.

I wrote about MyHeritage’s New AutoClustering feature 3 weeks ago, showing my results. Unfortunately, I don’t have any DNA matches at MyHeritage whose relationship I truly know, so I couldn’t identify the ancestral source of any of the clusters.

But at 23andMe, I have a number of known relatives who tested.  So I should be able to identify some of the clusters. Let’s see how it goes.

I went to my account at Genetic Affairs and added my 23andMe website to it. Then I requested an autocluster using the default parameters:

image

I performed an analysis first “Based on shared matches” and then did it again after selecting “Based on Triangulated Groups”

For the shared matches, I got all one big cluster of 54 members:

image

So that’s my endogamy and 1 cluster is not of much help. However, notice that some of the cells have a little DNA symbol in them, like this:

image

These are people who are not only ICW (In Common With) myself and the row person and column person, but also are shown in the 23andMe Relatives in Common list as a “Y” in the Shared DNA column. That mean’s that all three of us share at least one common segment of DNA with each other, i.e. we triangulate somewhere.

So Evert-Jan had the innovative idea to allow just the use of these triangulating people to be used for clustering. When my second run based on Triangulated Groups came back, it looked like this:

image

This initially got me really excited. There were just four clusters and I was hoping that this clustering had done the trick and divided my DNA relatives into my four grandparent groups. Did it?



The Trouble with Using Triangulations for Clustering

Unfortunately, I noticed something very important. The first person in the red group is Bruce, a 3rd cousin of mine. The first person in the purple group is Rick, his brother, also my 3rd cousin. If you go down the column of the first red box to the row of the first purple box, you can see the two of them have a grey square with DNA symbol in it meaning the three of us triangulate. We have two full brothers triangulate who absolutely must be in the same cluster no matter which way you look at it. So why aren’t they?

While I was looking at all of this, Evert-Jan himself Facebook messaged me, and we started discussing this problem. I then noticed that the 3rd purple box in the last row and last column with Rick in the purple group was Rick’s daughter Arianna. If you then look down from Bruce’s red box to Arianna’s row, you’ll see there is no triangulation!

image

Evert-Jan and I discussed this for a while. Why were Bruce and his niece Arianna not triangulating with me? I then went to 23andMe and compared our shared DNA in their chromosome browser:

image

Sure enough. The 5 segments that I share with Arianna, Bruce does not share with me. Arianna, Bruce and I do not triangulate.

But it’s even worse. Bruce and his brother Rick only share the same segment with me in two places, the large segment on Chromosome 2, and the very small one on Chromosome 9. It could have been that Bruce and Rick might not even have shared those same two segments with me. In that case, Bruce, Rick and myself may not have triangulated. Then we’d have a case where two brothers would never be put in the same cluster using triangulation groups.

Evert-Jan summed it up:  He said to me: ”TG clustering breaks up family bonds … and if you were using ICW they would most probably be placed in the same clusters.”



Some Thoughts

I think the conclusion is that triangulated segments and triangulated groups give good information to help you try to determine who might be your common ancestors. But they are not all-inclusive and close relatives need not have been passed down the same segment of DNA that the rest of the group received. Therefore using triangulated segments may separate close relatives when clustering.

Whereas ICW (In Common With) information will never separate close relatives and therefore is likely better for cluster analysis than triangulated groups.

Very interesting. Not every analytical technique works out exactly as expected.

Evert-Jan was initially wondering if anyone wrote about this before. I told him I that other than Jim Bartlett’s article, I doubt it, because Evert-Jan’s clustering program is the first tool ever invented that has let us look at clustering of triangulation groups.

So now it will take some thinking. Despite the possible breakup of close relations, can the ideas of triangulation group clustering still be used? Is there maybe some way of merging the ICW cluster information together with the triangulation information? We don’t know yet, but great analytical minds like Evert-Jan will be thinking hard, and that will ultimately result in new ideas along with new and better analysis software for you to analyze your DNA matches.




Update: March 21: 

I noted that the TG clustering did split up a family into 2 clusters which is a problem. But I failed to mention that the 4 clusters are still good clusters, where the people in each cluster do appear to be from the same family.

Since shared match clustering gave me one big cluster because of endogamy, I didn’t get anything at all from that. But Evert-Jan’s TG clustering gave me 4 clusters, provided me information where shared match clustering alone did not.

Maybe Evert-Jan can figure a way to parlay the information that the TG clusters have to make the shared match clustering even better.

Small Segment Matches - Tue, 19 Mar 2019

Blaine Bettinger posted a poll on his Genetic Genealogy Tips and Techniques Facebook group about 7 hours ago. It is a closed group, but if you are a member you can see the poll here.  In four hours, the poll got almost 800 responses and over 350 comments.

Blaine asked people to go to GEDmatch Genesis and do a one-to-one comparison between their kit and his kit, but reduce the minimum segment threshold down from the default of 7 cM and do the comparison using a minimum of 3 cM instead.

This little poll/experiment is well designed to help people realize that most small single matching segments are false, and to realize how many of them they might have with someone whom they are likely not at all related to. It is because we have a pair of each chromosome, and there are high probabilities that alleles located on either of our chromosomes may match either of another person’s two alleles at the same position. What ends up happening is that we can get random matches to segments as large as 15 cM. As a segment gets larger, the laws of probability start saying that the random allele matches will get less likely the same way you can’t keep throwing a coin on heads forever. And above 15 cM you can be fairly certain that almost all segment matches are real, i.e. likely Identical by Descent (IBD) and likely passed down from a common ancestor that was anywhere from 1 to 20 generations back.

So here’s the result of the poll when I snapshot it:

image

649 out of the 738 people (88%) who responded, including me, had multiple segment matches with Blaine that were 3 cM or more but not larger than 7 cM. Only 42 people (6%) did not share any segments 3 cM or more with Blaine.

These, for example, are the segments I share with Blaine:

image

There’s 8 matching segments totaling just 30.7 cM. The largest is just 4.9 cM. The most SNPs shared is 733. That sort of means we have 733 SNPs in a row where one of my alleles matches one of Blaine’s alleles, but due to misreads, GEDmatch and other DNA companies usually allow for a mismatch every now and then, maybe one or two every 100 or so. Also there are usually a few percent no-calls (unreadable SNPs) that are always treated as a match. Maybe I have 15 no-calls and Blaine has 15 over those 733 SNPs, so that’s 30 positions that may not be matches but are treated so.

Blaine is rightfully trying to get the attention of genealogists to inform them to be wary of these small segment matches. They are single matches. They are dangerous, because most are false. Confirmation bias, where you think someone is related and then believe that some small segments are the connection must be avoided. I likely share zero DNA with Blaine, yet I’ve got 7 segments showing here. Don’t believe it. You need more than this.

So I thought I’d look into these small segment matches people have with Blaine and see if I could learn more. Among the 350 comments to the poll in the first 4 hours, there were 25 people who posted their matches to Blaine like I did above. I entered their matches as well as mine into a spreadsheet so I could do some analysis.

The 26 of us have 247 segment matches with Blaine. That’s on average 9.5 matches. The fewest is 2. The most is 17. The average each of us match with Blaine 35 cM, minimum 8 cM, maximum 64 cM. 

If all those small segments are real, then potentially 64 cM could indicate a 3rd cousin with Blaine, but that’s the best that could happen. It is much more likely none of the 26 of us and Blaine are truly DNA related because almost all those small segment matches are false.


Does Triangulation Help?

I don’t know of any studies of triangulation done with small segments. The best I can quote is Jim Bartlett’s observations that just about all his triangulations are true down to 7 cM and most are true down to 5 cM.

I have done a related study. It is not triangulation per se, but it is what I would call Parental Filtering. It finds segments that a child matches but neither parent matches, thus indicating that the segment match of the child is false. They cannot be matching through one chromosome (or they would match one of their parents), so they must be matching through both their chromosomes randomly. Parental filtering effectively forces the match through one chromosome just as triangulation does, so it is a good first cut estimate of how triangulation might work on small segments. The final result of my study was this graph:

image

This tells us that parental filtered segments almost always match when they are at least 7 cM. But between 3 cM and 7 cM, you still can have a lot of false matches even if both the child and a parent match.

So my question is, do any of the 247 segment matches of the 26 people with Blaine triangulate?  If they do, is it a true match that is a small segment that is IBD, or is the triangulation a false match?

I sorted the 247 segment matches by chromosome, starting position and ending position. And I looked for overlapping segments. Guess how many there were? Would you believe 147 (60%) overlapped with one or more other segment matches. This happens because once you start to get as many as 247 segment matches, it becomes like the birthday problem (How many people in the room before 2 have the same birthday). A 4 cM match has a 1 / 500 chance of overlapping with another 4 cM match. Once you have 247 segment matches, you have 247 x 246 / 2 = 30,381 possible match combinations, so there is an expected value of 30,381 / 500 = 60 matches if they happened at random. But as the segments matches start to fill up the chromosome, the chance of matching starts to increase. So my observed number of 147 matches is quite likely. If I had entered 1000 segment matches with Blaine, almost all of them likely would overlap with at least one other segment match.

This is what I might call the “chromosome browser phenomena”. People see their segment matches lining up in the chromosome browser and assume they must be valid matches because they are all lining up. False, false false!

The chromosome browser shows you double matches. A double match is where Person A matches Person B and Person A also matches Person C on the same segment. Double matches are simply alignment of single matches. There is nothing there that tells you that any of them are valid if they are small segments.

The important step to validate a double match is to see if that match triangulates. What you need to do is check that Person B also matches Person C on the same segment. What that will usually do is, like parental filtering, force the match to be on just one chromosome between the three sets of people: A and B, A and C, B and C. (There are special case exceptions, but I won’t get into that here).

GEDmatch allows you to check triangulations. You can do a one-to-one comparison of any two people if you know their Kit number. In Blaine’s comments, about 10 of the 26 people who gave their matches inadvertently included their Kit number in their screen shot. Now I’m not going to hack their account, but I am going to do some one-to-one comparisons between them to check to see if any of these false matches are triangulations.

There only happened to be 8 overlapping matches for the people who have given their Kit numbers. That should do for a very rough estimate as to how many of these small segments triangulate. These are the overlaps I found and checked:

image

Interestingly I am in 5 of these 8 matches. I suspect the reason why I am in so many is because my kit I uploaded is a combination of the raw DNA from 5 companies. Therefore, my kit has SNPs to match all the companies, and I’m guessing that must allow me to match more segments than a single company kit would.

Checking the 8 sets of matches, I find only one of them triangulates. I’m showing the Sonia versus Louis segment match on the last line of the above table in green.



What does this all mean?

Well 7 of these 8 small segment matches do not triangulate. If a segment does not triangulate, it cannot be IBD and therefore is almost assuredly false.

This does say that triangulation a good way of further filtering out small false segments. If triangulation can eliminate 7 out of 8 segments for you, then it will have saved you a lot of unnecessary analysis.

And that 8th segment:  Is it real? If the segment is above 7 cM, then Jim Bartlett would tell you that it is very likely real. But if it’s under 7 cM, then we can’t say for sure. Triangulation alone cannot prove that a small segment is real. It can only disprove a segment. You would need to analyze other matches on the same segment and check that they all match each other. Then you have a triangulation group which is genetic genealogy’s version of “a preponderance of evidence” which starts to tell you something. But there are still a lot of caveats. With small segments, each of the 3 matches may be matching randomly, or two may be matching and the third is a random match. And you may have a valid triangulation group for some people in the group, with the others in the group matching randomly to the group’s common segment.

Also note that even if you find a true triangulation group for a small segment, you must realize that a segment that small could very easily have come from an ancestor 10, 20 or even 30 generations back, so you may never find the common true ancestor for most of your small segment matches.