Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Genetic Affairs Clustering at 23andMe - Wed, 20 Mar 2019

Today Evert-Jan Blom, author of Genetic Affairs and the new clustering algorithm implemented by MyHeritage DNA, posted on the Genetic Genealogy Tips & Techniques group on Facebook. He announced some improvements to his AutoCluster analysis on Genetic Affairs for 23andMe matches.

He posted:

A well known feature for the DNA relatives list on 23andme are the Relatives in Common. What is interesting is that 23andme, just like MyHeritage, supplies the shared cM values between shared matches. On MyHeritage, we use this data to improve the analysis of people from endogamous populations.
In addition, there is a Shared DNA column in the Relatives in Common list. The Matches marked with a “Yes” have overlapping segments – and, according to the research of Jim Bartlett (https://segmentology.org/2017/…/20/triangulation-at-23andme/) over 99% of the time these matches form a Triangulated Group (TG).

The shared cM values between matches as well as TG information is now employed for 23andme AutoCluster analyses. …

But what about these TG data? … The rectangles that contain a DNA helix symbol have overlapping segments and probably form a TG. I’ve already discovered some clusters that could be extended by taking into account some grey cells that in fact were TGs. …

So we supplement the ICW based 23andme AutoCluster analyses with TG data which already improves the analysis. And, although we know that not all members of a (large) cluster will form TGs, wouldn’t it be interesting to only take into account TG data? We thoughts so as well and therefore are this feature is now also available for 23andme AutoCluster.

I wrote about MyHeritage’s New AutoClustering feature 3 weeks ago, showing my results. Unfortunately, I don’t have any DNA matches at MyHeritage whose relationship I truly know, so I couldn’t identify the ancestral source of any of the clusters.

But at 23andMe, I have a number of known relatives who tested.  So I should be able to identify some of the clusters. Let’s see how it goes.

I went to my account at Genetic Affairs and added my 23andMe website to it. Then I requested an autocluster using the default parameters:

image

I performed an analysis first “Based on shared matches” and then did it again after selecting “Based on Triangulated Groups”

For the shared matches, I got all one big cluster of 54 members:

image

So that’s my endogamy and 1 cluster is not of much help. However, notice that some of the cells have a little DNA symbol in them, like this:

image

These are people who are not only ICW (In Common With) myself and the row person and column person, but also are shown in the 23andMe Relatives in Common list as a “Y” in the Shared DNA column. That mean’s that all three of us share at least one common segment of DNA with each other, i.e. we triangulate somewhere.

So Evert-Jan had the innovative idea to allow just the use of these triangulating people to be used for clustering. When my second run based on Triangulated Groups came back, it looked like this:

image

This initially got me really excited. There were just four clusters and I was hoping that this clustering had done the trick and divided my DNA relatives into my four grandparent groups. Did it?



The Trouble with Using Triangulations for Clustering

Unfortunately, I noticed something very important. The first person in the red group is Bruce, a 3rd cousin of mine. The first person in the purple group is Rick, his brother, also my 3rd cousin. If you go down the column of the first red box to the row of the first purple box, you can see the two of them have a grey square with DNA symbol in it meaning the three of us triangulate. We have two full brothers triangulate who absolutely must be in the same cluster no matter which way you look at it. So why aren’t they?

While I was looking at all of this, Evert-Jan himself Facebook messaged me, and we started discussing this problem. I then noticed that the 3rd purple box in the last row and last column with Rick in the purple group was Rick’s daughter Arianna. If you then look down from Bruce’s red box to Arianna’s row, you’ll see there is no triangulation!

image

Evert-Jan and I discussed this for a while. Why were Bruce and his niece Arianna not triangulating with me? I then went to 23andMe and compared our shared DNA in their chromosome browser:

image

Sure enough. The 5 segments that I share with Arianna, Bruce does not share with me. Arianna, Bruce and I do not triangulate.

But it’s even worse. Bruce and his brother Rick only share the same segment with me in two places, the large segment on Chromosome 2, and the very small one on Chromosome 9. It could have been that Bruce and Rick might not even have shared those same two segments with me. In that case, Bruce, Rick and myself may not have triangulated. Then we’d have a case where two brothers would never be put in the same cluster using triangulation groups.

Evert-Jan summed it up:  He said to me: ”TG clustering breaks up family bonds … and if you were using ICW they would most probably be placed in the same clusters.”



Some Thoughts

I think the conclusion is that triangulated segments and triangulated groups give good information to help you try to determine who might be your common ancestors. But they are not all-inclusive and close relatives need not have been passed down the same segment of DNA that the rest of the group received. Therefore using triangulated segments may separate close relatives when clustering.

Whereas ICW (In Common With) information will never separate close relatives and therefore is likely better for cluster analysis than triangulated groups.

Very interesting. Not every analytical technique works out exactly as expected.

Evert-Jan was initially wondering if anyone wrote about this before. I told him I that other than Jim Bartlett’s article, I doubt it, because Evert-Jan’s clustering program is the first tool ever invented that has let us look at clustering of triangulation groups.

So now it will take some thinking. Despite the possible breakup of close relations, can the ideas of triangulation group clustering still be used? Is there maybe some way of merging the ICW cluster information together with the triangulation information? We don’t know yet, but great analytical minds like Evert-Jan will be thinking hard, and that will ultimately result in new ideas along with new and better analysis software for you to analyze your DNA matches.




Update: March 21: 

I noted that the TG clustering did split up a family into 2 clusters which is a problem. But I failed to mention that the 4 clusters are still good clusters, where the people in each cluster do appear to be from the same family.

Since shared match clustering gave me one big cluster because of endogamy, I didn’t get anything at all from that. But Evert-Jan’s TG clustering gave me 4 clusters, provided me information where shared match clustering alone did not.

Maybe Evert-Jan can figure a way to parlay the information that the TG clusters have to make the shared match clustering even better.

Small Segment Matches - Tue, 19 Mar 2019

Blaine Bettinger posted a poll on his Genetic Genealogy Tips and Techniques Facebook group about 7 hours ago. It is a closed group, but if you are a member you can see the poll here.  In four hours, the poll got almost 800 responses and over 350 comments.

Blaine asked people to go to GEDmatch Genesis and do a one-to-one comparison between their kit and his kit, but reduce the minimum segment threshold down from the default of 7 cM and do the comparison using a minimum of 3 cM instead.

This little poll/experiment is well designed to help people realize that most small single matching segments are false, and to realize how many of them they might have with someone whom they are likely not at all related to. It is because we have a pair of each chromosome, and there are high probabilities that alleles located on either of our chromosomes may match either of another person’s two alleles at the same position. What ends up happening is that we can get random matches to segments as large as 15 cM. As a segment gets larger, the laws of probability start saying that the random allele matches will get less likely the same way you can’t keep throwing a coin on heads forever. And above 15 cM you can be fairly certain that almost all segment matches are real, i.e. likely Identical by Descent (IBD) and likely passed down from a common ancestor that was anywhere from 1 to 20 generations back.

So here’s the result of the poll when I snapshot it:

image

649 out of the 738 people (88%) who responded, including me, had multiple segment matches with Blaine that were 3 cM or more but not larger than 7 cM. Only 42 people (6%) did not share any segments 3 cM or more with Blaine.

These, for example, are the segments I share with Blaine:

image

There’s 8 matching segments totaling just 30.7 cM. The largest is just 4.9 cM. The most SNPs shared is 733. That sort of means we have 733 SNPs in a row where one of my alleles matches one of Blaine’s alleles, but due to misreads, GEDmatch and other DNA companies usually allow for a mismatch every now and then, maybe one or two every 100 or so. Also there are usually a few percent no-calls (unreadable SNPs) that are always treated as a match. Maybe I have 15 no-calls and Blaine has 15 over those 733 SNPs, so that’s 30 positions that may not be matches but are treated so.

Blaine is rightfully trying to get the attention of genealogists to inform them to be wary of these small segment matches. They are single matches. They are dangerous, because most are false. Confirmation bias, where you think someone is related and then believe that some small segments are the connection must be avoided. I likely share zero DNA with Blaine, yet I’ve got 7 segments showing here. Don’t believe it. You need more than this.

So I thought I’d look into these small segment matches people have with Blaine and see if I could learn more. Among the 350 comments to the poll in the first 4 hours, there were 25 people who posted their matches to Blaine like I did above. I entered their matches as well as mine into a spreadsheet so I could do some analysis.

The 26 of us have 247 segment matches with Blaine. That’s on average 9.5 matches. The fewest is 2. The most is 17. The average each of us match with Blaine 35 cM, minimum 8 cM, maximum 64 cM. 

If all those small segments are real, then potentially 64 cM could indicate a 3rd cousin with Blaine, but that’s the best that could happen. It is much more likely none of the 26 of us and Blaine are truly DNA related because almost all those small segment matches are false.


Does Triangulation Help?

I don’t know of any studies of triangulation done with small segments. The best I can quote is Jim Bartlett’s observations that just about all his triangulations are true down to 7 cM and most are true down to 5 cM.

I have done a related study. It is not triangulation per se, but it is what I would call Parental Filtering. It finds segments that a child matches but neither parent matches, thus indicating that the segment match of the child is false. They cannot be matching through one chromosome (or they would match one of their parents), so they must be matching through both their chromosomes randomly. Parental filtering effectively forces the match through one chromosome just as triangulation does, so it is a good first cut estimate of how triangulation might work on small segments. The final result of my study was this graph:

image

This tells us that parental filtered segments almost always match when they are at least 7 cM. But between 3 cM and 7 cM, you still can have a lot of false matches even if both the child and a parent match.

So my question is, do any of the 247 segment matches of the 26 people with Blaine triangulate?  If they do, is it a true match that is a small segment that is IBD, or is the triangulation a false match?

I sorted the 247 segment matches by chromosome, starting position and ending position. And I looked for overlapping segments. Guess how many there were? Would you believe 147 (60%) overlapped with one or more other segment matches. This happens because once you start to get as many as 247 segment matches, it becomes like the birthday problem (How many people in the room before 2 have the same birthday). A 4 cM match has a 1 / 500 chance of overlapping with another 4 cM match. Once you have 247 segment matches, you have 247 x 246 / 2 = 30,381 possible match combinations, so there is an expected value of 30,381 / 500 = 60 matches if they happened at random. But as the segments matches start to fill up the chromosome, the chance of matching starts to increase. So my observed number of 147 matches is quite likely. If I had entered 1000 segment matches with Blaine, almost all of them likely would overlap with at least one other segment match.

This is what I might call the “chromosome browser phenomena”. People see their segment matches lining up in the chromosome browser and assume they must be valid matches because they are all lining up. False, false false!

The chromosome browser shows you double matches. A double match is where Person A matches Person B and Person A also matches Person C on the same segment. Double matches are simply alignment of single matches. There is nothing there that tells you that any of them are valid if they are small segments.

The important step to validate a double match is to see if that match triangulates. What you need to do is check that Person B also matches Person C on the same segment. What that will usually do is, like parental filtering, force the match to be on just one chromosome between the three sets of people: A and B, A and C, B and C. (There are special case exceptions, but I won’t get into that here).

GEDmatch allows you to check triangulations. You can do a one-to-one comparison of any two people if you know their Kit number. In Blaine’s comments, about 10 of the 26 people who gave their matches inadvertently included their Kit number in their screen shot. Now I’m not going to hack their account, but I am going to do some one-to-one comparisons between them to check to see if any of these false matches are triangulations.

There only happened to be 8 overlapping matches for the people who have given their Kit numbers. That should do for a very rough estimate as to how many of these small segments triangulate. These are the overlaps I found and checked:

image

Interestingly I am in 5 of these 8 matches. I suspect the reason why I am in so many is because my kit I uploaded is a combination of the raw DNA from 5 companies. Therefore, my kit has SNPs to match all the companies, and I’m guessing that must allow me to match more segments than a single company kit would.

Checking the 8 sets of matches, I find only one of them triangulates. I’m showing the Sonia versus Louis segment match on the last line of the above table in green.



What does this all mean?

Well 7 of these 8 small segment matches do not triangulate. If a segment does not triangulate, it cannot be IBD and therefore is almost assuredly false.

This does say that triangulation a good way of further filtering out small false segments. If triangulation can eliminate 7 out of 8 segments for you, then it will have saved you a lot of unnecessary analysis.

And that 8th segment:  Is it real? If the segment is above 7 cM, then Jim Bartlett would tell you that it is very likely real. But if it’s under 7 cM, then we can’t say for sure. Triangulation alone cannot prove that a small segment is real. It can only disprove a segment. You would need to analyze other matches on the same segment and check that they all match each other. Then you have a triangulation group which is genetic genealogy’s version of “a preponderance of evidence” which starts to tell you something. But there are still a lot of caveats. With small segments, each of the 3 matches may be matching randomly, or two may be matching and the third is a random match. And you may have a valid triangulation group for some people in the group, with the others in the group matching randomly to the group’s common segment.

Also note that even if you find a true triangulation group for a small segment, you must realize that a segment that small could very easily have come from an ancestor 10, 20 or even 30 generations back, so you may never find the common true ancestor for most of your small segment matches.

Inferred Segment Matches - Thu, 7 Mar 2019

When we match our DNA to other people to find common ancestors, we are comparing segments of DNA that match the other people. That’s only logical, isn’t it.

Well, interestingly enough, there’s a technique that will help you determine which ancestors your DNA comes from by using non-matches. Actually, you are using matches of people you are closely related to, and finding common relatives who they match to, but you don’t.

Jonny Perl, the author of DNA Painter, recently wrote an article about this technique titled Painting your DNA with inferred matches. I believe he is the person who named it “Inferred Matching”. (Please correct me Jonny if this is not the case.)

Jonny gave examples showing how he used:

  1. His dad’s matches with a 2nd cousin once removed that he did not match
  2. His dad’s half-brother’s matches that his father matches but he does not match
  3. His mother’s paternal cousin, and his second cousin.
  4. Siblings

The basic idea behind Inferred Matching is that it works because you know you got your parent’s DNA either from your father and your mother. And each (small enough) segment you got from each parent was either from grandfather or grandmother. What you do is find another close relative, who I’ll call Person B, who matches a third person who I’ll call Person C. If Person B matches Person C on a segment, but you (Person A) do not match Person C on that segment, then you couldn’t have got your segment from the same line. If you did, it would have matched.

So Inferred Matching basically tells you the ancestral line your segment did not come from.

image

Looking at the diagram above, I show an example where I’m assuming your grandmother’s father (GM’s father) is the ancestral source of a segment. He passes it down through your grandmother, through your uncle/aunt to your 1st cousin (Person B). He also passes it down to your more distant cousin (Person C).  If he passed the same segment down to you as well, then you and your two cousins would all have the same segment, your segments would all match each other and you therefore triangulate. The triangulation is a clue that all three of you may have been passed down that segment from a common ancestor.

But what if your two cousins match each other, but you don’t match? You know you couldn’t have got the segment from your GM’s father. So who could have given you the segment? Answer: the segment you got from your parent could have instead been passed down from your GF’s father, your  GF’s mother or your GM’s mother.

So you usually can’t directly tell which line you came from with Inferred Matching. In the above example, you still don’t even know if the line is from your grandfather or grandmother’s side. But it does tell you the one line that you don’t come from.

Alone, you can’t do too much with it. But combined with other information, you can. If you find another cousin, who matches someone else on your grandmother’s side that you also match, but not on that segment, then you have a second refutation. If that refutation is, say, on your grandmother’s mother’s side, then all of a sudden you have refuted both your grandmother’s parents, and your segment should be on your grandfather’s side. Then if through yet another pair of cousins, your infer that the segment cannot be on your grandfather’s father’s side, all that remains is your grandfather’s mother’s side, and that could very likely be the ancestral path for your segment of interest.



Who Can Be Used for Inferred Matching?

Persons B and C can be anyone who is related to who share a Most Recent Common Ancestor (MRCA) with you. You must match Person B and Person C somewhere, but it’s the segments that you don’t match one or both of them that can be used for inferred matching. The ancestral path through the MRCA that is closest to you is the one that you can refute, because you cannot continue to follow up that path to the further MRCA. If you did, then you would be matching on that segment.

Using a parent or a parent’s descendant as a Person B is wonderful. With a parent, sibling, nephew or niece, you are now dealing with only two possible segments that you can receive rather than four. Because of that, Inferred Matching of segments your parent or half-relative’s matches that you don’t have will always tell you that if your match is not through your parent’s father, then it must be through your parent’s mother (and vise-versa).You will need to know the MRCA of Person C so you can determine which grandparent the non-match will be on. Jonny’s article gives excellent examples of this.



Caveat

Of course nothing’s ever perfect. If your Person B or Person C is related to you more than one way, e.g. through both of your grandparents, then you could get incorrect results. But this should be a somewhat rarer case. Normally, Inferred Matching works and works pretty well.



Visual Phasing

Inferred Matching has been used before Jonny’s paper. The technique of Visual Phasing takes the matches of 3 or more siblings and compares them. In doing so, the segments of each sibling’s DNA that came from each grandparent can be determined. Visual Phasing has been around for a few years. Part of the technique involves refuting a grandparent on a segment, which is effectively Inferred Matching, but I’ve never seen any posts about Visual Phasing referring to the term “Inferred Matching”.



Inferred Matching and Double Match Triangulation

Doing Inferred Matching manually is laborious. For any segment, you need to find all the segment matches that your known relatives have with each other that you don’t match to. Then you must logically work out what ancestral paths back to the MRCA’s are possible and see if you can eliminate some paths from possibility and thus infer the paths that are possible.

Inferred Matching works well with the ideas behind double match triangulation.

Double matching involves finding all the segment matches of Person A with Person C and compares them to all the segment matches of Person B with Person C. Those that overlap (along with A’s segment matching B’s) are triangulations.

Inferred Matching uses the complementary information available in the data used for double matching. Inferred Matching uses the segment matches of Person B with Person C where Person A is not matching either Person A and/or Person B on that segment. 

I’ve been working on implementing Chromosome Mapping into what will be Version 3.0 of Double Match Triangulator. I’m also incorporating Inferred Matching into that. In Double Match Triangulator, an inferred match will be telling you what ancestral paths cannot occur, and will look like this:

image

The green sections are triangulations that Person A and Person B have with several C Persons. In the example triangulation group, the MRCAs of the C People who triangulate are not known. The ancestral path (MM = mother’s mother) is only known from Person B’s MRCA.

An inferred match is shown on the first line and states that Person A doesn’t have the B-C match and the ancestral path cannot be MMFF. So only MMFM, MMMF and MMMM are possible. If additional Inferred Matches are found for that segment that rules out more of the possible paths, then Double Match Triangulator may be able to extend the ancestral path of the triangulations to longer path when it becomes the only possibility. This can provide extra information that wouldn’t have been available without the Inferred Matching.



Bonus: Inferred Matching on Triangulating Segments

Look at the 3rd line in the above diagram. This is a triangulation, but to the right there are 5 grey B’s. That is a section of the double match that Person A no longer matches. Person A stops matching at the last green T. But Person B continues matching Person C for 5 more Mbps (Mega base-pairs).

Inferred Matching can be applied to those 5 B’s. Person C has an ancestral path of “MM”, meaning that this segment can no longer be from the MM ancestral path. What we have found is a crossover at the end of that triangulation group belonging to Person A. These additional Inferred Matches are also being identified and will be displayed and used for ancestral path determination in the upcoming version 3.0 of Double Match Triangulator.

Of course we have to be careful not to use too small segments. There can always be some random matching at the beginning and end of any match, so we must make sure that the B-C matching preceding or following a triangulation is significant.



Double Match Triangulator 3.0

I’ve been making good progress and I will release DMT 3.0 as soon as it is ready. There have been so many great advances in DNA analysis over the past six months with clustering and new tools and especially new features at Ancestry DNA and MyHeritage DNA announced at RootsTech that I’ve been following. All of these have redirected my thinking as to what’s needed. I’ve established that the tool that is now needed is one that will help people do Chromosome Mapping by applying and automating the rules for them so they don’t have to do it themselves. The results will then be made available to you so that you can input them into DNA Painter and other tools.

I’m very excited as to what I have programmed so far. Most of what I’ve talked about above is completed in my development version. This post was mainly to document some of my thoughts about Inferred Matching, but is also meant to be a teaser as to what’s coming in DMT 3.0.

Stay tuned.



A Second Type of Inferral

It’s amazing as you work through the details of something and try to implement it programmatically that you suddenly realize something. I shake my head sometimes as to how the mind works, but it somehow connects all the dots together all by itself and suddenly this idea pops into your head.

The type of inferral that Jonny Perl wrote about and that I was writing about up to now is an inferral you can make because a close relative matches to someone on a segment, but you don’t.

What about the other way around? It works too. You can infer in a similar manner from a segment match that you have, but a close relative doesn’t.

The simple case of this is when you match someone on a segment, but one of your parents doesn’t. I like to call this "Parental Filtering”. Almost all the time, that will mean that either you match through your other parent, or the segment is false.

There is the borderline case where your parent falls under the match limit but you don’t. But in that case, you’ll still want to eliminate that segment from your analysis because you can’t say for sure that it is a segment going through that parent.

People do this parental filtering all the time, especially when they only have one parent tested. But you can also use siblings (as in Visual Phasing) to infer grandparent lines that you can’t have. And similarly you can use other segments that you have that some close relatives on those lines don’t have to infer more lines that you can’t have. And once you have all lines covered (e.g. both parents or all four grandparents), then you can start to classify segment as likely to be false.

I am now working to incorporate this second type of inferred matching into DMT. We’ll soon see how well these two methods of eliminating possible lines work to help identify the ancestral path that the segments of your DNA came from.




Followup (3 hours following my post): Blaine Bettinger wrote on the Facebook Genetic Genealogy Tips & Techniques group that there are several names for this process. Blaine says he uses “Indirect Mapping”.




Revision: Mar 10:  Nearly complete rewrite. On Facebook, Jonny Perl and Stevlana Hensman pointed out a major oversight I originally made in my article. I had thought that the Inferred match always resulted in knowing the ancestral line that your segment came from. That is only the case for parents and descendants of your parents (siblings, nephews/nieces, etc.). For anyone else, all it does is tell you the one ancestral line that your segment did not come from. That is still very useful information, however, and needs to be automated in DMT so that people can make use of it.

These concepts are brand new and are still being discovered by the genetic genealogy community. They are not simple. I am still learning myself and my head still spins every time I try to map a how DNA is shared. I appreciate all feedback as peer review is the best way to confirm, correct and improve methodologies.




Followup: Mar 14: I’ve confirmed that you can infer the grandparent when the inferred match is made through your parent or a descendant of your parent (i.e. siblings, nieces/nephews, etc.) The reason is that your parent gets one chromosome of each pair from each grandparent that only comes from two of your great-grandparents on that parent’s side. If you do not match to one, you must match to the other. 

This does not work for uncles/aunts, 1st cousins, or other relations, because they need not have got the same grandparent segment that your father did. So for them there are four possible great-grandparent segments to choose from. You can eliminate one, but without further eliminations, that still leaves two on one grandparent’s side and one on the other.




Followup:  Mar 15: I added the section at the end: “A Second Type of Inferral”



Followup: Oct 6, 2020: Blaine Bettinger gave a webinar on FamilyTreeWebinars about this technique. He now prefers using the term: “Deductive Mapping”.