Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Non-Matches by cM - Wed, 1 Feb 2017

Roberta Estes did the first analysis of this and asked someone to do the same thing for an endogamous population. I had that data and I felt I wanted to know as well.

I did a blog post about it a few days ago that I titled Double Match Phasing for an Endogamous Population, but using the term “Double Match Phasing” was not quite accurate, since Phasing is done at the allele level, so it contradicts itself because “matching” works at the segment match level. I’m going to go back and rename this “Double Match Filtering” since what it really does is filter out everyone who doesn’t match to both people. (As it turns out, this is one huge benefit of Double Matching, in that you can choose the two people to filter with – but that will be the topic of a future post).

Regarding the issue of Non-Matches by cM: First let me state that we are not in any way trying to claim that a segment is Identical by Descent (IBD). We are actually doing the opposite and are showing that the match is false and cannot be IBD. This can be shown when there’s a segment match of a child with Person c that does not also match with at least one parent of the child. If neither parent matches, then the child could not have had the segment passed down and it must be a by-chance match with Person c.

The first step of this analysis was to verify that my analysis gave the same results as Roberta. She was kind enough to send me the FamilyTreeDNA Chromosome Browser Results files she used so that I could check and compare my results with hers.

 

A Free Excel Spreadsheet Template for You

In Roberta’s analysis, she combined the child, father and mother results together and manually inspected them to find child matches that did not match either parent. This was going to be a lot of work on her part, so to reduce the number of matches she’d have to inspect, she first eliminated all matches under 3 cM from the 3 files.

For my analysis, I developed Excel equations that would automate the detection of overlapping segments for me. This would ensure I would not make any manual mistakes and allowed me to use all the data right down to the 1 cM limit that FamilyTreeDNA provides in its CBR files.

I have made available for free a template for this Excel spreadsheet that you can use and try for yourself. It includes a few terse instructions on how to add your child/father/mother CBR files and even includes a graph you can use to compare your results. But it’s caveat emptor. You’ll need to have decent skills with Excel to use it. To understand what it’s results mean, read the rest of this article.

The template is here:  DM Filtering Parents Child Template.xlsx

Roberta did use Double Matching in her analysis but did not recognize it as such. She found all the matches of the child, father and mother, and she looked for double matches between the child as Person a to any Person c, and either the father as Person b or the mother as Person b  to the same Person c. If neither the father or mother matched to the Person c, then she marked that Person c as a false match of the child.

I did basically the same, except that I kept the father’s Double Matches with the child separate from the mother’s Double Matches with the child. This allows for a bit more analysis since you can now also determine the number of matches with both parents which is useful for endogamy. My number matching neither parent is no different than Roberta’s.

Comparing my results with Roberta’s gave this:

image

In each of the charts on this page, the cM value represents the lower bound of the cM group. So “1” is 1 to 1.99 cM. “2” is 2 to 2.99 cM. The “15” is 15 to 19.99 cM and “20” is 20 cM or more.

In the chart above, you’ll see that at 5 cM and above, Roberta’s line and my line for her data are almost exactly the same. When you go down to 4 cM and 3 cM they start to diverge. The reason for this was Roberta’s 3 cM cutoff. There are instances where the child has a little bit extra random match that puts them above the 3 cM threshold, but the parent who matched was just under 3 cM and was no longer in the file and thus Roberta had deleted.

The same phenomena may be happening to my data at the 1 cM cutoff’ done by FamilyTreeDNA, but that’s the smallest segment that they provide. So the numbers in the Check data may be a couple of percent higher at the 1, 2 and 3 cM level than they should be.

But this is very interesting. It says that for segments 8 cM and smaller, the number of child matches that don’t have a corresponding parent match in Roberta’s data grows from 20% up to a penultimate level of about 80% at 3 cM or less. That is saying that a full 20% of very small segments do have a parent matching on the same segment, which is probably more than most people thought.

 

Must Match at least Once with Parents

Okay. I’ve verified that Roberta’s results match with mine. During that process, I found something similar to that 3 cM cutoff effect that needed to be handled. This was a situation where the child matches to Person c on one or more segments, but neither parent matches to Person c at all on any segment. In this case, Roberta is including all the child’s segments as non-matches.

About 15% of the the people matching matched only the Child but neither parent. It is not as if there is only one or two matching segments with these people. The minimum match requirements of FamilyTreeDNA are enough that each of these people match the child on between 6 and 29 segments averaging 12 segments. The total cM matching to each person ranges from 20 cM to averaging 33 cM with only a few totalling more than 40 cM.  The average likelihood of one of these segments not matching (according to our results shown in the graph above) is about 70%. The chance of every one of 12 segments on a match all being non-matches is 0.7**12 = or about 1%. So in almost all cases, some of these segments must match to some segments of at least one of the parents. Why don’t they? Because the parent must have just slipped under the FamilyTreeDNA’s criteria of a match and thus were not included in the match file.

So if we include these children that don’t have at least one segment match with a parent, then we are counting all their segments as non-matches which is almost assuredly not true and we are overestimating the amount of non-matches by 15%. As a result, I’ve included the option in my spreadsheet template of “Must Match at least Once with Parents” which I recommend be left at TRUE. You can change it to FALSE to compare to what most studies (who do not realize this is a problem) would come out with.

Here is what Roberta’s results look like when corrected for this. Compare the blue line to the orange line:

image

The very interesting effect of this is not to lower chances of non-matches for smaller or larger segments, but to lower them for the mid-range segments. This is likely because these matches barely met the criteria required for the child to match and most had a reasonably large segment in the 5 cM or 6 cM range which were called false because the parents just missed the criteria.

 

Removal of X Chromosome matches

One other thing I found while I did this work was that the X chromosome was different. It had its own pattern of false matches. It should be studied separately (and I will do so at the end of this post). The X chromosome should not be combined with the autosomal chromosomes 1 to 22.

When we take the X out of Roberta’s data, we get this. Compare the yellow line to the blue line:

image

Removal of the X chromosome gets rid of the strange looking drop we had at 6 cM and gives us a nice smooth line with non-matches starting to be significance when segments are 6 cM or less and ultimately reaching 77% proven false matches when segments are 1 or 2 cM.

And the resulting yellow line is just about the same as Roberta’s initial results, except shifted left 2 cM.

Once again, remember, this is the percentage of child matches that can be shown to be non-IBD simply because neither parent matches on the same segment. This is not the non-IBD percentage which will be higher. This number is a lower bound because there are other reasons why a segment might be IBD.

From this point on, I’ll refer to the refer to the yellow line as Estes*, as it includes the refinements I applied above, which were: (1) going down to 1 cM, (2) excluding matches to no parents, and (3) excluding the X chromosome. This Estes* yellow line will be the base for which I’ll compare other results to for the rest of this article.

 

The Other Person

We, of course, are only checking that our child and one of his parents both match Person c.  But what about Person c?  What if Person c, who connects to us, does not match either of their parents on that segment? Assuming that it’s as likely to have matches proved false for Person c as what we found for Estes*, and assuming independence between the child’s false matches and Person c’s false matches, it is easy to calculate the additional percentage of false matches as:  1 – (1 - %child-false)**2 and it theoretically will result in this:

image

Now that blue line takes us up to non-confidence levels many people believe are true with respect to non-IBD numbers, at least for very small segments. I’m still not saying that these represent IBD likelihood, because these don’t. This is just the percentage of matches that can be proven false if one or the other of a match does not have a parent that also matches.

The assumption of independence means that a non-match with a parent on the cousin’s side is not more or less likely when the child is a match or non-match with a parent. If there is a dependency, then non-matches of the child will more often happen at the same time the cousin doesn’t have a parent match. This will reduce the % of non-matches and the combined line will fall somewhere between the yellow and blue lines.

For now, we’ll ignore this double-sided numbers because I don’t have parents for the cousins to enable analysis of the where the blue line is. I can only determine the yellow line. So for the rest of the analysis on this page, we’ll go back to the Estes* yellow line to compare with. We will also always exclude X and exclude the child’s matches with people who don’t match a parent on at least one segment.

 

The Child’s Sibling

Roberta gave me the data for a second child which she didn’t analyse for her article. It’s easy for me to analyze it with my spreadsheet, so I thought I’d do the calculation for her.

image

The good news is that the two give very similar results.

 

Comparing to a Different Child-Mother-Father trio

The question is whether or not Roberta’s example is representative for everyone or it there’s a big variance between the non-match rate for different people.

Kathy Benzi responded to my request for additional sets of Chromosome Browser Results files for child/mother/father trios. When I ran her results, it gave me these results:

image

Interesting! Kathy’s results give slightly lower non-match percentages than Roberta’s do. Not sure why, but they are still reasonably close.

 

Bonus! Child-Father-Father’s Mother

Kathy sent me a “bonus”. She also included the CBR file for the father’s mother. At first, I didn’t think I’d be able to use the grandparent, but I put it into the DM Filtering spreadsheet and realized it gives different, but also very useful information.

If a person is related to the child on the father’s mother’s side, then the child’s match must also be a match with the father and the grandmother (which the spreadsheet defines as “both”). We can ignore the matches that are only on the father’s side, because the valid ones would be on his father (the grandfather’s) side. And we can ignore the matches on the “neither” side, because the valid ones would be on the child’s mother’s side.

What is important are the matches of the child that are the same as the grandmother, but are not matches with the father. Those are then false matches that somehow did not go through the father. These aren’t one-off cases of single segments. These are multiple segments that match between the child and Person c and between the grandmother and Person c but don’t match between the father and Person c. Once again, I have to make sure that the father matches with the Person c on at least one segment, or his reason for not matching is that he just missed the match criteria as in the “Must Match at least Once with Parents” as discussed earlier.

With that adjustment, I can determine the number of segments that somehow only the child and grandmother match to, but not the parent in-between, and here’s the results (the grey line):

image

That is a significantly lower percentage of missing father matches. And that is good, because you’ve got two people, the child and the grandmother both matching Person c.  This Double Match should lower false matches, and it does.

 

Comparing to an Endogamous Population

I have two sets of Chromosome Browser Results files from endogamous populations that I can use. One has both parents with a son and daughter, and the other which is a completely separate family has both parents with a son.

Comparing their results with the Estes* results gives:

image

This is very interesting. There are a lot fewer non-matches than in an non-endogamous population. I have no idea why that might be. Also, the 3 cases I have give almost identical results.

I also think the DM Filtering spreadsheet I’ve made also can give you a decent estimate of how endogamous a family is. If you take a look at the number of child-father matches, child-mother matches, and child-bothparents matches, endogamous groups will have many more matches than non-endogamous, and child-bothparents matches will be a much higher percentage of the total. Compare the following total results from the tests that I had:

image

You’ll notice that the 3 non-endogamous children have about 20,000 matches in total.  The 3 endogamous children have over 6 times as many matches.

The 3 non-endogamous children have a much lower percentage of their matches in common with both parents than do the 3 endogamous children.

And finally, the 3 non-endogamous children have a significantly higher percentage of non-matches that can be disproved because neither of their parents shares that match.

 

Non-Matches in the X Chromosome by cM

There are much fewer matches in the X chromosome to use than in the autosomes used above. My inspection of the X results indicates more of a difference between males and females than between endogamous and non-endogamous. So I’m going to put the 4 females together and the 2 males together.

Females get two X chromosomes, one from their father and one from their mother . The combined match totals of the 4 female children I have are:

image

Males get one X chromosome, just from their mother. When tallying the percent non-matches for males, I also include the Father column, since the male cannot get his X from his father.

image

Graphing these against our autosomal Estes* for comparison gives:

image

Interestingly, the X non-match percentage is significant even for large segments.

Also interestingly, the male X-segments don’t get worse than 36% non-matches even for very small segments.

I’m not sure why. For the X, I’m just the messenger, presenting the results.

 

Conclusion

So that’s my analysis of Non-Matches by cM using parent filtering.

But it really isn’t what I ultimately need to know. What I am looking for is to find how much these non-matches can be improved by using Triangulation and also what the improvement is for Double Matches that are missing the a-b match and therefore don’t Triangulate.

My theory is that there should be a significant reduction in the % Non-Matches for all Double Matches, whether they Triangulate or not. I’m wondering what the threshold should be, i.e. what cM level, where you need to start worrying that Triangulated segments can be disproved from being IBD simply because the child does not match one of its parents.

But until I can get some data and time to do an analysis, go back to the Bonus! Child-Father-Father’s Mother section above. That was Triangulation in action.

My #RootsTech 2017 Schedule - Sun, 29 Jan 2017

I finalized my schedule for RootsTech about a week ago (have you finalized yours yet?) of what I plan to do. I’ve put it on my online RootsTech App as well as on OneNote on my Phone in case the wifi at the Conference is spotty.

If you want to track me down, this is where I’ll be:


Monday, Feb 6

My Flight Arrives
7:45PM to 7:55PM
Airport

Commonweath Dinner Meetup
8:30PM to 9:00PM
Blue Lemon


Tuesday, Feb 7

BYU Family History Technology Workshop
7:30AM to 5:00PM
Brigham Young University

Media Dinner
6:30PM to 8:30PM
Room 355

Innovator Showdown Semi-Finalist Technical Setup
8:35PM to 8:45PM
Ballroom B


Wednesday, Feb 8

Innovator Showdown Semi-Finalist Rehersal
7:00AM to 8:00AM
Ballroom B

IS7000 Innovator Summit General Session
9:00AM to 10:00AM
Ballroom B

IS7100 Industry Trends and Outlook
10:15AM to 11:15AM
Ballroom B

IS7303 Innovation: Best Practices and Applications
11:45AM to 12:15PM
Ballroom J

IS7200 Innovator Showdown Semi-Final
12:30PM to 1:30PM
Ballroom B

IS7702 How to Pitch an Investor
2:00PM to 2:50PM
Ballroom J

RT8642 How will DNA continue to disrupt our industry
3:00PM to 4:00PM
155A - Getting Started

IS1743 FamilySearch API: What’s New and What’s Coming?
4:30PM to 5:30PM
Ballroom G

RT1006 Welcome Party: We Don’t Need Roads
6:00PM to 7:30PM
Marriott Downtown


Thursday, Feb 9

RT5100 General Session - Thursday
8:30AM to 10:00AM
Hall D

LUN9001 MyHeritage Sponsored Lunch
12:00PM to 1:30PM
355B-Lunch

RT4117 It’s a Collaborative Work: Blending FamilySearch and Partner Applications
3:00PM to 4:00PM
255D

LAB2064 Your Health. Your Legacy. Their Future.
4:30PM to 5:30PM
251B - LAB

RT1054 RootsTech Opening Event: Music It Runs in the Family
8:00PM to 9:30PM
Conference Center


Friday, Feb 10

RT1200 General Session - Friday
8:30AM to 10:00AM
Hall D

IS2543 Innovator Showdown Final
10:30AM to 11:30AM
Hall D

LAB1616 Introduction to Chromosome Mapping
3:00PM to 4:00PM
251E - LAB

RT1876 Culture Celebration: Celebrate Your Heritage
5:30PM to 7:30PM
Expo Hall

MyHeritage RootsTech After-Party
8:00PM to 11:00PM
Mariott City Creek


Saturday, Feb 11

RT1300 General Session - Saturday
8:30AM to 10:00AM
Hall D

Flight Home
3:20PM to 8:20PM
Airport


During my unscheduled time, I’ll likely be roaming around the Expo Hall. I’ll also make sure I get a chance to go to the Family History Library and see their new Discovery Center.

It does look like time to eat and sleep will be in short supply.

Track me down if you can, and I’ll give you a couple of ribbons:

No automatic alt text available.

Triangulation, Single Matching and Double Matching - Fri, 27 Jan 2017

It seems like my last post was a bit confusing to many people. I expect that the way I drew the boxes (to be segments) and the way I connected them with lines (indicating matching) was not intuitive, and it did not allow people to see that Double Matching with two people actually triangulates.

I’m going to start from scratch here. I’m going to use an illustration that hopefully most people will understand. This will be a representation of FamilyTreeDNA’s  Chromosome Browser which most people reading this should be familiar with.

 

Triangulation

Below is a representation of the Chromosome Browser as seen by three different people when they log in to their account at FamilyTreeDNA. Person a will see the top diagram, Person b will see the middle diagram, and Person c will see the bottom diagram. One person cannot see the diagrams of the others.

image

In the top diagram, Person a’s Chromosome Browser shows a match with Person b and Person c on the same segment.

If you log into Person b’s results, their Chromosome Browser will show a match with Person a and Person c over the same segment.

And Person c’s will see that their Chromosome Browser says Person a and Person b match on the same segment.

This is called Triangulation, where three people all match each other on the same segment. Person a matches Person c, Person b matches Person c, and Person a matches Person b.

The purpose of Triangulation is to help you identify segments that may be Identical by Descent (IBD) because those that are IBD come from a common ancestor of the people who share the same segment. Then through genealogical research, you trace back each of the people’s trees to see where they connect.

For a segment to be IBD, it must Triangulate.

However, a segment that Triangulates is not necessarily IBD. There are a couple of reasons for this:

  1. Two segments may match by chance. This starts happening when segments are shorter than 15 cM and happens more often as the segments get smaller.
  2. Two segments may be on opposite chromosomes. This situation was identified by Blaine Bettinger a couple of days ago on the International Society of Genetic Genealogy (ISOGG) Facebook page. In other words, one segment is the maternal segment and the other is the paternal segment of a chromosome pair.

 

Single Match Triangulation

This is the method most people use for Triangulation. It uses one person’s matches:

image

What you have shown so far by this is that Person a matches Person b, and Person a matches Person c on the same segment. You have not yet Triangulated because you must also show that Person b matches Person c on the same segment. The above Chromosome Browser image does not tell you that. And Person a does not have access to that Person b match with Person c in their own match information at FamilyTreeDNA or at 23andMe.

What Person a can find out from their own account at FamilyTreeDNA or 23 and Me is if Person b is “In Common With (ICW)” Person c. That means Person b shares enough DNA with Person c to be considered a match. If that is the case, then it increases the likelihood that the segments Triangulate, but it does not guarantee it because those matches between Person b and Person c may not be on the same segment. There are several tools that make use of ICW data to help you locate Triangulated segments, such as Don Worth’s Autosomal DNA Segment Analyzer (ADSA) at DNAGedcom.

However, to truly Triangulate, you need to verify that the Person b and Person c segments match each other. The one obvious way to do this is to contact either Person b or Person c and ask them to look in their Chromosome Browser to see if they match the other person over this specific segment. If they do, you have verified that this segment Triangulates between Persons a, b and c, and the segment therefore might be IBD.

If Person b or Person c tell you that they don’t match the other person over this specific segment, then they have verified that Persons a, b and c do not Triangulate over this segment and have shown that the segment cannot be IBD for the three of them.

This is however a lot of work, to verify every segment with every person on a one by one basis if you do it manually.

There is just one tool out there that will check the third match for you. It is the GEDmatch Tier 1 Triangulation Tool. It actually looks at the segments of Person b and Person c to ensure that the same segment matches with Person a. GEDmatch find all the pairs of people who match Person a. So it will display all the Triangulations it finds, whether paternal or maternal and does not differentiate.

 

Double Match Triangulation

This method makes use of two people’s match information. When each they log in to FamilyTreeDNA and look at their Chromosome Browser, Person a will see the top diagram, and Person b will see the bottom diagram.

image

Using just two people’s information, you can truly Triangulate. This is why:

Person a knows of their segment match with Person b, and knows of their segment match with Person c, but does not know if Person b matches Person c on the same segment.

Person b knows of their segment match with Person a, and knows of their segment match with Person c, but does not know if Person a matches Person c on the same segment.

If you put that data together, then you know from Person a that Person a matches Person c on the segment, you know from Person b that Person b matches Person c on the segment, and you know from both of them that Person a matches Person b on the segment. You have the three matches on the same segment that you need for true Triangulation.

So only data from two people is required to Triangulate. You do not need the data from the third person.

When Person a downloads a Chromosome Browser Results (CBR) file from FamilyTreeDNA, it contains all of Person a’s segment matches with everyone else. When Person b downloads their CBR file, it contains all Person b’s segment matches with everyone else. Using these two files, you can therefore find in one fell swoop every segment that is a true Triangulation that involves Person a and Person b and someone else.

This is the method that Double Match Triangulator uses to Triangulate. The method of Double Matching ensures that Person a matches Person c, Person b matches Person c, and Person a matches Person c all on the same segment, which is exactly what the GEDmatch Tier 1 Triangulation Tool does as far as Triangulation goes.

But DMT takes this one step further because of its Double Matching. Only segments that Double Match both Person a and Person b will be included in the Triangulations for those two people. So the triangulations are effectively filtered by the relationship of Person a with Person b. For example, If Person b is a 2nd cousin of Person a, then DMT will produce Triangulations with only people that are not only related to both Person a and the 2nd cousin, but also need to have segment matches between Person a and the 2nd cousin that yield true Triangulations. By comparison, GEDmatch does not differentiate its Triangulations and thus does not give you the ability to filter them.

But always keep in mind that even though both DMT and the GEDmatch Triangulation tool both produce true Triangulations, a true Triangulation does not guarantee that the segment is IBD (see above for the two reasons). Determining IBD is a separate issue that neither DMT nor GEDmatch can yet address.