Non-Matches by cM - Wed, 1 Feb 2017
Roberta Estes did the first analysis of this and asked someone to do the same thing for an endogamous population. I had that data and I felt I wanted to know as well.
I did a blog post about it a few days ago that I titled Double Match Phasing for an Endogamous Population, but using the term “Double Match Phasing” was not quite accurate, since Phasing is done at the allele level, so it contradicts itself because “matching” works at the segment match level. I’m going to go back and rename this “Double Match Filtering” since what it really does is filter out everyone who doesn’t match to both people. (As it turns out, this is one huge benefit of Double Matching, in that you can choose the two people to filter with – but that will be the topic of a future post).
Regarding the issue of Non-Matches by cM: First let me state that we are not in any way trying to claim that a segment is Identical by Descent (IBD). We are actually doing the opposite and are showing that the match is false and cannot be IBD. This can be shown when there’s a segment match of a child with Person c that does not also match with at least one parent of the child. If neither parent matches, then the child could not have had the segment passed down and it must be a by-chance match with Person c.
The first step of this analysis was to verify that my analysis gave the same results as Roberta. She was kind enough to send me the FamilyTreeDNA Chromosome Browser Results files she used so that I could check and compare my results with hers.
A Free Excel Spreadsheet Template for You
In Roberta’s analysis, she combined the child, father and mother results together and manually inspected them to find child matches that did not match either parent. This was going to be a lot of work on her part, so to reduce the number of matches she’d have to inspect, she first eliminated all matches under 3 cM from the 3 files.
For my analysis, I developed Excel equations that would automate the detection of overlapping segments for me. This would ensure I would not make any manual mistakes and allowed me to use all the data right down to the 1 cM limit that FamilyTreeDNA provides in its CBR files.
I have made available for free a template for this Excel spreadsheet that you can use and try for yourself. It includes a few terse instructions on how to add your child/father/mother CBR files and even includes a graph you can use to compare your results. But it’s caveat emptor. You’ll need to have decent skills with Excel to use it. To understand what it’s results mean, read the rest of this article.
The template is here: DM Filtering Parents Child Template.xlsx
Roberta did use Double Matching in her analysis but did not recognize it as such. She found all the matches of the child, father and mother, and she looked for double matches between the child as Person a to any Person c, and either the father as Person b or the mother as Person b to the same Person c. If neither the father or mother matched to the Person c, then she marked that Person c as a false match of the child.
I did basically the same, except that I kept the father’s Double Matches with the child separate from the mother’s Double Matches with the child. This allows for a bit more analysis since you can now also determine the number of matches with both parents which is useful for endogamy. My number matching neither parent is no different than Roberta’s.
Comparing my results with Roberta’s gave this:
In each of the charts on this page, the cM value represents the lower bound of the cM group. So “1” is 1 to 1.99 cM. “2” is 2 to 2.99 cM. The “15” is 15 to 19.99 cM and “20” is 20 cM or more.
In the chart above, you’ll see that at 5 cM and above, Roberta’s line and my line for her data are almost exactly the same. When you go down to 4 cM and 3 cM they start to diverge. The reason for this was Roberta’s 3 cM cutoff. There are instances where the child has a little bit extra random match that puts them above the 3 cM threshold, but the parent who matched was just under 3 cM and was no longer in the file and thus Roberta had deleted.
The same phenomena may be happening to my data at the 1 cM cutoff’ done by FamilyTreeDNA, but that’s the smallest segment that they provide. So the numbers in the Check data may be a couple of percent higher at the 1, 2 and 3 cM level than they should be.
But this is very interesting. It says that for segments 8 cM and smaller, the number of child matches that don’t have a corresponding parent match in Roberta’s data grows from 20% up to a penultimate level of about 80% at 3 cM or less. That is saying that a full 20% of very small segments do have a parent matching on the same segment, which is probably more than most people thought.
Must Match at least Once with Parents
Okay. I’ve verified that Roberta’s results match with mine. During that process, I found something similar to that 3 cM cutoff effect that needed to be handled. This was a situation where the child matches to Person c on one or more segments, but neither parent matches to Person c at all on any segment. In this case, Roberta is including all the child’s segments as non-matches.
About 15% of the the people matching matched only the Child but neither parent. It is not as if there is only one or two matching segments with these people. The minimum match requirements of FamilyTreeDNA are enough that each of these people match the child on between 6 and 29 segments averaging 12 segments. The total cM matching to each person ranges from 20 cM to averaging 33 cM with only a few totalling more than 40 cM. The average likelihood of one of these segments not matching (according to our results shown in the graph above) is about 70%. The chance of every one of 12 segments on a match all being non-matches is 0.7**12 = or about 1%. So in almost all cases, some of these segments must match to some segments of at least one of the parents. Why don’t they? Because the parent must have just slipped under the FamilyTreeDNA’s criteria of a match and thus were not included in the match file.
So if we include these children that don’t have at least one segment match with a parent, then we are counting all their segments as non-matches which is almost assuredly not true and we are overestimating the amount of non-matches by 15%. As a result, I’ve included the option in my spreadsheet template of “Must Match at least Once with Parents” which I recommend be left at TRUE. You can change it to FALSE to compare to what most studies (who do not realize this is a problem) would come out with.
Here is what Roberta’s results look like when corrected for this. Compare the blue line to the orange line:
The very interesting effect of this is not to lower chances of non-matches for smaller or larger segments, but to lower them for the mid-range segments. This is likely because these matches barely met the criteria required for the child to match and most had a reasonably large segment in the 5 cM or 6 cM range which were called false because the parents just missed the criteria.
Removal of X Chromosome matches
One other thing I found while I did this work was that the X chromosome was different. It had its own pattern of false matches. It should be studied separately (and I will do so at the end of this post). The X chromosome should not be combined with the autosomal chromosomes 1 to 22.
When we take the X out of Roberta’s data, we get this. Compare the yellow line to the blue line:
Removal of the X chromosome gets rid of the strange looking drop we had at 6 cM and gives us a nice smooth line with non-matches starting to be significance when segments are 6 cM or less and ultimately reaching 77% proven false matches when segments are 1 or 2 cM.
And the resulting yellow line is just about the same as Roberta’s initial results, except shifted left 2 cM.
Once again, remember, this is the percentage of child matches that can be shown to be non-IBD simply because neither parent matches on the same segment. This is not the non-IBD percentage which will be higher. This number is a lower bound because there are other reasons why a segment might be IBD.
From this point on, I’ll refer to the refer to the yellow line as Estes*, as it includes the refinements I applied above, which were: (1) going down to 1 cM, (2) excluding matches to no parents, and (3) excluding the X chromosome. This Estes* yellow line will be the base for which I’ll compare other results to for the rest of this article.
The Other Person
We, of course, are only checking that our child and one of his parents both match Person c. But what about Person c? What if Person c, who connects to us, does not match either of their parents on that segment? Assuming that it’s as likely to have matches proved false for Person c as what we found for Estes*, and assuming independence between the child’s false matches and Person c’s false matches, it is easy to calculate the additional percentage of false matches as: 1 – (1 - %child-false)**2 and it theoretically will result in this:
Now that blue line takes us up to non-confidence levels many people believe are true with respect to non-IBD numbers, at least for very small segments. I’m still not saying that these represent IBD likelihood, because these don’t. This is just the percentage of matches that can be proven false if one or the other of a match does not have a parent that also matches.
The assumption of independence means that a non-match with a parent on the cousin’s side is not more or less likely when the child is a match or non-match with a parent. If there is a dependency, then non-matches of the child will more often happen at the same time the cousin doesn’t have a parent match. This will reduce the % of non-matches and the combined line will fall somewhere between the yellow and blue lines.
For now, we’ll ignore this double-sided numbers because I don’t have parents for the cousins to enable analysis of the where the blue line is. I can only determine the yellow line. So for the rest of the analysis on this page, we’ll go back to the Estes* yellow line to compare with. We will also always exclude X and exclude the child’s matches with people who don’t match a parent on at least one segment.
The Child’s Sibling
Roberta gave me the data for a second child which she didn’t analyse for her article. It’s easy for me to analyze it with my spreadsheet, so I thought I’d do the calculation for her.
The good news is that the two give very similar results.
Comparing to a Different Child-Mother-Father trio
The question is whether or not Roberta’s example is representative for everyone or it there’s a big variance between the non-match rate for different people.
Kathy Benzi responded to my request for additional sets of Chromosome Browser Results files for child/mother/father trios. When I ran her results, it gave me these results:
Interesting! Kathy’s results give slightly lower non-match percentages than Roberta’s do. Not sure why, but they are still reasonably close.
Bonus! Child-Father-Father’s Mother
Kathy sent me a “bonus”. She also included the CBR file for the father’s mother. At first, I didn’t think I’d be able to use the grandparent, but I put it into the DM Filtering spreadsheet and realized it gives different, but also very useful information.
If a person is related to the child on the father’s mother’s side, then the child’s match must also be a match with the father and the grandmother (which the spreadsheet defines as “both”). We can ignore the matches that are only on the father’s side, because the valid ones would be on his father (the grandfather’s) side. And we can ignore the matches on the “neither” side, because the valid ones would be on the child’s mother’s side.
What is important are the matches of the child that are the same as the grandmother, but are not matches with the father. Those are then false matches that somehow did not go through the father. These aren’t one-off cases of single segments. These are multiple segments that match between the child and Person c and between the grandmother and Person c but don’t match between the father and Person c. Once again, I have to make sure that the father matches with the Person c on at least one segment, or his reason for not matching is that he just missed the match criteria as in the “Must Match at least Once with Parents” as discussed earlier.
With that adjustment, I can determine the number of segments that somehow only the child and grandmother match to, but not the parent in-between, and here’s the results (the grey line):
That is a significantly lower percentage of missing father matches. And that is good, because you’ve got two people, the child and the grandmother both matching Person c. This Double Match should lower false matches, and it does.
Comparing to an Endogamous Population
I have two sets of Chromosome Browser Results files from endogamous populations that I can use. One has both parents with a son and daughter, and the other which is a completely separate family has both parents with a son.
Comparing their results with the Estes* results gives:
This is very interesting. There are a lot fewer non-matches than in an non-endogamous population. I have no idea why that might be. Also, the 3 cases I have give almost identical results.
I also think the DM Filtering spreadsheet I’ve made also can give you a decent estimate of how endogamous a family is. If you take a look at the number of child-father matches, child-mother matches, and child-bothparents matches, endogamous groups will have many more matches than non-endogamous, and child-bothparents matches will be a much higher percentage of the total. Compare the following total results from the tests that I had:
You’ll notice that the 3 non-endogamous children have about 20,000 matches in total. The 3 endogamous children have over 6 times as many matches.
The 3 non-endogamous children have a much lower percentage of their matches in common with both parents than do the 3 endogamous children.
And finally, the 3 non-endogamous children have a significantly higher percentage of non-matches that can be disproved because neither of their parents shares that match.
Non-Matches in the X Chromosome by cM
There are much fewer matches in the X chromosome to use than in the autosomes used above. My inspection of the X results indicates more of a difference between males and females than between endogamous and non-endogamous. So I’m going to put the 4 females together and the 2 males together.
Females get two X chromosomes, one from their father and one from their mother . The combined match totals of the 4 female children I have are:
Males get one X chromosome, just from their mother. When tallying the percent non-matches for males, I also include the Father column, since the male cannot get his X from his father.
Graphing these against our autosomal Estes* for comparison gives:
Interestingly, the X non-match percentage is significant even for large segments.
Also interestingly, the male X-segments don’t get worse than 36% non-matches even for very small segments.
I’m not sure why. For the X, I’m just the messenger, presenting the results.
Conclusion
So that’s my analysis of Non-Matches by cM using parent filtering.
But it really isn’t what I ultimately need to know. What I am looking for is to find how much these non-matches can be improved by using Triangulation and also what the improvement is for Double Matches that are missing the a-b match and therefore don’t Triangulate.
My theory is that there should be a significant reduction in the % Non-Matches for all Double Matches, whether they Triangulate or not. I’m wondering what the threshold should be, i.e. what cM level, where you need to start worrying that Triangulated segments can be disproved from being IBD simply because the child does not match one of its parents.
But until I can get some data and time to do an analysis, go back to the Bonus! Child-Father-Father’s Mother section above. That was Triangulation in action.