Louis Kessler’s Behold Blog

Double Match Filtering for an Endogamous Population - Sun, 22 Jan 2017

A few days ago, Roberta Estes posted: Concepts – Segment Size, Legitimate and False Matches where she compared a child’s matches against those of her parents. She downloaded the Chromosome Browser Results (CBR) file from FamilyTreeDNA for a set of parents and a child, and then explained how she did the matching in a spreadsheet.

Roberta’s key result was a Parent Child Phased Segment Match Chart which show she passed the 50% mark for false matches for 7 to 7.99 cM segments rising to 87% false matches once segments are as small as 3 to 3.99 cM.

Roberta refers to this technique as “double parent phasing” (no caps) whereas I’d like to call it “Double Match Filtering” (with caps). My reason for naming it this is because it is exactly the same technique I use for what I call Double Match Analysis.

What is being done is we are taking a child as Person a and one parent as Person b, and we are finding all the Person c people that match to both. Then we do it a second time, with the same child again as Person a, the other parent as Person b, and we then are finding the Person c people that match to them. Using these two sets of Double Matches, we go back to all the child’s Single Matches and see which do not double match to either parent. Those non-matches cannot be Identical by Descent (IBD) since one parent would have had to match to pass the segment down from the ancestor, through them to the child.

The high percentage of false matches for small segments under 8 cM in Roberta’s results is what scares genealogists from using small segments. And this is the downfall of Single Match Triangulation. A large number of small single matches are likely false and are not IBD.

Towards the end of her article, Roberta said:

“I hope that other people in non-endogamous populations will do the same type of double parent phasing and report on their results in the same type of format. This experiment took about 2 days.

Furthermore, I would love to see this same type of experiment for endogamous families as well.”

An Endogamous Family

I’ve had plans to do this anyway. I need to analyze how the matches pass down as part of my investigation into methods to use Double Match Triangulation to map segments onto ancestors.

So I’m taking a number of Chromosome Browser Results files that were sent to me by Arnold, a DNA-cousin of mine, to help me develop my Double Match Triangulator program and see if I can use it to figure out how we’re related.

(By the way, I define a “DNA-cousin” or “DNA-relative” as someone who is a DNA match, but neither of us have the foggiest idea of how we’re actually related.)

Arnold has been doing DNA analysis with FamilyTreeDNA for a long time, and he had about 20 CBR files that he let me use. He, like me, comes from a endogamous Ashkenazi population.

His files include a father, mother, son and daughter, as well as other relatives of those four. Endogamous population gives those involved many more matches than you’d expect. That’s because everybody is related to everybody else often in multiple ways. Here’s the statistics for the four people I’ll use:

The father has 163,249 single match segments with 7,654 people.
The mother has 149,083 single match segments with 7,139 people.
The daughter has 146,767 single match segments with 7,271 people.
The son has 142,066 single match segments with 7,014 people.

To add an interesting complication, the father and mother are related. They have 25 matching segments that match each other totalling 98.0 cM with the longest being 18.9 cM. This would normally make them something like 3rd cousins. But because of endogamy, they are more likely 5th and 6th cousins in several different ways.

The Spreadsheet Analysis

I basically did what Roberta said to do. I did it twice, once for the son with his parents, and once for the daughter and parents. Each file has about 450,000 lines in it. These are big Excel files that ended up (with analysis equations) being about 80 MB in size each.

I didn’t delete the segments under 3 cM like Roberta did. She was visually inspecting each match herself, so wanted a manageable number of matches to work with. Her non-endogamous CBR files had about 25,000 segment matches in each one, and removing the under 3 cM ones left her with about 6,000 matches in each, for a total of 18,000 lines to work with, and that was plenty to provide reasonable results.

I was able to develop Excel formulas to do the match comparisons that Roberta did by hand. Since I was letting the computer do the work, I didn’t need to cut down the size of the analysis and I could work with the whole dataset.

Roberta didn’t mention it, but you do have to remove the father, mother and child wherever they appear as the “MATCHNAME”. They all match each other on many segments, including the father and mother as I mentioned above. You don’t want to count those in these statistics.

Also, it’s really important is to check the date of your downloads of the two parents and the child file. If they were not downloaded at the same time, a later downloaded file will contain matches to people that an earlier download did not. This will make it look like one person matches and the other does not when what is really true is that you just don’t have the matches for the other person.

These one-sided matches had to be eliminated. I found the best way was to see if the child had matches to a Person c that neither their mother or father had. For this Person c to show up in the child’s match list, they had to have at least a half dozen matches totalling at minimum around 20 cM. For that to happen and for none of those segments to match either parent is practically impossible meaning the matches for the parent is missing. So I deleted these from the analysis. They amounted to about 5% of the matches and did not really change the results other than reducing the number of large segments that did not match.

And because the parents were related, I knew there would be some matches that would be on both parents sides, so I made sure I was able to count those so I’d have them for future analysis.

The Double Match Phasing Results

These results include only matches on the 22 autosomal chromosome pairs. The X chromosome is a bit different so I removed them and will analyze them separately in a later post.

Here’s the results of the daughter versus her father and mother:

And the results of her brother (the son) versus the same father and mother were very similar:

The results showed that there was much less chance of a non-match in small segments for these endogamous people than what Roberta was showing as her results. Yellowing in the 50% point, it comes in at the 2 – 3 cM range, as compared to Roberta’s 50% point which for her comes in at the 7 – 8 cM range. This surprised me so much that I went back and double and triple checked my equations to make sure they were identifying segments correctly and totalling everything correctly. They were.

Here is a plot of % Non-matches by segment size from several different analysis. In addition to my results and Roberta’s results, I’m including John Walden’s False Positive both sides phased results that are on the ISOGG Wiki which Blaine Bettinger talks about in his “Small Matching Segments – Friend or Foe” article of 2014. Also I’m including Ann Raymont’s findings in her “When is a match a false positive?” post from 2016.

It seems that every other study, all non-endogamous populations, give similar results, but mine is different. I currently do not know why this is. I can’t think of a reason why endogamy might give fewer non-matches for a given segment size. Unless my analysis is being done differently (or incorrectly) and I don’t believe it is, and my number of observations used is certainly large enough, then I think I may be showing something quite significant and relevant.

Among my 68 Chromosome Browser Results files that I have and that my DNA-relatives have given me, this father/mother/son/daughter was the only set of both-parents with child that I have. I would like to test some more, both endogamous and not.

I made my analysis spreadsheet quite general so that I could easily do this analysis for any father/mother/child triplet. If you’re interested in seeing what your non-match percentage looks like and would like to help me with this research that I’ll use to give my Double Match Triangulator program some smarts, please send me your set of CBR files. In return, I’ll be happy to send you the spreadsheet with your data in it and the results.

So if you have any set of CBR files from FamilyTreeDNA that include both parents and 1 or more children, would you be willing to send them to me so that I analyze them the same way? Thanks.

Double Match Triangulator - Version 1.4 - Fri, 20 Jan 2017

DMT is a semi-finalist in the #InnovatorShowdown at #RootsTech 2017. This is a new version of the program with several improvements.

You can get the new version on my DMT page. It is freeware to help you do autosomal DNA segment analysis.

Now Works with Older CBR files

My own FamilyTreeDNA results came in 11 days ago. When I downloaded my Chromosome Browser Results (CBR) file and ran it through DMT, it didn’t find any triangulations with anybody. That’s because my results were brand new. The other CBR files I had did not know about my results because when they were downloaded, my results weren’t in the system yet.

DMT used to check that Person a’s file had matches with Person b and Person b’s file had the equal matches with Person a. If not, DMT wouldn’t use the a-b matches. So there would be people who Double Matched, but nobody would Triangulate.

To handle this situation, Version 1.4 now only needs the a-b matches in either Person a or Person b’s file. Now you won’t need to update all your older CBR files whenever you get a new tester in your family. Of course, you’ll only Double Match with Person c people who got their results after the older of your Person a and Person b files. Eventually you may want to update your older CBR files with newer ones, especially if there’s a particular Person c missing from the analysis. But updating your files is no longer necessary.

Prevents the Same Person from being used Twice

This was annoying. If you had several CBR files for a person, downloaded on different dates, and you ran By Chromosome to combine everything, then the person would be included as Person b multiple times.

Now DMT checks the names of the Person b people. If the same name shows up, it will only use the last file when ordered by filename alphabetically, which should be the one with the latest date.

This way, you can download new CBR files and leave them with their older ones for comparison, and DMT will only use the newest in its By Chromosome runs.

Excludes non-matches from the By Chromosome Analysis

Originally, I thought it was okay to include all the Chromosome Browser Results files in the By Chromosome analysis. I thought that even if Person b does not match Person a, the Double Matches should still be meaningful.

Yes that is true, but …

This will yield to false interpretation if Person b actually does match Person a on some segments, but they are below the threshold of FamilyTreeDNA to consider them a match. The segments that were a-b matches would then incorrectly show up in DMT as Missing a-b Segments rather than as Triangulations. This is very bad because Double Match Theorem 1 would get you to conclude that this segment is on the other half of the Chromosome pair than it really it. That would make you conclude that this is a paternal match when it is really maternal, and vice versa.

So that had to change. Non-matches are excluded in the By Chromosome Analysis.

Better Handling of Duplicate Segments in CBR files

FamilyTreeDNA unfortunately downloads matches in its CBR files by match name rather than kit number. If two people have the same John Smith, or if one person tested twice under the same name, all those matches will be in the CBR file mixed together looking like one person. DMT puts a “##” before the name of people with this problem, so that you will be aware when you use those segment matches.

Duplicate segments will be because a person tested twice. In most cases, all the segments are duplicated (or even triplicated if someone did 3 tests). This case is easy to detect and remove all the extra entries. Then this Person c can be used without worry. DMT now fixes this for you and there is no “##” before these people’s names.

For the overlapping people, if you really need to fix one or two because they are critical in your analysis, you can go to FamilyTreeDNA’s Chromosome Browser and look that name up. You’ll see more than one person. You can download their individual matches and manually doctor up your CBR files, but you’ll have to make up a different name for the other, e.g. John Smith and John Smith2. This is messy because your CBR file for Person b will also have its John Smiths together, and your John Smith2 won’t match anyone in Person b’s file unless you fix that file as well. Ugh! Better to wait for FamilyTreeDNA to fix this problem, if anyone knows how to let them know about it.

Improvements to the People Page

This is likely the most visible improvement. It is on the People page for individual Double Match runs, and for the By Chromosome run. The two have been made more consistent.

And now all segment matches use consistent notation for the largest Single Matches between Person a and Person c on each Chromosome, 1 to 22 and X (sometimes referred to as 23)

If a-c Triangulate on that Chromosome, then the largest length in cM of any a-c segment that Triangulates is prefixed by the letter "T" and is shown in green so it can be easily picked out, e.g.

If a-c does not Triangulate on that Chromosome, but does Double Match, then the largest length in cM of any a-c segment that Double Matches is prefixed by the letter "D", e.g.

X matches will be shown in column ACX with red text and the prefix after the letter "T" or "D" will be "X", e.g. or

Also all Triangulating people are shown first, ordered highest to lowest in their total a-c cM, so the closer relatives will be listed earlier on.

————————–

I found that I needed the above changes once I downloaded my own data. I’m sure they’ll be useful to you as well if you use DMT.

It took me 6 days to make these changes. I know I worked hard to get this working over that time. So I was curious and I counted up the number of DMT runs that I had to do to implement, test and debug all this. I was able to total up the number of DMT log files that were created each day. They counts were:

Sunday, Jan 15 - 48
Monday, Jan 16 – 33
Tuesday, Jan 17 - 30
Wednesday, Jan 18 – 80
Thursday, Jan 19 - 73
Friday Jan 20 – 33

Wow! I thought I worked hard on this, but I never expected that it would have taken me 297 runs of Double Match Triangulator to get the changes in this version working.

In total I’ve got log files for 1,668 Double Match Triangulator runs dating back to my first prototype run on June 26, 2016 when I first added the log file.

2 Comments

GEDCOM 1 Lives! - Sat, 14 Jan 2017

I found out about this On December 29, when Martin Geldmacher of Germany requested a trial key for Behold on the Behold Download page. He wrote in the “Please let me know how you found out about Behold” box the following:

I am trying to find a way to read/convert an old Gedcom 1.0 file (created by FHS). A few hours of googling brought me to your blog posts about "prehistoric" gedcom files. While no support for 1.0 is promised, I still want to try it out if it can help me.

Well, that was definitely interesting to me. I’ve done my part in the past to resurrect ancient GEDCOMs. In August 2014, I found what I thought was the The World’s Oldest GEDCOM File? Tamura Jones confirmed for me that this was just a GEDCOM 2.0 file, and that there was still GEDCOM 1 before it. Tamura wrote an article about GEDCOM 1.0 and told that the he had a GEDCOM 1 file in his collection. It was a sample file that Phillip Brown created. Phillip Brown is the author of Family History System. He is the only programmer to have implemented the very first GEDCOM specs and an earlier version of his program, Family History System, is the only program known to have export GEDCOM 1. Later versions of FHS exported GEDCOM 2.0 and later.

I was a user of Family History System many years ago, first purchasing it in 1993. Like most genealogists, I never throw anything out, and I still had a hardcopy of the FHS user manual which had the GEDCOM 1 specs in it. I then wrote my article: From Ancient GEDCOM to Prehistoric GEDCOM, where I said:

Will I support GEDCOM 1.0 in Behold? Well I could. But I doubt if anyone has any files of that format lying around that they really need to extract the data from. Let me know if you do.

So I was very surprised by Martin’s claim of having a GEDCOM 1 file. I emailed Martin back and I said to him:

If you really have some GEDCOM 1.0 files, I’d love to see them. They are a rarity.

And if Behold doesn’t do work right for them, then I can get it to.

Martin wrote back and told me the unbelievable. He said the file was created by his father Joachim in the 1990’s. It contains about 10,000 people that included not only his family, but the whole small German town where they were from. That data was eventually compiled into a book that contains town history and genealogy information, and Martin’s ancestors go back to the year 1658. The GEDCOM file was dated 2015, which Martin believes was when it was copied from his father’s old DOS computer. Martin attached a copy of the file for me.

It took me about 10 days to implement GEDCOM 1 reading in Behold. On January 10, I quietly released version 1.2.2 of Behold. I went back to the stable version of 1.2.1 (rather than using the 1.3 development version I’m nearing completion on) and added GEDCOM 1 support to it.

GEDCOM 1 uses 2 letter tags that are not separated from their level number, e.g. “0HH” is the header record whereas “1 HEAD” is what that was changed to in GEDCOM. Handling this was relatively simple. I originally mapped the two letter tags to their 3 or 4 letter equivalent, but there were too many that didn’t match, so I changed that so Behold would recognize the tags directly, as I do for the GEDCOM 2.0 tags.

I was hoping the family structure would be similar to GEDCOM 2.0 which connected siblings youngest to oldest together rather than listing children of families as later GEDCOM does. Yes it did connect the siblings, but of course it had to be oldest to youngest. And it did so via each parent, so for any person, you have the father’s next child and the mother’s next child. You also have the children pointing to their father and mother. So I had to custom build the conversion of this to the CHIL/FAMC connections in use today. That will allow this information to be exported to GEDCOM 5.5.1 once I add GEDCOM export to Behold (coming next, after version 1.3 is released).

What actually caused me the most problem was that there were no BIRT, MARR, DIV or DEAT level 1 events. Instead there were BD (Birthdate), BP (Birthplace), MD, MP, etc. tags at level 1. They needed to be mapped to level 2 tags under their level 1 tag that I had to create. This was tricky as you have to wait to encounter the following tags before you create the earlier one. It is tough to do that efficiently in what is a sequential parser. But I found a solution that worked well enough.

So I was able to read this GEDCOM 1 input:

And display it in Behold like this:

The only new thing in Behold Version 1.2.2 is the ability to read GEDCOM 1 files. Unless your last name is Geldmacher, I doubt you’ll need to upgrade to this version.

I am amazed that the first non-example GEDCOM 1 file produced from a real genealogical research study is one so detailed and comprehensive. My congrats go to Martin’s father for such an effort. And I thank Martin for searching me out and allowing me to use his father’s file.

GEDCOM 1 Lives!

—

Note: Martin has given me permission to include his and his father’s name in this article. However, his GEDCOM 1 file has information about living people in it and I’ve promised him I wouldn’t share it.

2 Comments

Louis Kessler’s Behold Blog