Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Double Match Triangulator 1.2 - Mon, 21 Nov 2016

I’ve gone through the 1.1.99 version of DMT I created about a month ago, have made some improvements and have now produced the 1.2 release. You can find and download the new version on the DMT page.

image

If you don’t have more than one Chromosome Browser Results (CBR) file from FamilyTreeDNA of your own to use, I’ve made up a set of sample files you can use to try DMT. It contains 5 Chromosome Browser Results (CBR) files you can download from the DMT page. These files are the same ones I used in the examples in DMT’s help file.

I’ve learned what my goal with DMT needs to be. It is to identify which ancestor provided Person a with each DNA segment on each chromosome half. It is possible to do this with match data, as Jim Bartlett has done this manually. I just have to figure out how to get DMT to do this automatically for you and save you the 6 years it took Jim to do it.

I wanted to get DMT version 1.2 out so that it could be the entry I submit to the RootsTech Innovator Showdown. Now that it’s ready, I just have to update my submission and then make up a 1 minute You Tube video demo’ing the program. I have to complete that video prior to the December 1st entry deadline or DMT won’t qualify. I’ve never made a You Tube video before, but I’ve got a good idea of how to do it and I expect making it will be a lot of fun.

Please try DMT Version 1.2 and let me know what you think. And if you plan to go to RootsTech this year, please look for me, and put in a good word for DMT.

Getting DMT to work with GEDmatch segment matches - Thu, 3 Nov 2016

Over at the ISOGG Facebook group, Rich Capen asked if it would be possible for Double Match Triangulator to compare GEDmatch kits. Well, that spurred my interest to see if I could.

My uncle’s Chromosome Browser Results (CBR) file at FamilyTreeDNA is very big because he is Ashkenazi. I just downloaded a new one so I can compare with GEDmatch. The first 3 attempts failed and it didn’t download the whole thing. The 4th finally worked and I downloaded the file. After selecting Download All Matches link at FamilyTreeDNA, I had to wait 30 seconds or so before anything would happen. Then the browser’s Save File box would pop up. Patience is a necessity here. My uncle currently matches with 9,288 people at FamilyTreeDNA. (Still, only one of them is a confirmed relative). The Chromosome Browser Results file that was downloaded contains 203,035 individual segment files for these people, an average of  22 segment matches per person.

The CBR file looks like this:

image

I then went over and gave a $10 donation to GEDmatch (which is less than what they should charge) so that I could get access to their Tier 1 utilities to get the segment matches that they found:

image

I entered my uncle’s GEDmatch kit number. FamilyTreeDNA downloads all segment matches down to 1 cM. GEDmatch has a default minimum of 7 cM. I entered the minimum GEDmatch would allow (5 cM and 500 SNPs) and I pressed the submit button.

GEDMatch showed a progress screen and after a couple of minutes finished processing and displayed the results in the browser Window. It said that matching segments will be identified with the closest 5,385 matching kits.

image

Right off the bat, I do like the Sex column, because if it is specified, then I don’t have to use the segments to guess at the sex (and I’m not even sure if that’s even possible yet, but it’s something I’m I will eventually want to do).

At the bottom of the file, it said: “Total 10000 segments”. That got me worried since it’s such a round number. Does GEDmatch have a limit of 10,000 segments that it will give you?

image

Well, the listing does go down to Chromosome 22, location 49,528,625. The FamilyTreeDNA CBR file goes to Chromosome 22, location 45,772,802. I would guess that GEDmatch is listing everything up to Chromosome 22 and maybe the result of getting exactly 10,000 was just a fluke.

However, the Chromosome 23 (X matches) are not included in the GEDmatch listing. It ends at 22. That’s not good. FamilyTreeDNA’s CBR file includes Chromosome X.

Okay then. Let’s put the GEDmatch data into a spreadsheet. GEDmatch does not download a file. The only way to do it is to select all the data on the browser page, copy it to the clipboard, and paste it into Excel. But 10,000 rows in my browser was too many to select. It wouldn’t copy to the clipboard. I found if I selected a few chromosomes at a time, it would copy and paste, so I did it in about 8 steps, and eventually loaded the spreadsheet with the 10,000 match lines and the one header line.

Next step: Let’s see if we can compare the data in the two files. Is it compatible? To do so, I need to only include GEDmatch data that’s from FamilyTreeDNA.

The 10,000 matches are for 1,853 kits. That means there’s only an average of 5 segment matches per kit (person) and that’s because of the 5 cM minimum match length versus FamilyTreeDNA’s 1 cM minimum which gives 22 segments per match.

Of the kits, 562 are prefixed by A, meaning they’re from AncestryDNA so I’ll get rid of those 3,193 matches. 559 are prefixed by M, meaning they’re from 23andMe, so I’ll get rid of those 2,466 matches. There are 3 kits prefixed with W having 3 matches and 32 kits prefixed by Z with 174 matches. I’m not exactly sure what those are, but they’re not FamilyTreeDNA, so I’ll get rid of them too.

That leaves me with 698 matching kits prefixed by T which are the FamilyTreeDNA kits. They total 4,164 matches and I’ll keep them.

Now for the next problem. The person’s name listed in the GEDmatch data is quite often not the same as the person listed in the FamilyTreeDNA CBR file. The CBR file does not have an email address in it, so that can’t be used for verification. But there are some that match. I can look at those.

665 names in GEDmatch were not matches in my FamilyTreeDNA file. That meant that only 34 names were the same. They have 208 matching segments in GEDmatch (down to 5 cM) and 753 matching segments in FamilyTreeDNA (down to 1 cM).

Now it becomes clear. The bad news. GEDmatch and FamilyTreeDNA do not give the same Start and End locations for the matching segments, nor are the cM or SNPs the same. See a comparison of the first person below. I’ve highlighted the matching segments.

image

I am very disappointed. What this means is that GEDmatch results cannot be combined with FamilyTreeDNA matches. Double Match Triangulator will not be able to use them both together.

But would it be possible for me to modify to Double Match Triangulator to be able to work with GEDmatch matches? Well technically yes, I could. But to me it’s not worth the effort. There are too many things to check to make sure the file is correct.

 

So Can DMT be Used with GEDmatch data?

Yes. If you want to use DMT with your GEDmatch Tier 1 segment matches, then this is what you can do. It should only take you 5 minutes or so if you know Excel well enough:

  1. Login to GEDmatch. You will need to be subscribed to the Tier 1 utilities. If you are not, you can subscribe for a month for $10.
  2. In the Tier 1 Utilities, click on “Matching Segment Search”. Select “No” for “Show graphic bar for Chromosome?”, and either submit that with the default settings, or lower SNP to 500 and cM to 5 (their minimums).
  3. Copy the displayed GEDmatch match table with the table headings, and paste them into a spreadsheet like columns A to I in the diagram above. Note: Internet Explorer and Microsoft Edge may be unable to copy the table if it is very large. Google Chrome, Firefox and Safari seem to work for any size.
  4. In the spreadsheet you created, in columns K to Q, row 1, put the column headings that a CBR file has, which are:
    NAME
    MATCHNAME
    CHROMOSOME
    START LOCATION
    END LOCATION
    CENTIMORGANS
    MATCHING SNPS
  5. In column K, put the name of the Kit owner
  6. Copy the Name Column G to the Matchname Column L
  7. Copy Chromosome, Start and End Locations, cM and SNPs from Column B to F and use “Paste Values” to put them into columns M to Q. That will get rid of the comma format. Alternatively, you can format those columns as “General” which will get rid of the commas.
  8. Delete columns A to J.
  9. Save the file as a CSV (Comma Delimited) file and give it a file name typical for a CBR file starting with the Kit number, e.g.:  A123456_Chromosome_Browser_Results_20161103.csv

DMT should be able to input that file.

Of course, what you now need is a second one of those files from another kit at GEDmatch. GEDmatch lets you work with any kit that is visible to you, so all you need is the kit number.

Note that GEDmatch’s Tier 1 segment matches exclude people who match more than 2100 cM. They say they are doing this so as not to obscure the matches that you’re really looking for. This means matches with parents, children, and siblings are left out of the Tier 1 match results. If you want them included, you’ll have to run GEDmatch’s one-to-one match and manually add their matches to the file.

 

One Little Trick

If you make a copy of the file, you can run DMT using the file for Person a and the copy as Person b. Every segment match will double match but you’ll end of with a nice Map of the matches and and a People page listing all your matches.

 

You will still be stuck with the GEDmatch limitations

  • Minimum 5 cM, 500 SNP matches
  • Possibly 10,000 match limit.
  • Possibly no X matches.
  • Cannot mix and match GEDmatch and FamilyTreeDNA CBR files.

But it should still give you lots of new double matches to keep you very busy.

 


Update: Dave Sherry on the ISOGG Facebook page said that FamilyTreeDNA uses Build 37 and GEDmatch uses Build 36, hence the differences. GEDmatch will at some time in the future, have to convert to Build 37 for the locations to be the same. Or maybe there is a utility out there to convert base addresses from Build 36 to Build 37.


Update Apr 28, 2017:  I improved the steps (now 9 of them) to convert the GEDmatch data to work with DMT. I’m going to try to add direct support of GEDmatch format into the next version of DMT so that this manual conversion will no longer be necessary.  Also want to do the same for 23andMe match files.

Double Match Triangulator 1.1.99 - Tue, 18 Oct 2016

I’ve released a beta of what will be version 1.2 of DMT.

It has all the new functionality. I just haven’t updated the help file yet. Once the help file is updated, I’ll re-release it as 1.2.

With the deadline for the RootsTech Innovator Showdown coming up, I wanted to get some major enhancements in. There are a number of exciting changes.

 

1. Double Match and Triangulation Groups

DMT now will group the matching segments together based on the crossovers between them. Each group will be delineated by a thick box around them.

image

The idea came from the two new posts by Jim Bartlett on his Segmentology blog. Understanding and Using TGs followed the next day by The Attributes of a TG. In the latter article in section A6, Jim explains how he identifies crossover points and the 6 steps to to find the end location of his Triangulation Groups.

I thought about this, and worked to implement something similar. Believe me when I say I must have changed the procedure a dozen times from my original idea before I got it working. I had to translate what Jim was saying to the data structures of my program.

Along the way, I learned a lot, including the following:

  1. Jim works with single matches that he is manually triangulating. That’s an admirable mass of work and he’s successfully mapped the majority of his segments. He talks about the ends being fuzzy. That is true for single matches because there are often some random matches at the ends. Although I have no proof, my observations indicate that the double match ends seem to be mostly precise. The random ends just don’t seem to be there. Crossover base addresses look very precise. All the single matches with fuzzy ends are excluded.
  2. I realized from Jim’s work is that the goal is to make a map of all the ancestors of Person a. You need the b people to do the double matching, but the goal is to map out a single person. Therefore, it is only the green X’s (double matches) and red a’s that should be used to define the Triangulation Groups.
  3. I needed an algorithmic way to determine the start and end of each Triangulation Group. My DMT program couldn’t use “judgement” because there’s “no hard rule”. So I developed a rule, the rule being: If the next segment starts at or after the end of the previous segment, then end the last Triangulation Group and start the next Triangulation Group.
  4. To make that rule work, I had to sort the segments first by the lowest starting base address, and then by the highest ending base address. You can see in the diagram above, that the triangulation groups get smaller as you go down the page, and then when the start point changes, they start again. Larger TGs are closer ancestors that are made up of smaller TG’s that are segments from ancestors of those ancestors.
  5. I have learned that Double Match segments that don’t Triangulate are as valuable as Double Match segments that do Triangulate. I call them Missing a-b Matches and I wrote about them in an earlier blog post. Triangulated segments belong to a common ancestor. Missing a-b matches may match from both halves of an ancestor’s chromosome, or may match from the same segment address of two ancestral parents (a father and mother). The Double match means that Person a matches Person c and Person b matches Person c. The likelihood of two matches to the same person, both by chance is severely reduced, especially since these are people in common match lists. I believe Jim Bartlett’s suggestion that Triangulated segments are mostly IBD down to 5 cM likely also applies to Double Match groups without the a-b match.

Once I got this going, I saw that I could improve the information shown in the Map file. In the above diagram, each row is one double match. The columns are:

  • Person a – the main person you are matching to
  • Person b – the person you are double matching with
  • Person c – the person who double matches both a and b on the segment
  • Chr – the chromosome number. 23 = the X chromosome.
  • Start-AC – the base address where the a-c match starts
  • End-AC – the base address where the a-c match ends
  • CM-AC – the distance the a-c match covers in cM
  • Start-BC – the base address where the b-c match starts
  • End-BC – the base address where the b-c match ends
  • CM-BC – the distance the b-c match covers in cM
  • DMG-START – the base address where the Double Match Group starts
  • DMG-END – the base address where the Double Match Group ends
  • DMGROUP – Jim Bartlett’s name for the Double Match Group
  • STATUS – tells you if that segment triangulates

Once again, Double Match Groups (DMG) are segments where a number of people have double matches and indicate common ancestors. If a Double Match Group also has the a-b match, then it s also a Triangulation Group (TG) and indicates a common ancestral segment on the same half of the chromosome.

 

2. Analysis By Chromosome

So that was the major improvement for version 1.2 of DMT. But that’s not all. I’ve added another checkbox to the main window:

image

If you check the Analysis By Chromosome box, then DMT will combine every individual run between Person a and all the b people that are in Folder b. It will then place the results into 23 files, one for each Chromosome.

The main reason why the chromosomes are split up is because the files can become very large. In my case, I have the Chromosome Browser Download files from 61 related people that I combine with my uncle’s. It takes DMT about 5 minutes to process all the files and produce the 23 chromosome map files. It’s an amazing amount of information that I believe may unlock all the secrets we’re looking for … if we can figure out how to do so. I’ll be trying. And if you figure something out, let me know, and maybe I can program it in for everyone to use.

For example, I’m sure there is enough information in the crossover base addresses for DMT to split the Double Match Groups into the two sides: maternal and paternal. That will be something I’ll be looking at to figure out how to do for version 1.3 and I’ll likely call it Double Match Filtering.

 

3. Improved People Page/File

Along with the 23 map files the Analysis By Chromosome produces, a combined People file is also produced. It looks like this:

image

This file will help you locate the relevant segments for the people you are interested in, and determine which chromosome file they’ll be in. All of Person a’s Person c matches are listed by highest total cM. The maximum segment length of each a-c match for each Chromosome is shown. That is followed by one column for every Person b so you can determine all the people Person c double matches to.

 

Like I say, I’m still trying to figure all this out myself. But feel free to try it out.

If you notice any problems or have any ideas, please let me know.

—-

Note: Version 1,1.99.1 corrects a problem where not all Triangulated segments were identified. If you downloaded 1.1.99 between October 18 and October 20, please download 1.1.99.1 from the link at the top of this post).