Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

The Life and Death of a DNA Segment - Mon, 19 Aug 2019

There’s a bad rumor going around that segment matches, especially for small segments, can be very old. I’ve heard expectations that the segment might come from a common ancestor 20 generations back or even 30, 40 or more. And that’s said to happen even if you have a fairly large 15 cM segment.

Part of this is due to the incorrect thinking that a segment of your DNA has been around forever and has been passed down from some ancient ancient ancestor to you and to just about everyone else. Since there is only a 1/2 chance that each generation gets the segment from the right parent, the argument is that it gets offset maybe by the more than 2 children per generation that keep the segment alive all the way down to two 30th or 40th generation descendants who then happen to share the segment. That also assumes there is no intervening ancestor along some other path who is more recent than that 30th generation one. For endogamy, the argument is that the segment has proliferated through the people and most of them happen to have it. Although in that case, I find it hard to believe that there is not a line to a different common ancestor who is fewer than 30 generations back.

The fallacy here is that all our DNA segments are ancient. They are not. In fact, many of them are quite recent, only a few generations old.

Let’s take a look at, say a 15 cM segment that you got from your father. You could have:

1. Got the whole segment from your father’s father’s chromosome,

2. Got the whole segment from your father’s mother’s chromosome, or

3. There could have been a recombination that occurred somewhere along the 15 cM segment and you got part of it from your father’s father and part from your father’s mother.

It is case number 3 that is interesting. In this case, that 15 cM segment is no longer the same as your father’s father’s segment, nor is it the same as your father’s mother’s segment. It is a new segment that has been born in you and you are the first ancestor to have that segment and maybe you’ll pass it down to many of your descendants. And no one else will have that segment that you have, unless some random miracle as rare as a lottery winning happens.

Also, your father’s father’s segment at this location and your father’s mother’s segment both are not passed down to you. Maybe they’ll be passed to a sibling of yours or maybe they won’t. But both of your grandparent’s segments have died along your line.

So what actually happens is that any segment of your DNA has its birth in one of your ancestors. That ancestor may pass it down to zero or more descendants, and if it is passed down, each descendant may or may not continue to pass it down. The segment eventually dies. A recombination on the segment can’t be avoided forever.

Now what is the probability of a new 15 cM segment being “born” in you? Well, that’s what cM represents and there will be about a 15% chance that any particular 15 cM segment of your DNA was formed from a recombination in your parent, and that you have a brand new segment. For most purposes, using the cM as a percentage is close enough. But for more accuracy, I’ll use the actual probability from the equation P(recomb) = 1 – exp(-cM/100) which gives 13.9%. (See my Update Jan 26, 2020 about this equation)

Well guess what? The probability that any particular 15 cM segment is born in any of your ancestors is also 13.9%. The chance that the segment was not born, but was passed down is therefore 86%. We can use that fact to now calculate the probability that this segment was passed down any number of generations to some descendant:

image

What this says is that if you have a 15 cM segment, then there is about a 50% chance that it was created in one of the last 5 generations, a 75% chance that it was created in one of the last 9 generations, and 95% chance that it was created in one of the last 20 generations. The average age of segments that size is 7.2 generations (1 / 13.9%). This is very simple mathematics/statistics.

If you match with another person on the same segment, then they have the same probabilities. The chance both of you got this segment from more than 20 generations back would be only 5% x 5% = 0.25%.


Revisiting Speed and Balding Once Again

I’m still frustrated that Speed and Balding’s simulation results are being used without question to estimate segment age for human DNA segment matches.

About two years ago, I used two different sets of calculations, one my own in Revisiting Speed and Balding, and one based on work by Bob Jenkins in Another Estimate of Speed and Balding Figure 2B. In both cases, I found segment age estimates that were somewhat less than Speed and Balding.

Let’s see how my Segment Life estimates compare. Picking a few different segment sizes and calculate their values gives:

image

And then lets plot these in a stacked chart:

image

Look at the gray area at the top left. That’s the probability of segments of the given segment size being 20 or more generations old. The green bar is the divider at 10 generations. You likely have a good chance to identify how you’re related to segment matches that are under the green bar, indicating that most segments over 15 cM should be identifiable and that even very small segments might be identifiable.

Compare this to Speed and Balding:

Speed and Balding give much larger chance of older segments than does my segment life methodology, or than do either of the two analyses in my earlier blog posts.


Conclusion

Segments aren’t passed down from ancient times. They are created and die all the time due to recombination events and they may not be as old as you are led to believe. Some of your smaller matching segments. e.g., between 5 and 15 cM have (by my segment life and other earlier calculations) a 40% to 70% chance of originating less than 10 generations ago. This means you might be able to determine how you’re related to your match.

By using triangulation techniques (such as Double Match Triangulator), you can determine triangulations of segments in the 5 to 15 cM range which will eliminate most by-chance matches. You can then put your segment matches into Triangulation Groups, to help find the common ancestor of the group and connect your DNA matches to your tree.




Update Jan 26, 2020:  After discussion with Celia Baitinger on the Facebook Genetic Genealogy Tips and Techniques group, we realized that the Wikipedia equation for P(recomb) = (1 – exp(-2 * cM / 100) / 2 may only be for recombinations that involve an odd number of crossovers. For genetic genealogy, we are interested in all crossover events. As a result, the correct analysis should be this:

Assuming a Poisson distribution for crossovers (which is what is usually assumed), then the P(zero recombs) when the mean is cM/100 is: exp(-cM/100), and therefore:

P(recomb) = 1 - exp(-cM/100)

I have updated the figures in the above article to reflect this correction. No changes were significant enough to affect any of my observations or conclusion.

50 Years, Travelling Salesman, Python, 6 Hours - Wed, 7 Aug 2019

This is my first blog post in over 2 months. The reason is that I have been working very hard trying to finish Version 3 of Double Match Triangulator. Every thing I’ve been doing with it is experimental, and there’s no model to follow. So it’s tough to get it just right. I started the documentation of the new version already, when I diverted to get some sample data from some people who had done Visual Phasing (VP) with 3 or more siblings, because I was thinking that this version of DMT should be able to use segment matches to get most of the same grandparent assignments that VP does. I’ve made progress but still not completed with that.

But this morning, I was sparked programmatically by an annual event that happens where I live in Winnipeg. Folklorama is a two week festival that celebrates the multiculturalism in our city. image

“Pavilions” are set up in various venues (arenas, churches, community centres) to showcase a particular country/culture. Each pavilion has a stage performance, cultural displays, and serves authentic ethnic food and drink.

This is the 50th year of Folklorama. So I remember it as a kid. The 40 pavilions were something that I always wanted to do a bike tour of, as they were spread all over our city. Being interested in mathematics, I was curious of a way to optimize my route and use the shortest possible route to bike to all of the pavilions.

But 50 years ago was well before we had personal computers or the internet. And route traversal problems, especially this one which was known as the Travelling Salesman problem, were computationally difficult to solve back then, even on the mainframe computers at the time.

This year’s version of Folklorama got me thinking: Maybe the problem is solvable easily today. I took a look online and was surprised very much by what I found. There is a Google Developers site that I didn’t know about.

image

And at that site, they had all sorts of OR-Tools.  OR stands for Operations Research which is the name of the field that deals with analytical methods to make better decisions. The Traveling Salesman problem is in that field and has its own page at Google Developers:

image

Not only that, but they explain the algorithms and present the programs in four different programming languages:  Python, C++, Java and C#.

Now, I’m a Delphi developer, and I use Delphi for development of Behold and Double Match Triangulator. I’ve never used the four programming languages given. But I’ve been looking for a quick and easy to program language to use for smaller tasks such as analysis of raw data files from DNA tests, or even analysis of the huge 100 GB BAM files from my Whole Genome Sequencing test.

Over the last year or so, I had been looking with interest at the language Python (which is not named after the snake but is named after Monty Python’s Flying Circus). Python has been moving up in popularity because it is a new, fast, interpretive, concise, powerful, extensible and free language that can do just about anything and even do a Hello World in just one line. It sort of reminds me of APL (but without the Greek letters) which was my favorite programming language when I was in University.

Well what better time to try Python than now to see if I can run that Travelling Salesman problem.

So this morning I installed the Windows version of Python on my computer. It normally runs from a command prompt, but there is a development environment for it called IDLE that it comes with it that makes it easier to use.

It didn’t take me too long to go through the first few topics of the Tutorial and learn the basics of the language.  I threw in the Traveling Salesman code and sample data from the Google Developers site, and I got an error. The Python ortools package was missing. It took me about an hour to figure out how to use the Python PIP (package manager) to add ortools. Once I did, the code ran like a charm.

Fantastic. Now can I use it for my own purpose. First, I had the map of all the Pavilion locations:

image

There were 22 pavilions in week 1, of which 4 were at our Convention Centre downtown, so in effect there were 19 locations, plus my home where I would start and end from, so 20 in total.

Now how to find the distances between each pavilion?  Well, that’s a fairly simple and fun thing to do. You can do it on Google Maps by selecting the start and end address. Choosing the bicycle icon, it would show me possible routes and the amount of time it would take to bike them.

For instance, to go from the Celtic Ireland Pavilion to the Egyptian Pavilion, Google Maps suggested 3 possible bike routes taking 44 minutes, 53 minutes or 47 minutes. I would choose the quickest one, so I’d take the 44 minute route.

image

Now it was just a matter of using Google Maps to find the time between each of the 20 locations. That’s 20 x 19 / 2 = 190 combinations!  Google Maps does have a Google Distance Matrix API to do it programmatically, but I figured doing this manually once would take less time than figuring out the API. And besides, I liked seeing the routes that Google Maps was picking for me. Google Maps did remember last entries, so using I only had to enter the street number to change the starting or ending location. It wouldn’t take that long.

At 1 p.m was the Legacy Family Tree webinar that I was registered for: “Case Studies in Gray: Identifying Shared Ancestries Through DNA and Genealogy.” by Nicka Smith.

image

It was a fantastic webinar. Nicka is a great speaker.

And while I had the webinar on my right monitor, I was Google mapping my 190 combinations on my left monitor and entering them into my Python data set:

image

I finished my data entry just about when the webinar ended at 2:30 pm CST.

Next, I ran the program with my own data, and literally in the blink of an eye, the program spewed out the optimal bike route:

image

After 50 years of wanting to one day do this, it took only 6 hours to install and use a new language for the first time, enter 190 routes onto Google Maps, load the data, find my answer, and enjoy a wonderful webinar.

So tomorrow morning, it will be back to working on version 3 of DMT in the morning, followed by what should be a very pleasant 4 hour (247 minute) afternoon bike ride to all 23 week 1 Folklorama pavilions along the optimal route.

image

And maybe next week, I’ll do the same for the week 2 pavilions.

Finally, Interesting Possibilities to Sync Your Data - Fri, 17 May 2019

Although I don’t use Family Tree Maker (FTM), per se, I am very interested in its capabilities and syncing abilities. FTM along with RootsMagic are the only two programs that Ancestry have allowed to use the API that gives them access to the Ancestry.com online family trees. Therefore they are the only two programs that can directly download data from, upload data to, and sync between your family tree files on your computer and up at Ancestry.


RootsMagic

RootsMagic currently has its TreeShare function to share the data between what you have in RootsMagic on your computer, and what you have on Ancestry. It will compare for you and show you what’s different. But it will not sync them for you. You’ll have to do that manually in RootsMagic, one person at a time using the differences.

image

That is likely because RootsMagic doesn’t know which data is the data you’ve most recently updated and wants you to verify any changes either way. That is a good idea, but if you are only making changes on RootsMagic, you’ll want everything uploaded and synced to Ancestry. If you are only making changes on Ancestry, you’ll want everything downloaded and synced to RootsMagic.

With regards to FamilySearch, RootsMagic does a very similar thing. So basically, you can match your RootsMagic records to Family Search and sync them one at a time, and then do the same with Ancestry. But you can’t do all at once or sync Ancestry and FamilySearch with each other.

With regards to MyHeritage, RootsMagic only incorporates their hints, and not their actual tree data.


Family Tree Maker

Family Tree Maker takes the sync with Ancestry a bit further than RootsMagic, offering full sync capabilities up and down.

image

For FamilySearch, FTM up to now only incorporates their hints and allows merging of Family Search data into your FTM data, again one person at a time. But Family Tree Maker has just announced their latest upgrade, and they include some new FamilySearch functionality.

What looks very interesting among their upcoming features that I’ll want to try is their “download a branch from the FamilySearch Family Tree”. This seems to be an ability to bring in new people, many at a time, from FamilySearch into your tree.


Family Tree Builder

MyHeritage’s free Family Tree Builder download already has full syncing with MyHeritage’s online family trees.

image

They do not have any integration with their own Geni one-world tree, which is too bad.

But in March, MyHeritage announced a new FamilySearch Tree Sync (beta) which allows FamilySearch users to synchronize their family trees with MyHeritage. Unfortunately, I was not allowed to join the beta and test it out as currently only members of the Church of Jesus Christ of Latter-Day Saints are allowed. Hopefully they’ll remove that restriction in the future, or at least when the beta is completed.


Slowly … Too Slowly

So you can see that progress is being made. We have three different software programs and three different online sites that are slowly adding some syncing capabilities. Unfortunately they are not doing it the same way and working with your data on the 6 offline and online platforms is different under each system.

The very promising Ancestor Sync program was one of the entrants in the RootsTech 2012 Developer Challenge along with Behold. I thought Ancestor Sync should have won the competition. Dovy Paukstys, the mastermind behind the program had great ideas for it. It was going to be the program that would sync all your data with whatever desktop program you used and all your online data at Ancestry, FamilySearch, MyHeritage, Geni and wherever else. And it would do it with very simple functionality. Wow.

This was the AncestorSync website front page in 2013 retrieved from archive.org.
image

They had made quite a bit of progress. Here is what they were supporting by 2013 (checkmarks) and what they were planning to implement (triangles):

image

Be sure to read Tamura Jones’ article from 2012 about AncestorSync Connect which detailed a lot of the things that Ancestor Sync was trying to do.

Then read Tamura’s 2017 article that tells what happened to AncestorSync and describes the short-lived attempt of Heirlooms Origins to create what they called the Universal Genealogy Transfer Tool.


So What’s Needed?

I know what I want to see. I want my genealogy software on my computer to be able to download the information from the online sites or other programs into it, show the information side by side, and allow me to select what I want in my data and what information from the other trees I want to ignore. Then it should be able to upload my data the way I want it back to the online sites, overwriting the data there with my (understood to be) correct data. Then I can periodically re-download the online data to get new information that was added online, remembering the data from online that I wanted to ignore, and I can do this “select what I want” again.

I would think it might look something like this:

image

where the items from each source (Ancestry, MyHeritage, FamilySearch and other trees or GEDCOMs that you load in) would be a different color until you accept them into your tree or mark them to ignore in the future.

By having all your data from all the various trees together, you’ll easily be able to see what is the same, what conflicts, what new sources are brought in to look at, and can make decisions based on all the source you have as to what is correct and what is not.

Hmm. That above example looks remarkably similar to Behold’s report.

I think we’ll get there. Not right away, but eventually the genealogical world will realize how fragmented our data has become, and will ultimately decide that they need to see it all their data from all sites together.