Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

My First MyHeritage Theory of Family Relativity - Wed, 29 Jun 2022

I’ve opened my MyHeritage account in 2014 and I’ve been a subscriber to their Complete Plan since February 2018. It was then that I started using MyHeritage as my primary site for storing my family tree information.

I took a MyHeritage DNA test in 2007 and uploaded my uncle’s test from FTDNA in 2018. I linked both my and my uncle’s tests to my tree.

By the end of 2019, I likely had 1000 genetic relatives in my tree. Today that number is probably close to 1500.

But I did not have any Theories of Family Relativity.  Why not?

Over at Ancestry, I have only a small version of my tree, maybe 200 people. I DNA tested there as well. Ancestry has its Thru Lines which are similar to MyHeritage’s Theories. I have about 20 Thru Lines over at Ancestry and about 5 of them helped me connect to new relatives.

So again, why do I not have any Theories at MyHeritage?

In February 2021, I submitted a support request to MyHeritage asking that question. A member of their DNA Support Team replied back that it didn’t make sense to him either. The final answer was that the Research and Development team did not have a solution yet for me. They said they’ll be giving priority to resolve this and said that they are injecting new Theories of Family Relativity soon and hopefully I will have some Theories.

The next Theory release in the summer of 2021 did not have any theories for me.

This morning, I saw this June 29, 2022 blog post from MyHeritage: New Update to Theory of Family Relativity. In it they state 25 million new Theories were added.  328 thousand kits that didn’t have any Theories now have at least one. And 233 thousand users will have at least one theory following this update.

I didn’t get my hopes up, but when I logged into MyHeritage, I saw this:

image

So I am one of the new Theorists!

I wrote the above before I looked to see what the Theory/ies might be. How many do I have? Are they accurate? Might they connect me to a new relative I have not yet found?


My Theories

Now for the reveal:  It turns out that I have my first 2 ever Theories of Family Relativity:

image


Theory 1

The first Theory for L.R. is very interesting to me. (Click on graphic for a larger image)

SNAGHTML98ed1b

I do know from records that the father of my great-grandfather Haim Herzanu was Leib, so the connection with the first web site from Israel evaluated at 75% makes sense.

The same website also makes the 3rd connection between Moscu Hertzman and Leonard Hertzman evaluated at 100%. This Moscu in the tree from Israel is said to have been born in Dorohoi or Hertza in Romania, which is where my Herzanu ancestors are from.

It’s that middle private family site that seems to have something wrong. It has Leiba Hertzeanu as a brother of Leib Hertzeanu which wouldn’t happen, and the former was born in 1796 and the latter born in 1848 and it’s unlikely brothers would be 52 years apart. It is possible that the latter Leib was a grandson of the Leibu.

But I’m not complaining. This information from this Israeli site may have a set of relatives that I do not know about, and I may be able to ultimately connect to the L.R. who is my DNA match. So there’s some enjoyable research work that will come out of this. I’m definitely going to have to contact the owner of that site from Israel.

My Uncle is on this side of my family, and he has this same Theory.


Theory 2 with 4 Paths

The 2nd Theory for D. Z. is equally interesting: It actually has 4 different paths.

Path 1, 71% confidence:

image

Path 2, 75% confidence:

image

Path 3, 67% confidence:

image

Path 4:, 20% confidence:

image

These are great. That’s two different Israeli web sites, the Geni site, as well as a Private site are involved in those 4 paths.

My Uncle’s DNA does not have this Theory. I’m not sure why not. I would think it should since it is along the same line as the 1st Theory. This could be something I can get MyHeritage to check into.


Conclusion

None of these theories are anything close to proof, but they certainly are good suppositions that will allow me to explore and contact the other site owners to share information and any documents we have. If these ancestors are truly from Dorohoi/Hertza, I know where records are obtainable from there that may be able to validly connect my family with those in these other trees.

I can finally see what the excitement is with this Theory of Family Relativity technology, which matches trees to records to DNA. It provides plenty of avenues for you to explore.

Building a Base Pair to Centimorgan Map - Thu, 16 Jun 2022

My last post defined base pairs and centimorgans, explained their relationship with each other, checked the accuracy of one genetic map, and described 3 converters that will calculate cM from base pairs.

Before leaving this topic, I wanted to document what I tried in an attempt to create an accurate bp to cM map using segment match files.


Segment Match Files

Segment Match Files contain all the matches for a person. You can download them from Family Tree DNA, 23andMe, MyHeritage DNA or GEDmatch-Tier1.

For each segment match, they provide at least the name of the person you match with as well as the chromosome, starting and ending base pair, cM, and number of SNPs. Here is an example of the beginning of a segment match file from Family Tree DNA:

image

These Family Tree DNA’s bp to cM map with the Centimorgan value shown with lots of decimal places.  It says, for example that that the segment on chromosome 1 from bp 203,910,220 to bp 209,092,631 is 7.594626 cM.

There are also a lot of segments given in Family Tree DNA’s segment match files. My file lists 188,438 segments.for the 32,449 people that I match to.

23andMe’s segment match file looks like this:

image

and it has more information to the right about the person matched to. It also gives an accurate cM value (e.g. 19.8441906) which it called the “Genetic Distance”

However, I only have 10,828 segments in my match file because 23andMe limits to 1500 people, which can be increased to 5,000 with a subscription to their Plus service.

MyHeritage DNA’s segment match file looks like this:

image

They do not have an accurate cM value. It is fine for most purposes but is rounded off to a tenth of of cM, e.g. 86.4.

My MyHeritage file has 75,028 segments for 19,162 people.

Finally, GEDmatch’s segment match file looks like this:

image

Like MyHeritage, GEDmatch also rounds their cM values to tenths of a cM.

My GEDmatch file only has 10,000 segments which is the limit GEDmatch allows. Those are for 1,955 people.


What is the Goal?

I want to come out with a map that for a particular company, will map a bp position to a cM genomic position on each chromosome. Then if you have the bp at the start of a segment and the bp at the end of the segment, you can determine the genomic positions at the start and the end of the segment. The cM of the segment then can be determined by subtracting the starting genomic position from the ending genomic position.

So we want a table that looks like:

image

This table is from the one that Amy Williams and Jonny Perl use.

So if we have a segment from 564,598 bp to 1,100,217 bp, then that segment would be 2.743511 – 1.478148 = 1.265363 cM.

If we had a start or end position in between two of those values, then would could interpolate.  e.g.: at bp = 850,000, the cM would be:

cM = 2.028035
               + (2.595322 – 2.028035) * (850,000 – 785,050) / (957,898 – 785050)

which equals 2.241201 cM.

This system works well when the programmer is using a database which has a fast lookup for entries on either side of the lookup value 850,000.

Alternatively, this can be approximated and simplified by interpolating values every 100,000 bp and setting them up in a simple array:

image

These are now interpolated and no longer exact. Here’s a comparison of the Table values versus the Array values:

image

You can barely see the differences betwen the two. So the array values should be good enough to get segment cM within 0.1 cM. 

The advantage of storing this in an array is that it simplifies programming, uses less memory and is faster to look up and calculate. With bp = 850,000, we know without lookup to use the [8] and [9] entries, and the interpolation becomes:

cM = 2.077101 + (2.405301 – 2.077101) * (850,000 – 800,000) / 100,000

equalling 2.241201 cM

which in this case happens to be exactly what the result was for the array method. That’s only because there are no array points between 700,000 and 800,000. If there were, the results would slightly differ.

Okay. That’s what we need. How do we get the values?


First Attempt:  Optimization

The idea here is to do this:

For each chromosome, create an array with bp values from 0 to the length of the chromosome by 100,000.  Assign a cM value to each base pair of 0.1 cM for each 100,000.

image

Now we take each of the matches in our segment match file and compare the actual with the cM value calculated in this table and we square the difference.

image

We sum the Diff Sq column. And our optimization goal is to minimum the total sum of squares by changing the cM values assigned to the 100000 bp values.

In Excel, I used their Solver tool, setting the objective as the Min of the total sum of squares cell, by allowing the algorithm to change any of the cM cells except the 0 cell. What I got was this:

image

Excel only allows 200 variable cells.

If you try 200 at once, it takes forever. If you try about 20 at once, it can solve the problem in a few minutes but gives some cM values lower than the previous one which is not possible. So then you have to add constraints to prevent this from happening.

This isn’t the best sort of problem for Excel to solve. Better would be to use a statistical package like R or to custom program the optimization.


Second Attempt – Following the Segment Trail

So then I thought I’d try a different tack.

How about starting with the first match on the chromosome. For me at Family Tree DNA, on chromosome 1 that is a match from the base pair starting location 72,526. I have segments that match with 16 different people starting at that location, and they end at various locations from 3,493,819 to 4,932,655.and those segments end from 6.210586 cM to 10.2785 cM

image

Base pair 3,493,819 therefore is at a genomic position 6.210586 cM higher than base pair 72,526.

If for each of those end locations, I find other segments that start at that location, then I can add those segment lengths to 6.210586 to get the genomic position of the ending base pair location.

And also for all of those end locations, I can find other segments not starting at 72,526 that end at one of them, then I can subtract those segment lengths from 6.210586 to get the genomic position of the starting base pair location.

I can continue this with each base pair that is assigned a genomic position until it runs out.

I tried this for Family Tree Data. I took 10 segment match files and combined them together. I extracted Chromosome 1 and sorted by start location and end location. I eliminated duplicates because for the same start and end base pairs, the cM was always the same. That gave me 65,627 segments that covered positions 72,526 to 249,222,527. 

Those 65,627 segments had 131,254 start and end positions. There were 38,225 unique positions, so each unique position was used on average in 3.4 segments.

I assigned base pair location 72,526 the genomic position 0.  With the 10 files I had 21 unique segments starting at that position, compared to the 16 just for me that I show above which had 14 unique.  I assigned the 21 genomic positions to the end points.

From those 21 end points, the file had 22 segments that started at one of them and 12 segments that ended at one of them that I hadn’t encountered already.

I assigned the new genomic positions to the other ends of those segments, and now I had 92 new starting segments and 177 new ending segments to process.

It took 20 iterations of this procedure until I ran out of segments to process.  By then, I had assigned genome positions to 35,082 or 92% of the unique positions. Here is what the first few final assignments looked like:

image

The –999 values are those that were not assigned. If we remove those, we then have a very accurate table that can be used for determining cM length from a start base pair and end base pair for Family Tree DNA data.

Compare this to the first table in the “What is the Goal” section above. That table was not accurate enough for Family Tree DNA and you can see that the genomic position at base pair 957,898 was 2.6 cM when for Family Tree DNA, it should have been closer to 0.8 cM.

Unfortunately, I couldn’t get this method to work at 23andMe because I had a lot fewer segments to work with due tof their 1,500 person limit, and I only had 5 files from other people to combine mine with. For chromosome 1, I only had 1801 segments to work with and could only chain 76 of them together. More data would be needed for this technique to work at 23andMe.

At MyHeritage and GEDmatch, the problem is that the cM values for the segments are only given to 1 decimal point. That means each value only has an accuracy of +/- 0.05 cM.  And the successive adding and subtracting of these for 20 iterations will multiply the error.


Conclusion

Well that was fun, but solving this problem is not my main goal in life. I think for now I’ll just leave it here so that someone else who gets the urge, will have some ideas to try.

Converting Base Pairs to Centimorgans - Sun, 12 Jun 2022

DNA Testing companies provide you with your matches and quantify how closely you match each person by giving you a total value in centimorgans (cM).

In addition, companies other than Ancestry DNA also provide you with all your individual segment matches and tell you the centimorgans of every segment. For each segment they also tell you the segment’s starting and ending base pair location.

For example, at Family Tree DNA, I share 2009 cM with my uncle. These are what Family Tree DNA provides for the matches I have with my uncle on chromosome 1:

image

The Start Location and End Location are the position in base pairs (bp) along the chromosome where each match starts and ends as determined by Family Tree DNA.


Centimorgans

The Centimorgans value is a measure of the how likely it is that the segment will recombine in one generation. A segment of 1 centimorgan has a 1% chance of recombining. The general equation is:

Chance of recombination in one generation = 1 – (e ** (cM of segment / 100))

where e is the constant known as Euler’s number = 2.718281828…

So: 

  • a 1 cM segment has a 1 – (e ** –0.01) = 1.0% chance of recombining
  • a 10 cM segment has a 1 – (e ** –0.1) = 9.5% chance of recombining
  • a 30 cM segment has a 1 – (e ** =0.3) = 25.9% chance of recombining
  • a 75 cM segment has a 1 – (e ** –0.75) = 52.8% chance of recombining
  • a 200 cM segment has a 1 – (e ** –2.0) = 86.5% chance of recombining

On the ISOGG Wiki’s Centimorgan page, there is a nice graph of the probability of crossover by segment length.

Also on the ISOGG page you can see from their cM values per chromosome table that the centimorgans for chromosome 1 as of 2015 were:

  • 267.21 at Family Tree DNA
  • 281.5 at GEDmatch
  • 284 at 23andMe

So each company provides slightly different estimates of centimorgans. For all the autosomes (chromosomes 1 to 22 and X), the totals range between 3580 and 3783 cM.

Recombinations are important because they represent a crossover of the two parental chromosomes, and that results in an endpoint for matches.

Males and females significantly differ in centimorgan values. Most companies don’t deal with male and female values, but use a combined average. That’s because once you’re talking about cousins and more distant relatives, the values tend to average out.


Base Pairs

A base pair is one individual position on a chromosome. It is made up as a pair of alleles that are bonded together, one from the chromosome’s forward strand and one from its reverse strand. Each allele has the value A, C, G or T, so the value of the base pair normally is shown as a pair of the values, e.g. AA or CT.

Surprisingly, ISOGG doesn’t have a table of the number of base pairs per chromosome, so you have to go to the Chromosome page on Wikipedia for it.

The autosomes (chromosomes 1 to 22 and X) total 3,022,102,095 base pairs, and chromosome 1 alone has 247,199,719 base pairs.

The number of base pairs given are very exact. They are all defined precisely so that references can me made to the specific base pairs of interest and everyone can “talk the same language” and be referring to the same segment on the chromosome.

Unfortunately, science is still working to fully define the human genome, so these base pair definitions are continuing to be updated. The National Library of Medicine maintains the Genome definitions. Here for example is GRCh37 from 2009, also known as Build 37 and as hg19. It was preceded by Build 36 (hg18) and by many other definitions before that. Build 37 itself has undergone 29 revisions and in 2013 was replaced by GRCh38 which itself is now at revision 14.

Obviously, DNA companies cannot be constantly changing their base pair several times each year. Fortunately, all the companies all decided to stick with a common version of Build 37 to define their allele locations. This is good because it allows DNA testers to transfer our raw data between platforms.

For the purpose of relative matching, only about 700,000 out of the 3 billion locations are tested, because these are the ones that are most likely to have differences between people or be of medical interest. These SNPs (Single nucleotide polymorphisms) generally are well-defined and should remain in the future builds of the human genome, although their position will continue to change as other positions between them are added and removed.


Why Do We Need to Map Mbp to cM?

For ancestry research, the cM value is important to have. Because of the way segment matching works, you can have segments that match simply by chance, where either allele of one person is matching either allele of the other person at each position of the segment. The cM value is a good measure of how likely this is to happen.

If you were comparing segments that are phased (separated correctly into their two parents) for both people, then this by chance matching wouldn’t happen. The ISOGG wiki has a nice graph of the probability a match survives phasing which indicates that segments under 15 cM are subject to being a by chance match, often referred to as a false match.

When using triangulated segments, you are comparing 3 people’s segments with each other and they must all match. The likelihood of false matches in this case is reduced and people like Jim Bartlett have indicated that by chance triangulated matches may start happening under 7 cM.

Therefore centimorgans tell us what segments are “too small” for our analysis.

If centimorgans were always available, then we’d be happy. But they are not always available.

Let us say you match two people, one on a 20 cM segment and the other on a 25 cM segment and the two segments overlap.

image

The problem is in determining how many centimorgans is the overlapping region between 30 Mbp and 50 Mbp? You don’t know that. You only know that the overlapping region is 20 Mbp. You would need to calculate the centimorgans it if you want it.


The Relationship Between Base Pairs and Centimorgans

Base pairs are large numbers expressing the position on the chromosome in millions to hundreds of millions. To simplify and approximate them, we can divide the values by 1 million and refer to Megabase pairs (Mbp).

So there are about 3,022 Mbp in the autosomes and 247 Mbp in chromosome 1 alone. Compare this to about 3,600 cM in the autosomes and 270 cM in chromosome 1. The Mbp and cM are similar. A simple rule of thumb is that the number of cM is roughly the same as the length of the segment in Mbp.

But there is quite a bit of variation. For example, the table above of my matches with my uncle can be re-expressed as:

image

As you can see, the ratio between cM of a segment and Mbp for the segment for this small sample of segments ranges from 0.83 to 2.37. This is because recombination rates vary considerably depending on which part of what chromosome you are looking at.

Family Tree DNA wrote this about how they determine the cM value for a DNA segment:

The Family Tree DNA bioinformatics team works with centiMorgan (cM) data from the International HapMap project.

Current knowledge of centiMorgan values across the human genome comes from the International HapMap project testing. The project tested father-mother-child trios from global population groups. Using this information, they mapped recombination rates across the human genome.

The tables they may be using could be the ones the NBCI made available in 2008. Here’s the beginning of Chromosome 1:

image

This is saying the cM/Mbp ratio is very low at the beginning of Chromosome 1, but once you reach position 711,153, the ratio has risen to 2 cM per Mbp. 

To calculate the cM for a segment, subtract the cM value at the end position from the cM value at the start position. e.g. from 554,461 to 730,720, the cM would be calulated as 0.042 – 0.001 = 0.041 cM.

If you have a position between those listed, then you would interpolate to determine the cM for the position desired.


Checking the Calculations

Jonny Perl, the developer of DNA Painter, recently mentioned to me the work of Amy Williams, a computer scientist and geneticist at Cornell. Jonny provided me with Amy’s Minimal viable genetic map which reduces the HapMap from nearly 3.4 million entries to just over 32,000 entries. Jonny told me that this file works quite well for 23andMe and MyHeritage matches.

I took this map data and checked it against all the chromosome 1 segments from a Family Tree DNA test that I’ll call “Terry”. I was surprised to see poor results.

So I re-did this but used my own test’s matches at Family Tree DNA, 23andMe, MyHeritage and GEDmatch. I found that Jonny was correct and both 23andMe and MyHeritage gave good results, but Family Tree DNA and GEDmatch gave poor results.

image

Family Tree DNA and GEDmatch had a standard deviation between 1.2 and 1.7 cM which meant they were only accurate 95% of the time to within 3 cM. If you are trying to see if a segment is above or below a 7 cM or even a 15 cM threshold, an accuracy of +/- 3 cM is really not very good. 

Not only that, the GEDmatch cM values on average were 1.1 cM different from the map calculations, so there must be a bias in the GEDmatch values.

I then took a look at the extreme values, the ones where the map’s calculations were furthest from the company’s calculations, by cM difference, and by percentage difference::

image

Out of 5221 segments, every one of MyHeritage’s calculations were within 0.1 cM of the Map calculation. They were not exact however, so MyHeritage must have been using a mapping very close to what Amy produced.

23andMe were not quite as close, but were still close enough that Amy’s mapping could be used for them. 

Whereas Family Tree DNA and GEDmatch must be using a different mapping.


Methods to Calculate cM from Mbp.

There are a few different Mbp to cM calculators available out there. First there’s Amy’s own calculator using her minimal map, called Lookup segment cM. I can take my extreme valued segments from above. Of the 10, 8 are different. If I plug those 8 segments into Amy’s calculator, I get:

image

These are all the same as I got when using and interpolating Amy’s table myself, except for one value which came out as 3.4 cM instead of 3.7 cM.  I’m not sure why the difference in that one, but 7 out of 8 ain’t bad.

Amy allowed Jonny Perl to port her program so you’ll also find it as the cM Estimator tool on his DNA Painter site.

There is an online service for estimating recombination rates along the genome called MareyMap Online. Genealogists will normally want to select Homo sapiens, and could choose mean, male or female. They have 3 different estimation methods you can choose from:  Sliding window, Loess, and Cubic Splines. Once you do that, you can then calculate the Genetic cM position from a physical position (bp):

image

I used the default settings (Loess method) and found the genetic positions for each of the 16 start and end points, I did this for the both the mean values and the male values. Subtracting the genetic end point from the start point gives the centimorgan value for the segment.

Hendrick Wendland created a MapS Converter program that you were at one time able to download. Seems the download is not working currently. But I still had it on my computer and tried it out.

image

It has a nice feature of being able to convert between Build 36 and Build 37 base pair locations, and you can select between mean, female and male

I compared the results of the three tools to the extreme values I had. I marked those closest to the company value in green and those not closest but still within 1 cM of the company value in yellow:

image

Well the results are all over the board. The data I was using are the extreme outliers for Amy’s mappings so it was a tough test.

It is possible that MapS is giving similar results as GEDmatch, so maybe GEDmatch is using a similar mapping. And MareyMap Online’s male values give the closest match to the extreme FTDNA values. More testing with a bigger dataset would be needed to confirm these statements.

What this all says is that there are many different mapping tables that can be used and it looks like each company has chosen their own mapping method.


Why I Investigated This

My program Double Match Triangulator (DMT) screens small segments out by using a cM threshold for single segments and another one for triangulations. For triangulations, 3 segments are involved, and they must all be at least the default 7 cM (which the user can set to something else). The segment match data gives cM values. But they don’t always give the size of the triangulation itself. You still may have three 15 cM segments that all overlap, and that overlap which is the triangulating region might be only 2 cM. It would be nice to filter those out.

Also there are some inferred segments where the cM cannot be calculated because it the segment adjacent to a triangulation where Person A no longer matches. If that extension cM could be calculated, then those that are too small could be filtered out.

When Jonny provided me with Amy’s data, I started to implement it into DMT but did not like the results for Family Tree DNA data. I likely wouldn’t have noticed it if I had first tried it for MyHeritage or 23andMe data.

Going forward, if I do decide to try to determine the centimorgans of the unknown triangulating sections and inferred match extensions, I’ll take the cM values that the company provided with the matches and build an internal map of Mbp to cM. Then I’ll use that map to interpolate the cMs of the needed segments. At least that way, the estimate that is used will be reflective of the company’s mapping.

This is not a major thing for DMT. Adding it would slow DMT down and I don’t feel the benefits of determining the centimorgans of triangulations and inferred segment extensions is worth the slowdown.