Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Catching Up - Sat, 8 Oct 2022

I was shocked to notice that I hadn’t blogged in over 3 months. I think it’s time to catch up with what’s been going on.


The Last 3 Months

To be honest, I’ve had another non-genealogy project distracting me and taking some of my genealogy/programming time away. Also, we had a beautiful summer here in Winnipeg, and I’ve been doing lots of bike riding and swimming. It’s much harder to be indoors on my computer for programming when its so nice outside.


Double Match Triangulator

I’ve actually released 6 minor versions (5.0.1 to 5.0.6) since June. Most have had to deal with changes at GEDmatch. They seem to have been making a lot of improvements there and their changes can break DMT. I always try to address anything no longer working as soon as I find out about them.

Meanwhile, I have been involved for over a year on a DNA project with my wife’s 3rd cousin. He had close to 60 descendants of my wife’s great-grandfather and five suspected siblings, and we are trying in spite of endogamy to determine whether they are indeed siblings. I have been trying to to use DMT to help with the analysis and the data has been an excellent test bed for DMT. It has led to improvements included in Version 5.0 that was released in May.

I still have some work to do on that study, and that may test DMT some more.


Behold

During the Spring, I was able to spend a couple of months working on Behold, trying to finish up Version 1.3 and get the Everything Report to be how I wanted it. But then the DMT work and the summer got in my way.

For the longest time, I’ve always expected Behold would become the genealogy editor that I would use for recording my genealogy. But I never got past it being just a genealogy data viewer. A few years back, In 2018, I decided I couldn’t wait for myself and Behold any more and I needed an editor for my genealogy. I decided to use MyHeritage online and its Family Tree Builder software on my computer. I am very happy that I did.That’s because there’s nothing like the smart matches that do a lot of research for you, pointing at records that more often than not are correct, and connecting you with other researchers and their trees.

As a result, my ideas for Behold have changed. I now have all my family tree data up on MyHeritage, and some of my data up on Ancestry, FamilySearch, WIkiTree, Geni, Geneanet and GenealogyOnline. Of those, FamilySearch, WikiTree and Geni are one-world trees, meaning other people are editing and updating my family there as well.

What I need now, rather than an editor, is a tool to help me evaluate and compare the information I have on MyHeritage to the information that’s on the other trees. I need to be able to find out what’s changed from my own information so that I can determine what needs to be updated. And I’d like some help to keep the thousands of profiles I have on the various systems in sync along with the photos and records attached to them.

There used to be a program called AncestorSync. It was designed to keep all your trees in sync on the various systems. It would have been perfect. But unfortunately, the team stopped developing the program.

I have no desire to try to get Behold to actually Sync your data. That would be a lot of work as I’d have to make agreements with the various tree operators to be allowed to update to their systems, and I’d have to be meticulous to learn to use their APIs (Application Program Interface) to make the update without mistakes. That’s a bit more than I want to take on.

Instead, I could do the next best thing: Set up Behold to help you manually update another tree with your data from a different tree. I know I myself want that function, and I would think that a lot of other people might find that very useful as well. We’ll see how it goes.


Conferencing and Social Media

After over 2 1/2 years of our Covid world, in-person genealogy conferences are finally starting up again. During the 2010’s, I attended and gave talks at 11 International Genealogical Conferences, 6 in the United States, 3 in Canada, 2 on the high seas around Australia and New Zealand, and 1 in The Netherlands.

2020 changed things and the world started going virtual. Now we’re inundated with a cornucopia of genealogical webinars available to us while sitting at home in our pajamas on our computer. I know I have watched hundreds of genealogy presentations in the past 30 months through providers such as Legacy Family Tree Webinars, the Virtual Genealogical Association, RootsTech, Dear Myrtle, Family History Fanatics, WikiTree, GeneaBlogger, Ed Thompson, the Association of Professional Genealogists, and presentations by the FGS and local genealogy societies – not just where I live but from all over the world. I have also been a speaker at several of these online conferences.

And last year I took an excellent online SLIG (Salt Lake Institute of Genealogy) course to help me with my Russian research.

I have got to the point where my head is 98% full and I’ll be starting to be more selective of what webinars I watch and conferences I go to. There will have to be something quite new and appealing to attract me now.

Social media is sort of the same thing. Genealogy groups on Facebook exploded in popularity in the past 5 years. I participated quite a bit at the beginning, but they have become less useful to me since then.


DNA

I’ve done just about everything I can with my own DNA. I’ve tested at all the sites including Big-Y and mtDNA at Family Tree DNA. I’ve used all the tools at all the sites as well as many 3rd party DNA tools. Endogamy and ancestry that only goes back only to the early 1800’s (Romania and Russian Empire) has limited my ability to connect to many of my DNA relatives. I have even taken a few WGS (Whole Genome Sequencing) tests to see what they could do for me.


Looking to the Future

My genealogy has taken great steps in the past couple of years, due to the help of MyHeritage hints and also hundreds Russian and Romanian records of my family found for me by some excellent researchers. I’ve got to get this all organized and fully documented.

I’ll be continuing to work on Behold to help me with this, and will update DMT as needed. I’ve got my GenSoftReviews site that I’ll keep maintaining as well as my personal lkessler.com website. Maintaining websites, by the way, does take a fair bit of time. Especially when an issue like updating a PHP version takes hold.

I’m still very excited about the future, because we never know what’s in store.

My First MyHeritage Theory of Family Relativity - Wed, 29 Jun 2022

I’ve opened my MyHeritage account in 2014 and I’ve been a subscriber to their Complete Plan since February 2018. It was then that I started using MyHeritage as my primary site for storing my family tree information.

I took a MyHeritage DNA test in 2007 and uploaded my uncle’s test from FTDNA in 2018. I linked both my and my uncle’s tests to my tree.

By the end of 2019, I likely had 1000 genetic relatives in my tree. Today that number is probably close to 1500.

But I did not have any Theories of Family Relativity.  Why not?

Over at Ancestry, I have only a small version of my tree, maybe 200 people. I DNA tested there as well. Ancestry has its Thru Lines which are similar to MyHeritage’s Theories. I have about 20 Thru Lines over at Ancestry and about 5 of them helped me connect to new relatives.

So again, why do I not have any Theories at MyHeritage?

In February 2021, I submitted a support request to MyHeritage asking that question. A member of their DNA Support Team replied back that it didn’t make sense to him either. The final answer was that the Research and Development team did not have a solution yet for me. They said they’ll be giving priority to resolve this and said that they are injecting new Theories of Family Relativity soon and hopefully I will have some Theories.

The next Theory release in the summer of 2021 did not have any theories for me.

This morning, I saw this June 29, 2022 blog post from MyHeritage: New Update to Theory of Family Relativity. In it they state 25 million new Theories were added.  328 thousand kits that didn’t have any Theories now have at least one. And 233 thousand users will have at least one theory following this update.

I didn’t get my hopes up, but when I logged into MyHeritage, I saw this:

image

So I am one of the new Theorists!

I wrote the above before I looked to see what the Theory/ies might be. How many do I have? Are they accurate? Might they connect me to a new relative I have not yet found?


My Theories

Now for the reveal:  It turns out that I have my first 2 ever Theories of Family Relativity:

image


Theory 1

The first Theory for L.R. is very interesting to me. (Click on graphic for a larger image)

SNAGHTML98ed1b

I do know from records that the father of my great-grandfather Haim Herzanu was Leib, so the connection with the first web site from Israel evaluated at 75% makes sense.

The same website also makes the 3rd connection between Moscu Hertzman and Leonard Hertzman evaluated at 100%. This Moscu in the tree from Israel is said to have been born in Dorohoi or Hertza in Romania, which is where my Herzanu ancestors are from.

It’s that middle private family site that seems to have something wrong. It has Leiba Hertzeanu as a brother of Leib Hertzeanu which wouldn’t happen, and the former was born in 1796 and the latter born in 1848 and it’s unlikely brothers would be 52 years apart. It is possible that the latter Leib was a grandson of the Leibu.

But I’m not complaining. This information from this Israeli site may have a set of relatives that I do not know about, and I may be able to ultimately connect to the L.R. who is my DNA match. So there’s some enjoyable research work that will come out of this. I’m definitely going to have to contact the owner of that site from Israel.

My Uncle is on this side of my family, and he has this same Theory.


Theory 2 with 4 Paths

The 2nd Theory for D. Z. is equally interesting: It actually has 4 different paths.

Path 1, 71% confidence:

image

Path 2, 75% confidence:

image

Path 3, 67% confidence:

image

Path 4:, 20% confidence:

image

These are great. That’s two different Israeli web sites, the Geni site, as well as a Private site are involved in those 4 paths.

My Uncle’s DNA does not have this Theory. I’m not sure why not. I would think it should since it is along the same line as the 1st Theory. This could be something I can get MyHeritage to check into.


Conclusion

None of these theories are anything close to proof, but they certainly are good suppositions that will allow me to explore and contact the other site owners to share information and any documents we have. If these ancestors are truly from Dorohoi/Hertza, I know where records are obtainable from there that may be able to validly connect my family with those in these other trees.

I can finally see what the excitement is with this Theory of Family Relativity technology, which matches trees to records to DNA. It provides plenty of avenues for you to explore.

Building a Base Pair to Centimorgan Map - Thu, 16 Jun 2022

My last post defined base pairs and centimorgans, explained their relationship with each other, checked the accuracy of one genetic map, and described 3 converters that will calculate cM from base pairs.

Before leaving this topic, I wanted to document what I tried in an attempt to create an accurate bp to cM map using segment match files.


Segment Match Files

Segment Match Files contain all the matches for a person. You can download them from Family Tree DNA, 23andMe, MyHeritage DNA or GEDmatch-Tier1.

For each segment match, they provide at least the name of the person you match with as well as the chromosome, starting and ending base pair, cM, and number of SNPs. Here is an example of the beginning of a segment match file from Family Tree DNA:

image

These Family Tree DNA’s bp to cM map with the Centimorgan value shown with lots of decimal places.  It says, for example that that the segment on chromosome 1 from bp 203,910,220 to bp 209,092,631 is 7.594626 cM.

There are also a lot of segments given in Family Tree DNA’s segment match files. My file lists 188,438 segments.for the 32,449 people that I match to.

23andMe’s segment match file looks like this:

image

and it has more information to the right about the person matched to. It also gives an accurate cM value (e.g. 19.8441906) which it called the “Genetic Distance”

However, I only have 10,828 segments in my match file because 23andMe limits to 1500 people, which can be increased to 5,000 with a subscription to their Plus service.

MyHeritage DNA’s segment match file looks like this:

image

They do not have an accurate cM value. It is fine for most purposes but is rounded off to a tenth of of cM, e.g. 86.4.

My MyHeritage file has 75,028 segments for 19,162 people.

Finally, GEDmatch’s segment match file looks like this:

image

Like MyHeritage, GEDmatch also rounds their cM values to tenths of a cM.

My GEDmatch file only has 10,000 segments which is the limit GEDmatch allows. Those are for 1,955 people.


What is the Goal?

I want to come out with a map that for a particular company, will map a bp position to a cM genomic position on each chromosome. Then if you have the bp at the start of a segment and the bp at the end of the segment, you can determine the genomic positions at the start and the end of the segment. The cM of the segment then can be determined by subtracting the starting genomic position from the ending genomic position.

So we want a table that looks like:

image

This table is from the one that Amy Williams and Jonny Perl use.

So if we have a segment from 564,598 bp to 1,100,217 bp, then that segment would be 2.743511 – 1.478148 = 1.265363 cM.

If we had a start or end position in between two of those values, then would could interpolate.  e.g.: at bp = 850,000, the cM would be:

cM = 2.028035
               + (2.595322 – 2.028035) * (850,000 – 785,050) / (957,898 – 785050)

which equals 2.241201 cM.

This system works well when the programmer is using a database which has a fast lookup for entries on either side of the lookup value 850,000.

Alternatively, this can be approximated and simplified by interpolating values every 100,000 bp and setting them up in a simple array:

image

These are now interpolated and no longer exact. Here’s a comparison of the Table values versus the Array values:

image

You can barely see the differences betwen the two. So the array values should be good enough to get segment cM within 0.1 cM. 

The advantage of storing this in an array is that it simplifies programming, uses less memory and is faster to look up and calculate. With bp = 850,000, we know without lookup to use the [8] and [9] entries, and the interpolation becomes:

cM = 2.077101 + (2.405301 – 2.077101) * (850,000 – 800,000) / 100,000

equalling 2.241201 cM

which in this case happens to be exactly what the result was for the array method. That’s only because there are no array points between 700,000 and 800,000. If there were, the results would slightly differ.

Okay. That’s what we need. How do we get the values?


First Attempt:  Optimization

The idea here is to do this:

For each chromosome, create an array with bp values from 0 to the length of the chromosome by 100,000.  Assign a cM value to each base pair of 0.1 cM for each 100,000.

image

Now we take each of the matches in our segment match file and compare the actual with the cM value calculated in this table and we square the difference.

image

We sum the Diff Sq column. And our optimization goal is to minimum the total sum of squares by changing the cM values assigned to the 100000 bp values.

In Excel, I used their Solver tool, setting the objective as the Min of the total sum of squares cell, by allowing the algorithm to change any of the cM cells except the 0 cell. What I got was this:

image

Excel only allows 200 variable cells.

If you try 200 at once, it takes forever. If you try about 20 at once, it can solve the problem in a few minutes but gives some cM values lower than the previous one which is not possible. So then you have to add constraints to prevent this from happening.

This isn’t the best sort of problem for Excel to solve. Better would be to use a statistical package like R or to custom program the optimization.


Second Attempt – Following the Segment Trail

So then I thought I’d try a different tack.

How about starting with the first match on the chromosome. For me at Family Tree DNA, on chromosome 1 that is a match from the base pair starting location 72,526. I have segments that match with 16 different people starting at that location, and they end at various locations from 3,493,819 to 4,932,655.and those segments end from 6.210586 cM to 10.2785 cM

image

Base pair 3,493,819 therefore is at a genomic position 6.210586 cM higher than base pair 72,526.

If for each of those end locations, I find other segments that start at that location, then I can add those segment lengths to 6.210586 to get the genomic position of the ending base pair location.

And also for all of those end locations, I can find other segments not starting at 72,526 that end at one of them, then I can subtract those segment lengths from 6.210586 to get the genomic position of the starting base pair location.

I can continue this with each base pair that is assigned a genomic position until it runs out.

I tried this for Family Tree Data. I took 10 segment match files and combined them together. I extracted Chromosome 1 and sorted by start location and end location. I eliminated duplicates because for the same start and end base pairs, the cM was always the same. That gave me 65,627 segments that covered positions 72,526 to 249,222,527. 

Those 65,627 segments had 131,254 start and end positions. There were 38,225 unique positions, so each unique position was used on average in 3.4 segments.

I assigned base pair location 72,526 the genomic position 0.  With the 10 files I had 21 unique segments starting at that position, compared to the 16 just for me that I show above which had 14 unique.  I assigned the 21 genomic positions to the end points.

From those 21 end points, the file had 22 segments that started at one of them and 12 segments that ended at one of them that I hadn’t encountered already.

I assigned the new genomic positions to the other ends of those segments, and now I had 92 new starting segments and 177 new ending segments to process.

It took 20 iterations of this procedure until I ran out of segments to process.  By then, I had assigned genome positions to 35,082 or 92% of the unique positions. Here is what the first few final assignments looked like:

image

The –999 values are those that were not assigned. If we remove those, we then have a very accurate table that can be used for determining cM length from a start base pair and end base pair for Family Tree DNA data.

Compare this to the first table in the “What is the Goal” section above. That table was not accurate enough for Family Tree DNA and you can see that the genomic position at base pair 957,898 was 2.6 cM when for Family Tree DNA, it should have been closer to 0.8 cM.

Unfortunately, I couldn’t get this method to work at 23andMe because I had a lot fewer segments to work with due tof their 1,500 person limit, and I only had 5 files from other people to combine mine with. For chromosome 1, I only had 1801 segments to work with and could only chain 76 of them together. More data would be needed for this technique to work at 23andMe.

At MyHeritage and GEDmatch, the problem is that the cM values for the segments are only given to 1 decimal point. That means each value only has an accuracy of +/- 0.05 cM.  And the successive adding and subtracting of these for 20 iterations will multiply the error.


Conclusion

Well that was fun, but solving this problem is not my main goal in life. I think for now I’ll just leave it here so that someone else who gets the urge, will have some ideas to try.