Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Finally, Interesting Possibilities to Sync Your Data - Fri, 17 May 2019

Although I don’t use Family Tree Maker (FTM), per se, I am very interested in its capabilities and syncing abilities. FTM along with RootsMagic are the only two programs that Ancestry have allowed to use the API that gives them access to the Ancestry.com online family trees. Therefore they are the only two programs that can directly download data from, upload data to, and sync between your family tree files on your computer and up at Ancestry.


RootsMagic

RootsMagic currently has its TreeShare function to share the data between what you have in RootsMagic on your computer, and what you have on Ancestry. It will compare for you and show you what’s different. But it will not sync them for you. You’ll have to do that manually in RootsMagic, one person at a time using the differences.

image

That is likely because RootsMagic doesn’t know which data is the data you’ve most recently updated and wants you to verify any changes either way. That is a good idea, but if you are only making changes on RootsMagic, you’ll want everything uploaded and synced to Ancestry. If you are only making changes on Ancestry, you’ll want everything downloaded and synced to RootsMagic.

With regards to FamilySearch, RootsMagic does a very similar thing. So basically, you can match your RootsMagic records to Family Search and sync them one at a time, and then do the same with Ancestry. But you can’t do all at once or sync Ancestry and FamilySearch with each other.

With regards to MyHeritage, RootsMagic only incorporates their hints, and not their actual tree data.


Family Tree Maker

Family Tree Maker takes the sync with Ancestry a bit further than RootsMagic, offering full sync capabilities up and down.

image

For FamilySearch, FTM up to now only incorporates their hints and allows merging of Family Search data into your FTM data, again one person at a time. But Family Tree Maker has just announced their latest upgrade, and they include some new FamilySearch functionality.

What looks very interesting among their upcoming features that I’ll want to try is their “download a branch from the FamilySearch Family Tree”. This seems to be an ability to bring in new people, many at a time, from FamilySearch into your tree.


Family Tree Builder

MyHeritage’s free Family Tree Builder download already has full syncing with MyHeritage’s online family trees.

image

They do not have any integration with their own Geni one-world tree, which is too bad.

But in March, MyHeritage announced a new FamilySearch Tree Sync (beta) which allows FamilySearch users to synchronize their family trees with MyHeritage. Unfortunately, I was not allowed to join the beta and test it out as currently only members of the Church of Jesus Christ of Latter-Day Saints are allowed. Hopefully they’ll remove that restriction in the future, or at least when the beta is completed.


Slowly … Too Slowly

So you can see that progress is being made. We have three different software programs and three different online sites that are slowly adding some syncing capabilities. Unfortunately they are not doing it the same way and working with your data on the 6 offline and online platforms is different under each system.

The very promising Ancestor Sync program was one of the entrants in the RootsTech 2012 Developer Challenge along with Behold. I thought Ancestor Sync should have won the competition. Dovy Paukstys, the mastermind behind the program had great ideas for it. It was going to be the program that would sync all your data with whatever desktop program you used and all your online data at Ancestry, FamilySearch, MyHeritage, Geni and wherever else. And it would do it with very simple functionality. Wow.

This was the AncestorSync website front page in 2013 retrieved from archive.org.
image

They had made quite a bit of progress. Here is what they were supporting by 2013 (checkmarks) and what they were planning to implement (triangles):

image

Be sure to read Tamura Jones’ article from 2012 about AncestorSync Connect which detailed a lot of the things that Ancestor Sync was trying to do.

Then read Tamura’s 2017 article that tells what happened to AncestorSync and describes the short-lived attempt of Heirlooms Origins to create what they called the Universal Genealogy Transfer Tool.


So What’s Needed?

I know what I want to see. I want my genealogy software on my computer to be able to download the information from the online sites or other programs into it, show the information side by side, and allow me to select what I want in my data and what information from the other trees I want to ignore. Then it should be able to upload my data the way I want it back to the online sites, overwriting the data there with my (understood to be) correct data. Then I can periodically re-download the online data to get new information that was added online, remembering the data from online that I wanted to ignore, and I can do this “select what I want” again.

I would think it might look something like this:

image

where the items from each source (Ancestry, MyHeritage, FamilySearch and other trees or GEDCOMs that you load in) would be a different color until you accept them into your tree or mark them to ignore in the future.

By having all your data from all the various trees together, you’ll easily be able to see what is the same, what conflicts, what new sources are brought in to look at, and can make decisions based on all the source you have as to what is correct and what is not.

Hmm. That above example looks remarkably similar to Behold’s report.

I think we’ll get there. Not right away, but eventually the genealogical world will realize how fragmented our data has become, and will ultimately decide that they need to see it all their data from all sites together.

Determining VCF Accuracy - Mon, 13 May 2019

In my last post, I was able to create a raw data file from the Whole Genome Sequencing (WGS) BAM file using the WGS Extract program. It seemed to work quite well.

But my previous post to that: WGS – The Raw VCF file and the gVCF file, I was trying to see if I could create a raw data file from the Variant Call Format (VCF) file. I ended that post with a procedure that I thought could generate a raw data file, which was:

  1. Make a list of all the SNPs you want raw data for.
  2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO has good set of them included with his DNA Kit Studio.
  3. The positions of variants in your gVCF file should be marked as no-calls. Many of these variants are false, but we don’t want them to break a match.
  4. The positions of variants in your filtered VCF should be marked as having that variant. This will overwrite most of the optimistic no-calls marked in step 3 with filtered reliable values.

When I wrote that, I had thought that the gVCF file contained more variants in it than the Raw VCF file had. During my analysis since then, I found out that is not true. The Raw VCF contains all the unfiltered variants. Everything that might be considered to be a variant is in the Raw VCF file. The gVCF includes the same SNP variants that are in the Raw VCF file, but also includes all the insertion/deletions as well as about 10% of the non-variant positions. It’s the  non-variant positions that makes the gVCF such a large file.

So right away, in Step 3 of the above proposed procedure, the Raw VCF file can be suggested instead of the gVCF file and will give the same results. That is a good thing since the Raw VCF file is much smaller than the gVCF file so it will be faster to process. Also the Raw VCF file and the filtered VCF file include the same fields. My gVCF included different fields and would need to be processed differently than the other two.

(An aside:  I also found out that my gVCF supplied to me by Dante did not have enough information in it to determine what the variant is. It gives the REF and ALT field values, but does not include the AC field. The AC field gives the frequency of the ALT value, either 1 or 2.

  • If REF=A, ALT=C, AC=1, then the variant value is AC.
  • If REF=A, ALT=C, AC=2, then the variant value is CC
  • If REF=A, ALT=C, AC is not given, then the variant value can be AC or CC.

For me to make any use of my gVCF file, for not just this purpose but any purpose, I would have to go back and ask Dante to recreate it for me and include the AC field in the variant records.  End aside.)


Estimating Type 1 and Type 2 Errors

We now need to see if the above procedure using the Raw VCF file in step 3 and the filtered VCF file in step 4 will be accurate enough to use.

We are dealing with two types of errors.

Type 1: False Positive: The SNP is not a variant, but the VCF file specifies that it is a variant.

Type 2: True Negative:  The SNP is a variant, but the VCF file specifies that it is not a variant.

Both are errors that we want to minimize, since either error will give us an incorrect value.

To determine the Type 1 and Type 2 error rate, I used the 959,368 SNPs that the WGS Extract program produced for me from my BAM file. That program uses well a developed and respected genomic library of analysis functions called samtools, so the values it extracted from my WGS via my BAM file are as good as they can get. It is essential that I have as correct values as possible for this analysis, so I removed 2,305 values that might be wrong because some of my chip test results disagreed with. I also removed 477 values that WGS Extract included but were at insertion or deletion positions.

From the remaining values, I could only use positions where I could determine the reference value. This included 458,894 variant positions, which always state the reference value, as well as the 10% or so of non-variant reference values that I could determine from my gVCF file. That amounted to 42,552 non-variants.

Assuming these variant and non-variant positions all have accurate values from the WGS extract, we can now compute the two types of errors for my filtered VCF file and for my Raw VCF file.

image

In creating a VCF, the filtering is designed to eliminate as many Type 1 errors as possible, so that the variants you are given are almost surely true variants. The Raw VCF only had 0.13% Type 1 errors, and the filtering reduced this to a very small 0.08%.

Type 1 and Type 2 errors work against each other. Doing anything to decrease the number of Type 1 errors will increase the number of Type 2 errors and vise versa.

The Raw Data file turns out to only have 0.06% Type 2 errors, quite an acceptable percentage. But this gets increased by the filtering to a whopping 0.76%.

This value of 0.76% represents the number of true variants that are left out of the filtered VCF file. This is what is causing the problem with using the filtered VCF file to produce a raw data file. When the SNPs that are not in the filtered VCF file are replaced by reference values, they will be wrong. These extra errors are enough to cause some matching segments to no longer match. And a comparison of a person’s raw dna with his raw dna generated from a filtered VCF file will not match well enough.

If instead, the Raw VCF file is used, the Type 2 errors are considerably reduced. The Type 1 errors are only slightly increased, well under worrisome levels.

Since there are approximately the same number of variants as non-variants among our SNPs, the two error rates can be averaged to give you an idea of the percentage of SNPs expected to have an erroneous value.  Using the Raw VCF instead of the filtered VCF will reduce the overall error rate down from 0.42% to 0.09%, a 79% reduction in errors.

This could be reduced a tiny bit more. If the Raw VCF non-variants are all marked as no-calls, and then the Filtered VCF non-variants are replaced by the reference values, then 20 of the 55 Type 1 Errors in my example above, instead of being wrong, will be marked as no-calls. No-calls are not really correct, but they aren’t wrong either. For the sake of reducing the average error rate from 0.09% to 0.07%, it’s likely not worth the extra effort of processing both VCF files.


Conclusion

Taking all the above account, my final suggested procedure to create a raw data file from a VCF file is to use only the Raw VCF file and not the filtered VCF file, as follows:

  1. Make a list of all the SNPs you want raw data for.
  2. Initially assign them all the human genome reference values. Note: none of the VCF files give all of these to you, so you need this initially set this up. Wilhelm HO has good set of them included with his DNA Kit Studio.
  3. Mark the positions of the variants in your Raw VCF with the value of that variant. These will overwrite the reference values assigned in step 2.

Voila!  So from a Raw VCF file, use this procedure. Do not use a filtered VCF file.

If you have a BAM file, use WGS Extract from yesterday’s post.




Update: May 14: Ann Turner pointed out to me (in relation to my “Aside” above, that in addition to the AC (allele count) field, the GT (genotype) field could supply the information to correctly identify what the variant is. Unfortunately, the gVCF file Dante supplied me with has missing values for that field’s values.

I’ve looked at all the other fields in my gVCF file and entries that leave out the BaseQRandSum and ClippingRankSum fields as they often indicate a homozygous variant, but I’ve found several thousand SNPs among the variants that constitute too many exceptions to use this as a "rule".

Wilhelm HO is working on implementing the sort of procedure I suggest into his DNA Kit Studio. It likely will be included when he releases Version 2.4, and his tool will then be able to produce a raw data file from a VCF file and will also extract a mtDNA file for you that you can upload to James Lick’s site for mtDNA Haplogroup analysis.

Creating a Raw Data File from a WGS BAM file - Sun, 12 May 2019

I was wondering in my last post if I could create a raw data file that could be uploaded to to GEDmatch or DNA testing company from my Whole Genome Sequencing (WGS) results. I was trying to use one of the Variant Call Format (VCF) files. Those only include where you vary from the human reference. So logically you would think that all the locations not listed must be human reference values. But that was giving less than adequate results.

Right while I was exploring that, there was a beta announced for a WGS Extract program. It works in Windows and you can get it here

image

This is not a program for the fainthearted. The download is over 2 GB because it includes the reference genome in hg19 (Build 37) and hg38 (Build 38) formats. It also includes a windows version of samtools which it runs in the background as well as the full python language.

I was so overwhelmed by what it brought that I had to ask the author how to run the program. I was embarrassed to find out that all I had to do was run the “start.bat” file that was in the main directory of the download, which opens up a command window that automatically starts the program for you, bringing up the screen I show above.

WGS Extract has a few interesting functions, but let me talk here about that one labeled “Autosomes and X chromosome” with the button: “Generate file in 23andmeV3 format”.  I selected my BAM (Binary Sequence Alignment Map) file, a 110 GB file I received by mail on a 500 GB hard drive (with some other files) from Dante. I pressed the Generate file button, and presto, 1 hour and 4 minutes later, a raw data file in 23andMe v3 format was generated as well as a zipped (compressed) version of the same file.

This was perfect for me. I had already tested at 5 companies, and had downloads of FTDNA, MyHeritage, Ancestry, Living DNA and 23andMe v5 raw data files. I had previously combined these 5 files into what I call my All 5 file.

The file WGS Extract produced had 959,368 SNPs in it. That’s a higher number of SNPs than most chips produce, and since it was based on the 23andMe v3 chip, I knew there should be quite a few SNPs in it that hadn’t been tested by my other 5 companies.

You know me. I did some analysis:

image

The overlap (i.e. SNPs in common) varied from a high of 693,729 with my MyHeritage test, to a low of 183,165 with Living DNA. These are excellent overlap numbers – a bit of everything.

Each test had a number of no-calls, so I compared all the other values with what WGS Extract gave me, and there was a 98.1% agreement. That’s a 2% error that is either in the chip test, or in the WGS test, but from this, I cannot tell whether its the chips or the WGS that are the incorrect values. But in each case, one of them is.

When I compare this file to my All 5 file, which has 1,389,750 SNPs in it, I see that there are an extra 211,747 SNPs in my WGS file. That means I’ll be able to create a new file, an All 6 file, that will have 1,601,497 SNPs in it.

More SNPs don’t mean more matches. In fact they usually mean fewer matches, but better matches. The matches that are more likely to be false are the ones that get excluded.

In addition to including the new matches, I also wanted to update the 747,621 SNPs in the file to the same SNPs in my All 5 file. As noted in the above table, I had 2,305 SNPs whose values disagreed, so I changed them to no calls. No calls are the same as an unknown value, and for matching purpose, are always considered to be a match. Having more no calls will make you more “matchy” and like having less overlap, you’ll have more false matches. The new SNPs added included another 905 no calls. But then, of the 20,329 no calls I had in my All 5 file, the WGS test had values for 9,993 of them.

So my number of no calls went from:

20,329 + 2,305 + 905 - 9,993 =  13,546, a reduction of 6,783.

I started with 20,329 no calls in 1,389,750 SNPs (1.5%),
and reduced that to 13,546 no calls in 1,601,497 SNPs (0.8%)

A few days ago, I was wondering how much work it take to get raw data for the SNPs needed for genealogical purposes out of my WGS test. A few days later, with this great program, it turns out to be no work at all. (It probably was a lot of work for the author, though.)

I have uploaded both the 23andMe v3 file, as well as my new All 6 file to GEDmatch to see how both do at matching. I’ve marked both research. But I expect once the matching process is completed, I’ll make my All 6 file my main file and relegate my All 5 file back to research mode.

Here are the stats at GEDmatch for those who know what these are:

WGS Extract SNPs:  original 959,368; usable 888234; slimmed 617,355
All 5 SNPs: original 1,389,750; usable 1,128,146; slimmed 813,196
All 6 SNPs: original 1,601,497; usable 1,332,260; slimmed 951,871