Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

My Whole Genome Sequencing. The VCF File - Wed, 6 Feb 2019

I received my results from my Dante Labs Whole Genome test last week. I purchased the test last August when I was able to get it for $399 USD. There were two health reports that I requested that are written in ancient Latin as far as my understanding of them goes. Then there were the VCF files which I was more interested in. The FASTQ and BAM files will be sent to me on a hard drive in a few weeks.

A Variant Call Format (VCF) file basically contains the differences between me and the “standard reference human”. There were two VCF files included in my results. One with my individual SNPs which was 143 MB and the other with Insertions and Deletions which was 43 MB. The individual SNP file is of most interest, because it is that file that contains the autosomal SNP data that DNA testing companies use for genealogical matching.

These files are in gz compressed format. When expanded (to 869 MB and 224 MB) they are standard text files and a bit of the individual SNP file looks like this:

image

My VCF file has a header section of 141 lines. The first line of the file (not shown above) indicates that this file’s format is Version 4.2 of VCF. Another important line in the header is line 139 above, which specifies the reference genome to be ucsc,bg19.fasta.  The ucsc is for the University of California Santa Cruz Genomics Institute who maintain and make available genome information at genome.ucsc.edu. The bg19 refers to the hg19 assembly of the human genome, which is also call Build 37, and is the version of the genome currently used by most of the DNA testing companies. And fasta is a format that lists all the reference values of the genome.

The header in my VCF file followed by 3,442,712 lines that represent each SNP where I am different from the reference value. “SNP” is an abbreviation for Single-nucleotide polymorphism. The “polymorphism” refers to something that can have more than one form, so when you hear SNP, think of a position on the genome where humans can differ from each other.

Each line contains:

  • #CHROM, the chromosome number of the SNP.  My file includes data for Chromosomes 1 to 22, X and Y.
  • POS, the position of the SNP on the chromosome
  • ID, the RSID of the SNP, i.e. a name it is given to reference it.  In my VCF file from Dante, no RSIDs are given and the ID is shown as a period on every line. That’s not a problem, since most DNA match is done by position, not RSIDs which can change positions between Builds.
  • REF, the value of that position on that chromosome in the reference genome and is one of A, C, G and T. This is usually the SNP value that most people have, e.g. if REF = A, then the pair AA with be the reference value for that SNP, i.e. A from their father and A from their mother.
  • ALT, the alternative values that I have. Usually it is one value, one of A, C, G and T and is different from the REF value. Occasionally it is two values, both different from the REF value, e.g. REF = A, ALT = C,T
  • QUAL, is a number estimating the quality of the read that was done in my test for that SNP. A higher number is better quality. 
  • FILTER, is an evaluation as to whether that SNPs value is reliable. My file only included SNPs with a filter value of PASS.
  • INFO and FORMAT, contains detailed information about the read at that SNP. The most important field is the AC field. If AC=2, then the ALT value will be both values of the pair. Otherwise the REF value will be the leftover value. e.g:
  • REF=A, ALT=C, AC=2, then SNP=CC
  • REF=A, ALT=C, AC=1, then SNP=AC
  • REF=A, ALC=C,T, AC=1, then SNP=CT

So from this file, using the REF, ALT and AC values on each line, I can compute the SNP value for the position given on the chromosome.

These are the counts of each computed SNP value for my file:

image

Remember that the above counts of homozygous readings (where both alleles are the same: AA, CC, GG or TT) do not include any SNPs which have the same reference value. If they are the same as the reference value, then they are not included in the VCF file.

Also note that since I’m a male, one allele should be shown for the X and Y chromosomes. I should not have any heterozygous (alleles are different) readings there. These might either be errors in the reads, or maybe they are reading the pseudo-autosomal regions on the X and Y where crossover might occur. I’m not sure why the number of my homozygous variants for Y are so low. But for genealogical matching purposes, I’m more interested in 1 to 22 and X.

The 1000 Genomes Project Consortium in 2015 found over 84.7 million SNPs among 2,504 individuals from 26 populations. They also found that “a typical genome differs from the reference human genome at 4.1 million to 5.0 million “sites” out of the 3.3 billion base pairs, so that’s only 0.14%. That means that 99.86% of our genomes are identical. These “sites” will include my 3,442,712 SNPs in the table above, as well as the 867,091 inserts and deletions from my other VCF file. So my total is 4,309,803 sites, which is in the correct range.

   

Comparing VCF values to my Raw Data

I’ve tested my DNA with 5 companies that have provided me with raw DNA results. The companies tested and gave me the results for from 618,640 SNPs (Living DNA) to 720,816 SNPs (MyHeritage DNA). There was overlap in what SNPs the companies tested. When I took the results of all 5 tests and combined them into one raw data file, I ended up with 1,389,750 unique SNPs.

A whole genome test is a test of all your DNA. My Dante WGS results provide me with values for all positions on all my chromosomes. These will come in 2 huge files I will receive soon on a hard drive.

The VCF files that I’m talking about in this article tell me what differs from the reference, so it is logical to assume that all values that are not in the VCF file are the same as the reference. Through deduction, you would think that I could state with certainty that the positions not specified in the VCF file would have the reference value. But that won’t always be true because the VCF contains only the SNPs that have “PASS” as the Filter value. We don’t know what the values are for those that are not marked as PASS from just the VCF. In fact, I don’t even know how many are not marked PASS, whether it is a lot or a few. Since this is a 30x (30 times coverage) WGS test, I would assume that the vast majority of the positions have been read correctly. Once I get the FASTA and BAM files, I’ll see if I can look at this in more detail.

My VCF file contains 471,923 SNPs that are in my combined raw data. So 34.0% of my combined raw data are specified in the VCF file. The other 3,837,880 SNPs in the VCF file are SNPs that none of the 5 DNA testing companies had tested. We’ll ignore those for now.

Here’s a summary of the 471,923 SNPs in common between my VCF file and my combined raw data file:

image

Of these, 98.0% were the same as they were in my combined raw data file.

The “New” column represent 6,321 SNPs that were no-calls in my combined raw data file, so my VCF allows me to define those.

The “Verify” column represents 228 SNPs that had disagreements between two or more of the raw data files, so I had set them to a no-call. The VCF could prove to be a tie-breaker in this case, but I’ll just continue to call these no-calls just to be safe.

The “Diff” column represent 2,798 SNPs that had a value in my combined raw data file, but the VCF value disagrees with it.

I could use this information to improve my raw data. I could assign values to the 6,321 no-calls, but I should then also turn 2,798 assigned values into no-calls. That would still reduce my overall number of no-calls down by 3,523, from 20,688 (1.5%) to 17,165 (1.2%).


How Can Genetic Genealogists Use a VCF file

Two ways:

1. Upload the VCF file to a DNA matching service that accepts it.

2. Use it to create a raw data file which you can then upload to a DNA matching service that accepts it.


Uploading a VCF file to GEDmatch Genesis

One would hope that if they did a whole genome test, they would be able to upload their whole genome data to one of the companies that do DNA matching.

The only company that currently takes VCF uploads is GEDmatch Genesis. I was patient and waited the 5 minutes until the browser responded after I hit the Upload button. Then it didn’t take very long to for GEDmatch to load the file and it provided this processing:

image

I made that kit “Research” and waited a day until GEDmatch completed the matching for the kit. Once the results came back, I found a problem.

The GEDmatch File Diagnostic Utility run on my combined raw data which I had previously uploaded gives this:

  image

When I run the diagnostics on my VCF file from Dante, I get this:

image

As correctly reported by the diagnostics, the All 5 file has 1,389,750 SNPs in it, and the WGS file has 3,442,712 SNPs in it.

The diagnostic then reports that my All 5 files has 1,128,146 usable SNPs which are then slimmed to 813,196 SNPs. The slimmed SNPs are the ones that GEDmatch Genesis uses for matching. They are the ones that are the most different between people and give you the most “bang for the buck”.

But my VCF file only had 590,334 useable SNPs which get slimmed to only 231,588 SNPs. That is way less than my All 5 file has. A WGS tests the whole genome, so it should give more SNPs than any other test or even combined tests give. So something was wrong.

Also, when I did a One to Many of my WGS kit, it matched most closely to my All 5 kit, which it should. But then it was closely followed by a whole bunch of kits of other people who are matching me close to identically. All those kits appear to be other whole genome tests.

It then became obvious to me that GEDmatch Genesis is only using the variant SNPs from the VCF file.  The reason why I get complete matches with other WGS kits is that if two people both have a variant at a position, then there is an extremely high probability that your variant is the same. And all GEDmatch is comparing between WGS files are variants.

The procedure that GEDmatch or anyone else who wants to load a VCF file needs to do is this:

  1. If a line in the VCF file has one REF value and one ALT value, then
    • If the INFO field contains:  “AC=1”, then you take the two of them.  e.g.  REF=T, ALT=C, then value is TC (or CT if you sort alphabetically)
    • If the INFO field contains:  “AC=2”, then you use the ALT value twice.  e.g.  REF=T, ALT=C, then value is CC.
  2. If a line in the VCF file has one REF and two ALT values, then you take both the ALT values.  e.g.  REF=T, ALT=C,G, then value is CG.  There are only a few hundred of these in my VCF file.
  3. If a SNP that they use is not in the VCF file, then use the reference. e.g. REF=C, to give the value CC.  They’ll need to have a reference table with the Build 37 genome reference values for all the SNPs that they use. This table would be the same for everyone.

I reported this to GEDmatch and John Olson replied back and confirmed that they are not adding the reference values. He said the VCF upload will have to wait until they get caught up on their Genesis conversion issues.


Using DNA Kit Studio to Create a Raw Data File from a VCF

Wilhelm H. created a wonderful little program called DNA Kit Studio that includes a VCF to RAW converter in it.

image

It originally did not accept my VCF from Dante. I contacted Wilhelm and the reason was that Dante did not include RSID values. Wilhelm made the change and sent me a beta of the program for me to try. It now created the raw data file, and correctly did steps 1a, 1b, and 2, above.  But he, like GEDmatch, also was not including the reference genome value for the other positions.

I gave Wilhelm links to a couple of open source sites that have most of the reference values for the 23andMe and Ancestry SNPs that the companies test for. And likely when I get the rest of my whole genome data (the Fasta and BAM files), I’ll figure out how to determine all the reference values myself.

If you can’t wait for Wilhelm to finish his update to his VCF to RAW converter, or if you don’t want to do the task yourself, you could use Wilhelm’s service and he’ll convert it for you for a small fee.


Conclusion:  Is a WGS Test Useful for Matching?

For the purposes of matching, it really only takes a raw data file from any of the major DNA testing companies to get you going. GEDmatch and some of the testing companies will accept uploads and you can get into most databases with just the one test.

You will get slightly more accurate matches at GEDmatch Genesis if you take a test from two companies, one using the old chip (AncestryDNA, Family Tree DNA or MyHeritage DNA) and one using the new chip (23andMe or Living DNA) and then use a tool like DNA Kit Studio to combine them before uploading.

But currently, I don’t see that the WGS test provides enough added utility to make it something genetic genealogists need for matching purposes.

Also see:

Comparing Genetic Clusters - Tue, 22 Jan 2019

In my last post, I described how I obtained my results from three Genetic Clustering tools: 

Pretty pictures are nice to look at. But what we really want is to be able use the results. The goal here is to see if these cluster algorithms actually do segment your family into groups that are related through an ancestor you can identify. And do these tools all identify the same cluster, or different clusters? Do they contradict each other, or can you use them all together to get even better information?

I took the three sets of results and combined them into one spreadsheet. These were all from AncestryDNA.  All three included matches down to 40 cM.

Currently, I have 242 matches of at least 40 cM at AncestryDNA.

Genetic Affairs included 226 matches and it put them into 34 clusters. They excluded my number one match of 411 cM because it exceeded their default upper limit of 400. The other 15 were excluded because a cluster could not be determined for them.

Collins’ Leeds included 223 matches and it put them into 37 clusters. It also excluded my number one match, as well as 18 others that it could not determine clusters for.

Shared Clustering included all 242 matches and it put them into 7 clusters.

Of my 242 matches, I know exactly how I’m related to just 8 of them.  I can divide them up into my 4 grandparents as follows:

  • 3 are Braunstein, my paternal grandfather’s side.
  • 2 are Focsaner, my paternal grandmother’s side.
  • 0 are Girman, my maternal grandfather’s side.
  • 3 are Goretsky, my maternal grandmother’s side.

So lets take these 8 people and see where each clustering technique puts them. Listed in the table is the cluster number assigned to each relative.

image

Genetic Affairs put all three of my Braunstein relatives in cluster 4. The Focsaner and Goretsky relatives were put in different clusters. Collins’ Leeds used different clusters for all of my relatives except for two Braunsteins that it put into its cluster 24. Shared Clustering put all my Braunsteins in its cluster 2, both my Focsaners in its cluster 5 and all my Goretskys in its cluster 1. So far so good.

Now let’s take all the rest of my 242 matches, and if they fall into a Braunstein cluster, I’ll color it blue.  i.e. Genetic Affairs cluster 4, Collin’s Leeds clusters 24 and 3, and Shared Clustering cluster 2 will be colored blue. Similarly Focsaner clusters will be brown and Goretsky clusters green.

My results are in the table below.  To intepret this table:

Look at Relative 9.  It was assigned cluster 4 by Genetic Affairs, cluster 24 by Collin’s Leeds, and cluster 2 by Shared Clustering. From my known relatives, those clusters all correspond to a Braunstein relative and are colored blue. They all match.

Whereas Relative 11 was assigned cluster 9 by Genetic Affairs, which was a Focsaner cluster and is colored brown. The assignment was cluster 31 by Collin’s Leeds and cluster 1 by Shared Clustering. The latter two are Goretsky clusters, so they are colored green. There is disagreement here, so Relative 11 is colored yellow.

(This is the tallest graphic I’ve ever included in my blog)

image

I have 9 relatives (Relative 9 to 17) that all 3 clustering techniques have assigned a grandparent cluster.  Unfortunately, they only all agree on the grandparent in 4 of those 9 cases.

I have 28 relatives (Relative 18 to 45) that two of the techniques have assigned a grandparent cluster. 18 of those assignments agree. 10 do not.

Then there’s 113 relatives (Relative 46 to 158) that one technique has assigned a grandparent cluster. Since there’s only one, there is no telling if the others would agree or disagree.

Leftover and not shown are 84 relatives where none of the techniques assigned a grandparent cluster.


Conclusion

The goal here is to be able to assign a grandparent to my matches whose relationship I do not know. Using the 8 relatives I do know, and assigning their grandparents to the clusters they were assigned, I can get 1 to 3 cluster assignments for 150 of matches.

Unfortunately, only 4 out 9 (44%) grandparent assignments agree for me when all three techniques have assignments, and 18 out of 28 (64%) agree for me when two of the techniques have assignments. That’s a bit more disagreement than I was hoping I’d get from different genetic clustering techniques.

I do have a lot of endogamy in my ancestry. I would expect that people who have more distinct lines than me to get more agreement between the clustering techniques than I have.

Genetic Clusters and DNAGedcom - Sun, 20 Jan 2019

Over the past 6 months, everyone has been jumping on board using genetic clustering techniques to help them partition their DNA matches into their ancestral origins.

The basic idea is to compare all the people that each of your DNA matches also match to. These are not segment matches being compared, but are people who are considered to be DNA matches to each of your DNA matches. These are known as the DNA testers who are In Common With (ICW) someone, i.e. they match each other.


The Leeds Method

The genetic clustering revolution started last summer with Dana Leeds who came out with the technique now named after her called The Leeds Method. It is primarily aimed at AncestryDNA testers, but will work with all companies. It really helps at AncestryDNA because they do not provide segment match information and have fewer tools that help you identify commonalities between your matches. The Leeds Method specifically is designed to partition your matches into groups representing each of your grandparents families.

Dana’s technique is a manual procedure and takes time to go one-by-one through your matches at AncestryDNA and add them to a spreadsheet. But it is a great exercise as it gives you a real feeling for how your matches relate to each other. It often works even if you have endogamy in your matches as I do.

When I tried it, I came out with this:

image

I was able to place the majority of the people in a Cluster (column). The first four columns I could identify as belonging to my four grandparents, which I show by their surnames in the top row. This technique can include relatives down to your closest 4th cousins, so I limited my matches to those who were 50 cM or more. The procedure takes several hours to do by hand. And you’ll probably, like me, do it incorrectly the first couple of times.

Some smart programmers were inspired by Dana’s method. Doing it by hand is laborious, so why not automate it, they thought? And while at it, why not figure out a great way to visualize it.


Genetic Affairs

In November, Evert-Jan Blom put his mind to this and developed Genetic Affairs. It is an online program that logs into your Ancestry account, gathers the ICW data, and produces your clusters in a large table that lists all your DNA matches on both the left and the top. If they match in a cluster, they are given a color for the cluster. If they are a match outside all clusters, they are colored grey.

Your match table gets emailed to you when it is ready. Mine looked like this:

image

Although there are lots of grey squares (representing my endogamy), there are also 34 colored clusters. Using my results from the Leeds Method helped me identify the ancestors for several of the clusters and I was able to figure out where some of the others were from as well. I ended up being able to assign about 90 of my AncestryDNA relatives to a particular ancestor in 12 of the clusters.


DNAGedcom – First Download Attempt

In December, over at DNAGedcom, they added the Collins’ Leeds Method 3D. Kitty Cooper describes it nicely. To try it, I had to resubscribe to DNAGedcom for their $5 a month fee.

Then I had to download some data. You do this with the DNAGedcom Client running on your Windows computer. And then, if you have as many matches as I do, you wait patiently. Here is one of the progress windows:

image

I’ve got 2,872 pages of AncestryDNA matches to download, which equals about 143,600 people.

The defaults were “Quicker Match Gather” selected, “Skip Distant Cousin Matches” unselected, and 0 for “Minimum cM”. I left them all as their default. I thought that’s probably a mistake, but what the heck. My computer’s not doing much right now anyway.

There are three steps to download AncestryDNA data using DNAGedcom Client:

Step 1, Gather Matches. I started the “Gather Matches” step of the program at 9:54 a.m. I was able to do other work on the computer while it was running (i.e. work on Version 3.0 of Double Match Triangulator). The Gather Matches step finished at 12:49 p.m. and the resulting DNAGedcom.db (database file) is 80.6 MB. So in total, it took almost 3 hours. That worked at a speed of just over 16 pages of 50 people per minute. The database uses on average 589 bytes per person.

The Gather Matches step also created a 46.9 MB file called m_Louis_kessler.csv.  This file contains a title line row plus 143,583 rows, one for each of my Ancestry matches. The columns are: 

  • testid – some unique long identifier representing me.
  • matchid – some unique long identifier representing my match.
  • name – the name of the tester I match to.
  • admin – the name of the person who is administrator for the test. The name is the same as the admin for 120,536 people (83.9%) in my file.
  • people – the number of people in their tree. Of my matches, 59,581 (41.5%) have trees. 42,406 (29.5%) have at least 10 people in their tree. 20,393 (14.2%) have at least 100 people in their tree. 28 trees have over 100,000 people in it. The largest has 277,652 people.  The total number of people in my 59,581 matches’ trees is 42,595,797, averaging 715 per tree. You’d think there should be a good number of relatives in there for me to find.
  • range -  Relationship range, I have 28 second and third cousins, 13,769 fourth cousins, and 129,786 distant cousins.
  • confidence – a number that’s 100 for my first 2 matches and goes down to 21.198 for my last match.
  • shared cM – for me ranges from 410.8 down to 6.
  • shared segments – for me, from 23 down to 1.  I have 19,191 sharing just one segment. Here’s an XY plot:
    image
  • last login – there is nothing in this column for me
  • starred – if I’ve starred the person, true or false
  • viewed – if I’ve viewed the person, true or false
  • private – if the person is private. 9,084 (6.3%) of mine are marked private
  • hint – not sure what this is, but all of mine are false
  • archived – there is nothing in this column for me
  • note – there is nothing in this column for me. I don’t use Ancestry notes.
  • imageurl – a link to the DNA tester’s profile picture at Ancestry. I have profile pictures on 16,016 (11.2%) of my matches.
  • profileurl – there is nothing in this column for me
  • treeurl – a link to the DNA tester’s tree at Ancestry. I have trees linked from 60,090 (41.9%) of my matches.
  • scanned – this column has today’s date for every match, since I started from scratch and got all my matches today. If I run DNAGedcom client again in the future, I can use this column to identify the new matches since my earlier run.
  • membersince – there is nothing in this column for me
  • ethnicregions – there is nothing in this column for me
  • ethnictraceregions – there is nothing in this column for me
  • matchurl – a link to the DNA tester’s match page at AncestryDNA.

I bet if I unclick the option “Quicker Match Gather”, that the columns that are now empty for me will get filled. I don’t need them for the clustering, so I’ll try that some other time.

Step 2, Gather Trees. I started the “Gather Trees” step of the program at 12:57 p.m.to gather 51,006 trees. This ran all afternoon and finally finished in the evening 11:42 p.m. That step took 10 hours and 45 minutes. That was an average of 79 trees per minute. The database size grew to 291.6 MB, which is an increase for this step of 211.0 MB. So the average tree needed about 5,994 bytes in the database.

Step 2 finished by creating a 162 MB file named a_Louis_Kessler.csv. This file has a title line followed by 1,028,007 rows of data. That’s pretty close to Excel’s limit of 1,048,576 rows. A few more trees, and I wouldn’t have been able to open that file with Excel but would have had to manually divide it into pieces with a text editor first and then load it in parts. The columns in this file are:

  • testid – some unique long identifier representing me.
  • matchid – some unique long identifier representing my match. There are only 50,780 different matchids in the file. This is a bit less than the 51,006 “trees” DNAGedcom said it was loading. I’m not sure why.
  • name – the name of the tester I match to. Because there are 50,780 different testers in the file, the average tester has 20 lines. These cannot be the full trees of the people. If they were, I’d be looking at 715 people per tree (see “people” in the “Gathering Matches” section, above). So it seems obvious that these are just the ancestors of each person from their trees.
  • admin – the name of the person who is administrator for the test
  • surname – the surname of an ancestor of the tester. 875,392 (85.2%) of these had a surname in it. The rest were blank. The most common surnames for me were Cohen (6,408), Smith (4,072), Miller (2,919), Schwartz (2,698), Goldberg (2,632), Brown (2,628) and Levy (2,546), which is a good mix of what happen to be the most common Jewish and non-Jewish surnames. With regards to some of my own ancestors surnames, there’s Braunstein (145), Focsaner (0), Goretsky (6), Silverberg (115). I took a look at some of these and I cannot connect any of them to my ancestors. There are other spellings of these names as well. As far as Kessler goes, there’s 392, but that does not matter here because I’m not DNA related to Kessler, who was my father’s stepfather.
  • given – the given name of the ancestor. 111,506 (10.8%) of these say “Private” and have a blank surname. Only 17,854 (1.7%) of the given names are blank.  The most common given names for me were John (15,487), Mary (15,223), Sarah (14,220), Elizabeth (11,829), Samuel (11,348), Joseph (11,260) and William (11,239). I personally found it interesting that Louis was in 16th place with 5,783.
  • birthdate – 704,463 (68.5%) have values.
  • deathdate – 581,187 (56.6%) have values.
  • birthplace – 679,101 (67.8%) have values. The most common are Russia (52,579), New York (12,887), Poland (11,403), Austria (9,439), Germany (8,676).  All my ancestors come from either Romania (4316), specifically Tecuci (5), Dorohoi (23) or Ukraine (1,435), specifically Mezhirichi (3) or Odessa (736).
  • deathplace – 558,889 (54.3%) have values.
  • relid – this looks like it is the ancestor’s ahnentafel number, which is 1 for the tester, 2 for the tester’s father, 3 for the mother, 4 for their paternal grandfather, 5, for paternal grandmother, etc. The highest number is 1023 which is the person’s mother’s mother’s … mother (with 9 mothers – i.e. 7th great-grandmother, 9th generation). For the 50,780 people, all of them list themselves. So that must be the reason for the cutdown from 51,006. Those 226 people must not have had themself in their tree. The average parent is included 47,528 times (93.6%), grandparent 40,070 (78.9%), 3rd Gen: 22,364 (44.0%), 4th Gen: 9,736 (19.2%), 5th Gen: 3,374 (6.6%), 6th Gen: 1,266 (2.5%), 7th Gen: 542 (1.1%), 8th Gen: 254 (0.5%), 9th Gen: 124 (0.2%) 
  • source – there is nothing in this column for me.

Step 3:  Gather ICW.  ICW stands for the “In Common With” people. This is what is used to cluster all the people using the various clustering techniques. I started this procedure at 11:42 p.m. The screen indicated that it was going to find the ICW for all 143,583 people. It wasn’t progressing very quickly. By 12:10 a.m, it had only completed 98. So I went to bed. When I checked in the morning at 8:19 a.m., DNAGedcom had completed only 1,393 (0.9%) of the ICWs. It was averaging 160 people per hour. The database had grown from 291.6 MB to 854.5 MB, and increase of 562.9 MB which is 5.7 MB per person. If it continued at this speed, it was going to take 37 days and nights for it to complete, and the database would become 819 GB in size. Now yes, maybe as it goes to the more distant relatives, it might speed up and contribute less to the database. So I thought I’d see if it would. DNAGedcom even as it was running, allowed me check the Skip Distant Cousin Matches. So I thought maybe it would recognize that and stop after my 13,797 second, third and fourth cousins. I then let it continue run for the day while I was out of the house. When I checked it at 6:02 p.m., it had only done 2,891 (2%). It was not going any faster and was still averaging about 160 people per hour. It still said its goal was 143,583 people. I didn’t want to let it run another 3 days to see if it would stop at 13,797, so I disappointingly hit the cancel button. And then I was again disappointed when I saw that no file was generated. I knew it was supposed to create an ICW file when it completed this step. That is the file that is used for the clustering procedures. I was hoping DNAGedcom would still generate this file with what it already had processed up to cancelling, but it didn’t.


DNAGedcom – Second Try

I saved my old files and this time from the beginning, selected “Skip Distant Cousins” and also set Minimum cM to 20, which is the level AncestryDNA starts its Distant Cousins at. I then thought I might as well uncheck the “Quicker Match Gather” and see what additional information might be retrieved.

Step 1: Gather Matches. Started at 6:10 p.m. It processed 2,880 Ancestry pages and finished by 8:04 p.m. So that was an hour quicker than previously. The database was 13.0 MB, one fifth the size it was previously. What I found extremely interesting was that it processed 2,880 pages. Just a day ago, it processed only 2,872 pages. Those 8 pages represent about 400 more matches that I have gained in just one day! AncestryDNA must have sold a lot of tests during the holiday period and the results are starting to come in!

The Gather Matches file m_Louis_kessler.csv is now just 6.2 MB in size. It now lists just my 13,843 fourth cousins and closer.  I had 13,797 just a day earlier. With the “Quicker Match Gather” turned off, I now have information in these columns:

  • profileurl – this now has values
  • membersince – lists the year the tester became a member of Ancestry. 28.7% were between 2010 and 2015. 11.7% were 2016. 22.0% were 2017. 22.3% were 2018. 15.4% were 2009 and earlier with the earliest year being 2000 by 110 people (except for 80 people listed obviously wrongly shown as becoming a member in the year 1900)
  • ethnicregions – these are lists of the top ethnicities of each match. For me, the number one ethnicity of my matches is “EuropeJe”, i.e. European Jewish and 10,707 (74.4%) are listed as that.  Another 2975 (21.5%) have “EuropeJe” followed by one or more other ethnicities (e.g. Slavic, EuropeS, Baltic, Germany, etc.), and 156 (1.1%) have “EuropeJe” in their list but not listed first. Only 1 match listed as: “Celtic, EuropeW, EuropeE, EuropeS, AngloSaxon” does not have “EuropeJe” in their list of ethnic regions.
  • ethnictraceregions – this had values for 8804 (63.6%) of my matches are were all over the map with no discernable trends.

Step 2: Gather Trees:  I started this at 8:13 p.m. It finished gathering 4,819 trees at 9:15 p.m. That was an average of 78 trees per minute, which was the same speed as previously, only there were fewer trees to process this time round. The database grew by 21.1 MB to 27.3 MB.

This time Step 2 created a_Louis_Kessler.csv as a 10.9 MB file with 72,491 lines for 4,802 people.

Step 3: Gather ICW.  I started this at 9:21 p.m. Two hours later, at 11:17 p.m., it was only at 302 of 13843, just 2.1%,  I was calculating that it would take 98 more hours. I was seriously considering stopping it, and upping the limit on Minimum cM and trying this for a third time. But then, at 11:22 p.m., the DNAGedcom progress indicator changed to 100% saying: “Finished Gathering ICW / creating FIles 100% Complete 0 of 0”

image

What was strange here was the statement: “0 of 0”. It was supposed to create an ICW file that I can use. There was no such file created.It looked like DNAGedcom had finished.  But before I gave up hope, I opened Task Manager:

image

Task Manager showed DNAGedcom was still using CPU and writing to disk. Maybe it was still creating that ICW file, but just not telling me that it was.

Sure enough, at 11:45 p.m, it completed and created the icw_Louis_Kessler.csv file.  That took 23 minutes. Finally DNAGedcom displayed:

image

DNAGedcom’s progress indicator is misleading. When you Skip Distant Cousin Matches or set a Minimum cM, it should show you progress relative to that, and not suddenly change you from 2.1% to Completed. It should then tell you that it is creating the ICW file. The statement: “Creating Files 100% Complete 0 of 0” is not an indicator that there is still something being created, especially when the phrase “100% Complete” is in the middle of it.

None-the-less, I now have my ICW file. It is 119.2 MB and contains 924,299 lines. The columns are:

  • matchid – the unique long identifier representing this match
  • matchname – the name of the tester of this match
  • matchadmin – the person administrating this match
  • icwid – the unique long identifier representing the ICW match
  • icwname – the ICW’s name
  • icwadmin – the ICW’s admin
  • Source – the value for all lines is “Ancestry”. DNAGedcom works with other company’s data as well, thus the reason for this column.

My file contains 13,767 different matches who are ICW on average 67 other matches each. The most any match is ICW is 3,749 (27.2% of my 4th cousins and closer).  I have 240 matches ICW 1,000 or more people. I have 93 matches ICW with only 1 other match, and 128 matches ICW only 2 other matches.


Collins’ Leeds Method 3D

It will now be interesting to see what DNAGedcom’s new clustering algorithm does with my ICW information.

image

It gives me this, which very much corresponds to the manual Leeds method I first used:

image

In order to attempt to do something similar to what Genetic Affairs does, I lowered the Minimum cM to 40 and got this:

image

So yay! Clusters are real and are useable to partition your DNA relatives into groups, even if you come from endogamy like I do. 


Shared Clustering

Just this month, another clustering approach was developed using Ancestry data from DNAGedcom downloads. This is an open source program called Shared Clustering built by Jonathan Brecher and is available at Github.

On his ”Shared Clustering versus other clustering tools” page, Jonathan states:

As of this writing, the clusters generated by Shared Clustering are significantly better than those generated by most other tools. In this context, "better" means that the clusters are more useful to the genealogical researcher.

and he follows by describing the reasons in detail. Obviously, this is something that needs to be tried.

I downloaded and ran the setup.exe program from the Shared Clustering Github site. Windows gives you a warning because the author did not code sign the program, but by pressing “more info” on the warning, I could then click “run anyway". After ignoring two more such warnings, it installed.

The main screen starts with an Introduction:

image

Since I had my ICW file from DNAGedcom, I went to the Cluster page. I put the path to my ICW file in the Saved data file box, and it automatically entered a cluster output file the same directory with the name: Louis_Kessler-clusters.xlsx

image

It ran very quickly, but before it finished, it gave this error:

image

Obviously, it couldn’t handle the size of my ICW file. So I went to the advanced options:

image

I changed both the 20 values to 40 and ran it again. This time it worked. I went the directory with the output files and opened up the xlsx file. Here’s what it looked like at 15% magnification:

image

There are 243 people included in 7 clusters. I can’t offhand identify my ancestors for that dark cluster 4 in the middle.

Interestingly, it gives extra information at the beginning of each row:

image

This includes tree information which is not in my ICW file, so it must be reading one of the other DNAGedcom files as well (or the DNAGedcom database). It also includes a list of correlated clusters for each person.


Conclusion

I’ve compared the visual results from:

  1. The Leeds Method
  2. Genetic Affairs
  3. Collins’ Leeds Method 3D
  4. Shared Clustering

Each gives useful results that can help you cluster your AncestryDNA matches into possible groups that have a common ancestor.  Tools to help you at AncestryDNA are important because Ancestry does not give access to your segment matches or a chromosome browser. So clustering can help you determine possible ancestral lines that groups of DNA relatives may share with you, and that will help you direct your genealogical research to connect yourself with them.

These ideas for using In Common With (ICW) data translate well to Chromosome Mapping and Triangulation techniques. Double Match Triangulator already shows you ICW data on the People page for all the B People in a combined run, added who among them triangulates with each other. More uses of ICW data are going to be in DMT 3.0 as I work to finish and release it.