Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

WeGene - Fri, 21 Sep 2018

WeGene is a Chinese company that up to now has done DNA testing in Shenzhen, China. They have grown to about 300,000 customers. They are going to open a second lab in Hong Kong so they can start offering services abroad. An article a couple of days ago on genomeweb gives more of the details.

Their main site www.wegene.com is in Chinese. They do autosomal testing somewhat akin to what 23andMe does giving you a lot of medical information. The article says that WeGene provides a DNA relatives report. If that is true, then they’d be only the 5th major DNA testing company to do so. (LivingDNA does not yet provide you with your DNA matches) The price they charge for an autosomal test is 499 Chinese Yuan which is about $73 USD.

image

They have an English site as well at: www.wegene.com/en/

image

When I saw the boxes at the bottom right to Import 23andMe and AncestryDNA data, I was intrigued. So I created an account and imported my 23andMe data. There was no cost for this.

First I tried the uploading the unzipped .txt file, but WeGene didn’t take it, so you have to upload the .zip file. I then did and it gave me this confirmation message:

image

I also uploaded my AncestryDNA with the same result. They sent me confirmation emails that look like this:

image

which translated says:

Hello:
WeGene Customer Service sent you a private message on WeGene Microgene - Focus on Personal Genome Detection and Analysis
———
Dear User, your data Louis Kessler has been updated to our site. The full report will be generated in a few hours. Thanks for your patience.
Please click on the link to continue: https://www.wegene.com/inbox/read/nnnnnn
Please do not reply to this email. This email is not monitored and you will not receive any response. For help, please log in to the website.
WeGene Microgene - Focus on Personal Genome Detection and Analysis

When you click on the link, it takes you to your messages at WeGene where a message has been left that says:

Dear User, your data Louis Kessler has been updated to our site. The full report will be generated in a few hours. Thanks for your patience.

I did a few other things and went back to the site about an hour later to look around and there were haplogroup reports already ready:

image

The paternal Haplogroup was a mouthful: R1a1a1b2a2b1a.  Now I have never seen it written that way, so I can’t tell you if it is correct or not. I’ve had Big-Y 500 done at Family Tree DNA, and from R1a they put me right into a M198 sub designation so it’s a bit of a different reference notation.

The maternal haplogroup exactly matched my Family Tree DNA mtFull sequence haplogroup of K1a1b1a, so that is definitely good.

The famous people they show in the pictures are from just the top letter of the Haplogroups R and K, which is about as close to me as someone with the same astrological sign. And “The Adams” should be John Adams, and “Mary Streep” should be “Meryl Streep”.  But they obviously are trying to cater to Western cultured customers by including a movie star, journalist and a president all from the United States, to be among the displayed people in the haplogroups.

Then I noticed I had two sets of reports available, likely one for each of my uploads. The second gave the same verbose paternal haplogroup. But it gave a maternal haplogroup of L3. Ann Turner explained this to me on Facebook:

“AncestryDNA (v2 only) has just a couple hundred mtDNA SNPs, and most of those appear to have been selected for their medical relevance. L3 might be the best WeGene can do. All European haplogroups do descend from L3.”

image

Unfortunately, I had no way of telling which of the two reports was from my 23andMe upload and which was from my AncestryDNA upload. Both are simply listed as: Louis Kessler. So I’ve now changed my name on the test profiles to include the company in parenthesis after my name. I assigned the Ancestry DNA to the one that gave me the L3 maternal haplogroup.

Another hour later and the Ancestry Composition reports appeared. I got these for my 23andMe and AncestryDNA uploads:

image

image

So 23andMe gave me 16.53% Ashkenazi and 39.00% Balkan = 55.53% that are ethnically and regionally correct, and AncestryDNA gave 18.64% Ashkenazi and 47.04% Balkan = 65.68% ethnically and regionally correct.  The Middle Eastern when expanded is mostly Egyptian, but for me to be that, you’d have to go back to biblical times.

image 

Other companies I tested with gave me at least 83.8% (at MyHeritage DNA) and as much as 99.2% Ashkenazi (at 23andMe), so I very much doubt the results here from WeGene. What is really unfortunate is that they give the values to two decimal places that seemingly make it look very accurate, when in my case, the values aren’t even close.

What I’d really be interested in is if just from free raw data uploads, they will provide me with my DNA matches. If so, it will be interesting to see if I match to anyone from what is likely a mostly Chinese customer base.

They offer an Application Programming Interface (API) at https://api.wegene.com/ to encourage developers to develop utilities that use their data. The API documentation says you can get access to User Information, raw data information, health risk data, the ancestral information I show above, and a health report. Unfortunately I don’t see anything in it about doing anything with your matches to other people. In fact I don’t see anything at their site about them providing you DNA matches. The only place I have seen that mentioned was in the article I linked to at the top of this post. Hopefully the article is correct.

So the English site only has the Ancestry Analysis and a Pocket DNA menu item that enables you to look at your various health issues. I’m not too interested in that, but you can feel free to explore it if you go there.

I then went to the main Chinese site at www.wegene.com and found that I could log in there as well. Translating the main page gives:

SNAGHTML2af2ca61

The MyGenes menu item is health info. Community is the discussion forum.

The Explore menu looks more interesting:

image

It gives:

  1. Institute: which is a number of survey questions to fill out so they can correlate your DNA to your characteristics.
  2. Micro-Interpretation: which is more health stuff.
  3. Application: which is a list of 3rd party applications available that use their API. There are 4 listed:
  4. Surname Outgroup: Shows the distribution of your surnames and places of origin relative to others and gives you historical events and migration information related to them. It says 71,037 people have been involved in this. This is for Chinese people.
  5. Genetic Relationships: This is where you enter your family tree and compare yourself to others. It looks like you have to enter the person in the tree and they have to have had a DNA test as well either at WeGene or uploaded there for you to use this.
    image
  6. Raw Data:  For me, this screen only says: 
    “Only data in wegen format is supported.”

So after I’m seeing this, I’m now suspecting that the article’s reference to a DNA relative report was with respect to this Genetic Relationships function. Unless there’s something I’m not seeing, I believe they’ll only compare DNA of people you put in your tree and you agree to be friends with. So I don’t now believe that they actually give you a list of all your DNA matches. Too bad.

Overall, very interesting. We’ll see what they do when they expand their services outside of China. I’m sure WeGene will be looking at what the other companies are doing and will then work to expand their offerings for people more interested in their genealogical research.

The Benefits of Combining Your DNA Raw Data - Mon, 17 Sep 2018

In my last post, I compared the raw data from 5 DNA testing companies. I ended that article by describing how I then took my 5 sets of raw data and combined them into a more complete set. I ended up with a file that contained 1,389,750 different SNPs of which just 20,688 (1.5%) were no-calls. I then described how I uploaded it to GEDmatch Genesis.

GEDmatch Genesis has since fully processed my combined raw data file. Let me from here on call it my All 5 kit.

I had earlier this year (in January) uploaded my 23andMe raw data to GEDmatch Genesis. That file had 613,899 SNPs with 16,906 (2.8%) no calls.

So the question is whether or not there are any tangible benefits from combining your raw data together and using that instead of using the raw data from one company. Let’s take a little adventure and see.


We Want to Ensure We Have Adequate Overlap

We want accurate matches. One of the challenges the people at GEDmatch are taking on is to overcome the differences in the coverage of SNPs provided by the DNA testing companies. The situation became worse when the new v5 chip was released and 23andMe and Living DNA both started using it. The new chip tests many different SNPs than the old chips do and only about 14% of the SNPs are in common with the old chip. This reduces the accuracy of matches of tests from 23andMe or Living DNA when compared to tests from Family Tree DNA, AncestryDNA or MyHeritage DNA.

That 14% I allude to is known as overlap. GEDmatch defines overlap as:

‘Overlap’ is the number of positions that exist in common between both kits, without regard to whether they match or not. The amount of overlap, along with the largest cM amount,
is usually a good indication of the relative quality of the match.

Overlap as a measurement was included in GEDmatch Genesis’ new One-to-Many report.

Important note: GEDmatch Genesis has two One-To-Many reports. The second has (Proto) written in red, obviously meaning it is a prototype or experimental. (Genesis actually has a third One-To-Many reports in its Tier 1 Utilities, but that one does not give overlaps).

The first report does not give correct overlap values. The second report with Proto does.

image

On the GEDmatch Facebook group, Aaron Wells, one of the developers at GEDmatch, said this about the Proto version of the One-To-Many report:

The "Proto" version is one that will be the final version of One-To-Many once migration is complete. There is now a brand new database, which will be the final database of matches. It was first populated with all native genesis kit matches to other native genesis kits. Next, the original gedmatch matches are being moved over as we speak, populating this new database. The "Proto" one-to-many is pulling matches from the new and final database.

The One-to-Many Proto report shows the overlap in its own column:

image

The above table shows the beginning of my GEDmatch Genesis One-to-Many(Proto) report for my 23andMe kit. The first line I match to is my All 5 kit, and that matches at 3572.4 cM which is 100%, so thankfully I match myself. The other lines show matches with other people. So far my closest match at GEDmatch Genesis is only 88.9 cM and I don’t know how I am related to that person or any of my other matches there.

Now lets take a look at the overlap column of this report. You’ll see a number of 23andMe kits that I match to with an overlap of about 317,000. You’ll see 3 FTDNA kits I match to and one Ancestry kit I match to with overlaps of about 70,000 shaded light red. And you’ll see a 23andMe kit and an unspecified kit with overlaps of about 50,000 shown in a darker red. I did not add the red shading. It is on the report from GEDmatch Genesis. They write:

Matches with low overlap have that field highlighted with a pink or red background, depending on the overlap value.
Matches with very low overlap are not shown.

So obviously, the quality of the matches in red are suspect. We want to see if a combined raw data file can improve on that.

The overlap of my 23andMe kit with most of the other people’s 23andMe kits are good as would be expected. They both use the same chip and effectively test the same SNPs. Theoretically there should be 100% overlap, but I have found when comparing different people’s Family Tree DNA raw data that they average 702,000 SNPs with one being as high as 708,092 and one as low as 680,544 SNPs. So there are variations of a percent or two in coverage that may occur even from the same chip.

My 23andMe v5 test results included 613,899 SNPs and other people testing at 23andMe on the v5 chip should get about the same. So the overlap should be something over 600,000, but GEDmatch Genesis is only showing about 317,000. Ann Turner told me on the Facebook GEDmatch User Group that GEDmatch Genesis discards SNPs with a low minor allele frequency, i.e. only the SNPs whose values vary the most often will be used. So GEDmatch Genesis is using the 50% of SNPs that give them the best bang for the buck. This means that the overlap values that are reported by them are about half of the overlap numbers you’ll find in the Autosomal SNP comparison chart at the ISOGG Wiki.



How Much Overlap is Not Enough?

The overlap of the v5 chip kits against non-v5 chip kits is not good. When I analyze my One-to-Many report in Excel, I can group the amounts of overlap by GEDmatch’s color coding. They use dark red for my overlaps of less than about 55,000, a lighter red for overlaps between 55,000 and 72,000, and an even lighter red up to 100,000. Anything above 100,000 is not shaded and is deemed okay.

People who upload to GEDmatch Genesis are asked to supply the name of the company they tested with. It’s free form text, but most of the company names listed are identifiable.

The One-To-Many(Proto) report gives the 3,000 closest matches. So here are the overlaps for the 3,000 matches of my 23andMe test against other people’s kits from various companies.

image

Over half of my matches also tested with 23andMe. But almost 34% of them have low matches because those were from tests done before 23andMe started using the v5 chip a couple of years ago.

Almost all of the Ancestry DNA, Family Tree DNA and MyHeritage DNA tests have low overlap with my 23andMe kit.

Almost all the Living DNA kits had good overlap with my 23andMe kit. That’s because Living DNA has only used the v5 chip and tests almost the same SNPs as 23andMe.

The bottom line is that over half (54.5%) of my matches are deemed to not have enough overlap.


The Improvement You Get Using Combined Raw Data

My All 5 kit includes all the SNPs tested, not only from 23andMe, but also from the other 4 major companies: Ancestry DNA, Family Tree DNA, Living DNA and MyHeritage DNA.

Analyzing my 3,000 matches of my All 5 kit in the same manner gives these results:

image

Well that is a lot better. Almost all the low overlap matches are gone.


How has the Overlap Changed?

Of the 3,000 matches to my 23andMe kit, 611 of them were not also matches to my All 5 kit. They got dropped off the list and replaced by new ones. That means over 20% of the matches were refuted or at least found to be not as good as they were originally stated to be after I improved my overlap count with the combined raw data.

If I compare the 2,389 matches that were in common to both kits, I get this very interesting graph:

image

The bottom axis is the overlap that the match has with my 23andMe kit and the left axis is the overlap the same match has with my All 5 kit.

The diagonal line is where the overlap is the same. In all cases the All 5 kit was at least as good an overlap as the 23andMe kit. That is expected because it includes all my 23andMe SNPs.

Around 300,000 and 300,000 you’ll see a blob of yellow (Living DNA) and blue (23andMe). These are the v5 tests. My 23andMe kit already had all the SNPs from that test, so my combined kit made very little improvement there.

But that line of dots on the left is what got improved. Those are all the matches that had under 100,000 overlap with my 23andMe test. They all got improved significantly when matched with my All 5 kit.

In case you’re wondering, those 3 highest green dots were matches to kits from DNA Land (at the top), an FTDNA kit merged with a 23andMe kit (next highest), and a Genes for Good kit (third highest)


But Have My Matches Themselves Changed?

Let’s now compare those 2,389 individual matches. GEDmatch Genesis’ One-to-Many report gives Total cM, Largest segment cM and a Gen value, which is an estimate of the number of generations back to your common ancestor.

By using a combined raw data file, this is how many of my match numbers changed:

image

More than half the matches had exactly the same values with my combined raw data as they had previously.  Those were mostly matches with v5 tests where my 23andMe data already had good overlap and gave good results.

For the matches that changed, the average Total cM went down 7.0 cM, the average Largest segment cM went down 1.5 cM, and the gen value increased by 0.3 generations. 

This means the addition of extra SNPs made the relationships with the matches a little bit weaker. It also means that a match without good overlap slightly overestimates the strength of the match.

The elimination of some matches in combination with the reduction of the Total cM of some matches resulted in reducing the Total cM of the 3,000th match from 54.4 cM with the 23andMe data, to 35.8 cM with the combined data.



One Specific Match

Let’s see what the effect that a combined raw data file has on one specific match. Here’s my top Ancestry DNA match, who is displayed on the 2nd line of the first table that is at the beginning of this post.

When compared with my 23andMe kit, this Ancestry DNA kit is an 88.9 cM match, with largest segment 19.0 cM and 3.7 generations. It has an overlap of 71,660 which is said not to be good.

Here is that Ancestry kit’s One-to-One comparison with my 23andMe kit for chromosomes 1 to 6:

image

When compared with my All 5 kit, the same Ancestry DNA kit is a total match of 79.1 cM, with largest segment 19.1 cM and 3.8 generations. It now has an overlap of 308,924.

The main difference is the dropping of the 8.8 cM match on chromosome 6 from the match list. Here’s the One-to-One comparison with my All 5 kit:

image

You will notice that there appears to be more red lines with the All 5 kit. The red lines are Base Pairs with No Matches. This indicates again that adding more SNPs to your raw data and getting more overlap will tend to disprove matches that might otherwise have been thought to be valid.

Also notice the information in the diagrams about the two segment matches on chromosome 3. Those segments only included 578 and 437 SNPs with the 23andMe kit, but that went up to 2,050 and 1,604 SNPs with the All 5 kit.  That’s an increase in SNP density by a factor of 4.


Conclusion and Recommendation

Combining raw data from different companies does seem to provide some degree of increased accuracy in your match information and eliminates a few incorrect segments. Using it will drop off some of your more distant matches from your match list and replace them with others.

The real issue is overlap. The v5 chip from 23andMe and Living DNA works well against other v5 chips but not as well against the older chips. The same goes for the non-v5 chips of AncestryDNA, Family Tree DNA and MyHeritage DNA. They work well for comparisons with each other, but not as well against the v5 chip.

You might want to consider testing once at 23andMe or Living DNA with the v5 chip, and once with one of the other companies using the older chips. Then you can combine your two sets of raw data with this nice free tool by Wilhelm Halys called: DNA Kit Studio.

Then whenever you want to transfer or upload your raw data to a DNA testing or analysis site, you can use this combined raw data for slightly more reliable results.

Will you really notice the difference? To be honest, I doubt it.

Comparing Raw Data from 5 DNA Testing Companies - Fri, 31 Aug 2018

In March 2017, I compared my DNA raw data from Family Tree DNA against my DNA raw data from MyHeritage DNA.  I had tested with FTDNA at home on Nov 25, 2016 and with MyHeritage DNA at RootsTech on Feb 10, 2017.

Since then, I ordered tests and tested at home with 23andMe on Dec 2, 2017, AncestryDNA on Dec 12, 2017, and Living DNA on June 23, 2018. So I now have five sets of my own Raw Data from different testing companies that I can compare.

You never know what the companies are doing, so just to make sure, I downloaded my Build 37 raw data from Family Tree DNA again and compared it with the download I did on Jan 12, 2017. Nothing had changed. The files were identical. That’s good. 
   

Raw Data File Contents

All 5 companies list your SNP (Single Nucleotide Polymorphism) data, one per line. Some companies include some lines of text description at the top, followed by a title line naming the fields, followed by the SNP data. Here for example is the beginning of my Ancestry DNA raw data file:

image

And this is the beginning of my Family Tree DNA raw data file:

image

Here’s a comparison of the five DNA tests I took and the raw data files I got from them:

image

Family Tree DNA and MyHeritage DNA files are both set up similarly as .csv files (comma delimited) with field put in double quotes. The other 3 companies use plain text files separating fields with a space or tab. Both type of files can easily be loaded into Excel and the fields will be placed properly into columns for you.

The first field for each SNP in all the files is the RSID (Reference SNP cluster Identifier) which basically is a name for the SNP. I checked, and in each raw data file, no RSID was listed more than once.

The RSID is followed by the chromosome number and the position in base pairs on the forward strand that the SNP is located on the chromosome. The position of the SNP can change when the powers that be come out with a new “build” of the genome. Several years ago, Build 36 was the standard, but most companies now use Build 37. They have already come out with a Build 38, but so far all of the companies are sticking to Build 37 because it really is a lot of work to change for little gain with regards to matching people to each other. All 5 of my raw data files are from Build 37, so (theoretically at least) the chromosome and position of any SNP should match. I’ll check that later in this article in the section: “RSIDs with more than one Position”.

The value of the SNP is called “result” by Family Tree DNA and MyHeritage DNA, “allele1 and allele2” by AncestryDNA, and “genotype” by 23andMe and Living DNA. Ancestry DNA puts a space between the two allele values. The other companies list the two alleles together as a single 2 character string.

The SNPs from all five companies are listed by chromosome and then by position within the chromosome. Chromosomes 1 to 22 (the autosomes) are listed first. The sex chromosomes X and Y and the mitochondrial MT follow.  Ancestry DNA numbers X as 23 and 25, Y as 24 and MT as 26. Ancestry uses 25 for the few SNPs that they probe that are in the pseudoautosomal region of the X and Y chromosomes. These are the tips of the X that actually combine with the Y chromosome just like autosomal genes do.

Family Tree DNA embeds a 2nd title line between the last SNP on the 22nd chromosome and the first SNP on the X chromosome. Don’t get caught by this. Be sure to remove this second title line if you are analyzing a Family Tree DNA raw data file in a spreadsheet or with programming.


RSIDs and SNPedia

The RSID, which you can think of as the name of the SNP, is usually represented by the letters “rs” followed by a number. The SNPedia has information on a fair percentage of these RSIDs and you can look them up to find out what that particular SNP has been found to do.  For example, the entry for rs1815739 in SNPedia will tell you that this SNP is on chromosome 11 at position 66560624, is part of Gene ACTN3, and is said to have an effect on muscle performance. Values of (C,C) could contribute to better performing muscles, (C, T) is a mix of muscle types, and (T,T)  could contribute to impaired muscle performance. Medical interpretation of SNPs is not something I have any experience with, so I will make no attempt to do that.

When testing companies test SNPs that do not already have an RSID defined, they often invent their own. 23andMe has used “i” followed by a number. Family Tree DNA and MyHeritage DNA have used “VG” followed by the chromosome number followed by “S” followed by a number. And Living DNA came up with a whole set of different RSID names, each of which must have some meaning to them. In my raw data, I found the following number of SNPs with these prefixes:

image

At the time I’m writing this, the number of SNPs defined in SNPedia is 109,335. SNPedia says that 49,082 of those are tested by Ancestry.com’s v2 platform and 24,761 by 23andMe’s v5 platform with 16,453 in common between them. There are 13,916 tested by Family Tree DNA. MyHeritage DNA and has about 12,000 entries and Living DNA has about 22,000. They say there are 1,504 SNPs of their defined SNPs that are in common to most platforms.


Number of SNPs by Chromosome

All companies read and provide raw data for the SNPs from the autosomes (chromosomes 1 to 22) as well as the X chromosome. MyHeritage DNA, Ancestry DNA and 23andMe provide Y chromosome SNPs. Ancestry DNA and 23andMe provide mitochondrial (MT) SNPs.

Below is the number of SNPs by chromosome in my raw data:

image

You’ll notice that the FTDNA and MyHeritage number of SNPs are identical for all chromosomes and are only 16 different for the X chromosome. That’s because both companies use the the same chip and the same Gene By Gene lab (the parent company of Family Tree DNA). Differences in the reads between the two are indicative of the error rate in one set of raw data. My analysis last year that compared the two sets of raw data found 42 differences out of 702,442 autosomal SNPs, indicating an error rate less than 0.01%. MyHeritage does include some Y chromosome results in its raw data, but Family Tree DNA does not.


Ancestry’s X Chromosome in More Detail

Ancestry divides its X data into what it calls chromosomes 23 and 25. The latter is said to represent the pseudoautosomal region which I described earlier. My 27,973 X SNPs from my Ancestry DNA raw data is made up of 27,473 chromosome 23 SNPs and just 500 pseudoautosomal chromosome 25 SNPs.

This is the range of positions and counts of my designated chromosome 23 versus chromosome 25 SNPs:

image

Ancestry DNA’s Chromosome 25 regions in my raw data include 339 SNPs up to position 2,697,868 which is the starting tip of the X chromosome and is the first pseudoautosomal region. And then there’s 63 SNPs at the ending tip of the chromosome in the second pseudoautosomal region.

For some reason, Ancestry DNA assigns 13 SNPs from 2,700,157 to 8,549,940 to chromosome 25 when it is outside the official region (up to 2.7 Mbp) where it also assigns 1,256 SNPs to chromosome 23. Then between 88,720,459 and 92,164,248, they have another 84 SNPs assigned to chromosome 25, and I’m not sure why.

The SNP designated 25 at position 117,610,641 in my raw data file is all alone and is likely an incorrect entry by Ancestry DNA.

138 of those Ancestry chromosome 25 SNPs are also included in my raw data from 23andMe, who simply include them as an X chromosome SNP and don’t differentiate them like Ancestry DNA does.


SNPs in common between companies

It is quite important to know how many SNPs are shared between companies. I compared my 5 sets of raw data in pairs and counted the SNPs shared. The numbers on the diagonal in bold are the number of SNPs in my raw data just from that company. The numbers below the diagonal are the number shared. The percentages above the diagonal are the percent shared out of the total SNPs that the two companies have = #shared / (#c1 + #c2 – #shared)

image

The first table shows the shared autosomal SNPs that I have between my raw data files from the five companies.

Below that are the comparable numbers from the Autosomal SNP comparison chart at the ISOGG Wiki. The FTDNA number 698,179 that I’ve marked in their chart has to be wrong because it can’t be less than the number FTDNA shares with MyHeritage. The numbers are fairly close to mine. I know from looking at several different people’s raw data from Family Tree DNA, that there is variation in the number of SNPs included in one company’s raw data from test to test.

Family Tree DNA and MyHeritage DNA provide identical autosomal SNPs. They share about 44% with AncestryDNA. 23andMe and Living DNA who both use the v5 chip share over 90% with each other, but only about 14% with the other companies. Only 110,231 autosomal SNPs were included in my raw data by all five companies.

Those low overlap percentages are what makes it difficult to find matching segments between data from the v5 chip and data from the old chip. Some companies like Family Tree DNA do not yet accept transfers of raw data from 23andMe or Living DNA because of that. MyHeritage DNA uses imputation to estimate the missing SNPs. GEDmatch is still working to develop a more reliable method to compare v5 chip data with earlier data through it’s GEDmatch Genesis project.

Here’s the same data, but for the X chromosome:

image

The ISOGG Wiki doesn’t yet have X data in their table for MyHeritage DNA, Living DNA or the new v5 chip of 23andMe.

Here are my tables for the Y chromosome and for mitochondrial.

image


RSIDs with more than one Position

All my raw data files were from Build 37 of the genome. So every RSID should map to one SNP on one specific chromosome at one position. That was true within any one set of raw data, where every RSID was just given once.

But once you combine multiple sets of raw data, you’ll find the same RSID tested in different files. This is the count of the number of RSIDs by the number of files each was found in:

image

So you would expect those RSIDs that are in more than one raw data file to be at the same position on the same chromosome in each file. It turns out that in my files 68 of those RSIDs are not at exactly the same position.

All but 1 are differences with the 23andMe raw data. And most of them are minor.

29 differences have the 23andMe position being just 1 less than the Living DNA position, e.g.  RSID rs498648 is on chromosome 1. In my 23andMe raw data file it is at position 176,957,452 and in my Living DNA file, it is at position 176,957,453. Now this is just 1 position different and isn’t important at all for genealogical purposes. But for a programmer who may want to develop tools for handling raw data, even a one difference can cause a problem. None of these 29 differences have RSIDs that are in the other 3 raw data files or in SNPedia, so I can’t tell which one might be the correct one.

34 of the differences are very small ones on the mt chromosome where 23andMe is 1 more (31 times), or 2 more (twice) or 3 more (once) than the Ancestry DNA position. e.g. for RSID rs118203886 Ancestry DNA lists position 611 on chromosome 26, and 23andMe lists position 613 on chromosome MT. Of these RSIDs, 32 are listed in SNPedia and SNPedia agrees with Ancestry DNA in all cases.

One more difference is SNP rs3857360 which is in both my Family Tree DNA and my MyHeritage DNA raw data files as position 102,989,428 on chromosome 5, but has a position one higher at 23andMe. This SNP is not in SNPedia.

But there are four differences between 23andMe and Living DNA that concern me the most because the RSID is used for two completely different locations. These 4 are:

image

Two of the values at 23andMe are no-calls, but of the other two, one doesn’t match with a TT at 23andMe and a AA at Living DNA. That already is indicative that these might be different SNPs that one of the companies has named incorrectly. None of these four SNPs are in SNPedia.


Positions with more than one RSID

So there were only 68 RSIDs with different positions, and only 4 of them were bad.

However, there are many more positions that have more than one RSID.

I found quite a number of SNPs on a chromosome at a specific position, where a different RSID was used for that SNP.

image

From my 5 raw data files, I had as many as 4 different RSIDs at a specific position.

For example, Chromosome 7, position 117,174,424 has these RSIDs:

  1. rs78440224 in AncestryDNA and Living DNA raw data
  2. i5010947 in 23andMe raw data
  3. i5053851 in 23andMe raw data
  4. VG07S45007 in Family Tree DNA and MyHeritage DNA raw data.

And if you look up rs78440224 in SNPedia, sure enough, they say that SNP is named i5010947 and i5053851 by 23andMe. It doesn’t happen to mention the fourth name though. (And I was happy to see that all four of those SNPs in my raw data have the value GG, which is not the cystic fibrosis carrier.)

The i5010947 and i5053851 RSIDs in the 23andMe raw data file means that there are two names for the same SNP in the same file. Cases like this will cause the position to occur more than once in the raw data file.


Analysis of the Allele Values

This is what we’ve really been trying to get to. Let’s first see what the allele values there are from each company.

image

The allele pair corresponding to the alleles on the forward strand of both parents’ chromosomes is given as two letters, with A, C, G and T being the possible choices. Ancestry lists the two alleles as two separate letters, but I’ve put them together in the above table.

Since it is unknown which of the two letters belongs to which parent, the order of display of the two letters is arbitrary. The standard practice is to order the two letters alphabetically, so if you have the choice of AC or CA, then you would use AC. For the most part, the companies follow this standard, but you can see very odd exceptions., e.g. MyHeritage DNA and AncestryDNA both using TC and TG instead of CT and GT. Living DNA often uses both orderings, and unless they’ve thought up something innovative, I doubt the order for a specific value means anything.

23andMe includes values for insertions (II) and deletions (DD) and even has a few deletion/insertions (DI).

Two dashes “–“ represent no-calls. These are positions where the values were not able to be determined. AncestryDNA uses two zeros: “00”. For matching purposes, no calls are treated as a match.

When a single letter is given, it is for a chromosome that is not in a pair. Since I’m a male, I have a single X chromosome from my mother and a single Y from my father and everybody’s mt chromosome comes just from their mother. 23andMe uses the single letter designation in this case, but the other companies duplicate the letter.

In order to compare allele values between companies to see if the readings are the same (in the next section), I’ll need to standardize the notation. I’ve chosen to use 2 letters and order them alphabetically as in the “Standardized” column of the above table.

When a value cannot be determined during the test, it is given what is known as a no call and is denoted by two hyphens by most companies, but by a zero by AncestryDNA. The percentage of no calls is a very important statistic and indicates the quality of the test results. A no call percentage of 3% or more is on the high side and the company may be willing to get new results from your sample or get you to re-test. My results from the five companies ranges from a low of 0.4% no calls at AncestryDNA to a high of 3.0% at 23andMe.

Below is my standardized table of counts for my autosomal chromosomes:

image

It’s interesting that Living DNA did not find any AT or CG values.

For the X chromosome below, I’ve marked the invalid values. Since I’m male and I only have one X chromosome, values with two different letters are impossible.

image

Next is the Y chromosome. There is a high number of no calls and invalids in the MyHeritage Y DNA data.

image

Only AncestryDNA and 23andMe include the mt chromosome in the raw data:

image


Comparing reads between companies

Now the interesting question. Do the different companies give the same values?

To do this, I re-sorted my combined file of results by chromosome and position, and merged the results for identical positions (SNPs with different RSIDs) together. If any of the readings of the SNPs at the same position conflicted, I was prepared to mark the value at the position as a no-call, but fortunately none did.

I did my analysis and summarized it with the following table:

image

So this table includes the 1,389,750 unique positions that were tested by my five companies. There were 3,346,178 readings in total, so that’s an average of 2.4 readings per position.

I’ve grouped the positions by the number of companies that read from that position from my five sets of raw data  and the by the number of those reads that were no calls.

For example, the first line says that 111,872 positions were read by all five companies. Only 19 of those have a disagreement among the 5 companies. For those 19 positions where there are disagreements, I would change the value to a no call, so 19 x 5 = 95 values will get changed to a no call.

The second line says that 2,353 positions were read by all five companies, but in each case one of the companies had a no call. Only 14 of those have a disagreement among the 5 companies. A no call does not count as a disagreement. For the 2,339 agreements, the no call can be given the value that the other companies agreed upon. For the 14 positions where there are disagreements, I would change the value to a no call, so 14 x 4 = 56 values will get changed to a no call.

In total, there are only disagreements between 2 or more companies at 665 positions, which is only 0.05%. That’s very good!

By doing this, I can assign 42,230 values to no call readings and only have to assign no calls to 1,692 readings. That reduces the number of no call readings from 73,127 to 73,127 – 42,230 + 1,692 = 32,589. So I have effectively reduced my percentage of no calls down from 2.19% to 0.97% of the readings the companies supplied to me.


Creating a combined raw data file

Well, it seems like I should take the next step and create a raw data file from these 1,389,750 positions.

I noted, but forgot to correct my X’s and Y’s earlier that were impossible values for me because they were not double letters. So that adds 304 no calls to my X values and 57 no calls to my Y values.

Here’s a summary of what I’ve got, with a comparison of what I got from the five sets of raw data from the five companies. My percent no calls is shown on the bottom line.

image

Note that 23andMe gave me 4,301 mt readings but I only have 2,483. That’s because 23andMe’s mt data included many SNPs with identical positions and I merged SNPs with the same position into one. In all cases, the SNPs that got merged all had the same value.

Now which company’s raw data should I emulate? The goal would be to create a raw data file that other utilities can read. Since I’ve got v5 chip data, I likely should use either 23andMe or Living DNA’s format. 23andMe is the only company that includes insertions and deletions, so I’ll use their format and follow 23andMe’s naming convention and name the file: genome_Louis_Kessler_v5_Full_20180831124000.txt.

23andMe uses tabs rather than spaces in between fields, so I used my text editor and converted all the spaces to tabs.

The first 10 data lines in the original 23andMe data file I got from them were:

image

And the first 10 lines of the file I have manufactured are:

image

Note there are extra SNPs from the other companies, and that SNP i713426 whose value was a no call in the 23andMe file is now filled in because the value AA was provided to me by Living DNA.

So this file is 35 MB in size and has 1,389,770 lines that include my 1,389,750 SNPs plus 19 description lines and one title line at the top.

And if you’re curious, the Excel file that I used to do all this analysis for this article is 186 MB in size.


Uploading to GEDmatch Genesis

I entered the file upload information and pressed the Upload button.

image 

It would not take the first file I tried uploading. I compared it to my raw data file from 23andMe file and noticed that my file was UTF8 with a byte order mark at the front of it. I saved the file as ANSI/ASCII file and then GEDmatch Genesis accepted it without error and identified it as a 23andMe kit type V3.

I’m not sure what I’ll do with it yet on GEDmatch Genesis, Maybe I’ll determine how it compares there with my 23andMe kit that I uploaded there back in January.

Any suggestions?


Meanwhile…Full Genome!!!

A full genome for less than $1000. That was the magic goal that labs had been trying to achieve.

A couple of days ago, while researching information for this article, I discovered an unbelievable deal by Dante Labs. They currently are offering Whole Genome Sequencing 30x marked down to $499 from $1000. I don’t know if that price is permanent or not, but it may be. They currently have a coupon code Dazzle4Rare you can use at checkout to save another $100. Global shipping is free.

That Dante Labs deal is so good, I couldn’t resist so I purchased a whole genome sequence for myself for just $399. I should get the test kit next week and then it will take about 10 weeks to process after the lab gets my sample.

Apparently during Amazon Prime Day, they offered it for $349.

So once I get the results back, I’ll see if I can compare my 1,389,770 SNPs that I put together here with the same alleles in my full genome and see what it tells me.

Followup: Sept 4, 2018: I was informed that in the last year or so, MyHeritage DNA stopped reporting about 30,000 SNPs from their tests and includes them now in your raw data as no calls. If I did the test again, I now might get close to 50,000 no calls (6.9%) rather than the 18,700 (2.6%) that I observed.

Followup: June 5, 2020:  I updated the first table comparing my 5 raw data files to include the date I took them and the chip that was used.

Followup: Nov 11, 2021:  Readers of this article might also be interested in my article from Apr 10, 2020: Determining the Accuracy of DNA Tests