Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Upload Your Raw DNA Data to Borland Genetics - Mon, 25 May 2020

There’s another website I recommend you upload your DNA raw data to called Borland Genetics.

image

See this video: Introducing Borland Genetics Web Tools

In a way, Borland Genetics is similar to GEDmatch in that they accept uploads of raw data and don’t do their own testing. Once uploaded, you can then see who you match to and other information about your match. Borland Genetics has a non-graphic chromosome browser that lists your segment matches in detail.    
   
But Borland Genetics has a somewhat different focus from all the other match sites. This site is geared to help you reconstruct the DNA of your ancestors and includes many tools to help you do so. And you can search for matches of your reconstructed relatives, and your reconstructed relatives will also show up in the match lists of other people.

Once you upload your raw data and the raw data from some tests done by a few of your relatives, you’re ready to use the exotically named tools that include:

  • Ultimate Phaser
  • Extract Segments
  • Missing Parent
  • Two-Parent Phase
  • Phoenix (partially reconstructs a parent using raw data of a child and relatives on that parent’s side)
  • Darkside (partially reconstructs a parent using raw data of a child and relatives that are not on that parent’s side)
  • Reverse Phase (partially reconstructs grandparents using a parent, a child, and a “phase map” from DNA Painter) 

Coming soon is the ominously named: Creeper, that will be guided by an Expert System that use a bodiless computerized voice to instruct you what your next steps should be.

There’s also the Humpty Dumpty merge utility that can combine multiple sets of raw data for the same person, and a few other tools.

The above tools are all free at Borland Genetics and there’s a few additional premium tools available with a subscription. You can use them to create DNA kits for your relatives. Then you can then download them if you want to analyze them yourself or upload them to other sites that allow uploads of constructed raw data.

By comparison, GEDmatch has only two tools for ancestor reconstruction. One called Lazarus and one called My Evil Twin. Both tools are part of GEDmatch Tier 1, so you need a subscription to use them. Also, you can only use the results on GEDmatch, because GEDmatch does not allow you to download raw data.


Kevin Borland

The mastermind behind this site is Kevin Borland. Kevin started building the tools he needed for himself for his own genetic genealogy research a few years ago and then decided, since there wasn’t one already, to build a site for DNA reconstruction. See this delightful Linda Kvist interview of Kevin from Apr 16, 2020.

In March 2020, Kevin formally created Borland Genetics Inc.and partnered with two others to ensure that this work would continue forward.

If you are a fan of the BYU TV show Relative Race (and if you are a genealogist, you should be), then you should know that Kevin was the first relative visited by team Green in Season 2.  See him at the end of Season 2 Episode 1 starting about 32:24.


Creating Relatives

I have not been as manic as many genetic genealogists in getting relatives to test. I only have my own DNA and my uncle (my father’s brother) who I have tested. So with only two sets of raw data, what can I do with that at Borland Genetics?

Well, first I uploaded and created profiles for myself and my uncle.

The database is still very small, currently sitting at about 2500 kits. Not counting my uncle, I have 207 matches with the largest being 54 cM. My uncle has 86 matches with the largest being 51 cM. This is interesting because most sites have more matches for my uncle than for me, since he is 1 generation further back.  I don’t know any of the people either of us match with. None of them are likely to be any closer than 4th cousins.

My uncle and I share 1805.7 cM. The chromosome browser indicates we have no FIR (fully identical regions) so it’s very likely that despite endogamy, I’m only matching my uncle on my father’s side.

The chromosome browser suggest three Ultimate Phaser options for me to try:

image

To interpret the results of these, you sort of have to know what you’re doing.

So let me go instead to try create some relatives. For that I can first use the Phoenix tool.

image

It allows me to select either myself or my uncle as the donor. I select myself as the donor and press Continue.

image

Here I enter information for my father and press Continue

SNAGHTML3187291c

I now can select all my matches who I know are related on my father’s side. You’ll notice the fourth entry lists the “Source” as “Borland Genetics” which means it is a kit the person created, likely of a relative who never tested anywhere.

In my case, my uncle is the only one I know to be on my father’s side, so I select just him. I then scroll all the way down to the bottom of my match list to press Continue.

image

And while I’m waiting, I can click play to listen to some of Kevin’s music.  After only about 2 minutes (the time was a big overestimate) the music stopped and I was presented with:

image

I now can go to my father’s kit and see what was created for him. His kit type is listed as “Mono” because only one allele (my paternal chromosome) can be determined. The Coverage is listed as 25% because I used his full brother who shares 50% with him, and thus 25% with me.

image

His match list will populate as if he was a person who had tested himself.

I can download my father’s kit:

image

which gives me a text file with the results at every base pair:

image

The pairs of values are all the same because this is a mono kit. Also be sure to  use only those SNPs within the reconstructed segments list. There must be an option somewhere to just download the reconstructed segments, but I can’t see it. (Kevin??)

In a very similar manner (which I won’t show here because it is, well, similar), I can use the Darkside tool to create a kit for my Mother using myself as the child and my Uncle as the family member on the opposite side of the tree.


Reconstructing Ancestral Bits

Now I have kits for myself, my uncle, my father and my mother. Can I do anything else?

Well yes! I can use my analysis from DNA Painter to define my segments by ancestor.

image

I just happened to have the DNA Painter analysis done already, which I used Double Match Triangulator for. Using DMT, I created a DNA Painter file from my 23andMe data for just my father’s side:

image

I labelled them based on the ancestor I identified, e.g. FMM = my father’s mother’s mother. I downloaded the segments from DNA Painter and clicked “Choose File” in Borland Genetics and it gave me my 5 ancestors with the same labeling to choose from.

  image

I select “FF”, click on “Extract Selected Segments” and up comes a screen to create a Donor Profile for my paternal grandfather!

image

Wowzers! I have now just created a DNA profile for a long-dead ancestor, and I can do the same for 4 more of my ancestors on my father’s side.

Just a couple of days ago, I think I was asking Kevin for this type of analysis. Only today when writing this post, did I see that he already had it.


Summary

I only have my own and my uncle’s raw data to work with, yet I can still do quite a bit. For people who have parents, siblings and dozens of others tested … well I’m enviously drooling at the thought of what you can do at Borland Genetics with all that.

There is a lot more to the Borland Genetics site than I have discussed here. There are projects you can create or join. Family tree information. Links to WikiTree. You can send messages to other users. There are advanced utilities you can get through subscription.

The site is still under development and Kevin is regularly adding to it. Kevin started a Borland Genetics channel on YouTube, and over the past 2 years he made an excellent 20 episode series of You Tube videos on Applied Genetics. And he runs the Borland Genetics Users Group on Facebook, now with 738 members.  – I don’t know how he finds the time.

So now, go and upload your raw data kits to Borland Genetics, help build up their database of matches, and try out all the neat analysis it can do for you.

OneDrive’s Poison Setting - Fri, 8 May 2020

OneDrive’s default setting of no limit for network upload and download rates has caused years of Internet problems at my house. Unbeknownst to us, it would from time to time consume most or all of the Internet bandwidth affecting me when on my ethernet connected desktop computer and affecting everyone else in my house connected with their devices to our Wi-fi. It is now obvious to me that this hogging of bandwidth happened following any significant upload of pictures or files from my desktop computer to OneDrive and the effect sometimes lasted for days!

Yikes! I’m flabbergasted at how we finally discovered the reason behind our Internet connection problems. A number of times in the past few years, we’ve found the Wi-fi and TV in the house to be spotty. We had got used to unplugging the power on the company-supplied modem and waiting the 3 or 4 minutes for it to reset. Often that seemed to improve things, or maybe the reset just made us feel it had done so – we don’t really know. We’ve called our supplier several times, and they came over, inspected our lines, checked our modem. In all cases, the problem repaired itself, if not immediately, then over the course of a few days.

It didn’t get really bad too often. But it did about 2 months ago, just after my wife and I got back from a wonderful Caribbean cruise (which we followed up with 2 weeks of just-in-case self-isolation at home). I had to replace my computer, and very shortly after the new one was installed, we had several days of Internet/TV problems.

I called my service provider (BellMTS) and I told them about the poor service we were having and they tried to help over the phone. We rebooted the modem several times but that wasn’t helping.

image

They sent a serviceman to check the wiring from our house to the distribution boxes on our block. We thought that might have helped and it was not long after that it seemed everything was pretty good.

We had very few problems over the next 6 weeks, but just last night, I was in the middle of an Association of Professional Genealogists Zoom webinar (Mary Kircher Roddy – Bagging a Live One; Reverse Genealogy in Action), when suddenly I lost my Internet in my other windows and my family lost the Internet on their devices. Our TV was even glitching. However the Zoom webinar continued on uninterrupted. I could not at all figure this out.

After the webinar ended, I called my Internet/TV provider and things seemed to improve. The next morning, the troubles reoccurred. I called my provider again. They sent a serviceman. He came into the house (respecting social distancing) and cut the cable at our box so they could test the wiring leading to our house. He was away for over an hour doing that. When he came back, they had set up some sort of new connectors. He reconnected us. But no, we still had the problem. He then found what he though was a poorly wired cable at the back of the modem. He fixed that, but still the problem. Then he replaced our modem and the power supply and the cabling. Still the problem.

We were monitoring the problem using speedtest.net. We’ve got what’s called the Fibe 25 plan**. We should be getting up to 25 Mbps (mega-bits per second) download and up to 3 Mbps upload. We were getting between 1 and 2 download and 1 upload. Not good. 

After several more attempted resets and diagnostic checks, we were now 3 hours into this service call. The serviceman’s next idea was the one that worked. He said turn off all devices connected to the Internet. Then turn them on one-by-one and we might find it is a device we have that’s causing the problem. We did so and when we got to my ethernet connected computer, it was the one slowing everything. The serviceman said there it is, found the reason. He couldn’t stay any more and left.

I checked and sure enough, when my computer was on, we got almost no Internet, but when it was off, everything was fine. Here was the speed test with my computer off:

image

When I went to the network settings to see if it was a problem with my ethernet cable, I could see a large amount of Activity, with the Sent and Received values changing quite quickly:

image

My first thought was that maybe my computer was hacked. I opened Task Manager and sorted by the Network column to see what was causing all the Network traffic. There was my answer, in number 1 place consuming the vast majority of my network was: Microsoft OneDrive.

My older daughter immediately commented that she had long ago stopped using the free 1 TB of OneDrive space we each get by being Microsoft 365 subscribers because she found it hogged all her resources.

Eureka! 2 months ago what had I done? I had uploaded all my pictures and videos from our trip to OneDrive. And what was I doing while watching that Zoom webinar last night? I was uploading several folders of pictures and videos to OneDrive. What wasn’t I doing during the 6 weeks in-between was any significant uploads to OneDrive.

In Task Manager, I ended the OneDrive task. Sure enough my download speed from speed test went back up to good numbers, and our Internet/TV problem had finally been isolated.

It didn’t take me long to search the Internet to find that OneDrive had network settings. The default was (horrors) a couple of “Don’t limit” settings. The “Limit to” boxes, which were not selected, both had suggested defaults of 125 KB/s (kilobytes per second). I did some calculations and selected them and set the upload value to 100 KB/s and left the download value at 125 KB/s: 

image

Note that these are in KB/s whereas Speedtest gives Mbps. The former is thousands of bytes and the latter is millions of bits. There are 8 bits in a byte. So 125 KB/s = 1.0 Mbps, which is about 4% of my 25 Mbps download capacity and 100 KB/s = 0.8 Mbps which is less than 30% of my upload capacity. Now when OneDrive is synching, there should be plenty left for everyone else. Yes, OneDrive will take several times longer to upload now. But I and my family should no longer have it affecting our Internet and TV in a significant way any more.

Also notice there’s an “Adjust automatically” setting. Maybe that is the one to choose, but unfortunately they don’t also have that setting on the Download rate, which is maybe more important.

My wife and daughters have complained to me for a number of years claiming my computer was slowing the Internet. Up to now, I did not see how that could be. Yes, as it turns out, it was technically coming from my computer, but the culprit in fact was OneDrive’s poison setting. I am someone who turns off my desktop computer when I am not using it and also every night I don’t have it working on anything. No wonder our problems were spotty. When my computer was off, OneDrive could not take over. So my family was right all along.

Well that’s now fixed. I will let my TV/Internet provider know about this so that they can save their time and their customers time when someone else has a similar intermittent internet problem which may be OneDrive. I will also let Microsoft know through their feedback form and hopefully they one day will decide to either change their default network traffic settings to something that would not affect the capacity of most home Internet providers, or change the algorithm so that “unlimited” has a lower priority than all other network activity. Maybe that “adjust automatically” setting is the magic algorithm. If so, it could be the default but it should also be added as an option on the Download rate, to eliminate OneDrive’s greediness.

Are you listening Microsoft?

And I’d recommend anyone who uses OneDrive to check out if you have no limit on your OneDrive Network settings. If you do, change them and you might see the speed and reliability of your Internet improve dramatically.


—-

**Note:  The Fibe 25 plan is the maximum now available from BellMTS in our neighborhood. They are currently (and I mean currently since my front lawn is all marked up) installing fiber lines in our neighborhood that will allow much higher capacity. Once installed, I should have access to their faster plans, and will likely subscribe to their Fibe 500 plan for only $20 more per month. That will give up to 500 Mbps download (20x faster) and 500 Mbps upload (167x faster). They have even faster plans, but that should be enough because our wi-fi is 20 MHz which is only capable of 450 Mbps. My ethernet cable (which was hardwired in from the TV downstairs to my upstairs office when we built the house 34 years ago) is capable of 1.0 Gbps which is 1000 Mbps. Once we switch plans, I’ll likely give OneDrive higher limits (maybe 100 Mbps both ways) and it will be a new world for us at home on the Internet. 

Determining the Accuracy of DNA Tests - Fri, 10 Apr 2020

In my last post, New Version of WGS Extract, I used WGS_Extract to create 4 extracts from 3 BAM (Binary Sequence Alignment Map) files from my 2 WGS (Whole Genome Sequencing) tests.

These extracts each contain about 2 million SNPs that are tested by the five major consumer DNA testing companies: Ancestry DNA, 23andMe, Family Tree DNA, MyHeritage DNA and Living DNA.

Almost two years ago, I posted: Comparing Raw DNA from 5 DNA Testing Companies to see how different the values were. Last year, in Determining VCF Accuracy, I estimated Type I and Type II error rates from two VCF (Variant Call Format) files that I got from my WGS (Whole Genome Sequencing) test.

But in those articles, I was not able to estimate how accurate each of the tests were. To do so, you need to know what the correct values are, in order to be able to benchmark the tests. But now with my 4 WGS extracts and my 5 company results, I now have enough information to make an attempt at this.

For this accuracy estimation, I’m going to look at just the autosomal SNPs, those from chromosome 1 to 22. I’ll exclude the X, Y and mt chromosomes because they each have their own properties that make them quite different from the autosomes.

Let me first summarize what I’ve got. Here are the counts of my autosomal allele values from each of my standard DNA tests. I’m not including test version numbers, because different places list them differently, so instead I’m including when I tested:

image

Comparing the above table to the one from my Comparing Raw DNA article last year, all values are the same except the 23andMe column. Last year’s article totalled 613,899 instead of 613,462, a difference of 437. I’m not sure why there’s this difference, but I do know this new value is correct. Whatever mistake I might have made should not have significantly affected my earlier analysis.

I find it odd that 23andMe and Living DNA both have half as many AC and AG values as the other companies. I also find it odd that Ancestry DNA has twice as many of the AT and CG values as the other companies, and that Living DNA has no AT or CG values. I have no explanation for this.

23andMe is the only company that identified and included any insertions and deletions (INDELs), the II, DD and DI values, that it found.

The double dash “–" values are called “no calls”. Those are positions tested that the company algorithm could not determine a value for. The percentage of no calls range from a low of 0.4% in my Ancestry DNA data to a high of 2.8% in my FTDNA data. Matching algorithms tend to treat no calls as a match to any value.

Below are the counts from my WGS tests:

image

I have done two WGS tests at Dante Labs: a Short Reads test and a Long Reads test.

For the Short Reads test, Dante used the program BWA (Burrows-Wheeler Aligner) to create a Build 37 BAM file. I then used WGS Extract to extract all the SNPs it could.

For my Long Reads test, I used the program BWA to create a Build 37 BAM file. (See: Aligning My Genome). But BWA was not supposed to be good for Long Reads WGS, so I had YSeq use the program minimap2 to create a build 37 BAM file.

The WGS Extract program would not work on my Long Reads file until I added the –B parameter to the mpileup program. The –B parameter is to disable BAQ (Base Alignment Quality) computation to reduce the false SNPs caused by misalignment. Because I had to add –B to get the Long Reads to work, I also did a run with –B added to my Short Reads so that I could see the effect of the –B parameter on the accuracy.

When I used WGS Extract a year ago (see: Creating a Raw Data File from a WGS BAM file), it produced a file for me with 959,368 SNPs from my Short Reads WGS file and I was able to use it to improve my combined raw data file.

  

Accuracy Determination

Now I’ll use the above two sets of data to determine accuracy. By accuracy, I’m interested in knowing if a test is saying that a particular position has a specific value, e.g. CT, then what is the probability that the CT reading is correct?

I will ignore all no calls in this analysis. If a test says it doesn’t know, so it isn’t wrong. Having no-calls is preferable to having incorrect values.

I will also ignore the 4518 SNPs where 23andMe say there is an insertion or deletion (II or DD or DI). The reason is because few of the other standard tests have values on those SNPs (which is good) but almost all the WGS test results do have a value there (which is conflicting information and bad!). Somehow WGS Extract needs to find a way to identify the INDELs so that it doesn’t incorrectly report them as seemingly valid SNPs. Of course some of 23andMe’s reported INDELs might be wrong, but I don’t have multiple sources reporting the INDELs to be able to tell for sure. I do have my VCF INDEL file from my Short Reads WGS, but then it’s just one word against another. A quick comparison showed that some 23andMe reported INDELs are in my VCF INDEL file, but some are not.

So first I’ll determine the accuracy of the standard DNA tests, then of the WGS tests.



The Accuracy of Standard Microarray DNA Tests

I have 4 BAM files from 2 WGS tests using different alignment or extraction methods. There are 1,851,128 out of the over 2 million autosomal positions where all 4 WGS readings were all the same and were not no calls and the 23andMe value was not an insertion or deletion.

Since all 4 BAM files agree, let’s assume the agreed upon values are correct.

I compared these with the values from each of my 5 standard tests:

image

That’s not bad. An error rate of 0.5% or less. Fewer than 1 error in 197 values. FTDNA and MyHeritage’s tests were the best with an error rate of about 1 out of 600 values.

These tests are all known as microarray tests. They do not test every position, but only test certain positions. They are very different from WGS and are expected to have a lower error rate than WGS tests. Of course, they often include 3% no calls to their results, but that’s the tradeoff required to help them minimize their Type I false positive errors.



The Accuracy of Whole Genome Sequencing Tests

WGS tests have several factors involved in their accuracy. One is the accuracy of their individual reads which in the case of Long Read WGS is said to be much worse than Short Read WGS, maybe even as bad as 1 in 20. But those inaccurate reads are offset by excellent alignment algorithms that have been tuned to handle high error rates. This is a necessary requirement anyway because the algorithms need to handle insertions and deletions as well.

Another factor in accuracy is coverage rate, and 30x is considered to be what will give reasonably accurate results. If you have 30 segments mapped over a SNP, and 13 of them say “A” and 16 of them say “T” and 1 says “C”, then the value is likely “AT”. If 27 are “A” and 3 are “T” then the value is likely “AA”. They’ve been doing this for a long time and know the probabilities and they’ve got this down to a science (pun intended).

So my question is what is the accuracy of my WGS Extract SNPs from my four BAM files. To determine this, I’ll do the opposite of what I did before. I’m going to find all the SNPs where at least 3 of my standard DNA tests gave the same value and the others either gave a no call or did not test that SNP. From my above analysis, each of my standard tests should have at least a 1 in 200 error rate, so three or more different tests with all the same value should not be wrong very often. I’ll compare them with every position in my 4 BAM files that have a value and are not a no call. Here’s my results:

image

So my Short Reads test gave really good results. Only 1 in over 1300 disagreed with my standard tests. That’s quite acceptable. The –B option on creating the BAM seemed to have little effect on the accuracy.

But those Long Reads tests – ooohh!  I’m very disappointed. 7.7% of the values in my Long Reads BAM file created with BWA were different from my standard tests. Using minimap2 instead of BWA only reduced that to 6.6%. This is not acceptable for SNP analysis purposes. The penalty for getting the wrong health interpretation of a SNP can be disasterous.

I’m very disappointed in this Long Reads result. Even though Long Reads are known to have higher error rates in individual readings, I would have thought that the longer reads along with good alignment algorithms that take into account possible errors, would give good values once you have a 30x coverage. If 1 out of 10 values are read wrong, then 27 out of 30 values should be correct.

So something else is happening here. This high error rate can come from one of several places. It could be read errors, transcription errors, algorithm errors, problems in any of the programs in the pipelines to create the BAM files, or problems in the programs that WGS Extract uses, such as the mpileup program.

So then can the Short Reads test values still be used? Well, I still have one outstanding problem with them. That’s with regards to INDELs as reported in my 23andMe test.  Unfortunately, the results out of WGS Extract gives SNP values at almost all of the INDEL positions. In the table below, I compare only the INDEL positions out of all the 23andMe positions that match each test:

image

Now I’m still not sure if the 23andMe value is correct or if the long read value is correct, but reporting a SNP value where there is an INDEL could be happening as much as 0.8% of the time, at least in the values reported by WGS Extract. This is something that needs to be looked at by the WGS Extract people to see if they can prevent this.



Conclusions

For genealogical purposes and relative matching on the various sites including GEDmatch, the standard microarray-based DNA tests are good enough.

Don’t ever expect that your DNA raw data is perfect. There are going to be incorrect values in it. Most matching algorithms for genealogists allow for an error every 100 SNPs or so. Some even introduce new errors with imputation. As long as errors are kept to under 1 in 100 or so, differences in analysis for genealogical purposes should be small. But because of these inaccuracies, nothing is exact.

It is worthwhile if you upload to a site, to improve the quality of your data by using a combined file made up of all the agreeing values from your DNA tests.  See my post on The Benefits of Combining Your Raw DNA Data.

WGS tests are worthwhile for medical purposes, but are probably overkill for genealogy. The WGS files you need to work with are huge requiring a powerful computer with large amounts of free disk space. Downloading your data takes days and uploading your data to an analysis site is impossible on most home internet services. The programs to analyze these files are made for geneticists and are designed for the Unix platform.

There are not many programs designed for genealogists that analyze WGS data. The program WGS Extract is excellent, but you will need to know what you are doing. Until they find a way to filter out the INDELs, you’ll have to be careful in using the raw data files that the program produces.




Followup Nov 20, 2021:  I found that a raw data file download from a company can change over time, and I posted an article: Your DNA Raw Data May Have Changed. I now think this is the reason for the discrepancy I mention above in my 23andMe counts.