Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

OneDrive’s Poison Setting - Fri, 8 May 2020

OneDrive’s default setting of no limit for network upload and download rates has caused years of Internet problems at my house. Unbeknownst to us, it would from time to time consume most or all of the Internet bandwidth affecting me when on my ethernet connected desktop computer and affecting everyone else in my house connected with their devices to our Wi-fi. It is now obvious to me that this hogging of bandwidth happened following any significant upload of pictures or files from my desktop computer to OneDrive and the effect sometimes lasted for days!

Yikes! I’m flabbergasted at how we finally discovered the reason behind our Internet connection problems. A number of times in the past few years, we’ve found the Wi-fi and TV in the house to be spotty. We had got used to unplugging the power on the company-supplied modem and waiting the 3 or 4 minutes for it to reset. Often that seemed to improve things, or maybe the reset just made us feel it had done so – we don’t really know. We’ve called our supplier several times, and they came over, inspected our lines, checked our modem. In all cases, the problem repaired itself, if not immediately, then over the course of a few days.

It didn’t get really bad too often. But it did about 2 months ago, just after my wife and I got back from a wonderful Caribbean cruise (which we followed up with 2 weeks of just-in-case self-isolation at home). I had to replace my computer, and very shortly after the new one was installed, we had several days of Internet/TV problems.

I called my service provider (BellMTS) and I told them about the poor service we were having and they tried to help over the phone. We rebooted the modem several times but that wasn’t helping.

image

They sent a serviceman to check the wiring from our house to the distribution boxes on our block. We thought that might have helped and it was not long after that it seemed everything was pretty good.

We had very few problems over the next 6 weeks, but just last night, I was in the middle of an Association of Professional Genealogists Zoom webinar (Mary Kircher Roddy – Bagging a Live One; Reverse Genealogy in Action), when suddenly I lost my Internet in my other windows and my family lost the Internet on their devices. Our TV was even glitching. However the Zoom webinar continued on uninterrupted. I could not at all figure this out.

After the webinar ended, I called my Internet/TV provider and things seemed to improve. The next morning, the troubles reoccurred. I called my provider again. They sent a serviceman. He came into the house (respecting social distancing) and cut the cable at our box so they could test the wiring leading to our house. He was away for over an hour doing that. When he came back, they had set up some sort of new connectors. He reconnected us. But no, we still had the problem. He then found what he though was a poorly wired cable at the back of the modem. He fixed that, but still the problem. Then he replaced our modem and the power supply and the cabling. Still the problem.

We were monitoring the problem using speedtest.net. We’ve got what’s called the Fibe 25 plan**. We should be getting up to 25 Mbps (mega-bits per second) download and up to 3 Mbps upload. We were getting between 1 and 2 download and 1 upload. Not good. 

After several more attempted resets and diagnostic checks, we were now 3 hours into this service call. The serviceman’s next idea was the one that worked. He said turn off all devices connected to the Internet. Then turn them on one-by-one and we might find it is a device we have that’s causing the problem. We did so and when we got to my ethernet connected computer, it was the one slowing everything. The serviceman said there it is, found the reason. He couldn’t stay any more and left.

I checked and sure enough, when my computer was on, we got almost no Internet, but when it was off, everything was fine. Here was the speed test with my computer off:

image

When I went to the network settings to see if it was a problem with my ethernet cable, I could see a large amount of Activity, with the Sent and Received values changing quite quickly:

image

My first thought was that maybe my computer was hacked. I opened Task Manager and sorted by the Network column to see what was causing all the Network traffic. There was my answer, in number 1 place consuming the vast majority of my network was: Microsoft OneDrive.

My older daughter immediately commented that she had long ago stopped using the free 1 TB of OneDrive space we each get by being Microsoft 365 subscribers because she found it hogged all her resources.

Eureka! 2 months ago what had I done? I had uploaded all my pictures and videos from our trip to OneDrive. And what was I doing while watching that Zoom webinar last night? I was uploading several folders of pictures and videos to OneDrive. What wasn’t I doing during the 6 weeks in-between was any significant uploads to OneDrive.

In Task Manager, I ended the OneDrive task. Sure enough my download speed from speed test went back up to good numbers, and our Internet/TV problem had finally been isolated.

It didn’t take me long to search the Internet to find that OneDrive had network settings. The default was (horrors) a couple of “Don’t limit” settings. The “Limit to” boxes, which were not selected, both had suggested defaults of 125 KB/s (kilobytes per second). I did some calculations and selected them and set the upload value to 100 KB/s and left the download value at 125 KB/s: 

image

Note that these are in KB/s whereas Speedtest gives Mbps. The former is thousands of bytes and the latter is millions of bits. There are 8 bits in a byte. So 125 KB/s = 1.0 Mbps, which is about 4% of my 25 Mbps download capacity and 100 KB/s = 0.8 Mbps which is less than 30% of my upload capacity. Now when OneDrive is synching, there should be plenty left for everyone else. Yes, OneDrive will take several times longer to upload now. But I and my family should no longer have it affecting our Internet and TV in a significant way any more.

Also notice there’s an “Adjust automatically” setting. Maybe that is the one to choose, but unfortunately they don’t also have that setting on the Download rate, which is maybe more important.

My wife and daughters have complained to me for a number of years claiming my computer was slowing the Internet. Up to now, I did not see how that could be. Yes, as it turns out, it was technically coming from my computer, but the culprit in fact was OneDrive’s poison setting. I am someone who turns off my desktop computer when I am not using it and also every night I don’t have it working on anything. No wonder our problems were spotty. When my computer was off, OneDrive could not take over. So my family was right all along.

Well that’s now fixed. I will let my TV/Internet provider know about this so that they can save their time and their customers time when someone else has a similar intermittent internet problem which may be OneDrive. I will also let Microsoft know through their feedback form and hopefully they one day will decide to either change their default network traffic settings to something that would not affect the capacity of most home Internet providers, or change the algorithm so that “unlimited” has a lower priority than all other network activity. Maybe that “adjust automatically” setting is the magic algorithm. If so, it could be the default but it should also be added as an option on the Download rate, to eliminate OneDrive’s greediness.

Are you listening Microsoft?

And I’d recommend anyone who uses OneDrive to check out if you have no limit on your OneDrive Network settings. If you do, change them and you might see the speed and reliability of your Internet improve dramatically.


—-

**Note:  The Fibe 25 plan is the maximum now available from BellMTS in our neighborhood. They are currently (and I mean currently since my front lawn is all marked up) installing fiber lines in our neighborhood that will allow much higher capacity. Once installed, I should have access to their faster plans, and will likely subscribe to their Fibe 500 plan for only $20 more per month. That will give up to 500 Mbps download (20x faster) and 500 Mbps upload (167x faster). They have even faster plans, but that should be enough because our wi-fi is 20 MHz which is only capable of 450 Mbps. My ethernet cable (which was hardwired in from the TV downstairs to my upstairs office when we built the house 34 years ago) is capable of 1.0 Gbps which is 1000 Mbps. Once we switch plans, I’ll likely give OneDrive higher limits (maybe 100 Mbps both ways) and it will be a new world for us at home on the Internet. 

Determining the Accuracy of DNA Tests - Fri, 10 Apr 2020

In my last post, New Version of WGS Extract, I used WGS_Extract to create 4 extracts from 3 BAM (Binary Sequence Alignment Map) files from my 2 WGS (Whole Genome Sequencing) tests.

These extracts each contain about 2 million SNPs that are tested by the five major consumer DNA testing companies: Ancestry DNA, 23andMe, Family Tree DNA, MyHeritage DNA and Living DNA.

Almost two years ago, I posted: Comparing Raw DNA from 5 DNA Testing Companies to see how different the values were. Last year, in Determining VCF Accuracy, I estimated Type I and Type II error rates from two VCF (Variant Call Format) files that I got from my WGS (Whole Genome Sequencing) test.

But in those articles, I was not able to estimate how accurate each of the tests were. To do so, you need to know what the correct values are, in order to be able to benchmark the tests. But now with my 4 WGS extracts and my 5 company results, I now have enough information to make an attempt at this.

For this accuracy estimation, I’m going to look at just the autosomal SNPs, those from chromosome 1 to 22. I’ll exclude the X, Y and mt chromosomes because they each have their own properties that make them quite different from the autosomes.

Let me first summarize what I’ve got. Here are the counts of my autosomal allele values from each of my standard DNA tests. I’m not including test version numbers, because different places list them differently, so instead I’m including when I tested:

image

Comparing the above table to the one from my Comparing Raw DNA article last year, all values are the same except the 23andMe column. Last year’s article totalled 613,899 instead of 613,462, a difference of 437. I’m not sure why there’s this difference, but I do know this new value is correct. Whatever mistake I might have made should not have significantly affected my earlier analysis.

I find it odd that 23andMe and Living DNA both have half as many AC and AG values as the other companies. I also find it odd that Ancestry DNA has twice as many of the AT and CG values as the other companies, and that Living DNA has no AT or CG values. I have no explanation for this.

23andMe is the only company that identified and included any insertions and deletions (INDELs), the II, DD and DI values, that it found.

The double dash “–" values are called “no calls”. Those are positions tested that the company algorithm could not determine a value for. The percentage of no calls range from a low of 0.4% in my Ancestry DNA data to a high of 2.8% in my FTDNA data. Matching algorithms tend to treat no calls as a match to any value.

Below are the counts from my WGS tests:

image

I have done two WGS tests at Dante Labs: a Short Reads test and a Long Reads test.

For the Short Reads test, Dante used the program BWA (Burrows-Wheeler Aligner) to create a Build 37 BAM file. I then used WGS Extract to extract all the SNPs it could.

For my Long Reads test, I used the program BWA to create a Build 37 BAM file. (See: Aligning My Genome). But BWA was not supposed to be good for Long Reads WGS, so I had YSeq use the program minimap2 to create a build 37 BAM file.

The WGS Extract program would not work on my Long Reads file until I added the –B parameter to the mpileup program. The –B parameter is to disable BAQ (Base Alignment Quality) computation to reduce the false SNPs caused by misalignment. Because I had to add –B to get the Long Reads to work, I also did a run with –B added to my Short Reads so that I could see the effect of the –B parameter on the accuracy.

When I used WGS Extract a year ago (see: Creating a Raw Data File from a WGS BAM file), it produced a file for me with 959,368 SNPs from my Short Reads WGS file and I was able to use it to improve my combined raw data file.

  

Accuracy Determination

Now I’ll use the above two sets of data to determine accuracy. By accuracy, I’m interested in knowing if a test is saying that a particular position has a specific value, e.g. CT, then what is the probability that the CT reading is correct?

I will ignore all no calls in this analysis. If a test says it doesn’t know, so it isn’t wrong. Having no-calls is preferable to having incorrect values.

I will also ignore the 4518 SNPs where 23andMe say there is an insertion or deletion (II or DD or DI). The reason is because few of the other standard tests have values on those SNPs (which is good) but almost all the WGS test results do have a value there (which is conflicting information and bad!). Somehow WGS Extract needs to find a way to identify the INDELs so that it doesn’t incorrectly report them as seemingly valid SNPs. Of course some of 23andMe’s reported INDELs might be wrong, but I don’t have multiple sources reporting the INDELs to be able to tell for sure. I do have my VCF INDEL file from my Short Reads WGS, but then it’s just one word against another. A quick comparison showed that some 23andMe reported INDELs are in my VCF INDEL file, but some are not.

So first I’ll determine the accuracy of the standard DNA tests, then of the WGS tests.



The Accuracy of Standard Microarray DNA Tests

I have 4 BAM files from 2 WGS tests using different alignment or extraction methods. There are 1,851,128 out of the over 2 million autosomal positions where all 4 WGS readings were all the same and were not no calls and the 23andMe value was not an insertion or deletion.

Since all 4 BAM files agree, let’s assume the agreed upon values are correct.

I compared these with the values from each of my 5 standard tests:

image

That’s not bad. An error rate of 0.5% or less. Fewer than 1 error in 197 values. FTDNA and MyHeritage’s tests were the best with an error rate of about 1 out of 600 values.

These tests are all known as microarray tests. They do not test every position, but only test certain positions. They are very different from WGS and are expected to have a lower error rate than WGS tests. Of course, they often include 3% no calls to their results, but that’s the tradeoff required to help them minimize their Type I false positive errors.



The Accuracy of Whole Genome Sequencing Tests

WGS tests have several factors involved in their accuracy. One is the accuracy of their individual reads which in the case of Long Read WGS is said to be much worse than Short Read WGS, maybe even as bad as 1 in 20. But those inaccurate reads are offset by excellent alignment algorithms that have been tuned to handle high error rates. This is a necessary requirement anyway because the algorithms need to handle insertions and deletions as well.

Another factor in accuracy is coverage rate, and 30x is considered to be what will give reasonably accurate results. If you have 30 segments mapped over a SNP, and 13 of them say “A” and 16 of them say “T” and 1 says “C”, then the value is likely “AT”. If 27 are “A” and 3 are “T” then the value is likely “AA”. They’ve been doing this for a long time and know the probabilities and they’ve got this down to a science (pun intended).

So my question is what is the accuracy of my WGS Extract SNPs from my four BAM files. To determine this, I’ll do the opposite of what I did before. I’m going to find all the SNPs where at least 3 of my standard DNA tests gave the same value and the others either gave a no call or did not test that SNP. From my above analysis, each of my standard tests should have at least a 1 in 200 error rate, so three or more different tests with all the same value should not be wrong very often. I’ll compare them with every position in my 4 BAM files that have a value and are not a no call. Here’s my results:

image

So my Short Reads test gave really good results. Only 1 in over 1300 disagreed with my standard tests. That’s quite acceptable. The –B option on creating the BAM seemed to have little effect on the accuracy.

But those Long Reads tests – ooohh!  I’m very disappointed. 7.7% of the values in my Long Reads BAM file created with BWA were different from my standard tests. Using minimap2 instead of BWA only reduced that to 6.6%. This is not acceptable for SNP analysis purposes. The penalty for getting the wrong health interpretation of a SNP can be disasterous.

I’m very disappointed in this Long Reads result. Even though Long Reads are known to have higher error rates in individual readings, I would have thought that the longer reads along with good alignment algorithms that take into account possible errors, would give good values once you have a 30x coverage. If 1 out of 10 values are read wrong, then 27 out of 30 values should be correct.

So something else is happening here. This high error rate can come from one of several places. It could be read errors, transcription errors, algorithm errors, problems in any of the programs in the pipelines to create the BAM files, or problems in the programs that WGS Extract uses, such as the mpileup program.

So then can the Short Reads test values still be used? Well, I still have one outstanding problem with them. That’s with regards to INDELs as reported in my 23andMe test.  Unfortunately, the results out of WGS Extract gives SNP values at almost all of the INDEL positions. In the table below, I compare only the INDEL positions out of all the 23andMe positions that match each test:

image

Now I’m still not sure if the 23andMe value is correct or if the long read value is correct, but reporting a SNP value where there is an INDEL could be happening as much as 0.8% of the time, at least in the values reported by WGS Extract. This is something that needs to be looked at by the WGS Extract people to see if they can prevent this.



Conclusions

For genealogical purposes and relative matching on the various sites including GEDmatch, the standard microarray-based DNA tests are good enough.

Don’t ever expect that your DNA raw data is perfect. There are going to be incorrect values in it. Most matching algorithms for genealogists allow for an error every 100 SNPs or so. Some even introduce new errors with imputation. As long as errors are kept to under 1 in 100 or so, differences in analysis for genealogical purposes should be small. But because of these inaccuracies, nothing is exact.

It is worthwhile if you upload to a site, to improve the quality of your data by using a combined file made up of all the agreeing values from your DNA tests.  See my post on The Benefits of Combining Your Raw DNA Data.

WGS tests are worthwhile for medical purposes, but are probably overkill for genealogy. The WGS files you need to work with are huge requiring a powerful computer with large amounts of free disk space. Downloading your data takes days and uploading your data to an analysis site is impossible on most home internet services. The programs to analyze these files are made for geneticists and are designed for the Unix platform.

There are not many programs designed for genealogists that analyze WGS data. The program WGS Extract is excellent, but you will need to know what you are doing. Until they find a way to filter out the INDELs, you’ll have to be careful in using the raw data files that the program produces.




Followup Nov 20, 2021:  I found that a raw data file download from a company can change over time, and I posted an article: Your DNA Raw Data May Have Changed. I now think this is the reason for the discrepancy I mention above in my 23andMe counts.

New Version of WGS Extract - Mon, 6 Apr 2020

Back in May 2019, I wrote about a program called WGS Extract to produce from your Whole Genome Sequencing (WGS) test, a file with autosomal SNPs in 23andMe format that you can upload to sites like GEDmatch, Family Tree DNA,  MyHeritage DNA, or Living DNA.

The mastermind behind this program, who prefers to remain anonymous, last month made a new version available.You can get it here: https://wgsextract.github.io/ The program last year was 2 GB. This one now is 4.5 GB. The download took about 45 minutes. And that is a compressed zip file which took about 3 minutes to unzip into 8,984 files totaling 4.9 GB. It didn’t expand much because the majority of the space was used by 5 already compressed human genome reference files, each about 850 MB:

  1. hg38.fa.gz
  2. hs37d5.fa.gz
  3. GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
  4. human_g1k_v37.fasta.gz
  5. hg19.fa.gz

I don’t know the technical aspects about what’s different in each of these references, except that 1 and 3 are Build 38 and 2, 4 and 5 are Build 37. For genealogical purposes, our DNA testing companies use Build 37.

Also included among the files and very useful are raw file templates from various companies which include the majority of the SNPs from each of the tests:

  1. 23andMe_V3.txt   (959286 SNPs)
  2. 23andMe_V4.txt   (601885 SNPs)
  3. 23andMe_V4_1.txt   (596806 SNPs)
  4. 23andMe_V5.txt   (638466 SNPs)
  5. 23andMe_V5_1.txt   (634165 SNPs)
  6. MyHeritage_V1.csv   (720922 SNPs)
  7. MyHeritage_V2.csv   (610128 SNPs)
  8. FTDNA_V1_Affy.csv   (548011 SNPs)
  9. FTDNA_V2.csv   (720449 SNPs)
  10. FTDNA_V3.csv   (630074 SNPs)
  11. FTDNA_V3_1.csv   (613624 SNPs)
  12. Ancestry_V1.txt   (701478 SNPs)
  13. Ancestry_V1_1.txt   (682549 SNPs)
  14. Ancestry_V2.txt   (668942 SNPs)
  15. Ancestry_V2_1.txt   (637639 SNPs)
  16. LDNA_V1.txt   (618640 SNPs)
  17. LDNA_V2.txt   (698655 SNPs)

There are 4 summary files:

  1. 23andMe_SNPs_API.txt   (1498050 SNPs) which likely combines the SNPs from all five 23andMe tests.
  2. All_SNPs_combined_RECOMMENDED_hg19_ref.tab.gz   (2081060)
  3. All_SNPs_combined_RECOMMENDED_GRCh37_ref.tab.gz   (2081060)
  4. All_SNPs_combined_RECOMMENDED_hg38_ref.tab.gz   (2080323)

The last 3 appear to be a combination of all the SNPs from all the raw file templates. The hg19 and GRCh37 files appear to be the same, but differ in how the chromosome is specified, as 1 or as chr1, as MT or as chrM. I’m not sure how the hg38 file was derived, but it may have been a translation of all addresses from Build 37 to Build 38, excluding 737 SNPs that are in Build 37 but not Build 38.



Running WGS Extract

The program can now run on any computer. Just run the appropriate script:

  1. Linus_START.sh
  2. MacOS_START.sh
  3. WIndows_START.bat

Since I have Windows, I ran the third script. I had to tell Microsoft Defender SmartScreen to allow it to run. It starts up a Command window which then starts up the WGS Extract Window:

image

There are now three tabs:  “Settings”, “Extract Data” and “Other”.  Above is the Settings Page.

Here is the Extract Data page:

image

The Mitochondrial DNA and Y-DNA functions are both new.

And this is the Other page:

image

All the functionality on this 3rd page is new.



Two WGS Tests

When I first checked it out WGS Extract last year, I only had my Dante Short Reads WGS test. See: Creating a Raw Data File from a WGS BAM file.  Since then, I have taken a Dante Long Reads WGS test.


Three BAM Files

The raw reads from a WGS test are provided as FASTQ files. These need to be put into the correct place on my genome. A file containing the mappings of each of my reads to where it is in my genome is called a BAM file (Binary Sequence Alignment Map).  It’s these BAM files that WGS Extract reads.

I have 3 BAM files I can use:

  1. The BAM file Dante provided with my Short Reads WGS test. They used a program called BWA (the Burrows-Wheeler Aligner) to produce my BAM.
  2. Dante did not provide a BAM file with my Long Reads WGS test. So I did the alignment myself using BWA to produce a BAM from this test. I documented that in my Aligning My Genome post.
  3. I found out that the program minimap2 produced more accurate alignment than BWA for Long Reads. I tried to run that myself but the job was taking too long. Then I heard that YSeq offered the mapping service using minimap2, so I had them create a minimap2-based BAM file from my Long Reads WGS test.

Let’s now try a few things.



Show statistics

From on the Settings page, we first load our BAM file and select an output directory. Loading the huge BAM file is surprisingly quick, taking only about 5 seconds.

We can now go to the “other” page and press “Show statistics on coverage, read length etc.”

Here’s my statistics from my Short Reads test. (Click image to enlarge)

image

My Short Reads test consisted of almost 1.5 billion reads. 86.44% of them were able to be mapped to a Build 37 human reference genome. That gave an average of 41x coverage over every base pair.  The average read length was 100 base pairs.

Here’s my statistics from my Long Reads test with BWA mapping:

image

Long Reads WGS test are known to have a higher percentage of errors in them than a Short Reads WGS test. But because their reads are longer, they still can be mapped fairly well to the human genome.   

I had over 20 million long reads. 76.17% of the reads were able to be mapped, which is lower than the 86% from my short read test. This resulted in an average coverage of 25x versus the 41x from my short read test. The average read length of the mapped reads was 3627 base pairs, which is 36 times longer than my short read test.

Here’s the stats from my Long Reads test aligned by YSEQ using minimap2:

image

I have no idea why the Samtools stats routine decided to show the chromosome in alphabetical order just for this run but not the other two above. That is definitely strange. But the stats themselves seem okay. This does show the improvement that minimap2 made over BWA since the average read depth is now up to 36x and the average read length of mapped reads is increased to 5639. I expect that BWA must have had trouble aligning the longer reads due to the errors in them, whereas minimap2 knows better how to handle these.



Haplogroups and Microbiome

First I run Y-DNA from the “Other” page using my Long Reads. WGS Extract now includes a version of the python program Yleaf which is available on GitHub to perform this analysis. It takes about 5 minutes and then gives me this.

image

That’s interesting. I know I’m R1a, but my more detailed groups from the various companies then start taking me into M198, Y2630, BY24978 and other such designations. I’ve not seen it strung out as a R1a1a1b2a2b1a3 designnation before. At any rate, it doesn’t matter too much. My Y-DNA does not help me much for genealogy purposes.

For Mitochondrial DNA, WGS Extract gave me this:
image

That’s okay. I already know my mt haplogroup is K1a1b1a. It doesn’t help me much for genealogical purposes either.

There was also an option in WGS Extract to create an oral microbiome that can be uploaded to app.cosmosid.com. This option will extract your unmapped reads which might be bacterial. I’m not interested in this so I didn’t try it.



Creating A DNA Raw Data File

Going to the Extract Data page in WGS Extract, now I press the “Generate files in several autosomal formats”. It gives me this screen:

image

When the screen first popped up, everything was checked. I clicked “Deselect everything” and then checked just the combined file at the top.

I did this for my Short Reads BAM file, When I pressed the Generate button at the bottom, the following info box popped up.

image

I  pressed OK and the run started. After 65 minutes it completed and produced a text file with a 23andMe raw data header and over 2 million data lines that looks like this:

image

It also produced a zipped version of the same file, since some of the upload sites request compressed raw data files.


A Glitch and a Fix

I wanted to do the same with my two Long Read BAM files. When I tried, it was taking considerably longer than an hour. So I let it run all night. It was still running the next morning. It was still running in the afternoon.Why would a Long Reads BAM file take over 20 times longer than a Short Reads BAM file to run? They both are about the same size. The Long Reads file of course has longer reads but fewer of them.

I started wondering what was going on. I contacted the author. I posted about this on Facebook and got some helpful ideas. Finally I found the temporary files area that WGS Extract used. I was able to tell that for my Long Read BAMs, the temporary file with the results was not being created. I isolated the problem to the program mpileup that was the one failing. I searched the web for “mpileup long reads nanopore” and found this post:  mpileup problem with processing nanopore alignment. It suggested to use the mpileup –B option.

The mpileup –B option stands for “no-BAQ”. The Samtools mpileup documentation explains BAQ to be Base Alignment Quality. This is a calculation of the probability that a read base is misaligned. Allowing the BAQ calculation “greatly helps to reduce false SNPs caused by misalignments”.

I tried adding the –B option, and now WGS Extract worked! It took 75 minutes to run for my BWA Long Reads file and 115 minutes for my YSEQ minimap2 Long Reads file. I then ran my Short Reads file with the –B option and it ran in only 20 minutes. I’ll compare that last run with my Short Reads run with the –B option, and that should give me an estimate as to how many false SNPs might have been introduced.


Next Steps

I’ll compare these 4 WGS Extract files with each other and with my 5 raw data files from my standard DNA tests in my next blog post. I’ll see if I can determine error rates, and I’ll see how much I can improve the combined raw data file that I’ll upload to GEDmatch.