Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

New Version of WGS Extract - Mon, 6 Apr 2020

Back in May 2019, I wrote about a program called WGS Extract to produce from your Whole Genome Sequencing (WGS) test, a file with autosomal SNPs in 23andMe format that you can upload to sites like GEDmatch, Family Tree DNA,  MyHeritage DNA, or Living DNA.

The mastermind behind this program, who prefers to remain anonymous, last month made a new version available.You can get it here: https://wgsextract.github.io/ The program last year was 2 GB. This one now is 4.5 GB. The download took about 45 minutes. And that is a compressed zip file which took about 3 minutes to unzip into 8,984 files totaling 4.9 GB. It didn’t expand much because the majority of the space was used by 5 already compressed human genome reference files, each about 850 MB:

  1. hg38.fa.gz
  2. hs37d5.fa.gz
  3. GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
  4. human_g1k_v37.fasta.gz
  5. hg19.fa.gz

I don’t know the technical aspects about what’s different in each of these references, except that 1 and 3 are Build 38 and 2, 4 and 5 are Build 37. For genealogical purposes, our DNA testing companies use Build 37.

Also included among the files and very useful are raw file templates from various companies which include the majority of the SNPs from each of the tests:

  1. 23andMe_V3.txt   (959286 SNPs)
  2. 23andMe_V4.txt   (601885 SNPs)
  3. 23andMe_V4_1.txt   (596806 SNPs)
  4. 23andMe_V5.txt   (638466 SNPs)
  5. 23andMe_V5_1.txt   (634165 SNPs)
  6. MyHeritage_V1.csv   (720922 SNPs)
  7. MyHeritage_V2.csv   (610128 SNPs)
  8. FTDNA_V1_Affy.csv   (548011 SNPs)
  9. FTDNA_V2.csv   (720449 SNPs)
  10. FTDNA_V3.csv   (630074 SNPs)
  11. FTDNA_V3_1.csv   (613624 SNPs)
  12. Ancestry_V1.txt   (701478 SNPs)
  13. Ancestry_V1_1.txt   (682549 SNPs)
  14. Ancestry_V2.txt   (668942 SNPs)
  15. Ancestry_V2_1.txt   (637639 SNPs)
  16. LDNA_V1.txt   (618640 SNPs)
  17. LDNA_V2.txt   (698655 SNPs)

There are 4 summary files:

  1. 23andMe_SNPs_API.txt   (1498050 SNPs) which likely combines the SNPs from all five 23andMe tests.
  2. All_SNPs_combined_RECOMMENDED_hg19_ref.tab.gz   (2081060)
  3. All_SNPs_combined_RECOMMENDED_GRCh37_ref.tab.gz   (2081060)
  4. All_SNPs_combined_RECOMMENDED_hg38_ref.tab.gz   (2080323)

The last 3 appear to be a combination of all the SNPs from all the raw file templates. The hg19 and GRCh37 files appear to be the same, but differ in how the chromosome is specified, as 1 or as chr1, as MT or as chrM. I’m not sure how the hg38 file was derived, but it may have been a translation of all addresses from Build 37 to Build 38, excluding 737 SNPs that are in Build 37 but not Build 38.



Running WGS Extract

The program can now run on any computer. Just run the appropriate script:

  1. Linus_START.sh
  2. MacOS_START.sh
  3. WIndows_START.bat

Since I have Windows, I ran the third script. I had to tell Microsoft Defender SmartScreen to allow it to run. It starts up a Command window which then starts up the WGS Extract Window:

image

There are now three tabs:  “Settings”, “Extract Data” and “Other”.  Above is the Settings Page.

Here is the Extract Data page:

image

The Mitochondrial DNA and Y-DNA functions are both new.

And this is the Other page:

image

All the functionality on this 3rd page is new.



Two WGS Tests

When I first checked it out WGS Extract last year, I only had my Dante Short Reads WGS test. See: Creating a Raw Data File from a WGS BAM file.  Since then, I have taken a Dante Long Reads WGS test.


Three BAM Files

The raw reads from a WGS test are provided as FASTQ files. These need to be put into the correct place on my genome. A file containing the mappings of each of my reads to where it is in my genome is called a BAM file (Binary Sequence Alignment Map).  It’s these BAM files that WGS Extract reads.

I have 3 BAM files I can use:

  1. The BAM file Dante provided with my Short Reads WGS test. They used a program called BWA (the Burrows-Wheeler Aligner) to produce my BAM.
  2. Dante did not provide a BAM file with my Long Reads WGS test. So I did the alignment myself using BWA to produce a BAM from this test. I documented that in my Aligning My Genome post.
  3. I found out that the program minimap2 produced more accurate alignment than BWA for Long Reads. I tried to run that myself but the job was taking too long. Then I heard that YSeq offered the mapping service using minimap2, so I had them create a minimap2-based BAM file from my Long Reads WGS test.

Let’s now try a few things.



Show statistics

From on the Settings page, we first load our BAM file and select an output directory. Loading the huge BAM file is surprisingly quick, taking only about 5 seconds.

We can now go to the “other” page and press “Show statistics on coverage, read length etc.”

Here’s my statistics from my Short Reads test. (Click image to enlarge)

image

My Short Reads test consisted of almost 1.5 billion reads. 86.44% of them were able to be mapped to a Build 37 human reference genome. That gave an average of 41x coverage over every base pair.  The average read length was 100 base pairs.

Here’s my statistics from my Long Reads test with BWA mapping:

image

Long Reads WGS test are known to have a higher percentage of errors in them than a Short Reads WGS test. But because their reads are longer, they still can be mapped fairly well to the human genome.   

I had over 20 million long reads. 76.17% of the reads were able to be mapped, which is lower than the 86% from my short read test. This resulted in an average coverage of 25x versus the 41x from my short read test. The average read length of the mapped reads was 3627 base pairs, which is 36 times longer than my short read test.

Here’s the stats from my Long Reads test aligned by YSEQ using minimap2:

image

I have no idea why the Samtools stats routine decided to show the chromosome in alphabetical order just for this run but not the other two above. That is definitely strange. But the stats themselves seem okay. This does show the improvement that minimap2 made over BWA since the average read depth is now up to 36x and the average read length of mapped reads is increased to 5639. I expect that BWA must have had trouble aligning the longer reads due to the errors in them, whereas minimap2 knows better how to handle these.



Haplogroups and Microbiome

First I run Y-DNA from the “Other” page using my Long Reads. WGS Extract now includes a version of the python program Yleaf which is available on GitHub to perform this analysis. It takes about 5 minutes and then gives me this.

image

That’s interesting. I know I’m R1a, but my more detailed groups from the various companies then start taking me into M198, Y2630, BY24978 and other such designations. I’ve not seen it strung out as a R1a1a1b2a2b1a3 designnation before. At any rate, it doesn’t matter too much. My Y-DNA does not help me much for genealogy purposes.

For Mitochondrial DNA, WGS Extract gave me this:
image

That’s okay. I already know my mt haplogroup is K1a1b1a. It doesn’t help me much for genealogical purposes either.

There was also an option in WGS Extract to create an oral microbiome that can be uploaded to app.cosmosid.com. This option will extract your unmapped reads which might be bacterial. I’m not interested in this so I didn’t try it.



Creating A DNA Raw Data File

Going to the Extract Data page in WGS Extract, now I press the “Generate files in several autosomal formats”. It gives me this screen:

image

When the screen first popped up, everything was checked. I clicked “Deselect everything” and then checked just the combined file at the top.

I did this for my Short Reads BAM file, When I pressed the Generate button at the bottom, the following info box popped up.

image

I  pressed OK and the run started. After 65 minutes it completed and produced a text file with a 23andMe raw data header and over 2 million data lines that looks like this:

image

It also produced a zipped version of the same file, since some of the upload sites request compressed raw data files.


A Glitch and a Fix

I wanted to do the same with my two Long Read BAM files. When I tried, it was taking considerably longer than an hour. So I let it run all night. It was still running the next morning. It was still running in the afternoon.Why would a Long Reads BAM file take over 20 times longer than a Short Reads BAM file to run? They both are about the same size. The Long Reads file of course has longer reads but fewer of them.

I started wondering what was going on. I contacted the author. I posted about this on Facebook and got some helpful ideas. Finally I found the temporary files area that WGS Extract used. I was able to tell that for my Long Read BAMs, the temporary file with the results was not being created. I isolated the problem to the program mpileup that was the one failing. I searched the web for “mpileup long reads nanopore” and found this post:  mpileup problem with processing nanopore alignment. It suggested to use the mpileup –B option.

The mpileup –B option stands for “no-BAQ”. The Samtools mpileup documentation explains BAQ to be Base Alignment Quality. This is a calculation of the probability that a read base is misaligned. Allowing the BAQ calculation “greatly helps to reduce false SNPs caused by misalignments”.

I tried adding the –B option, and now WGS Extract worked! It took 75 minutes to run for my BWA Long Reads file and 115 minutes for my YSEQ minimap2 Long Reads file. I then ran my Short Reads file with the –B option and it ran in only 20 minutes. I’ll compare that last run with my Short Reads run with the –B option, and that should give me an estimate as to how many false SNPs might have been introduced.


Next Steps

I’ll compare these 4 WGS Extract files with each other and with my 5 raw data files from my standard DNA tests in my next blog post. I’ll see if I can determine error rates, and I’ll see how much I can improve the combined raw data file that I’ll upload to GEDmatch.

When Everything Fails At Once… - Sun, 22 Mar 2020

Remember the words inscribed in large friendly letters on the cover of the book called The Hitchhiker’s Guide to the Galaxy:

DON’T PANIC

I returned 9 days ago from a two week vacation with my wife and some good friends on a cruise to the southern Caribbean. While away, we had a great time, but every day we heard more and more news of what was happening with the coronavirus back home and worldwide.

On the ship, extra precautions were being taken. Double the amount of cleaning was being done, and purell sanitizer was offered to (and taken by) everyone when entering and leaving all public areas. The sanitizer had been a standard procedure on cruise ships for many years. I joked that this cruise would be one where I gained 20 pounds: 10 from food and 10 from purell. Our cruise completed normally and we had a terrific time. There was no indication that anyone at all had got sick on our cruise.

We flew home from Fort Lauderdale to Toronto to Winnipeg. Surprisingly to us, the airports were full of people as were our flights. None of the airport employees asked us anything related to the coronavirus and gave no indication that there was even a problem. I don’t think we saw 2 dozen people with masks on out of the thousands we saw.

After a cab ride home at midnight, our daughters filled us in on what was happening everywhere. Since we were coming from an international location, my wife and I began our at-least 2 week period of self-isolation to ensure that we are not the ones to pass the virus onto everyone else. We both feel completely fine but that does not matter. Better safe than sorry.


Failure Number 1 – My Phone

On the second day of cruise, I just happened to have my smartphone in the pocket of my bathing suit as I stepped into the ship’s pool. I realized after less than two seconds and immediately jumped out. I turned on the phone and it was water stained but worked. I shook it out as best as I could and left it to dry.

I thought I had got off lucky. I was able to use my phone for the rest of the day. All the data and photos were there. It still took pictures. The screen was water stained but that wasn’t so bad. But then that night, when I plugged it in to recharge, it wouldn’t. The battery had kicked the bucket. Once the battery completely ran out, the phone would work only when plugged in.

Don’t panic!

I had been planning to use my phone to take all my vacation pictures. Obviously that wouldn’t be possible now. I went down to the ship’s photo gallery. They had some cameras for sale but I was so lucky that they had one last one left of the inexpensive variety. I bought the display model of a Nikon Coolpix W100 for $140 plus $45 for a 64 GB SD card. I took over 1000 photos of our vacation over the remainder of our cruise, including some terrific underwater photos since the camera is waterproof.

imageBefore the cruise was over, my phone decided to get into a mode where it wouldn’t start up until I did a data backup to either an SD card (which the phone didn’t support) or a USB drive which I didn’t have with me.

Somehow, with some fiddling, the phone then decided it needed to download an updated operating system so I wrongly let it do that. Bad move! It was obvious that action failed as then the phone would no longer get past the logo screen. 

At home, Saturday at 11 pm, I ordered a new phone for $340 from Amazon. It arrived at my house on Monday afternoon and I’m back in action. The only thing on my old phone were about a month of pictures including the first 3 days of our vacation. If it’s not too expensive, I might try to see if a data recovery company can retrieve the pictures for me. If not, oh well.


Failure Number 2 – My Desktop Computer

I had left my computer running while I was gone. I was hoping for it to do a de novo assembly of my genome from my long read WGS (Whole Genome Sequencing) test.  I had tried this a few months ago, running on Ubuntu under Windows. When I first tried, it had run for 4 days but when I realized it was going to take several days longer I canned it. Knowing I was going to be away for 14 days was the perfect opportunity to let it run. I started it up the day before I left and it was still running fine the next morning when I headed to the airport.

When I got back, I was faced with the blue screen of death. Obviously something happened. “Boot Device Not Found”.

image

Don’t panic!

I went into the BIOS and it sees my D drive with all my data, but not my C drive. My C drive is a 256 GB SSD (Solid State Drive) which includes the Windows Operating System as well as all my software. My data was all on my D drive (big sigh of relief!) but I also have an up-to-date backup on my network drive from my use of Windows File History running constantly in the background. So I wasn’t worried at all about my data. Programs can be reinstalled. Data without backups are lost forever.

I spent the rest of Saturday seeing if I can get that C drive recognized. No luck. My conclusion is that my SSD simply failed which can happen. I had a great computer but it was about 8 years old. The SSD drive was a separate purchase that I installed when I bought it to speed up startup and all operations and programs. My computer was as dead as a doorknob,

Saturday night, along with the phone I purchased at Amazon, I also purchased a new desktop at Amazon. Might as well get a slight upgrade while I’m at it.  From my current HP Envy 700-209, a 4-core 4th generation i7 with 12 GB RAM, 256 GB SSD and 2 TB hard drive, I decided on a refurbished/renewed HP Z420 Xeon Workstation with 32 GB RAM, 512 GB SSD and a 2 TB hard drive for $990. It comes with 64-bit Windows 10 installed on the SSD drive. I’ve always had excellent luck with refurbished computers. The supplying company makes doubly sure that they are working well before you get them and the price savings are significant.

On Tuesday, the computer was shipped from Austin Texas to Nashville Tennessee. It went through Canada customs Thursday morning arriving here in Winnipeg at 9 a.m. and at my house just before noon.

First step, hook it up and a problem: My monitors have different cables than its video card needs. I ordered the less expensive video card with it, an NVIDEA Quadro K600. It did not come with the cables. I’m not a gamer so I don’t need a high-powered card, I made sure it could handle two monitors but I didn’t think about the cables. As it turns out, comparing my old NVIDEA GeForce GTX 645 card, I see my old card is a better card. So first step, switch my old card into my new computer.

image

Now start it up, update the video driver, and get all the windows updates. (The latter took about a half a dozen checks for updates and 3 hours of time)

Next turn it off and remove my 2 TB drive from my old computer to an empty slot in my new computer and connect it up. That will give me a D drive and an E drive, each with 2 TB which should last me for a while.

That was good enough for Thursday. Friday and Saturday, I spent configuring Windows the way I like it and updating all my software, including:

  1. Set myself up as the user with my Microsoft account.
  2. Change my user files to point to where they are on my old D drive.
  3. Set my new E drive to be my OneDrive files and my workplace for analysis of my huge (100 GB plus) genome data files.
  4. Reinstall the Microsoft Office suite from my Office 365 subscription.
  5. Set my system short dates and long dates the way I like them:
    2020-03-22 and Sun Mar 22, 2020
    image
  6. Set up my mail with Outlook. Connect it to my previous .pst file (15 GB) containing all my important sent and received emails back to 2002.
  7. Reinstall and set up MailWasher Pro to pre-scan my mail for spam.
  8. Reinstall Diskeeper. If you don’t use this program, I highly recommend it. It defragments your drives in the background, speeds up your computer and reduces the chance of crashes. Here’s my stats for the past two days:
    image
  9. Reindex all my files and email messages with Windows indexer:
    Capture1
  10. Change my screen and sleep settings to “never” turn off.
  11. Get my printer and scanner working and reinstall scanner software.
  12. Reinstall Snagit, the screen capture program I use.
  13. Reinstall UltraEdit, the text editor I use.
  14. Reinstall BeyondCompare, the file comparison utility I use. I also use it for FTPing any changes I make to my websites to my webhost Netfirms.
  15. Reinstall TopStyle 5, the program I use for editing my websites. (Sadly no longer supported, but it still works fine for me)
  16. Reinstall IIS (Internet Information Server) and PHP/MySQL on my computer so that I can test my website changes locally.
  17. Reinstall Chrome and Firefox so that I can test my sites in other browsers.
  18. Delete all games that came with Windows.
  19. File Explorer: Change settings to always show file extensions. For 20 years, Windows has had this default wrong. image
  20. Set up Your Phone, so I can easily transfer info to my desktop.
  21. Set up File History to continuously back up my files in the background, so if this ever happens again, I’ll still be able to recover.
    image
    (and occasionally it saves me when I need to get a previous copy of a file)
  22. Reinstall Family Tree Builder so I can continue working on my local copy of my MyHeritage family tree. I hope Behold will one day replace FTB as the program I use once I add editing and if MyHeritage allows me to connect to their database. I also have a host of other genealogy software programs that I’ve purchased so that I can evaluate how they work. I’ll reinstall them when I have a need for them again. These include: RootsMagic, Family Tree Maker, Legacy, PAF and many others.
  23. My final goal for the rest of today and tomorrow is to reinstall my Delphi development environment so that I can get back to work on Behold. This includes installation of three 3rd party packages and is not the easiest procedure in the world. Also Dr. Explain for creating my help files and Inno Setup for creating installation programs. I’ll also have to make sure my Code Signing certificate was not on my C drive. If so, I’ll have to reinstall it.
  24. Any other programs I had purchased, I’ll install as I find I need them, e.g. Xenu which I use as a link checker, or PDF-XChange Editor which I use for editing or creating PDF files, or Power Director for editing videos. I’ll reinstall the Windows Susbsystem for Linux and Ubuntu when I get back to analyzing my genome.
  25. One program I’m going to stop using and not reinstall is Windows Photo Gallery. Windows stopped supporting it a few years ago, but it was the most fantastic program for identifying and tagging faces in photos.  I know the replacement, Microsoft Photos, does not have the face identification, but hopefully it will be good enough for all else that I need. Maybe I’ll have to eventually add that functionality to Behold if I can get my myriad of other things to do with it done first.

Every computer needs a good enema from time to time. You don’t like it to be forced on you, but like cleaning up your files or your entire office or your whole residence, you’ll be better off for it.

How would you cope if both your phone and computer failed at the same time?

Just don’t panic!


Followup: After a few weeks of fiddling, I was able to get my old phone started again while plugged in, and was able to transfer my one month of photos from it to my computer via USB. So in the end, nothing important was lost.

Computers 23 years ago - Tue, 25 Feb 2020

#Delphi25 #Delphi25th – I came across an email I sent to a friend of mine on February 6, 1997 (at 1:17 AM). I’ll just give it here without commentary, but it should amuse and bring back recollections of people who were early PC users.
 image

You should find this message to be a little different. I am sending it using Microsoft Mail & News through my Concentric Network connection, rather than than using my Blue Wave mail reader through my Muddy Waters connection. This gets around my problem of not being able to attach files, as you had tried for me. In a future E-mails, I can attach pictures for you. I presume you can read GIFs, or would you prefer JPG or TIF?

I will still be keeping my MWCS account until the end of 1997, but I am switching over more and more to my Concentric account. I am still not entirely happy with Windows-based Newsreaders yet, and find Blue Wave much more convenient for reading newsgroups. Hopefully, by the end of the year I will have this sorted out.

I bit the bullet, and switched over to Windows 95 at home. I first had to upgrade my machine. I bought 16 MB more memory (to give me 24 MB) for $99 at Supervalue (of all places!) and bought a 2 GB hard drive for $360 (also at Supervalue!) less a $30 US mail-in rebate on the Hard Drive and a $30 sweatshirt thrown in due to a Supervalue coupon when over $200 is spent. My 260 MB drive that I bought 3 1/2 years ago already had Stacker on it to make it 600 MB, and I only had 80 MB free. I wanted to get rid of Stacker before going to Windows 95.

It only took me 3 1/2 hours to install the RAM and the Hard Drive myself at home! It wasn’t without problems, but the operation was a success. I had hooked up my old and new Drives as master and slave and everything worked. The next night, I took another 3 1/2 hours to transfer everything from my old drive to my new one, removing the old drive, and getting the system working from the new drive - again not without problems, but completed that evening. I am very proud of myself! The next evening, it took about an hour to get Windows 95 installed, and to customize it to the way I liked.

This hardware upgrade should be good for another couple of years. I only have the power supply, base, keyboard, mouse, and monitor as original parts. All the rest has been since upgraded.

Windows 95 - Well I actually like 90% of it better than Windows 3.1, and am only finicky about 10% of it. I know, I know, buy a Mac you will say. Well I hope you are prepared to buy a new operating system every six months like Jobs says you’ll have to. I still agree Macs are a good system, but there is much more software available for PCs, Macs are 40% more expensive, and they still use that horrible character font that they used in the early 80’s - yecch!

In the meantime, I have kept myself very, very, very, very, very, very,
very, very busy. I have been working hard on many different fronts, after work playing hard with the kids until their bedtimes (usually closer to 10 p.m. than to 8), most often working on the Computer from 10 to 11 to 12 to (yikes) 1 or 2 sometimes - Got my web pages up (http://www.concentric.net/~Ikessler); have responded to about 50 e-mail messages and inquiries about it; designed a tender proposal for the photographic work for our Cemetery Photography Project
(http://www.concentric.net/~Ikessler/cemphoto.shtml); and I’ve started learning how to use Borland Delphi to develop my BEHOLD program (http://www.concentric.net/Ikessler/behold.shtml)

Whew! I’m getting tired just thinking about all this!

Take care.  Louis