In just over a week, on Tuesday July 14, 2020 at 8:00 pm EDT, I’ll be giving a live online talk for the Virtual Genealogical Association @VirtualGenAssoc
The description of my talk is:
Presenter Louis Kessler explains those mysterious files that we download from DNA testing companies, helps us to understand what’s in them, and shows us the ways we can make use of them. He will also discuss whether Whole Genome Sequencing (WGS) tests are worthwhile for genealogists.
I hope you come and join me for this.
To register for my presentation, you’ll need to be a member of the Virtual Genealogical Association. Annual Dues are only $20 USD, and that gives you free registration for a year to any of their regular webinars as well as handouts and other benefits. Upcoming webinars include:
Tuesday, July 14 at 8 pm EDT - Louis Kessler presents “Your DNA Raw Data & What You Can Do With It”
Sunday, July 26 at 1 pm EDT - Sara Gredler presents “Successfully Searching the Old Fulton New York Postcards Website”
Saturday, August 1, 2020 EDT - Jessica Trotter presents “Occupational Records: Finding Work-Related Paper Trails”
Friday, August 7, 2020 at 8:00 pm EDT - Ute Brandenburg presents “Research in East and West Prussia“
Tuesday, August 18, 2020 at 8:00 pm EDT - Caroline Guntur presents “Introduction to Swedish Genealogy”
Sunday, August 23, 2020 at 1 pm EDT - Julie Goucher presents “Researching Displaced People”
Saturday, Sept 5, 2020 at 11:00 am EDT - Sara Campbell presents “Using Historic Maps of New England and Beyond”
Tuesday, Sept 15, 2020 at 8:00 pm EDT - Tammy Tipler-Priolo presents “Simple Steps to Writing Your Ancestors’ Biographies”
Sunday, Sept 20, 2020 at 1:00 pm EDT - Tamara Hallo presents “How to Get the Most Out of FamilySearch.org”
Friday, Sept 25, 2020 at 8:00 pm EDT - Annette Lyttle presents “Finding & Using Digitized Manuscript Collections for Genealogical Research”
Saturday, Oct 3, 2020 at 11:00 am EDT - Patricia Coleman presents “Beginning with DNA Painter: Chromosome Mapping”
Sunday, Oct 11, 2020 at 1:00 pm EDT - Kristin Brooks Barcomb presents “Understanding & Correlating U.S. World War I Records & Resources”
Tuesday, Oct 20, 2020 at 8:00 pm EDT - Christine Johns Cohen presents “Lineage & Hereditary Societies: Why, Where, When, What & How?”
Sunday, November 22, 2020 at 1:00 pm EST - Judy Nimer Muhn presents “Researching French-Canadians in North America”
Tuesday, November 24, 2020 at 8:00 pm EST - Marian B. Wood presents “Curate Your Genealogy Collection – Before Joining Your Ancestors!”
Tuesday, Dec 1, 2020 at 8:00 pm EST - Diane L. Richard presents “The Organizational Power of Timelines”
Friday, Dec 4, 2020 at 8:00 pm EST - Nancy Loe presents “Using Macs and iPads for Genealogy”
Sunday, Dec 13, 2020 at 1:00 pm EST - Jean Wilcox Hibben presents “Family History Can Heal Family Present”
Notice they vary the day of the week and the time of the day to accommodate people all over the world with different schedules.
If you are unable to attend a talk live that you wanted to, members have access to recordings of the last six months of webinars. Some of the past webinars that you can still access if you join now include:
Pam Vestal presented “20 Practical Strategies to Find What You Need & Use What You Find”
Mary Cubba Hojnacki presented ”Beginning Italian Research”
Alec Ferretti presented ”Strategies To Analyze Endogamous DNA”
Renate Yarborough Sanders presented ”Researching Formerly Enslaved Ancestors: It Takes a Village”
Lisa A. Alzo presented ”Finding Your Femme Fatales: Exploring the Dark Side of Female Ancestors”
Lisa Lisson presented ”How To Be A Frugal Genealogist”
Michelle Tucker Chubenko presented ”Using the Resources of the U.S. Holocaust Memorial Museum”
Cheri Hudson Passey presented ”Evidence: Direct, Indirect or Negative? It Depends!”
Kate Eakman presented ”William A. James’ 30 May 1944 Death Certificate”
While you’re at it, clear off your calendars from Nov 13 to 15 for the VGA’s annual Virtual Conference. Many great speakers and topics. There is a $59 fee for members and $79 for non-members. If the Conference interests you, then why not join the VGA right now for $20 and enjoy a year of upcoming webinars and 6 months of past webinars for free!
I’ve been a member of the Virtual Genealogical Association since it started in April 2018. They are always on the lookout for interesting speakers with interesting topics. If you would like to propose a talk, they are now accepting submissions for 2021 webinars and the 2021 Virtual Conference. Deadline for submission is August 30, 2020.
I’ve written over 1100 genealogy-related blog posts since I started blogging in 2002. But very rarely have I written about my own genealogy research.
It’s actually going okay now.
This blog was started to document the development and progress of my software program Behold, that I’m building to assist me with my genealogy. About 8 years ago, I started attending international conferences and became a genealogy speaker myself. Then about 4 years ago, DNA testing started to become a thing, and I jumped fully in, finding everything about it fascinating, and I wrote my program Double Match Triangulator to help decipher matches. About 2 years ago, the Facebook era of genealogy groups began. I joined and started participating in many groups that were of interest to me and relevant to my own family research.
I got interested in my genealogy in my late teens when one of my father’s aunts was in from Los Angeles and she started drawing a tree showing her and her 8 brothers and sisters. Then I started researching. The first program I started entering my data into was Reunion for Windows. When Reunion sold their Windows product to Sierra in 1997, I became a beta tester for their release of the program which they called Generations. I added all my genealogy data into Generations by 1999 and was using it to display my information until 2002, when Genealogy.com purchased it along with Family Origins and Ultimate Family Tree, and then subsequently dropped all three programs in favour of their own product Family Tree Maker.
What I had was a GEDCOM with my family tree information updated up to 1999. And until about 2 years ago, I had made no updates to that at all, waiting for Behold to become the program I’d enter all my genealogy data into. Working full time, the onset of DNA testing, becoming involved in genealogy conferencing and speaking, plus family and life in general prevented that from happening.
But then a simple step recently rebooted me and my genealogy work.
The MyHeritage Step
In February 2018, I took advantage of a half-price subscription for MyHeritage’s Complete Plan. I loaded my 16 year-old GEDCOM up to MyHeritage. I downloaded their free Family Tree Builder program which syncs with their online system, and I went to it.
The special price enticed me, but I liked what I saw in MyHeritage. They had lots of users. Billions of records. They had plenty of innovation, especially in their Smart Matching. And they were less America-centric than Ancestry. All my ancestors come from Romania and Ukraine ending up here in Canada, so I have eastern European needs. I’ll need to write names in Romanian, Russian, Hebrew and Yiddish, and language handling is one of MyHeritage’s strong points.
The one place MyHeritage was weak was Canada. So I also subscribed to Ancestry as well, but just their Canadian edition. The main database I wanted that Ancestry gave me was the passenger lists for arrival to Canadian ports.
Once I uploaded my 1400 people I had from 2002 via GEDCOM, MyHeritage’s Smart Matches started working for me. Over the course of a year, I added about 500 people to my tree and attached 5000 source records to them.
Filling Out My Tree
The sides of my family I am researching include my 5 grandparents and my wife’s 4 grandparents. My father’s parents are both from Romania. My mother’s parents are both from Ukraine as are all my wife’s grandparents.
My 5th grandparent is my father’s step-father Kessler. He is my mystery side. I know very little about him and his first wife. I don’t even know where he came from other than some unidentifiable place Ogec somewhere in Russia. He has no living blood relatives that I know of, and since no one I know is related to him, I can’t even use DNA to help me on his or his first wife’s side.
In addition to my 9 grandparents, I am also sort of doing a one-place study of Mezhirichi in the Ukraine, where my mother’s father came from. The reason why that town is more of interest than the other towns of my grandparents is because in the 1920’s, a synagogue in Winnipeg was formed called the Mezericher Shul made up only of immigrants from that town, including my mother’s father. I am trying to trace back all the people in Winnipeg whose parents or grandparents went to that synagogue, back to their roots in Mezhirichi. I’m sure many of us are related in ways that we don’t know. So to be more precise this is not really a one-place study of Mezhirichi, but is really a study of the families of the people who attended this synagogue in Winnipeg who likely came from Mezhirichi.
On my wife’s father’s mother’s side is a cousin in the United States who has done an extensive study on that side of the family. He wrote a 255 page book listing about 1000 people who descended from his and my wife’s common ancestors. He graciously allowed me to add the data to my MyHeritage tree as another way to preserve his research. I enjoyed the month and a half I spent manually adding people and their birth and death years to my family tree. That was enough to let MyHeritage’s Smart Matches do the dirty work of finding record matches and easily allowing me add dates and places from the records to our people.
Shortly after that, I ran into a problem. MyHeritage is supposed to privatize living person information. And when you look at a person in the tree who is living, it looks like they have been privatized. But it isn’t quite:
It shows the surname of the person, and the spouse’s maiden name. This wasn’t that bad, but the real problem were the Smart Matches. When someone Smart Matches to you over living people that they may have in their tree, they get all the information you have: names, dates, places, children, etc. I had a cousin email me and tell me he got a Smart Match from my tree, and his birthday was displayed to him. He wasn’t happy and neither was I.
I really was hoping I wouldn’t have to delete all the living people from my online tree, keeping them only in my local files on my computer. Fortunately there was a solution. When editing a person in Family Tree Maker, the “More” tab contains a privatization selection for the person. You check the box to make the person private:
They had no automated way to check this selection for all living people, so I manually opened up each of my 1500 living people and marked them private one-by-one, another week-long project.
Once those private people synced up to MyHeritage, the living couples now displayed as:
That’s much better. Every person still has a box online, but they are all now marked as “Unknown” rather than “private” with a surname. Also, no more information about living people is given to anyone through Smart Matches. As a consequence, I also don’t get smart matches for any of my privatized people. But this latter aspect might be a blessing in disguise. Now the Smart Matches I get are only for my deceased people who are the ones I’m most interested in researching and tracing further back. And the number of Smart Matches I now get are manageable. I can clean them out in a few days until I get a few hundred more a few weeks later.
Cousin Bait
I love this term cousin bait. You don’t want to put your data in one place. You want to put it everywhere you can. And you don’t want to put it all up for everyone to see and take. You want to make enough available to get people to contact you, so you can communicate with them and then share what you both have.
That page is well indexed on Google. For instance, searching for “Braunstein Tecuci” on Google brings my page up in 3rd place out of 11,500 results:
Over those 20 years, I’ve had about 200 people email me inquiring about some of the names and places that I identify. And maybe one third of those have been actual relatives whom I’ve shared data with.
The 2nd best resource I’ve used for a long time to find family has ben the JewishGen Family Finder (JGFF). I have just 17 entries, but those have been enough to get maybe 100 people to contact me to see if we have part of our family tree in common. And again, in maybe a third of those cases, we did.
Also, 2 decades ago, I uploaded my GEDCOM to JewishGen’s Family Tree of the Jewish People. As of March 2017, the collection had 7,310,620 records from 6,266 family trees. I’ve recently updated my tree there with my MyHeritage tree.
One of the best successes from my family webpage and through JewishGen was my connection to about 10 relatives on my father’s mother’s Focsaner side. We all have been emailing each other for many years and have been sharing information about our common family. I have only met one of these relatives in person, when our family went to New York City for a vacation about 10 years ago. But despite most of us never having met, and being 3rd cousins or further, we feel like we’re close family.
In the past 2 years, I have also added some of my own family tree (not my wife’s) to other sites, usually just my ancestors.
Ancestry: Just ancestors, but I’ve connected them down to any DNA matches who are relatives. This has given me a number of useful ThruLines that have led me to identify a couple of DNA testers who were relatives that I didn’t have in my tree.
Family Search: I just added my ancestors, but I’m connecting them to anyone else in this one-world tree who I know are relatives.
Geni: Same as for Family Search.
Wikitree: I’ve only put myself and my parents in so far. If in the future I notice a relative, I’ll connect to them.
Geneanet: About a year ago, I uploaded my tree from MyHeritage, so I have about 4000 in my tree there.
GenealogieOnline: Just ancestors.
Family Tree DNA: Just ancestors but connected down to DNA matches
GEDmatch: Up to yesterday, just ancestors.
Unfortunately, other than the ThruLines results at Ancestry, these trees have not led to people contacting me. So they are not as good at being cousin bait as I hoped they would be.
But yesterday, GEDmatch added their MRCA Search Tool, that compares the GEDCOM file you uploaded to GEDmatch to the GEDCOM file of your DNA matches. So I downloaded my GEDCOM from MyHeritage (which already had all living people privatized) and I uploaded it to GEDmatch and ran their new tool.
The GEDmatch tool compared 766 of my DNA matches’ trees to mine, and 933 of my uncle’s DNA matches trees to my uncle in my tree. Mine is a very problematic family for these sorts of comparisons. All my ancestors are Jewish so I have endogamy to deal with on the DNA side, and they are all from Romania or Ukraine, so I have lack of records and ability to only go back 5 generations to deal with on the tree side. The result sort of expectedly was that neither I nor my uncle had any MRCA matches.
Other Findings
Of course, one goal every genealogist has is to expand our ancestral tree as much as we can. With all my ancestors coming from Romania and Ukraine, the records there only start in the early to mid 1800s. I can only hope to go back about 5 generations with the known records available.
Over the past few years, I found some researchers who have been able to acquire records for me and translate them from the Romanian or Russian they are written in.
Researcher Gheorge Mireuta obtained 10 birth and death records from Tecuci, Romania on my father’s father’s side.
Sorin Goldenberg obtained about 70 records from the Dorohoi region of Romania on my father’s mother’s side.
Viktoria Chymshyt has obtained records from the Mezhirichi area of Ukraine, trying to find people for me on my mother’s father’s side, but we haven’t been successful yet.
Boris Malasky has obtained about 70 records on two of my wife’s sides from Kodnya and Zhitomir in the Ukraine.
This record research is really the only possible way to expand my tree into the “old country” and provide the physical evidence to back it up.
I really love MyHeritage’s Fan View. It give me a good representation as to where I am. Here’s the Fan View of my tree today:
And a new record I just got a few days ago from Sorin Goldenberg gave me the first names of the parents of my great-great-great-grandfather Manashcu Naftulovici.
So Naftuli and Sura are the first two ancestors I’ve identified in my 6th generation! Their son Manashcu was the first in his line to start using a surname, and he selected the patronym: Naftulovici.
My wife’s Fan View is currently this:
We have two of her 7th generation ancestors identified in records acquired from Boris Malasky.
Still To Do
In one word, lots! All genealogists know this is a never ending task. Every new ancestor you find leads to two new questions.
But my three major tasks over the next few years will be:
Going through and organizing the dozens of boxes in my closet and basement and binders in my bookshelf of unorganized genealogical material and pictures from my early years of research and from my parents and my wife’s parents and grandparents.
Digitizing what’s valuable from #1.
Entering data obtained from #1 into my family tree along with source citations.
That should keep me busy for a while.
And in the meantime, I’ll still be developing Behold so that it will continue to assist me as I go.
I have now taken 5 DNA microarray (chip) tests with Family Tree DNA, 23andMe, Ancestry DNA, MyHeritage DNA and Living DNA. I have also taken two Whole Genome Sequencing (WGS) tests with Dante Labs, one short-reads and one long-reads.
I analyzed the accuracy of these tests by comparing individual SNP values in my article Determining the Accuracy of DNA Tests. The chip tests don’t all test the same SNPs, but there’s enough commonality that they can be compared, and an error rate can be estimated. For my chip tests, that error rate turned out to be less than 0.5%.
The WGS test results don’t give you individual positions. They give you a set of reads, which are segments that are somewhere along the genome. Short read WGS tests give you segments that may be 100 to 150 bases long. Long read WGS tests can give segments that average 10000 bases long with the longest being into the megabases (millions of bases). But you don’t know where those segments are located on the genome.
To determine where the WGS reads are on the genome, there are two methods available:
1. Alignment: Each of the reads are matched to where they are best located in the human reference genome. The WGS testing companies often do the alignment for you and give your results to you in a BAM (Binary sequence Alignment Map) file. The alignment cannot be perfect because
You have variants that are different from the human reference genome as well as INDELs (insertions and deletions),
The WGS data has errors in the reads, sometimes changing values, adding extra values or deleting values.
The algorithms used for alignment are not perfect and sometimes make assumptions.
Comparing my BAM file results from my short read WGS test using the BWA alignment tool, the SNPs I could compare were even more accurate than my chip tests with an error rate of less than 0.1%. That sounds very good, but still 1 in 1300 results were wrong, meaning in 700,000 SNPs, there could be 500 errors.
The WGS_Extract tool that I was using to extract the SNP values from the BAM file didn’t handle INDELs properly so I couldn’t check the accuracy of those. Despite its high accuracy for individual SNPs, short read WGS tests are not as good at identifying INDELs correctly, e.g the YouTube video (see below) states 85% to 95% accuracy which is a high 5% to 15% error rate.
For my long reads WGS test, I had two alignments done, one using a program called BWA and one using minimap2 which was supposed to be better for long reads. I was very disappointed to find a quite high error rate on the SNPs I could compare, which was 7.7% and 6.6% for the two programs.
Thus, alignment techniques and the algorithms that implement them are not bad, but they are far from perfect. They match your reads to a reference genome and have to assume that the best fit is where your read goes.
2. De Novo Assembly, or just Assembly: This is where you only take the WGS reads themselves, and match them up with each other, piecing them together like a big picture puzzle.
Actually, it’s tougher than a picture puzzle. The best analogy I’ve seen is it’s like taking 100 copies of today’s issue of the New York Times newspaper, and shredding them into small random pieces where you can only see a few words from a few lines. Just to make it a bit more difficult, half the papers are the morning edition, and half are the afternoon edition, where 90% of the articles are the same, but the other 10% have a different article in the same location in the paper. On top of that, somehow one copy of yesterday’s paper accidentally got in the mix. Now you have to reassemble one complete newspaper from all these pieces. And as a bonus, try to create both the morning edition and the afternoon edition.
You likely will mix up some morning edition articles with some afternoon edition articles, unless you get some pretty big pieces that include 2 of the same edition’s articles in that piece. (Think about this!)
So the two editions are like the your paternal and maternal chromosomes, and one copy of the previous day’s paper are like a 1% error rate that your reassembling has to deal with. Add in shredded versions of six different issues of the newspaper for a 6% error rate.
A genome assembler matches one read to another and tries to put them together. The longest stretches of continuous values that it can assemble are called contigs. Ideally, we would want to assemble 24 contigs for the 23 autosomal chromosomes plus the mitochondrial (mtDNA) chromosome. A male will have a 25th, that being his Y chromosome.
When assemblers can’t connect a full chromosome together (which none can do yet for humans), you can run another program to use a technique called scaffolding to connect the contigs together. That is done by mapping the contigs to the human reference genome and using the human reference genome as the scaffolds (or connections).
Assembly with short read WGS has not been able to give good results. Similar to alignment, the reads are too short to span repeats, and thus give way too many contigs. Long reads are normally used for assembly, and despite their high error rate for individual base pairs, sophisticated error correction techniques and minimum distance algorithms have been developed to do something reasonable. However, chromosome-scale configs are still not there yet, and many smart researchers are working to solve this, eg. this article from Nov 2019 describing a method using a connection graph.
I attempted an assembly of my long reads WGS about 6 months ago using a program called miniasm. I let it run on my computer for 4 days but I had to stop it. So I waited until before a 2 week vacation and started it, but while it was running my computer crashed.
I realized that this is too long to occupy my computer to do an assembly that likely will not give good results. And I was not happy running it in Unix on my Windows machine. I was interested in a Windows solution.
Algorithms for Genome Assembly
I’ve always been a programmer who likes the challenge of developing an algorithm to solve a problem. I have a BSc Honours in Statistics and an MSc in Computer Science, and my specialty and interest was in probabilty and optimization.
I have developed and/or implemented many computer algorithms, including detection of loops in family trees for Behold, matching algorithms in Double Match Triangulator, simulation of sports and stock market performance (winning me over $25,000 in various newspaper contests) and from my university days: my at-the-time world class chess program: Brute Force.
Currently for the next version of Behold, I am implementing a DNA probability of match and conditional upon matching expected match length for autosomal, X, Y and mtDNA between selected people and everyone else in your family tree. In doing so, I have to also determine all ways the selected people are related and statistically combine the results. All this data will be available if the user wants along with all the ways these people are related. It should be great.
But back to genome assembly. The problem with assembly algorithms today is that they have to use long reads, and long reads have very high error rates. So they must attempt to do some sort of approximate matching that allows for errors and then uses the consensus approach, i.e. that takes the values that most reads aligning to the same position agree on. It is not clean. It is not simple. There is a lot of error correction and many assumptions must be made.
Ahh, but wouldn’t it be simple if we could just take one read, and match the start to another read and the end to a third read. If you have enough coverage, and if the reads are accurate enough, then this would work fine.
In fact this is how they put together the first human genomes, painstakingly connecting the segments that they had one by one.
But alas, the long reads WGS tests are not accurate enough to do this. So something else had to be done.
Chapter 3 is :How Do We Assemble Genomes? That is where I got the exploding newspaper analogy which I expanded on above. The chapter is amazing, turning the problem into graph theory, bringing in the famous Königsberg Bridge Problem, solved by mathematician Leonhard Euler, and explaining that a de Bruijn graph is the best solution for error prone reads.
This looked like quite a task to implement. There are many assembly algorithms already developed using this technique, and I don’t think there’s anything I can do here that those working on this haven’t already done.
Accurate Long Reads WGS!!!
Also a couple of months ago, another innovation caught my attention. The company PacBio developed a DNA test they call PacBio HiFi SMRT (Single Molecule, Real-Time) WGS, which are both long reads (up to 25 kb) and are highly accurate (about 99.8%)
Whoa! The world has just changed.
No longer was extensive error correction required. The video above talks about the HiCanu assemblers and how they were modified to very much take advantage of this improved test. Not only that, but the practise of using short reads to “polish” the data is no longer required, and is actually discouraged with HiFi reads as the polishing actually can introduce errors.
What does this mean? Well, to me this indicates that the original ideas of simply connecting ends might just work again. I have not seen any write-up about this being attempted anywhere yet. The assembly algorithm designers have been using advanced techniques like de Bruijn graphs for so long, they might never have thought to take a step back and think that a simpler solution may now work.
So I thought I’d take that step back and see if I can develop that simpler solution.
A Simple Single-Pass Assembler for Windows
For 25 years I’ve developed software using the programming language Delphi on Windows. Most bioinformatics tools are written in Python for Unix. I’m too much of an old horse who is too busy to learn new tricks. So Delphi it will be for me.
The algorithm with perfect reads seemed fairly simple to me. Make the first read a contig. Check the next read. Does the start or end of the read match any anywhere within the contig? If so, extend the contig. If not, make the read a contig. Continue sequentially just one time through the reads and after the last read, you should be done!!!
Once I got going, I only found it slightly more complicated than that. You also had to check if the start and end of the contigs matched anywhere within the read, and also if the read contained the contig or the contig contained the read. I set a minimum overlap length thinking that I’d want to ensure that the read and the contig matched at least that much. Then any repeats smaller than that overlap would be bridged.
First I needed some sample data. In the Bioinformatics Algorithms book Chapter 9, the Ch 9 Epilogue on Mismatch-Tolerant Read Mapping gives a challenge problem includes a 798 KB partial dataset of the bacterial genome Mycoplasma pneumoniae with 816,396 values in it, all either A, C, G or T.
This is what that dataset looks like in my text viewer. It’s just one long line 816,396 values in it:
The challenge problem also included a file of 40,000 short reads from that dataset, all of length 100. That gives 4 million data points for a coverage of 4.9x over the 816,396 in the genome.
However, not a single one of the 40,000 reads were in the genome. The challenge was to find the number of reads that had at most 1 mismatch.
Since I wanted a perfect dataset of reads to start with, I saw that I needed to create my own. Also, I wanted them to be like long reads, all with differing lengths. So after a bit of trial and error, I ended up using a base-10 lognormal distribution, with a mean of 3 and standard deviation of 0.25 to generate 9000 random read lengths. Longest read length was 11,599. Shortest was 124. Mean was 1174.
So those 9000 reads average 1174 bases and total 10.6 million data points, giving 13.0x coverage of the genome, which is quite a bit more than the 4.9x coverage in their example short reads. This is good, because there’s more likelihood I’ll have enough reads to cover the entire genome without gaps.
I then generated random start positions for those 9000 reads, and extracted the actual genome values at that position for that read length, and put those into my own reads file. So now I had a set of reads with no errors to develop with.
This is what my set of reads look like in my text viewer. There are 9000 lines, each of varying length:
To do the alignment, I didn’t know how much of the start and the end of each read was needed for finding a match in another read. So I wrote a small routine to take the first n positions at the start and end of the first read, and find out how many other reads they are contained in:
I started at length 5. The first 5 values of the first read matched somewhere in 14,281 other reads. Obviously length 5 is too small. Increasing to the first 11 values, we see the start of the first read only matches 10 other reads and the end only matches 8. This does not decrease any more as we increase the segment size indicating that we likely found all the occurrences of that sequence in all the reads. With 13.0x coverage, you would expect on average 13 matches over any segment. I have 1 + 10 = 11 occurrences of the first part of the first read, and 1 + 8 = 9 occurrences of the last part of the first read. That’s a very possible result with 13.0x coverage.
So for this genome and the sample data I have, I’ll set my segment length to 12 and select the first 12 values and last 12 values of each read for my comparisons.
The reason why such a small 12 value segment can be used is because there are 4 possible values, A, C, G and T at each position. And 4 to the power of 12 is 16,777,216 meaning there’s that many ways to make a 12 letter string out of those 4 values. Our genome is only 816,396 bases long, so there is very little chance there are very many segments of length 12 that are randomly included more than once. For a human genome of 3 billion reads, a slightly longer segment to compare with will be required, maybe length 17 or 18 will do it.
Running My Assembler: Sembler
After about 4 days of development, testing, debugging and enhancement, I got my simple assembler running. I call it: Sembler. This version ended up with about 200 lines of code, but half of that is for reporting progress.
So this is its algorithm. Sembler checks the first and last 12 positions of every read against each contig created so far. It also checks the first and last 12 positions of the contig against the read. And it checks if the read is completely in the contig and if the contig is completely in the read. Based on the situation, it will then either expand the contig, combine two contigs, or create a new contig.
Sembler reports its progress as it goes. Here is how it starts on my test data:
The first line shows the settings used. The next lines show the reads used. Since the minimum read length for this run was 1200, reads 2, 6, 7, 9, … were not used because they were too short.
Up to read 66 no overlaps were found, so a contig was created from each read. At read 66, and again at read 77, the first 12 values of the read matched somewhere in one of the contigs. The rest of the contig matched the read after those 12 values, but the read was longer and had more values available that Sembler then used to extend that contig to the right.
If we go down further to reads starting at 1004 we see:
We have now built up 177 contigs and they grow to a maximum of 179 contigs by read 1018. At this point, the contigs cover much of the genome and it is getting tougher for new reads not to be overlapping with at least one of the contigs.
The start of read 1024 matches somewhere in contig 78 and the end of read 1024 matches somewhere in contig 90. So this read has connected the two contigs. Contig 90 is merged into contig 78, and the last contig 179 is moved into contig 90’s spot just so that there aren’t any empty contigs to deal with.
Below is the end of the output from running reads with length >= 1200:
We get down to read 8997 which ends up merging contig 3 into contig 1, and contig 4 becomes contig 3. So we are left with just 3 contigs.
The run took 19.156 seconds.
Normally, you don’t know the genome. This procedure is designed to create the genome for you. But since I am still developing to get this to work, I had Sembler look up the final contigs in the genome to ensure it has done this correctly. The three contigs it came up with were:
Contig 1 from position 82 to 658275 Contig 3 from position 658471 to 764383, and Contig 2 from position 764404 to 816396.
So positions 1 to 81 were not identified, because there were no reads with length at least 1200 that started before position 82. And there was a small gap of length 195 between 658275 and 658471 which no reads covered and another gap of length 20 between 764383 and 764404 that no reads covered.
Despite the 7.7x coverage, there were still a couple of small gaps. We need a few more reads to fill in those gaps. One way of doing so is to lower the minimum read length. So I lowered the minimum read length to 1000 and get this:
Success! We now have a single contig from position 82 to 816396.
Optimizing the Assembler
I have so far done nothing to optimize Sembler’s code. The program compares character strings. It uses a Pos function to locate one string within another. There are many ways to improve this to make it run faster, but getting the algorithm working correctly was the first necessity. I have a lot of experience at optimizing code, so if I carry this program further, I will be sure to do so.
But just as important as optimizing the code is optimizing the algorithm. Starting parameters are very important. Let’s look at what tweaks can be made.
As we increase the minimum length of the reads we include, we reduce the number of reads we are using. This reduces coverage, reduces the number of compares we do and takes less time. The maximum contigs we have to deal with decreases and that maximum happens later during the reads.
But if our value for the minimum length is too high, we don’t get enough coverage to fill in all the gaps and we end up with more than one contig. The most important thing here is to try to end up with just one contig.
Based on the one contig requirement, our optimum for this set of reads for this genome is to select a minimum length of 1000.
Now let’s set the minimum length to 1000 and vary the segment length:
Varying the segment length we are comparing doesn’t change the result. The segment length is looking for a potential contig it matches to. If the length is too short, then the start or end of the read will match to random locations in each contig. They will be rejected when the rest of the read is compared, which is why the solution doesn’t change. But all these extra checks can dramatically increase the execution time if the segment length is too small.
These are perfect reads I’m working with right now that have no errors. Once errors are considered, we’ll want to keep the seglength as small as possible to minimize the chance that the start segment or end segment contains an error. If it does, then that read will all be rejected when the rest of the read is compared, effectively eliminating the use of that read.
Now let’s put the segment length back to 12 and vary the minimum overlap which by default I had set to 100:
These results surprise me somewhat. I was expecting a minimum overlap of 50 and especially of 0 to fail and give lots of contigs. I’ll have to think about this a bit. Maybe it is because I’m using perfect reads with no errors in them.
None-the-less, this shows that if the minimum overlap is too high, then some of our matches will be excluded causing some gaps. We don’t want the minimum overlap too low, or we may match two contigs that are side by side but don’t have enough “proof” to connect them. That isn’t a problem in this “perfect reads” case, but once errors are introduced, some overlap will likely be wanted as a double check.
Next Steps
This procedure works.
Is it fast enough for a full WGS dataset? We’re talking about tens to hundreds of millions of reads rather than just 9000. And we’re talking about a genome that is 3 billion positions rather than just 800,000. So instead of 200 max contigs, we’ll likely have 200,000 max contigs. So it could theoretically take a million times longer to solve than the little problem I have here.
If with optimization I can get the comparisons to be 20 times faster, then we’re talking a million seconds which is 278 hours, i.e. 23 days. That’s a little bit longer than I was hoping. But this is just a back of the envelope calculation. I’m not multithreading, and there are faster machines this can run on. If a procedure can be made available that will do a full de novo assembly of a human genome in less than 24 hours, that would be an achievement.
I have so far only tested with perfect data. It wouldn’t be too hard to test the addition of imperfections. I could change every 1000th value in my sample reads to something else and use that as a 0.1% error rate like WGS short reads. I could change every 500th for a 0.2% error rate like PacBio HiFi reads. And I can change every 20th for a 5% error rate like WGS long reads. I already have some ideas to change my exact comparison to an approximate comparison that will allow for a specific error rate. The tricky part will be getting it to be fast.
It might be worthwhile running my program as it is against my WGS short reads. I showed above that the minimum overlap may not need to be as high as I originally thought, so maybe the WGS short reads will be able to assemble somewhat. There likely will be many regions that repeats are longer than the short reads are, and this procedure will not be able to span them. The elephant in the room is can I process my entire WGS short reads file in a reasonable amount of time (i.e. 1 day, not 23 days)? And how many contigs will I get? If it will be 200, that will be pretty good, since that will only be an average of 10 per chromosome. But if there’s 2000 contigs, then that’s not quite as good.
I should try to get a set of PacBio HiFi human reads.That is what this procedure is geared towards. PacBio HiFi are the reads that I think with enough coverage, might just result in 25 contigs, one for each chromosome plus Y plus mt. Then it wouldn’t be too hard to add a phasing step to that to separate out those contigs into 46 phased chromosomes + mt for women, or 44 phased chromosomes + X + Y + mt for men. I think the PacBio HiFi reads have a chance of being able to do this.
Finally, I would love to get a set of PacBio HiFi reads for myself. I don’t have my parents with me any more and they never tested, and I’d love to phase my full genome to them. Also, then I can do some analysis and see how well (or poorly) the WGS alignment techniques I did compared to the (hopefully) accurate genome that I’ll have assembled for myself.
Maybe this won’t all happen this year. But I’m sure it will eventually, whether based on my Sembler algorithm, or on some other creation by some of the many hyper-smart bioinformation programmers that are out there.
If PacBio HiFi reads prove to be the revolution in genetic testing that they are promising to be, then for sure the whole world of WGS testing will change in the next few years.
I live in Winnipeg, Manitoba, Canada. Beautiful summers. Brutal winters.
I've been researching my own family for over 30 years. I've been working with computers and technology for just as long.
I believe it is time for a change in the way genealogy software works, and my program Behold is my realization of this.
This blog is where you can follow, in detail, the development of my genealogical software Behold and my DNA analysis program Double Match Triangulator. You'll read about the concepts behind them, my progress, and all sorts of related random tidbits that happen to be relevant to genealogists, programmers, or people in general.