Unicode files and files in many other characters sets often start off with a few special characters call the BOM (Byte Order Mark). This makes it easy for Windows or other programs, like Behold, to figure out the character set and display the text correctly.
GEDCOM files are files like any other. If they are have UTF-8 or Unicode character sets, they should have a BOM preceeding the file to indicate this. But unfortunately, many GEDCOMs that I’ve seen in this format are missing the BOM. What that means is that a text reader, and Behold for that matter, by default will assume that the file is standard ANSI text and it will come out a garbled mess.
Because of this, Behold needs a way to check any file without a BOM, to see what character set it is from before it tries to process it. But I had no idea of the best way to do that.
So I went to my good friend Stackoverflow and asked the Question: How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?.
I am just in shock at how great a tool Stackoverflow is for me as a programmer. It wasn’t long before I had four very helpful answers and then closed the question because I had obtained the solution I needed:
This was the conclusion:
ShreevatsaR’s answer led me to search on Google for “universal encoding detector delphi” which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.
The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.
I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla’s i18n component.
Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!
Once I add this to Behold, it will mean that Behold will be able read GEDCOM files in unidentified character sets, and eliminate the missing BOM problem.
But I will have to custom ANSEL into it. Many early genealogy programs produced GEDCOMs in ANSEL, and that character set is no longer used in current-day Windows Operating Systems.