Louis
What do you make of Tamura Jone's comment (http://www.tamurajones.net/SiblingTortureTest.xhtml):
None of these produced any errors or warnings, except Behold 1.04.
Behold warned that the date 13 Apr 2012 is non-standard, and should be 13 APR 2012; the warning is that the abbrevation should be in ALL-CAPITALS.
That is what the specification seems to say, but it does not;
Chapter 2 of the GEDCOM 5.5.1 specification clearly states that All controlled line_value choices should be considered as case insensitive.,
and that values should be converted to all uppercase or all lowercase prior to comparing.
That means that Apr is fine, and that means that you may even write aPR or aPr
It seems to me that all lower or all upper is correct but not mixed, as Tamura suggests.
No, Tamura's correct. The statement: "values should be converted to all uppercase or all lowercase prior to comparing" means that aPR and aPr should be both changed to APR (if uppercase is used for comparison) or to apr (if lowercase is used for comparison). Either way, aPR and aPr are equivalent to apr, APR and Apr.
Subsequent thinking on this makes me now believe that GEDCOM intended that only LINE_VALUEs that are an enumerated list of choices were to be allowed to be mixed upper and lower case.
A DATE_VALUE is a line value. But it is not made up of an enumerated list of choices. It is made up of a substructure, with some components of the substruction (such as month) being enumerated. I now don't believe that GEDCOM intended these complex structures to be allowed as mixed case, but should be precisely as defined (upper case).
Whether or not this is true, at least a warning should be given, because there may be programs that will not interpret all of "JAN", "Jan", "jan and "jAn" to be the month of January.
By enumerated list of choices, are you meaning 'controlled' as referred to in the specification:
All controlled line_value choices should be considered as case insensitive.
This means that the values should be converted to all uppercase or all lowercase prior to comparing.
The terms UPPERCASE and UpperCase are considered equal. TAGS are always UPPERCASE.
Yes. However, GEDCOM does not define the difference between "controlled" and "uncontrolled" line values.
My interpretation is that controlled line values are line values that are restricted to a specified set of allowed optional values. Anything more complicated than that is likely deemed not to be controlled, since that is the logical meaning of the word "controlled".
Just signed up but I have been mulling over this issue for a bit.
To me, the operative words in interpreting the standard (5.5.1) are "prior to comparing":
Til now, my interpretation has been - and I still will need more convincing to alter that - that the case of actual value in the original does not matter.
IMO, the standard addresses the issue of whether data should be rejected due to differences in case and by specifying that the value from the original should be convert to either to upper or lower case 'prior to comparing' makes it clear that any and all combinations are acceptable as long a the complete string matches the string specified in the standard - in a case-insentive way. :-)
If the 'orIginal' string was to be expected to have a specific case formation - all upper or lower case or even leading capital - that is the way the standard should have expressed it.
I think I've flip flopped again back to Tamura's view and your view. Despite my doubts about this, I think I'm going to have to allow any case in the line value without giving a warning message.
This is extra work for the programmer implementing GEDCOM, because the line value case must be retained for user data but the individual, but individual values within it that are to be compared to a set of enumerated values must be converted to one case prior to comparison.
Maybe the GEDCOM developers thought this would be easier for the programmer, but it is an extra step versus requiring the case to be as specified in the documentation. And it is something open to interpretation somewhat as this thread shows.
Yes, isn't it nice to have a 'standard' ;-)
The main reason I commented was because, before I found Behold, I had tried different validation apps to check my data and when I found nothing I thought was thorough enough - and also because the ones I did find disagreed with each other and the apps that produced the GEDCOM - I started building my very own validator.
If nothing else, it showed up the 'standard' for its many ambiguities and missed specs :-(
and it is nowhere near as thorough as I would want to be and may never get there.
As some wiseacre said: the advantage of having standards is that we now have so many to chose from :-)
Not too long after Version 1.1 is out I'll be adding Consistency checking, which should really give your data a shakedown. Following that, when I add saving to GEDCOM, I'm going to be implementing even better GEDCOM checking on input.
Despite it's ambiguities, GEDCOM has been remarkably successful. Here 15 years later after the last GEDCOM standard was released, almost all genealogy software has GEDCOM input and output. It may not be perfect, but just the idea that everyone uses it says something for it.
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Fri, 13 Apr 2012
Louis
What do you make of Tamura Jone's comment (http://www.tamurajones.net/SiblingTortureTest.xhtml):
None of these produced any errors or warnings, except Behold 1.04.
Behold warned that the date 13 Apr 2012 is non-standard, and should be 13 APR 2012; the warning is that the abbrevation should be in ALL-CAPITALS.
That is what the specification seems to say, but it does not;
Chapter 2 of the GEDCOM 5.5.1 specification clearly states that All controlled line_value choices should be considered as case insensitive.,
and that values should be converted to all uppercase or all lowercase prior to comparing.
That means that Apr is fine, and that means that you may even write aPR or aPr
It seems to me that all lower or all upper is correct but not mixed, as Tamura suggests.
Brett
Joined: Sun, 9 Mar 2003
288 blog comments, 245 forum posts
Posted: Fri, 13 Apr 2012
Brett,
See my blog post: How To Get A Developer To Fix A Bug.
No, Tamura's correct. The statement: "values should be converted to all uppercase or all lowercase prior to comparing" means that aPR and aPr should be both changed to APR (if uppercase is used for comparison) or to apr (if lowercase is used for comparison). Either way, aPR and aPr are equivalent to apr, APR and Apr.
Louis
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Fri, 13 Apr 2012
So what does:
values should be converted to all uppercase or all lowercase prior to comparing.
actually mean?
Is this when:
1. comparing two dates within a program, such as to work out age or
2. two supposedly identical GEDCOMs are compared for differences?
If 1 above, how does a user know it is being done correctly by the user?
If 2 above, how do we change a GEDCOM to same case in both files, without a large (and possibly manual) conversion.
Brett
Joined: Sun, 9 Mar 2003
288 blog comments, 245 forum posts
Posted: Fri, 13 Apr 2012
I think it simply means comparing for the purpose of interpreting its value.
For a DATE value, I don't just compare the month-part to JAN, FEB, MAR,..., but I compare the uppercased value of the month-part to JAN, FEB, MAR,...
For a TYPE value, I don't just compare the value to STILLBORN, but I compare the uppercased value of the value to STILLBORN.
Louis
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Fri, 13 Apr 2012
I assume this applies to BET, ABt etc in that they can be Bet, bet etc but compared upper or lower cased.
Brett
Joined: Sun, 9 Mar 2003
288 blog comments, 245 forum posts
Posted: Sat, 14 Apr 2012
Yes. All parts of the date. And that actually simplifies the work that Behold is doing.
Personally, I think it is a great idea that the GEDCOM designers had. I should have discovered it earlier, but now that I have, I'll make use of it.
Louis
Joined: Sun, 9 Mar 2003
288 blog comments, 245 forum posts
Posted: Mon, 30 Jun 2014
Subsequent thinking on this makes me now believe that GEDCOM intended that only LINE_VALUEs that are an enumerated list of choices were to be allowed to be mixed upper and lower case.
A DATE_VALUE is a line value. But it is not made up of an enumerated list of choices. It is made up of a substructure, with some components of the substruction (such as month) being enumerated. I now don't believe that GEDCOM intended these complex structures to be allowed as mixed case, but should be precisely as defined (upper case).
Whether or not this is true, at least a warning should be given, because there may be programs that will not interpret all of "JAN", "Jan", "jan and "jAn" to be the month of January.
See also: http://www.beholdgenealogy.com/blog/?p=1087
Louis
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Mon, 30 Jun 2014
By enumerated list of choices, are you meaning 'controlled' as referred to in the specification:
All controlled line_value choices should be considered as case insensitive.
This means that the values should be converted to all uppercase or all lowercase prior to comparing.
The terms UPPERCASE and UpperCase are considered equal. TAGS are always UPPERCASE.
Joined: Sun, 9 Mar 2003
288 blog comments, 245 forum posts
Posted: Tue, 1 Jul 2014
Yes. However, GEDCOM does not define the difference between "controlled" and "uncontrolled" line values.
My interpretation is that controlled line values are line values that are restricted to a specified set of allowed optional values. Anything more complicated than that is likely deemed not to be controlled, since that is the logical meaning of the word "controlled".
Louis
Joined: Mon, 24 Nov 2014
10 blog comments, 13 forum posts
Posted: Mon, 24 Nov 2014
Just signed up but I have been mulling over this issue for a bit.
To me, the operative words in interpreting the standard (5.5.1) are "prior to comparing":
Til now, my interpretation has been - and I still will need more convincing to alter that - that the case of actual value in the original does not matter.
IMO, the standard addresses the issue of whether data should be rejected due to differences in case and by specifying that the value from the original should be convert to either to upper or lower case 'prior to comparing' makes it clear that any and all combinations are acceptable as long a the complete string matches the string specified in the standard - in a case-insentive way. :-)
If the 'orIginal' string was to be expected to have a specific case formation - all upper or lower case or even leading capital - that is the way the standard should have expressed it.
Joined: Sun, 9 Mar 2003
288 blog comments, 245 forum posts
Posted: Sat, 29 Nov 2014
Arnold,
I think I've flip flopped again back to Tamura's view and your view. Despite my doubts about this, I think I'm going to have to allow any case in the line value without giving a warning message.
This is extra work for the programmer implementing GEDCOM, because the line value case must be retained for user data but the individual, but individual values within it that are to be compared to a set of enumerated values must be converted to one case prior to comparison.
Maybe the GEDCOM developers thought this would be easier for the programmer, but it is an extra step versus requiring the case to be as specified in the documentation. And it is something open to interpretation somewhat as this thread shows.
Louis
Joined: Mon, 24 Nov 2014
10 blog comments, 13 forum posts
Posted: Sat, 29 Nov 2014
Yes, isn't it nice to have a 'standard' ;-)
The main reason I commented was because, before I found Behold, I had tried different validation apps to check my data and when I found nothing I thought was thorough enough - and also because the ones I did find disagreed with each other and the apps that produced the GEDCOM - I started building my very own validator.
If nothing else, it showed up the 'standard' for its many ambiguities and missed specs :-(
and it is nowhere near as thorough as I would want to be and may never get there.
As some wiseacre said: the advantage of having standards is that we now have so many to chose from :-)
Joined: Sun, 9 Mar 2003
288 blog comments, 245 forum posts
Posted: Sat, 29 Nov 2014
Arnold,
Not too long after Version 1.1 is out I'll be adding Consistency checking, which should really give your data a shakedown. Following that, when I add saving to GEDCOM, I'm going to be implementing even better GEDCOM checking on input.
Despite it's ambiguities, GEDCOM has been remarkably successful. Here 15 years later after the last GEDCOM standard was released, almost all genealogy software has GEDCOM input and output. It may not be perfect, but just the idea that everyone uses it says something for it.
Louis