Re: [Gedcom-parse-devel] DATE is not LR (I think)
Status: Beta
Brought to you by:
verthezp
|
From: Peter V. <Pet...@ad...> - 2001-12-28 09:32:27
|
Hi Perry,
I don't know the definition of an LR parser very well, but I think
that the issues you give here are no problem for yacc/bison...
prapp wrote:
>
> I think that GEDCOM dates are not LR for two reasons:
>
> 1) date_phrase matches, for example, "1 SEP 1993 or 1994"
> but you don't know that is a date_phrase til you get past the 1993
> (and find out it is not a valid date)
Not true. According to the Gedcom spec, date phrases have to be
encapsulated
in parentheses. Again, the spec is not very consistent in its notation,
because it says:
DATE_VALUE := ... | (<DATE_PHRASE>)
DATE_PHRASE := (<TEXT>)
and so you'd end up with double parentheses, but still, the DATE_PHRASE
explanation says "The date phrase is enclosed in matching parentheses",
so the parentheses (single then) are for real.
So, the date "1 SEP 1993 or 1994" cannot even be a date phrase, it is
simply not valid according to the spec.
>
> 2) "25" looks like the start of a day month year, but is actually
> just a year
That should be no problem, I think. It would be a bigger problem if
a day could stand on its own as a date, because then you couldn't
distinguish between the single day and the single year.
LR parsers use a single token as lookahead, and on this basis, this
date can be parsed perfectly.
>
> If I am correct, it will be more difficult to parse dates with
> a bison grammar, yes ?
I don't see any problems at the moment, but maybe I will further
today :-)
>
> In fact, I decided to do a custom date parse, because I don't know enough
> to handle the phrase backtracking.
> (Also because I'm revising an existing date parser in LifeLines which is
> custom, and is a freeform, non-LR parser).
I agree that it will be difficult to have the date parsing of
gedcom-parse
integrated into LifeLines.
I was thinking of having a separate bison parser for the dates, but in a
first try I won't do that, because I want to concentrate on the parsing
itself (having a separate parser means it needs again its own lexer,
which
gets too messy to handle now).
>
> (I think I'll use some context-sensitivity in calendars eventually -- if I
> add support for the Islam calendar eventually, I can recognize it by the
> month name,
> and then can expect AH or BH as an optional trailer instead of AD or BC).
From your experience with Gedcom files, what do you think the
Gedcom spec means in the description of YEAR_GREG ? Is the
optional suffix "(B.C.)" or "B.C."? To my interpretation, it should
be the first one, but what have you seen in actual Gedcom files?
>
> I'm not planning to worry about a BC-equivalent for the Hebrew or Roman
> calendars :)
> (both go back pretty far, and probably have no standard trailer for such)
I agree with that. Also because the Roman calendar is not defined
at all in the spec... Although frankly neither is the Islam
calendar. But I agree that this would be a useful addition to the
Gedcom standard...
Best regards,
Peter.
--
===================================================================
Peter Verthez Software engineer
Email at work: mailto:Pet...@al...
at home: mailto:Pet...@ad...
WWW: http://gallery.uunet.be/Peter.Verthez
===================================================================
Don't believe anything you read, hear or think.
|