Re: [Gramps-devel] Finnish date handler

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

On Thursday 20 April 2006 17:35, James A. Treacy wrote:
> On Wed, Apr 19, 2006 at 07:26:16PM -0700, Alex Roitman wrote:
> > It did, e.g. for en_US (any locale that you don't have or that we
> > lack a handler for will fall back to this as well). The bce string
> > is composed from these:
> >     bce =3D ["BC", "B\.C", "B\.C\.", "BCE", "B\.C\.E", "B\.C\.E"]
> >     self._bce_str =3D '(' + '|'.join(self.bce) + ')'
> > and the regex is such:
> >     self._bce_re   =3D re.compile("(.*)\s+%s( ?.*)" % self._bce_str)
> >
> > So what happens for en_US is that if B.C.E. is displayed, then
> > B\.C is matched by regex and ".E." is appended  to the date.
> > So from '90-11-04 B.C.E.' we get '90-11-04.E.' to work with.
>
> > This then fails to parse, and we end up with the text-only date.
> >
> > Should we go from the longest to the shortest match in bce list?

If there's this kind of a problem, I think all of these (qual, cal, bce)
lists should be sorted from longest to shortest.   It's probably enough
just to specify that all date handlers should list these from longest to
shortest, than force that in code.

(Maybe adding a comment about that to each of the existing date
handlers would be enough?)

> > I did that and it seemed to help. If you have a better suggestion,
> > please feel free to implement it.

I looked into Python regexp quickref and there was no option to force the
regexp to pick the longest matching pattern alternative.  However,
"man 7 regex":
       In  the  event  that  an RE could match more than one sub=AD
       string of a given string, the RE matches the one  starting
       earliest  in  the string.  If the RE could match more than
       one substring starting  at  that  point,  it  matches  the
       longest.

Would indicate that also the longest alternative branch should be selecte=
d.
Could somebody check this with Python developers?

> Is there any reason you don't use the following?
>   self._bce_re =3D re.compile("(.*)\s+(B[.]?C[.]?(E[.]?)?)( ?.*)")
> That will cover the problem with greediness and cases where
> someone isn't consistent in putting in a '.'.

It will also allow things like B.CE and BC.E, so maybe something
like following would be better:
	(BCE?|B[.]C([.]E[.]?)?)

However, the sorting would still be required for bce alternatives specifi=
ed
by non-english date handers.

> Also, is there guaranteed to be a space before the BCE? If not,
> perhaps the '\s+' should come out.

Well, I think it's good to require that the qualifiers are separated,
otherwise they could match things like "abcepr" which would result
in matching bce + apr (short for April after "bce" is removed).

	- Eero

Re: [Gramps-devel] Finnish date handler

Gramps, the open source genealogy program

Re: [Gramps-devel] Finnish date handler