From: Eero T. <ee...@us...> - 2006-04-20 19:08:35
|
Hi, On Thursday 20 April 2006 17:35, James A. Treacy wrote: > On Wed, Apr 19, 2006 at 07:26:16PM -0700, Alex Roitman wrote: > > It did, e.g. for en_US (any locale that you don't have or that we > > lack a handler for will fall back to this as well). The bce string > > is composed from these: > > bce =3D ["BC", "B\.C", "B\.C\.", "BCE", "B\.C\.E", "B\.C\.E"] > > self._bce_str =3D '(' + '|'.join(self.bce) + ')' > > and the regex is such: > > self._bce_re =3D re.compile("(.*)\s+%s( ?.*)" % self._bce_str) > > > > So what happens for en_US is that if B.C.E. is displayed, then > > B\.C is matched by regex and ".E." is appended to the date. > > So from '90-11-04 B.C.E.' we get '90-11-04.E.' to work with. > > > This then fails to parse, and we end up with the text-only date. > > > > Should we go from the longest to the shortest match in bce list? If there's this kind of a problem, I think all of these (qual, cal, bce) lists should be sorted from longest to shortest. It's probably enough just to specify that all date handlers should list these from longest to shortest, than force that in code. (Maybe adding a comment about that to each of the existing date handlers would be enough?) > > I did that and it seemed to help. If you have a better suggestion, > > please feel free to implement it. I looked into Python regexp quickref and there was no option to force the regexp to pick the longest matching pattern alternative. However, "man 7 regex": In the event that an RE could match more than one sub=AD string of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest. Would indicate that also the longest alternative branch should be selecte= d. Could somebody check this with Python developers? > Is there any reason you don't use the following? > self._bce_re =3D re.compile("(.*)\s+(B[.]?C[.]?(E[.]?)?)( ?.*)") > That will cover the problem with greediness and cases where > someone isn't consistent in putting in a '.'. It will also allow things like B.CE and BC.E, so maybe something like following would be better: (BCE?|B[.]C([.]E[.]?)?) However, the sorting would still be required for bce alternatives specifi= ed by non-english date handers. > Also, is there guaranteed to be a space before the BCE? If not, > perhaps the '\s+' should come out. Well, I think it's good to require that the qualifiers are separated, otherwise they could match things like "abcepr" which would result in matching bce + apr (short for April after "bce" is removed). - Eero |