From: Alex R. <sh...@gr...> - 2006-04-19 15:55:26
|
Eero, Just wanted to let you know that the Finnish date handler is not 100% OK. I thought it was, but it turned out that it was not imported due to the typo and en_US used as a fallback. You can run it like this, from the "src" directory in CVS: $ LANG=3Dfi_FI.utf8 python date_test.py Here's the result: ['fi_FI', 'utf'] <Date_fi.DateDisplayFI instance at 0xa6fa856c> <Date_fi.DateParserFI instance at 0xa7089dcc> [snip] RESULT: slash-dates: 60 dates ok, 0 failed. partial date: 144 dates ok, 0 failed. Non-gregorian: 270 dates ok, 34 failed. B. C. E.: 30 dates ok, 6 failed. basic test: 2022 dates ok, 0 failed. Here's where it fails: input was: calendar: CAL_GREGORIAN modifier: MOD_AFTER quality: QUAL_NONE dateval: (4, 11, -90, False) text: 'Comment. Format: VVVV-KK-PP (ISO)' DateDisplay gives: '4.11.90 ekr. j=C3=A4lkeen' parsed date was: calendar: CAL_GREGORIAN modifier: MOD_NONE quality: QUAL_NONE dateval: (4, 11, -90, False) text: '4.11.90 ekr. j=C3=A4lkeen' this gives:'4.11.90 ekr.' and similar cases. Also: input was: calendar: CAL_JULIAN modifier: MOD_RANGE quality: QUAL_NONE dateval: (4, 10, 1789, False, 5, 11, 1876, False) text: 'Comment. Format: VVVV-KK-PP (ISO)' DateDisplay gives: '4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4 (Juliaanine= n)' parsed date was: calendar: CAL_GREGORIAN modifier: MOD_TEXTONLY quality: QUAL_NONE dateval: (0, 0, 0, False) text: '4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4' this gives:'4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4' I can't quite find my way in named regex, so I thought I'd let you know. In both cases, it seems that the date is displayed correctly, but the it is not parsed correctly. Let me know if I can be of any help here, Alex --=20 Alexander Roitman http://www.gramps-project.org |
From: Eero T. <ee...@us...> - 2006-04-19 16:24:13
|
Hi, On Wednesday 19 April 2006 18:55, Alex Roitman wrote: > Just wanted to let you know that the Finnish date handler is > not 100% OK. I thought it was, but it turned out that it was > not imported due to the typo and en_US used as a fallback. > You can run it like this, from the "src" directory in CVS: > $ LANG=3Dfi_FI.utf8 python date_test.py Thanks, I'll look into into and (try to :)) fix it. - Eero > Here's the result: > ['fi_FI', 'utf'] > <Date_fi.DateDisplayFI instance at 0xa6fa856c> > <Date_fi.DateParserFI instance at 0xa7089dcc> > [snip] > RESULT: > slash-dates: 60 dates ok, 0 failed. > partial date: 144 dates ok, 0 failed. > Non-gregorian: 270 dates ok, 34 failed. > B. C. E.: 30 dates ok, 6 failed. > basic test: 2022 dates ok, 0 failed. > > Here's where it fails: > input was: > calendar: CAL_GREGORIAN > modifier: MOD_AFTER > quality: QUAL_NONE > dateval: (4, 11, -90, False) > text: 'Comment. Format: VVVV-KK-PP (ISO)' > DateDisplay gives: '4.11.90 ekr. j=C3=A4lkeen' > parsed date was: > calendar: CAL_GREGORIAN > modifier: MOD_NONE > quality: QUAL_NONE > dateval: (4, 11, -90, False) > text: '4.11.90 ekr. j=C3=A4lkeen' > this gives:'4.11.90 ekr.' > and similar cases. > > Also: > input was: > calendar: CAL_JULIAN > modifier: MOD_RANGE > quality: QUAL_NONE > dateval: (4, 10, 1789, False, 5, 11, 1876, False) > text: 'Comment. Format: VVVV-KK-PP (ISO)' > DateDisplay gives: '4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4 (Juliaa= ninen)' > parsed date was: > calendar: CAL_GREGORIAN > modifier: MOD_TEXTONLY > quality: QUAL_NONE > dateval: (0, 0, 0, False) > text: '4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4' > this gives:'4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4' > > I can't quite find my way in named regex, so I thought I'd let > you know. In both cases, it seems that the date is displayed > correctly, but the it is not parsed correctly. > > Let me know if I can be of any help here, > Alex |
From: Eero T. <ee...@us...> - 2006-04-19 21:02:20
Attachments:
DateParser.py.diff
|
Hi, On Wednesday 19 April 2006 18:55, Alex Roitman wrote: > Just wanted to let you know that the Finnish date handler is > not 100% OK. I thought it was, but it turned out that it was > not imported due to the typo and en_US used as a fallback. > You can run it like this, from the "src" directory in CVS: > $ LANG=3Dfi_FI.utf8 python date_test.py > > I can't quite find my way in named regex, so I thought I'd let > you know. In both cases, it seems that the date is displayed > correctly, but the it is not parsed correctly. > > Let me know if I can be of any help here, > > Here's where it fails: > input was: > calendar: CAL_GREGORIAN > modifier: MOD_AFTER > quality: QUAL_NONE > dateval: (4, 11, -90, False) > text: 'Comment. Format: VVVV-KK-PP (ISO)' > DateDisplay gives: '4.11.90 ekr. j=C3=A4lkeen' > parsed date was: > calendar: CAL_GREGORIAN > modifier: MOD_NONE > quality: QUAL_NONE > dateval: (4, 11, -90, False) > text: '4.11.90 ekr. j=C3=A4lkeen' > this gives:'4.11.90 ekr.' > and similar cases. DateParser was discarding the stuff after bce pattern (here "ekr."). Attached patch fixes that. The patch contains same fix also for quality and calendar type patterns and matches. Although nothing currently requires it, I think it would be good to be fixed. Could you test that it doesn't break calendars for any of the other locales? Btw. I noticed that last BCE strings in the DateParser were same, I guess that's not the intention? > Also: > input was: > calendar: CAL_JULIAN > modifier: MOD_RANGE > quality: QUAL_NONE > dateval: (4, 10, 1789, False, 5, 11, 1876, False) > text: 'Comment. Format: VVVV-KK-PP (ISO)' > DateDisplay gives: '4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4 (Juliaa= ninen)' > parsed date was: > calendar: CAL_GREGORIAN > modifier: MOD_TEXTONLY > quality: QUAL_NONE > dateval: (0, 0, 0, False) > text: '4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4' > this gives:'4.10.1789 ja 5.11.1876 v=C3=A4lill=C3=A4' The problem seems to be that the calendar names in DateDisplayFI were set as unicode. If the "u" is removed, then above cases work. However, 4 cases for hebrew and french calendar are still failing although other calendars are working. The errors look like this: input was: calendar: CAL_HEBREW modifier: MOD_NONE quality: QUAL_NONE dateval: (4, 13, 1789, False) text: 'Comment. Format: VVVV-KK-PP (ISO)' DateDisplay gives: '1789-13-04 (heprealainen)' parsed date was: calendar: CAL_GREGORIAN modifier: MOD_TEXTONLY quality: QUAL_NONE dateval: (0, 0, 0, False) text: '1789-13-04' this gives:'1789-13-04' Could you explain me why things in DateParserFI should use unicode strings and things in DateDisplayFI or parser patterns should not use unicode? - Eero |
From: Alex R. <sh...@gr...> - 2006-04-20 02:26:33
|
Eero, On Thu, 2006-04-20 at 00:19 +0300, Eero Tamminen wrote: > DateParser was discarding the stuff after bce pattern (here "ekr."). > Attached patch fixes that. Thanks! > The patch contains same fix also for quality and calendar type patterns > and matches. Although nothing currently requires it, I think it would be > good to be fixed. Could you test that it doesn't break calendars for any > of the other locales? It did, e.g. for en_US (any locale that you don't have or that we lack a handler for will fall back to this as well). The bce string is composed from these: bce =3D ["BC", "B\.C", "B\.C\.", "BCE", "B\.C\.E", "B\.C\.E"] self._bce_str =3D '(' + '|'.join(self.bce) + ')' and the regex is such: self._bce_re =3D re.compile("(.*)\s+%s( ?.*)" % self._bce_str) So what happens for en_US is that if B.C.E. is displayed, then B\.C is matched by regex and ".E." is appended to the date. So from '90-11-04 B.C.E.' we get '90-11-04.E.' to work with. This then fails to parse, and we end up with the text-only date. Should we go from the longest to the shortest match in bce list? I did that and it seemed to help. If you have a better suggestion, please feel free to implement it. But now it started to fail for de_DE, because German bce string used groups and this changed the way regex was matched. I fixed that for now, but it is important that localized bce strings don't have anything in the parentheses. > Btw. I noticed that last BCE strings in the DateParser were same, I guess > that's not the intention? I think the last one was meant to end with '\.', thanks for catching. > Could you explain me why things in DateParserFI should use unicode > strings and things in DateDisplayFI or parser patterns should not use > unicode? No :-) This is black magic for me, nothing makes sense. I suspect that the regex in python is not 100% bug-free when it comes to mixing unicode and plain strings, but I can't really pinpoint anything. Alex P.S. The Finnish handler still has 30 errors on non-gregorian dates. --=20 Alexander Roitman http://www.gramps-project.org |
From: James A. T. <tr...@de...> - 2006-04-20 14:35:34
|
On Wed, Apr 19, 2006 at 07:26:16PM -0700, Alex Roitman wrote: > It did, e.g. for en_US (any locale that you don't have or that we > lack a handler for will fall back to this as well). The bce string > is composed from these: > bce = ["BC", "B\.C", "B\.C\.", "BCE", "B\.C\.E", "B\.C\.E"] > self._bce_str = '(' + '|'.join(self.bce) + ')' > and the regex is such: > self._bce_re = re.compile("(.*)\s+%s( ?.*)" % self._bce_str) > > So what happens for en_US is that if B.C.E. is displayed, then > B\.C is matched by regex and ".E." is appended to the date. > So from '90-11-04 B.C.E.' we get '90-11-04.E.' to work with. > This then fails to parse, and we end up with the text-only date. > > Should we go from the longest to the shortest match in bce list? > I did that and it seemed to help. If you have a better suggestion, > please feel free to implement it. Is there any reason you don't use the following? self._bce_re = re.compile("(.*)\s+(B[.]?C[.]?(E[.]?)?)( ?.*)") That will cover the problem with greediness and cases where someone isn't consistent in putting in a '.'. Also, is there guaranteed to be a space before the BCE? If not, perhaps the '\s+' should come out. -- James Treacy tr...@de... |
From: Alex R. <sh...@gr...> - 2006-04-21 00:30:19
|
On Wed, 2006-04-19 at 19:26 -0700, Alex Roitman wrote: > On Thu, 2006-04-20 at 00:19 +0300, Eero Tamminen wrote: > > Could you explain me why things in DateParserFI should use unicode > > strings and things in DateDisplayFI or parser patterns should not use > > unicode? >=20 > No :-) This is black magic for me, nothing makes sense. I suspect > that the regex in python is not 100% bug-free when it comes to mixing > unicode and plain strings, but I can't really pinpoint anything. I figured that changing everything to unicode helps :-) Committed to the CVS, all tests are passing now! Alex --=20 Alexander Roitman http://www.gramps-project.org |
From: Eero T. <ee...@us...> - 2006-04-21 18:51:57
|
Hi, On Friday 21 April 2006 03:30, Alex Roitman wrote: > > > Could you explain me why things in DateParserFI should use unicode > > > strings and things in DateDisplayFI or parser patterns should not use > > > unicode? > > > > No :-) This is black magic for me, nothing makes sense. I suspect > > that the regex in python is not 100% bug-free when it comes to mixing > > unicode and plain strings, but I can't really pinpoint anything. > > I figured that changing everything to unicode helps :-) > Committed to the CVS, all tests are passing now! Ok , thanks! I was getting a bit stumped on that... :-) - Eero |
From: Eero T. <ee...@us...> - 2006-04-20 19:08:35
|
Hi, On Thursday 20 April 2006 17:35, James A. Treacy wrote: > On Wed, Apr 19, 2006 at 07:26:16PM -0700, Alex Roitman wrote: > > It did, e.g. for en_US (any locale that you don't have or that we > > lack a handler for will fall back to this as well). The bce string > > is composed from these: > > bce =3D ["BC", "B\.C", "B\.C\.", "BCE", "B\.C\.E", "B\.C\.E"] > > self._bce_str =3D '(' + '|'.join(self.bce) + ')' > > and the regex is such: > > self._bce_re =3D re.compile("(.*)\s+%s( ?.*)" % self._bce_str) > > > > So what happens for en_US is that if B.C.E. is displayed, then > > B\.C is matched by regex and ".E." is appended to the date. > > So from '90-11-04 B.C.E.' we get '90-11-04.E.' to work with. > > > This then fails to parse, and we end up with the text-only date. > > > > Should we go from the longest to the shortest match in bce list? If there's this kind of a problem, I think all of these (qual, cal, bce) lists should be sorted from longest to shortest. It's probably enough just to specify that all date handlers should list these from longest to shortest, than force that in code. (Maybe adding a comment about that to each of the existing date handlers would be enough?) > > I did that and it seemed to help. If you have a better suggestion, > > please feel free to implement it. I looked into Python regexp quickref and there was no option to force the regexp to pick the longest matching pattern alternative. However, "man 7 regex": In the event that an RE could match more than one sub=AD string of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest. Would indicate that also the longest alternative branch should be selecte= d. Could somebody check this with Python developers? > Is there any reason you don't use the following? > self._bce_re =3D re.compile("(.*)\s+(B[.]?C[.]?(E[.]?)?)( ?.*)") > That will cover the problem with greediness and cases where > someone isn't consistent in putting in a '.'. It will also allow things like B.CE and BC.E, so maybe something like following would be better: (BCE?|B[.]C([.]E[.]?)?) However, the sorting would still be required for bce alternatives specifi= ed by non-english date handers. > Also, is there guaranteed to be a space before the BCE? If not, > perhaps the '\s+' should come out. Well, I think it's good to require that the qualifiers are separated, otherwise they could match things like "abcepr" which would result in matching bce + apr (short for April after "bce" is removed). - Eero |
From: Alex R. <sh...@gr...> - 2006-04-20 19:32:18
|
On Thu, 2006-04-20 at 22:25 +0300, Eero Tamminen wrote: > > Should we go from the longest to the shortest match in bce list? >=20 > If there's this kind of a problem, I think all of these (qual, cal, bce) > lists should be sorted from longest to shortest. It's probably enough > just to specify that all date handlers should list these from longest to > shortest, than force that in code. Not necessarily all of them from longest to shortest, but the entries that match same thing need to be sorted that way. E.g. BCE and BC. But BCE and "before common era" could be ordered either way. I don't quite see your distinction between specifying and forcing. The code should have it right. What did you mean on this one? > > Is there any reason you don't use the following? > > self._bce_re =3D re.compile("(.*)\s+(B[.]?C[.]?(E[.]?)?)( ?.*)") > > That will cover the problem with greediness and cases where > > someone isn't consistent in putting in a '.'. >=20 > It will also allow things like B.CE and BC.E, so maybe something > like following would be better: > (BCE?|B[.]C([.]E[.]?)?) See, the problems with these are that in the localized handler self._bce_re must have the exact same groupls defined. If we just compose _bce_re from the list of 'B\.C\.E\.', 'B C E', etc, then localized handlers can just re-define bce list and be done. If the number of groups varies then every handler needs to rewrite a lot of the parser, or suffer the breakage. It seems that the solution I propose will work for all handlers: bce list is defined so that entries matching same strings are ordered longest to shortest. Alex --=20 Alexander Roitman http://www.gramps-project.org |
From: Eero T. <ee...@us...> - 2006-04-20 20:15:27
|
Hi, On Thursday 20 April 2006 22:32, Alex Roitman wrote: > > It will also allow things like B.CE and BC.E, so maybe something > > like following would be better: > > (BCE?|B[.]C([.]E[.]?)?) > > See, the problems with these are that in the localized handler > self._bce_re must have the exact same groupls defined. > If we just compose _bce_re from the list of 'B\.C\.E\.', 'B C E', > etc, then localized handlers can just re-define bce list and be done. > If the number of groups varies then every handler needs to > rewrite a lot of the parser, or suffer the breakage. Ah, I hadn't realized that subgroups produce separate matches. > It seems that the solution I propose will work for all handlers: > bce list is defined so that entries matching same strings are > ordered longest to shortest. Shouldn't actually all the other lists be ordered similarly? Currently only month_to_int and modifier_to_int are reverse sorted in DateParser, but also islamic_to_int, calendar_to_int, quality_to_int have items which are subsets of their other items. And date parses for other locales might have similar problem also for other dicts. Easiest would be to have an auxiliary method: def re_group_in_reverse_order(self, keys): "creates such a RE group out of given keys that longest is matched first" keys.sort() keys.reverse() return '(' + '|'.join(keys) + ')' Btw. the init_strings() method redundantly stores the regex patterns to self. If needed, you can get the pattern for the compiled regex "pattern" variable. I.e. all of these: self._rfc_mon_str = '(' + '|'.join(self._rfc_mons_to_int.keys()) + ')' etc could be replaced with: rfc_mon_str = '(' + '|'.join(self._rfc_mons_to_int.keys()) + ')' - Eero |
From: James A. T. <tr...@de...> - 2006-04-20 21:41:55
|
On Thu, Apr 20, 2006 at 11:32:13PM +0300, Eero Tamminen wrote: > > See, the problems with these are that in the localized handler > > self._bce_re must have the exact same groupls defined. > > If we just compose _bce_re from the list of 'B\.C\.E\.', 'B C E', > > etc, then localized handlers can just re-define bce list and be done. > > If the number of groups varies then every handler needs to > > rewrite a lot of the parser, or suffer the breakage. > > Ah, I hadn't realized that subgroups produce separate matches. This isn't necessarily a problem as subgroups can be non-capturing. Just start the group you don't want to see in the result with '?:', e.g. (?:abc) If there is concern with changing the group numbering, named groups can be used. E.g. >>> self.bce_re = re.compile("(?P<pre>.*)\s+(?P<bce>B[.]C[.](E[.])?)(?P<post> ?.*)") >>> m = self.bce_re.search('before B.C.E. after') >>> m.group('bce') 'B.C.E.' >>> m.group('pre') 'before' I'm not suggesting the code be switched to use this. I just wanted you to be aware of alternatives which might be useful. -- James Treacy tr...@de... |
From: Martin H. <Mar...@gm...> - 2006-04-21 18:47:11
|
James, Eero, now that the date handlers pass all tests and we are close to a new release I would suggest to leave that as is for now (it is working) before starting to create complicated regex. Well, those new regex is not complicated for developers but maybe too complicates for new translators. Simply translating a list is much more easy. Just my 2 cent. Cheers, Martin. > Von: "James A. Treacy" <tr...@de...> > On Thu, Apr 20, 2006 at 11:32:13PM +0300, Eero Tamminen wrote: > If there is concern with changing the group numbering, named > groups can be used. E.g. > >>> self.bce_re = > re.compile("(?P<pre>.*)\s+(?P<bce>B[.]C[.](E[.])?)(?P<post> ?.*)") > >>> m = self.bce_re.search('before B.C.E. after') > >>> m.group('bce') > 'B.C.E.' > >>> m.group('pre') > 'before' -- GMX Produkte empfehlen und ganz einfach Geld verdienen! Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner |
From: Alex R. <sh...@gr...> - 2006-04-21 19:19:58
|
On Fri, 2006-04-21 at 20:46 +0200, Martin Hawlisch wrote: > James, Eero, >=20 > now that the date handlers pass all tests and we are close to a new relea= se > I would suggest to leave that as is for now (it is working) before starti= ng > to create complicated regex. Well, those new regex is not complicated for > developers but maybe too complicates for new translators. Simply translat= ing > a list is much more easy. I would second that, Alex --=20 Alexander Roitman http://www.gramps-project.org |
From: Eero T. <ee...@us...> - 2006-04-22 10:54:58
Attachments:
dateparse.diff
|
Hi, On Friday 21 April 2006 21:46, Martin Hawlisch wrote: > now that the date handlers pass all tests and we are close to a new > release I would suggest to leave that as is for now (it is working) > before starting to create complicated regex. Well, those new regex is not > complicated for developers but maybe too complicates for new translators. > Simply translating a list is much more easy. There are still bugs in the date parsing that I mentioned in the earlier mail, because it seems that with regex group alternatives, the longest alternative is not automatically matched, just the first matching one. The automatic tester doesn't cover these because they are specific to strings used by the locale specific parsers. 1. Attached patch should fix following issues in the localized date parser: - In RU parser the _range1 alternatives were in an order which matches shorter form first - In ES parser the _range1 alternatives where in an order which matches the shorter ('ent' instead of 'ent.') first - In FR parser the range alternatives where in an order which matches the shorter ('ent' instead of 'ent.') first 2. In the DateParser.py it fixes similar issue with (e.g. with Islamic calendar) by improving the code in the following way: - Adding a utility method that sorts the keys for the regex so that longest key is first and quotes the '.' characters. The method returns a string containing the keys as a RE group - Changing init_strings() method to use the new utility method. (current code does this operation only for some arrays which can be overriden by the locale specific parsers and in about different way for each one...) - Removing the quoting from BCE strings as its now done by the utility method The patch modifieds the DE, FI & RU date parsers accordingly by removing quotes from their string lists (as they are now quoted automatically). Note that for 1), the RU & ES parsers could be fixed also by calling the new (self.re_longest_first) method instead of join() for the string arrays. 3. Additionally the patch contains following DataParser.py modifications: - Does not add the temporary strings to self in strings_init() (makes code slightly more readable and insignificantly faster & less memory using) - Moves the gregorian validity check from parse_calendar() to _parse_greg_julian() method as it doesn't seem to be used anywhere. This "improvement" could be dropped in case the other calendars will later get their own validation functions - Eero |
From: Alex R. <sh...@gr...> - 2006-04-22 18:53:14
|
Eero, This is a long patch with many changes all over. Before we consider putting it into 2.0.11, can you tell what errors does it fix? On Sat, 2006-04-22 at 14:11 +0300, Eero Tamminen wrote: > There are still bugs in the date parsing that I mentioned in the earlier > mail, because it seems that with regex group alternatives, the longest > alternative is not automatically matched, just the first matching one. > The automatic tester doesn't cover these because they are specific > to strings used by the locale specific parsers. I am not sure I understand this. The tester sets the dates by directly setting the Date() instance. Then it displays it. Then it parses the displayed value. If the two displayed values are the same and if the parsed Date() instance is equivalent to the original Date() instance then we're in a good shape. I am hesitant to introduce this much change without a good reason. Alex > 1. Attached patch should fix following issues in the localized date parse= r: > - In RU parser the _range1 alternatives were in an order which matches > shorter form first > - In ES parser the _range1 alternatives where in an order which matches > the shorter ('ent' instead of 'ent.') first > - In FR parser the range alternatives where in an order which matches > the shorter ('ent' instead of 'ent.') first >=20 >=20 > 2. In the DateParser.py it fixes similar issue with (e.g. with Islamic > calendar) by improving the code in the following way: > - Adding a utility method that sorts the keys for the regex so that > longest key is first and quotes the '.' characters. The method returns > a string containing the keys as a RE group > - Changing init_strings() method to use the new utility method. > (current code does this operation only for some arrays which can be > overriden by the locale specific parsers and in about different way fo= r > each one...) > - Removing the quoting from BCE strings as its now done by the utility > method >=20 > The patch modifieds the DE, FI & RU date parsers accordingly by removing > quotes from their string lists (as they are now quoted automatically). >=20 > Note that for 1), the RU & ES parsers could be fixed also by calling the = new > (self.re_longest_first) method instead of join() for the string arrays. >=20 >=20 > 3. Additionally the patch contains following DataParser.py modifications: > - Does not add the temporary strings to self in strings_init() > (makes code slightly more readable and insignificantly faster & less > memory using) > - Moves the gregorian validity check from parse_calendar() to > _parse_greg_julian() method as it doesn't seem to be used anywhere. > This "improvement" could be dropped in case the other calendars will > later get their own validation functions --=20 Alexander Roitman http://www.gramps-project.org |
From: Eero T. <ee...@us...> - 2006-04-22 20:57:09
Attachments:
dateparse.diff
|
Hi, On Saturday 22 April 2006 21:53, Alex Roitman wrote: > This is a long patch with many changes all over. > Before we consider putting it into 2.0.11, can you tell > what errors does it fix? The problem is that if you have a regex like "(cd|cde)", Python will match "cd" instead of "cde" from string "abcde", as we found out with bce. However, bce is not the only thing with this problem, it's just the only one that the current test code happened to excercise. In several Date handlers the alternatives were not in the longest first order. In DateParser.py itself alternatives for some other things are taken from dict keys and I don't think Python orders them in longest first order... The current DateParser sorts some of the dict keys before joining them (and in different way for no aparent reason, "April" is before "Apr" regardless whether you short according to key lenght or first sort keys alphabetically and then reverse) , but not all. I fixed this by adding a trivial method to do the sorting (and '.' quoting) and then changed all joinings to use that method in init_strings(). > On Sat, 2006-04-22 at 14:11 +0300, Eero Tamminen wrote: > > There are still bugs in the date parsing that I mentioned in the > > earlier mail, because it seems that with regex group alternatives, the > > longest alternative is not automatically matched, just the first > > matching one. The automatic tester doesn't cover these because they are > > specific to strings used by the locale specific parsers. > > I am not sure I understand this. The tester sets the dates by directly > setting the Date() instance. Then it displays it. Then it parses the > displayed value. If the two displayed values are the same and if > the parsed Date() instance is equivalent to the original Date() instance > then we're in a good shape. These tests test only whether the Date parsers can handle the output of Date displayers, not whether the parsers can handle all the alternatives for given type that the localizer thought he had handled by listing them in the Date handler specific dicts. Testing that would require there to be test code that is specific to each Date handler... > I am hesitant to introduce this much change without a good reason. The changes are pretty trivial: - make all joinings to use the new method which guarantees that the joined keys are in correct (longest first) order - removing '\' chars from the arrays & dicts which are now quated by the new method - "s/self._//g" for all *_str variables in the init_strings() method (this is the one accouting for most of changed lined, but it's easy to see that no regexps were change by this) - fix the order of alternate groupings in Date handlers where they were not longest first Attached is a version from which I left out the gregorian function changes. I hope this makes reviewing it easier. - Eero > Alex > > > 1. Attached patch should fix following issues in the localized date > > parser: - In RU parser the _range1 alternatives were in an order which > > matches shorter form first > > - In ES parser the _range1 alternatives where in an order which matches > > the shorter ('ent' instead of 'ent.') first > > - In FR parser the range alternatives where in an order which matches > > the shorter ('ent' instead of 'ent.') first > > > > > > 2. In the DateParser.py it fixes similar issue with (e.g. with Islamic > > calendar) by improving the code in the following way: > > - Adding a utility method that sorts the keys for the regex so that > > longest key is first and quotes the '.' characters. The method > > returns a string containing the keys as a RE group > > - Changing init_strings() method to use the new utility method. > > (current code does this operation only for some arrays which can be > > overriden by the locale specific parsers and in about different way > > for each one...) > > - Removing the quoting from BCE strings as its now done by the utility > > method > > > > The patch modifieds the DE, FI & RU date parsers accordingly by > > removing quotes from their string lists (as they are now quoted > > automatically). > > > > Note that for 1), the RU & ES parsers could be fixed also by calling > > the new (self.re_longest_first) method instead of join() for the string > > arrays. > > > > > > 3. Additionally the patch contains following DataParser.py > > modifications: - Does not add the temporary strings to self in > > strings_init() (makes code slightly more readable and insignificantly > > faster & less memory using) > > - Moves the gregorian validity check from parse_calendar() to > > _parse_greg_julian() method as it doesn't seem to be used anywhere. > > This "improvement" could be dropped in case the other calendars will > > later get their own validation functions |
From: Alex R. <sh...@gr...> - 2006-04-22 23:02:23
|
Eero, OK, I see the need for fixing the order now. > The changes are pretty trivial: > - make all joinings to use the new method which guarantees that the joine= d > keys are in correct (longest first) order No problem here. > - removing '\' chars from the arrays & dicts which are now quated by > the new method No problem. > - "s/self._//g" for all *_str variables in the init_strings() method > (this is the one accouting for most of changed lined, but it's easy > to see that no regexps were change by this) But this breaks German, French, and Dutch handlers that use self._*_str attributes! Let's not do cosmetic changes between the rc2 and the release. We can take our time and pretty things up after the release. Or we could have done it weel before 2.0.11. But now I think it's doing more harm than good. Inevitably, something gets overlooked and then we're stuck with the broken released code. > - fix the order of alternate groupings in Date handlers where they were > not longest first OK > Attached is a version from which I left out the gregorian function change= s. > I hope this makes reviewing it easier. Yes. I'll apply the parts that don't cause the breakage. Alex --=20 Alexander Roitman http://www.gramps-project.org |
From: Alex R. <sh...@gr...> - 2006-04-22 23:34:02
|
Eero, On Sat, 2006-04-22 at 16:02 -0700, Alex Roitman wrote: > I'll apply the parts that don't cause the breakage. I have applied and committed these and some other changes, all of the same sort. Sorry if I was too harsh in my previous response. But I really think that converting from class attributes to local vars is not as important as fixing the bugs at this point. Thanks a lot for your patches, even though I only applied a small subset :-) Alex --=20 Alexander Roitman http://www.gramps-project.org |
From: Eero T. <ee...@us...> - 2006-04-23 16:36:30
|
Hi, On Sunday 23 April 2006 02:02, Alex Roitman wrote: > OK, I see the need for fixing the order now. > > > - fix the order of alternate groupings in Date handlers where they were > > not longest first > > OK Hm, seems I was a bit too hasty. I thought about it a bit more and the alternate grouping should be a problem only if the RE pattern allows the alternate group to be followed by any character (that is in the end of any alternatives). If it's followed by e.g. whitespace, the RE pattern should match correctly. So, something like "(entre|ent|ent\.)\s+" should actually always match the right alternative, whereas "(entre|ent|ent\.)\s*" wouldn't, it would take just "ent" from "ent.". I.e. actually the ru/fr/es alternative groups wouldn't have needed changing, in all of them the alternatives group is followed by at leat one whitespace in the pattern. > > The changes are pretty trivial: > > - make all joinings to use the new method which guarantees that the > > joined keys are in correct (longest first) order > > No problem here. > > > - removing '\' chars from the arrays & dicts which are now quated by > > the new method > > No problem. > > > - "s/self._//g" for all *_str variables in the init_strings() method > > (this is the one accouting for most of changed lined, but it's easy > > to see that no regexps were change by this) > > But this breaks German, French, and Dutch handlers that use self._*_str > attributes! Hmph. I hadn't noticed that. (All of German, French and Dutch date handlers are using the months in such a regex that the longest first sorting is required. I.e. the sorting is still a good thing for DateParser, although the locale specific Date handler changes were not necessary.) > Let's not do cosmetic changes between the rc2 and the > release. We can take our time and pretty things up after the release. > Or we could have done it weel before 2.0.11. But now I think > it's doing more harm than good. Inevitably, something gets overlooked > and then we're stuck with the broken released code. Sorry about that. Happily the date tests were in place to catch my blunderings. :-) When there's the right time to pretty up things, I'd like to see that: - Things which are used outside the class, do not start with underscore. I notice that I had used _bce_str in Date_fi, that would be nice to clean up in addition to _*_str stuff used by de/fr/nl Date handlers. This is an API for translators... > > Attached is a version from which I left out the gregorian function > > changes. I hope this makes reviewing it easier. > > Yes. I'll apply the parts that don't cause the breakage. - Eero |
From: Alex R. <sh...@gr...> - 2006-04-23 18:20:43
|
Eero, On Sun, 2006-04-23 at 19:53 +0300, Eero Tamminen wrote: > Hm, seems I was a bit too hasty. I thought about it a bit more and the > alternate grouping should be a problem only if the RE pattern allows the > alternate group to be followed by any character (that is in the end of > any alternatives). If it's followed by e.g. whitespace, the RE pattern > should match correctly. >=20 > So, something like "(entre|ent|ent\.)\s+" should actually always match th= e > right alternative, whereas "(entre|ent|ent\.)\s*" wouldn't, it would take > just "ent" from "ent.". I.e. actually the ru/fr/es alternative groups=20 > wouldn't have needed changing, in all of them the alternatives group is > followed by at leat one whitespace in the pattern. But since they are already changed now, it seems that this does not hurt either, so we can leave things as they are. Right? > > > - "s/self._//g" for all *_str variables in the init_strings() method > > > (this is the one accouting for most of changed lined, but it's easy > > > to see that no regexps were change by this) > > > > But this breaks German, French, and Dutch handlers that use self._*_str > > attributes! >=20 > Hmph. I hadn't noticed that. >=20 > (All of German, French and Dutch date handlers are using the months in su= ch > a regex that the longest first sorting is required. I.e. the sorting is > still a good thing for DateParser, although the locale specific Date hand= ler > changes were not necessary.) True, and this is committed. > When there's the right time to pretty up things, I'd like to see that: > - Things which are used outside the class, do not start with underscore. > I notice that I had used _bce_str in Date_fi, that would be nice to > clean up in addition to _*_str stuff used by de/fr/nl Date handlers. > This is an API for translators... This is a tough call though. The DateParserXX classes are derived from DateParser, so in a way self._*_str is not used outside the class. This is because self._*_str is a class attribtue of DateParserXX through the inheritance. But we can come up with any conventions we want :-) The formal API is kinda hard to do, because some date handlers need to rewrite more of the DateParser methods than others. For example, Russian handler only redefines the constants, but French and Dutch write their own parsing methods. So it's not always clear in advance what is a DateParser private method versus what should be exposed and formalized as an API. If you have time and inclination to work this out, the good time would start right after 2.0.11 is released. Should be in about a week from now. Thanks, Alex --=20 Alexander Roitman http://www.gramps-project.org |
From: Eero T. <ee...@us...> - 2006-04-23 20:00:33
|
Hi, On Sunday 23 April 2006 21:20, Alex Roitman wrote: > > Hm, seems I was a bit too hasty. I thought about it a bit more and the > > alternate grouping should be a problem only if the RE pattern allows > > the alternate group to be followed by any character (that is in the end > > of any alternatives). If it's followed by e.g. whitespace, the RE > > pattern should match correctly. > > > > So, something like "(entre|ent|ent\.)\s+" should actually always match > > the right alternative, whereas "(entre|ent|ent\.)\s*" wouldn't, it > > would take just "ent" from "ent.". I.e. actually the ru/fr/es > > alternative groups wouldn't have needed changing, in all of them the > > alternatives group is followed by at leat one whitespace in the > > pattern. > > But since they are already changed now, it seems that this does not > hurt either, so we can leave things as they are. Right? Sure. :-) > > When there's the right time to pretty up things, I'd like to see that: > > - Things which are used outside the class, do not start with > > underscore. I notice that I had used _bce_str in Date_fi, that would be > > nice to clean up in addition to _*_str stuff used by de/fr/nl Date > > handlers. This is an API for translators... > > This is a tough call though. The DateParserXX classes are derived > from DateParser, so in a way self._*_str is not used outside the class. > This is because self._*_str is a class attribtue of DateParserXX > through the inheritance. > > But we can come up with any conventions we want :-) > > The formal API is kinda hard to do, because some date handlers need > to rewrite more of the DateParser methods than others. For example, > Russian handler only redefines the constants, but French and Dutch > write their own parsing methods. > > So it's not always clear in advance what is a DateParser private > method versus what should be exposed and formalized as an API. > If you have time and inclination to work this out, the good time > would start right after 2.0.11 is released. Should be in about a > week from now. Well, the other things that are overridden by locale specific date handlers are arrays without underscore prefix, so I think the *_str things should also be without underscore. They could also be documented, currently there are only 9 locale specific date handlers, but translations for 20 languages. - Eero |