From: SourceForge.net <no...@so...> - 2010-12-06 13:00:28
|
Bugs item #3085863, was opened at 2010-10-12 13:52 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3085863&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: development: 8.6b1.1 >Status: Closed >Resolution: Fixed Priority: 5 Private: No Submitted By: Lars Hellström (lars_h) Assigned to: Jan Nijtmans (nijtmans) Summary: tclUniData 9 years old Initial Comment: It seems the tables in tclUniData.c (i.e., what class the various characters belong to) are based on a UnicodeData.txt file that is at least 9 years old, since they have been unchanged for at least that long. This means for example \u0220 (LATIN CAPITAL LETTER N WITH LONG RIGHT LEG) is not considered to be alphabetic by Tcl % string is alpha \u0220 0 despite it being listed as having class Lu in http://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt. A possible factor could be that tools/uniParse.tcl states its input should be the file ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt which doesn't exist anymore; there is only a UnicodeData.txt which presumably serves the same purpose. Another factor could be worry that updating these tables could reopen the "Unicode beyond \uFFFF" can of worms, but the CVS comments for tclUniData.c v1.4 says it was generated from the UnicodeData.txt for Unicode 3.1.0, and (if I recall it correctly) that is precisely the first version that added non-BMP characters, so we've already jumped that particular bullet. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2010-12-06 14:00 Message: It seems everything is OK, so closing ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2010-10-24 09:25 Message: Probably not worth your time. 8.4 is probably EOLed now unless someone finds something seriously wrong with it (security hole or crash-bug). ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2010-10-23 09:34 Message: Added more tests, and backported to 8.5. Any interest for 8.4 too? ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2010-10-15 17:28 Message: Checked in in HEAD. Left open because of: - More tests should be added - Backport to 8.5/8.4?? But before doing that, I would like to receive more feedback that this is really OK, not introducing any problem. I cannot find anything wrong, but 9 years of changes is a long time........ ---------------------------------------------------------------------- Comment By: Lars Hellström (lars_h) Date: 2010-10-14 14:12 Message: Changing uni::shift to 6 (is currently 5, but for some reason a 6 is hardcoded in what uniParse.tcl writes to stdout) lets you get by with 208 groups. Each +1 increase in this doubles the length of the groupMap though, so perhaps it's cheaper to let the entries in the pageMap be swell to 16 bits. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2010-10-13 17:17 Message: After this patch: % string is alpha \u0220 1 So it looks like it works! ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2010-10-13 17:07 Message: New attempt. Didn't try your code (which is obvously the right way), just stripped of the out-of-range characters from UnicodeData.txt before running the tools. Here is the result, a new patch. The table has some 2048 elements now, so that looks more reasonable, but still more than 256. Is this better? ---------------------------------------------------------------------- Comment By: Lars Hellström (lars_h) Date: 2010-10-13 14:20 Message: There is something slightly odd here. The current pageMap vector is indexed by an 11-bit integer (11 = 16 - OFFSET_BITS), meaning 2048 distinct elements can be accessed, but the vector has 5886 elements! Presumably it contains data also for non-BMP characters :-), even though Tcl can't access it :-(. Looking at uniParse.tcl, it indeed does not have any provisions for ignoring data if the codepoint is out of range. And what is perhaps worse: It only uses the first four hex digits as codepoint (the index variable). That's a bug! (So the data for unaccessible characters was probably wrong anyway.) Suggested fix: Change the lines scan [lindex $items 0] %4x index set index [format 0x%0.4x $index] to scan [lindex $items 0] %x index if {$index > 0xFFFF} then { # Ignore non-BMP characters, as long as Tcl doesn't support them continue } set index [format 0x%0.4x $index] I wouldn't be surprised if that solves the more-than-256 groups problem as well. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2010-10-12 17:37 Message: Jeff, I did what Lars suggested, re-generating the necessary files with the latest UnicodeData.txt. The only 'real' thing I had to change is the type static variable pageMap, from unsigned char to unsigned const, because there are more than 256 maps now. Here is the patch (after some more manual changes to regc_locale.c). All test seem to run fine. Please evaluate, is there anything I am missing? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3085863&group_id=10894 |