[Tcl-bugs] [ tcl-Bugs-3085863 ] tclUniData 9 years old

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #3085863, was opened at 2010-10-12 13:52
Message generated for change (Comment added) made by nijtmans
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3085863&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 44. UTF-8 Strings
Group: development: 8.6b1.1
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Lars Hellström (lars_h)
Assigned to: Jan Nijtmans (nijtmans)
Summary: tclUniData 9 years old

Initial Comment:
It seems the tables in tclUniData.c (i.e., what class the various characters belong to) are based on a UnicodeData.txt file that is at least 9 years old, since they have been unchanged for at least that long. This means for example \u0220 (LATIN CAPITAL LETTER N WITH LONG RIGHT LEG) is not considered to be alphabetic by Tcl

% string is alpha \u0220
0

despite it being listed as having class Lu in http://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt.

A possible factor could be that tools/uniParse.tcl states its input should be the file

  ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt

which doesn't exist anymore; there is only a UnicodeData.txt which presumably serves the same purpose.

Another factor could be worry that updating these tables could reopen the "Unicode beyond \uFFFF" can of worms, but the CVS comments for tclUniData.c v1.4 says it was generated from the UnicodeData.txt for Unicode 3.1.0, and (if I recall it correctly) that is precisely the first version that added non-BMP characters, so we've already jumped that particular bullet.

----------------------------------------------------------------------

>Comment By: Jan Nijtmans (nijtmans)
Date: 2010-12-06 14:00

Message:
It seems everything is OK, so closing

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2010-10-24 09:25

Message:
Probably not worth your time. 8.4 is probably EOLed now unless someone
finds something seriously wrong with it (security hole or crash-bug).

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2010-10-23 09:34

Message:
Added more tests, and backported to 8.5.

Any interest for 8.4 too?

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2010-10-15 17:28

Message:
Checked in in HEAD. Left open because of:
- More tests should be added
- Backport to 8.5/8.4??
But before doing that, I would like to receive
more feedback that this is really OK, not
introducing any problem. I cannot find
anything wrong, but 9 years of changes
is a long time........

----------------------------------------------------------------------

Comment By: Lars Hellström (lars_h)
Date: 2010-10-14 14:12

Message:
Changing uni::shift to 6 (is currently 5, but for some reason a 6 is
hardcoded in what uniParse.tcl writes to stdout) lets you get by with 208
groups. Each +1 increase in this doubles the length of the groupMap though,
so perhaps it's cheaper to let the entries in the pageMap be swell to 16
bits.

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2010-10-13 17:17

Message:
After this patch:
  % string is alpha \u0220
  1

So it looks like it works!

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2010-10-13 17:07

Message:
New attempt. Didn't try your code (which is obvously the right way),
just stripped of the out-of-range characters from UnicodeData.txt
before running the tools. Here is the result, a new patch.
The table has some 2048 elements now, so that looks more
reasonable, but still more than 256.

Is this better?

----------------------------------------------------------------------

Comment By: Lars Hellström (lars_h)
Date: 2010-10-13 14:20

Message:
There is something slightly odd here. The current pageMap vector is indexed
by an 11-bit integer (11 = 16 - OFFSET_BITS), meaning 2048 distinct
elements can be accessed, but the vector has 5886 elements! Presumably it
contains data also for non-BMP characters :-), even though Tcl can't access
it :-(.

Looking at uniParse.tcl, it indeed does not have any provisions for
ignoring data if the codepoint is out of range. And what is perhaps worse:
It only uses the first four hex digits as codepoint (the index variable).
That's a bug! (So the data for unaccessible characters was probably wrong
anyway.)

Suggested fix: Change the lines

	scan [lindex $items 0] %4x index
	set index [format 0x%0.4x $index]

to

	scan [lindex $items 0] %x index
        if {$index > 0xFFFF} then {
            # Ignore non-BMP characters, as long as Tcl doesn't support
them
            continue
        }
	set index [format 0x%0.4x $index]

I wouldn't be surprised if that solves the more-than-256 groups problem as
well.

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2010-10-12 17:37

Message:
Jeff, I did what Lars suggested, re-generating the necessary files with the
latest UnicodeData.txt. The only 'real' thing I had to change is the type
static variable
pageMap, from unsigned char to unsigned const, because there are more
than 256 maps now. Here is the patch (after some more manual changes
to regc_locale.c).

All test seem to run fine.
Please evaluate, is there anything I am missing?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3085863&group_id=10894

[Tcl-bugs] [ tcl-Bugs-3085863 ] tclUniData 9 years old

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-3085863 ] tclUniData 9 years old