|
From: <apn...@ya...> - 2025-09-01 05:33:43
|
Not looked in detail. Rushing to finish off a few things before I leave for a couple of weeks on unexpected travel with sporadic connectivity, but ...
I agree with changing 0x323c0 -> 0x110000. I had noticed that but did not know the reason for picking the last assigned character as opposed to last valid code point so left it alone.
Regarding invalid code points, I do not have strong opinions and would not object to any changes. As Harald commented in one of the tickets Tcl does not check for validity for strings passed through the C API and it should be up to the application or extension to ensure only valid data comes in (I think we already do this at the script level except possibly for surrogates). Tcl currently may interpret these as Cp1252, replace with U+FFFD, or leave as is depending on the API and specific code point. Garbage in, garbage out... I would prefer Tcl be consistent in handling but other than that any changes should be viewed as something applications should not have relied on anyways (invalid data is undefined behavior).
-----Original Message-----
From: Jan Nijtmans <jan...@gm...>
Sent: Sunday, August 31, 2025 3:11 PM
To: apn...@ya...
Cc: Tcl Core List <tcl...@li...>
Subject: Re: [TCLCORE] CFV: TIP 726
Op za 30 aug 2025 om 06:43 schreef Ashok:
> Regarding (1) -
>
> As a matter of principle, I prefer to check validity of data before being passed to external libraries (I treat utf8proc as such). When evaluating various libraries, some, like the very fast SIMD based library, expect valid data, others don't, and some differ in their treatment depending on function being called (including utf8proc). I don't think we want to be in the business of revisiting the Tcl implementation on a library change.
>
> But aside from those general principles, there are specific 9.0 compatibility issues with your implementation that I see looking at the code. For example, 9.0 Tcl_UniCharToUpper (and similar) map 0xFFFFFF to 0x1FFFFF while your code seemingly maps it to 0xFFFFFF. On a tangential note, utf8proc definitions of character classifications are not exactly the same as Tcl's historical classifications (including 9.0). I happen to think utf8proc is more appropriate and in line with the Unicode standard but as stated before, TIP 726 strives for full compatibility with Tcl 9.0. Any changes to character classification would be a compatibility change (no matter how small) and require a separate TIP.
You have a point here. The Tcl core only calls those
Tcl_UniCharToXXX() functions with values <= 0x10FFFF
(since that's the maximum number that Tcl_UtfToUnichar() can produce).
There are 3 ranges to consider
closer:
1) 0xD800 - 0xDFFF. Since Tcl 8.6 outputs the same value as
input, for compatibility
we want 9.1 to do the same. It does.
2) 0x110000 and 0x1FFFFF. Personally, I don't mind much what
Tcl_UniCharToXXX()
does for values between 0x110000 and 0x1FFFFF. But since 9.0 outputs the
same value as input, it makes sense to keep doing this in 9.1.
If utf8proc
does something different (like returning -1), yes we should do
a range check.
3) Values above 0xFFFFF. There are currently no testcases for that
(those should
be added, but that's not TIP #726's fault).
Just one more remark for now. In tclUtf.c, I see:
#define UNICODE_OUT_OF_RANGE(ch) (((ch) & 0x1FFFFF) >= 0x323C0)
Why 0x323C0? In Tcl 8.6 and 9.0, this number was generated from
the UnicodeData.txt file. It was simply the last character present
in the table. It is dangerous to keep this number: What if a future Unicode
version has more characters than that in the 3th plane (or adds a 4th plane)?
So I suggest to change this value to 0x110000 (or 0x40000, with the
remark that this should be increased if the 4th plane gets any characters)
Hope this helps,
Jan Nijtmans
|