Thread: [TCLCORE] CFV: TIP 726

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

This is a CFV for  <https://core.tcl-lang.org/tips/doc/trunk/tip/726.md> TIP
726: Commands for Unicode normalization 

CFV ends 2025-08-29 00:00 UTC.

My vote: Yes

/Ashok

Op di 19 aug 2025 om 04:19 schreef Ashok:
> This is a CFV for TIP 726: Commands for Unicode normalization

My vote.

TIP #726   YES

On my wish-list: "string tolower -locale <locale> <string>".   Like
the C++ std::tolower(). But that would be a separate TIP
(I don't even know if utf8proc can do that).  ;-)

Hope this helps,
        Jan Nijtmans

Op di 19 aug 2025 om 10:49 schreef Jan Nijtmans:
> My vote.
>
> TIP #726   YES

Some more minor review remarks:
1)  UNICODE_OUT_OF_RANGE can be completely eliminated, because
     utf8proc_category() already checks if its parameter is out of range.
2)  Right-shifts are more efficient here than left-shifts: It saves a
     '!='-operator in the implementation. The original code did
     that already. You removed a comment saying that :-)

See:
      https://core.tcl-lang.org/tcl/info/453ab178cb1751b7

Regards,
       Jan Nijtmans

Jan,

I've moved your changes to the tip-726-jan branch for the reasons below. I do not plan to merge them as part of TIP 726.

Regarding (1) - 

As a matter of principle, I prefer to check validity of data before being passed to external libraries (I treat utf8proc as such). When evaluating various libraries, some, like the very fast SIMD based library, expect valid data, others don't, and some differ in their treatment depending on function being called (including utf8proc). I don't think we want to be in the business of revisiting the Tcl implementation on a library change. 

But aside from those general principles, there are specific 9.0 compatibility issues with your implementation that I see looking at the code. For example, 9.0 Tcl_UniCharToUpper (and similar) map 0xFFFFFF to 0x1FFFFF while your code seemingly maps it to 0xFFFFFF. On a tangential note, utf8proc definitions of character classifications are not exactly the same as Tcl's historical classifications (including 9.0). I happen to think utf8proc is more appropriate and in line with the Unicode standard but as stated before, TIP 726 strives for full compatibility with Tcl 9.0. Any changes to character classification would be a compatibility change (no matter how small) and require a separate TIP.

Regarding (2) - 

It is not clear what you mean by more efficient when the compiled instruction stream is identical at least for clang x64 and gcc arm64. See https://godbolt.org/z/8afa54eh3.  I wrote in the style I did because (in my mind of course) it was more natural and clearer to look up the bit in the mask than shifting the mask to look at the bit. I verified with only two compilers and architectures but see no reason to claim either shift would be more efficient. Compilers are smarter than me.

As always, thanks a lot for reviewing but with a (mild) request to make future proposed changes to TIP's on separate branches once the vote is done and the branch is about to be merged.

/Ashok

-----Original Message-----
From: Jan Nijtmans <jan...@gm...> 
Sent: Friday, August 29, 2025 7:05 PM
To: apn...@ya...
Cc: Tcl Core List <tcl...@li...>
Subject: Re: [TCLCORE] CFV: TIP 726

Op di 19 aug 2025 om 10:49 schreef Jan Nijtmans:
> My vote.
>
> TIP #726   YES

Some more minor review remarks:
1)  UNICODE_OUT_OF_RANGE can be completely eliminated, because
     utf8proc_category() already checks if its parameter is out of range.
2)  Right-shifts are more efficient here than left-shifts: It saves a
     '!='-operator in the implementation. The original code did
     that already. You removed a comment saying that :-)

See:
      https://core.tcl-lang.org/tcl/info/453ab178cb1751b7

Regards,
       Jan Nijtmans

Op za 30 aug 2025 om 06:43 schreef Ashok:
> Regarding (1) -

I'll come back later on this one.

> Regarding (2) -
>
> It is not clear what you mean by more efficient when the compiled instruction stream is identical at least for clang x64 and gcc arm64. See https://godbolt.org/z/8afa54eh3.  I wrote in the style I did because (in my mind of course) it was more natural and clearer to look up the bit in the mask than shifting the mask to look at the bit. I verified with only two compilers and architectures but see no reason to claim either shift would be more efficient. Compilers are smarter than me.

More efficient means more efficient, so less instructions. Try the
same experiment
without the "-O2", an you will see that the non-optimized version is longer than
the optimized one. The optimizer is smart enough to realize that a right-shift
is better here, so it changes the left shift to a right-shift.
The original author of the code in tclUtf.c was smart enough
to realize this, and I realize it too. We don't want less efficient code in
debug mode (in which the optimizer is disabled).

I object to making such a change in existing code, without
realizing what was behind it. You could have asked before.

Sorry,
       Jan Nijtmans

> We don't want less efficient code in debug mode (in which the optimizer is disabled).

So if I understand you correctly, the generated code in a release build is identical but your objection is that the debug builds are not?

This is just absurd. Sorry.

Turning off optimization generates so much rubbish code (from an efficiency point of view) that a couple of instructions are completely immaterial. Worrying about speed in a non-optimized build is something I have never ever heard of and that too in interpreters that are never really speed demons in any case.

Being smart is not about bit twiddling any more. That time passed with the compilers that arrived about the turn of the century or before.

Do as you please now that the code has been merged into the trunk. Revert the shifts, or not, whatever. Not a productive use of my time to go back and forth on this.

/Ashok

-----Original Message-----
From: Jan Nijtmans <jan...@gm...> 
Sent: Saturday, August 30, 2025 8:14 PM
To: apn...@ya...
Cc: Tcl Core List <tcl...@li...>
Subject: Re: [TCLCORE] CFV: TIP 726

Op za 30 aug 2025 om 06:43 schreef Ashok:

> Regarding (1) -

I'll come back later on this one.

> Regarding (2) -

> 

> It is not clear what you mean by more efficient when the compiled instruction stream is identical at least for clang x64 and gcc arm64. See  <https://godbolt.org/z/8afa54eh3> https://godbolt.org/z/8afa54eh3.  I wrote in the style I did because (in my mind of course) it was more natural and clearer to look up the bit in the mask than shifting the mask to look at the bit. I verified with only two compilers and architectures but see no reason to claim either shift would be more efficient. Compilers are smarter than me.

More efficient means more efficient, so less instructions. Try the

same experiment

without the "-O2", an you will see that the non-optimized version is longer than

the optimized one. The optimizer is smart enough to realize that a right-shift

is better here, so it changes the left shift to a right-shift.

The original author of the code in tclUtf.c was smart enough

to realize this, and I realize it too. We don't want less efficient code in

debug mode (in which the optimizer is disabled).

I object to making such a change in existing code, without

realizing what was behind it. You could have asked before.

Sorry,

       Jan Nijtmans

And to one more point (yeah, I know I said this is not a productive use of my time!) ...

> I object to making such a change in existing code,

I didn’t just randomly make changes in existing code. I was modifying those lines to call the utf8proc functions and used the expression forms that I saw as natural.

Now really ‘nuff said from my side.

/Ashok

-----Original Message-----
From: Jan Nijtmans <jan...@gm...> 
Sent: Saturday, August 30, 2025 8:14 PM
To: apn...@ya...
Cc: Tcl Core List <tcl...@li...>
Subject: Re: [TCLCORE] CFV: TIP 726

Op za 30 aug 2025 om 06:43 schreef Ashok:

> Regarding (1) -

I'll come back later on this one.

> Regarding (2) -

> 

> It is not clear what you mean by more efficient when the compiled instruction stream is identical at least for clang x64 and gcc arm64. See  <https://godbolt.org/z/8afa54eh3> https://godbolt.org/z/8afa54eh3.  I wrote in the style I did because (in my mind of course) it was more natural and clearer to look up the bit in the mask than shifting the mask to look at the bit. I verified with only two compilers and architectures but see no reason to claim either shift would be more efficient. Compilers are smarter than me.

More efficient means more efficient, so less instructions. Try the

same experiment

without the "-O2", an you will see that the non-optimized version is longer than

the optimized one. The optimizer is smart enough to realize that a right-shift

is better here, so it changes the left shift to a right-shift.

The original author of the code in tclUtf.c was smart enough

to realize this, and I realize it too. We don't want less efficient code in

debug mode (in which the optimizer is disabled).

I object to making such a change in existing code, without

realizing what was behind it. You could have asked before.

Sorry,

       Jan Nijtmans

Op za 30 aug 2025 om 06:43 schreef Ashok:
> Regarding (1) -
>
> As a matter of principle, I prefer to check validity of data before being passed to external libraries (I treat utf8proc as such). When evaluating various libraries, some, like the very fast SIMD based library, expect valid data, others don't, and some differ in their treatment depending on function being called (including utf8proc). I don't think we want to be in the business of revisiting the Tcl implementation on a library change.
>
> But aside from those general principles, there are specific 9.0 compatibility issues with your implementation that I see looking at the code. For example, 9.0 Tcl_UniCharToUpper (and similar) map 0xFFFFFF to 0x1FFFFF while your code seemingly maps it to 0xFFFFFF. On a tangential note, utf8proc definitions of character classifications are not exactly the same as Tcl's historical classifications (including 9.0). I happen to think utf8proc is more appropriate and in line with the Unicode standard but as stated before, TIP 726 strives for full compatibility with Tcl 9.0. Any changes to character classification would be a compatibility change (no matter how small) and require a separate TIP.

You have a point here. The Tcl core only calls those
Tcl_UniCharToXXX() functions with values <= 0x10FFFF
(since that's the maximum number that Tcl_UtfToUnichar() can produce).
There are 3 ranges to consider
closer:
  1)  0xD800 - 0xDFFF.    Since Tcl 8.6 outputs the same value as
input, for compatibility
        we want 9.1 to do the same. It does.
  2) 0x110000 and 0x1FFFFF. Personally, I don't mind much what
Tcl_UniCharToXXX()
        does for values between 0x110000 and 0x1FFFFF. But since 9.0 outputs the
        same value as input, it makes sense to keep doing this in 9.1.
If utf8proc
        does something different (like returning -1), yes we should do
a range check.
   3) Values above 0xFFFFF. There are currently no testcases for that
(those should
       be added, but that's not TIP #726's fault).

Just one more remark for now. In tclUtf.c, I see:
    #define UNICODE_OUT_OF_RANGE(ch) (((ch) & 0x1FFFFF) >= 0x323C0)
Why 0x323C0? In Tcl 8.6 and 9.0, this number was generated from
the UnicodeData.txt file. It was simply the last character present
in the table. It is dangerous to keep this number: What if a future Unicode
version has more characters than that in the 3th plane (or adds a 4th plane)?

So I suggest to change this value to 0x110000 (or 0x40000, with the
remark that this should be increased if the 4th plane gets any characters)

Hope this helps,
    Jan Nijtmans

Not looked in detail. Rushing to finish off a few things before I leave for a couple of weeks on unexpected travel with sporadic connectivity, but ...

I agree with changing 0x323c0 -> 0x110000. I had noticed that but did not know the reason for picking the last assigned character as opposed to last valid code point so left it alone.

Regarding invalid code points, I do not have strong opinions and would not object to any changes. As Harald commented in one of the tickets Tcl does not check for validity for strings passed through the C API and it should be up to the application or extension to ensure only valid data comes in (I think we already do this at the script level except possibly for surrogates). Tcl currently may interpret these as Cp1252, replace with U+FFFD, or leave as is depending on the API and specific code point. Garbage in, garbage out... I would prefer Tcl be consistent in handling but other than that any changes should be viewed as something applications should not have relied on anyways (invalid data is undefined behavior).

-----Original Message-----
From: Jan Nijtmans <jan...@gm...> 
Sent: Sunday, August 31, 2025 3:11 PM
To: apn...@ya...
Cc: Tcl Core List <tcl...@li...>
Subject: Re: [TCLCORE] CFV: TIP 726

Op za 30 aug 2025 om 06:43 schreef Ashok:
> Regarding (1) -
>
> As a matter of principle, I prefer to check validity of data before being passed to external libraries (I treat utf8proc as such). When evaluating various libraries, some, like the very fast SIMD based library, expect valid data, others don't, and some differ in their treatment depending on function being called (including utf8proc). I don't think we want to be in the business of revisiting the Tcl implementation on a library change.
>
> But aside from those general principles, there are specific 9.0 compatibility issues with your implementation that I see looking at the code. For example, 9.0 Tcl_UniCharToUpper (and similar) map 0xFFFFFF to 0x1FFFFF while your code seemingly maps it to 0xFFFFFF. On a tangential note, utf8proc definitions of character classifications are not exactly the same as Tcl's historical classifications (including 9.0). I happen to think utf8proc is more appropriate and in line with the Unicode standard but as stated before, TIP 726 strives for full compatibility with Tcl 9.0. Any changes to character classification would be a compatibility change (no matter how small) and require a separate TIP.

You have a point here. The Tcl core only calls those
Tcl_UniCharToXXX() functions with values <= 0x10FFFF
(since that's the maximum number that Tcl_UtfToUnichar() can produce).
There are 3 ranges to consider
closer:
  1)  0xD800 - 0xDFFF.    Since Tcl 8.6 outputs the same value as
input, for compatibility
        we want 9.1 to do the same. It does.
  2) 0x110000 and 0x1FFFFF. Personally, I don't mind much what
Tcl_UniCharToXXX()
        does for values between 0x110000 and 0x1FFFFF. But since 9.0 outputs the
        same value as input, it makes sense to keep doing this in 9.1.
If utf8proc
        does something different (like returning -1), yes we should do
a range check.
   3) Values above 0xFFFFF. There are currently no testcases for that
(those should
       be added, but that's not TIP #726's fault).

Just one more remark for now. In tclUtf.c, I see:
    #define UNICODE_OUT_OF_RANGE(ch) (((ch) & 0x1FFFFF) >= 0x323C0)
Why 0x323C0? In Tcl 8.6 and 9.0, this number was generated from
the UnicodeData.txt file. It was simply the last character present
in the table. It is dangerous to keep this number: What if a future Unicode
version has more characters than that in the 3th plane (or adds a 4th plane)?

So I suggest to change this value to 0x110000 (or 0x40000, with the
remark that this should be increased if the 4th plane gets any characters)

Hope this helps,
    Jan Nijtmans

Am 19.08.2025 um 04:19 schrieb apnmbx-public--- via Tcl-Core:
> This is a CFV for TIP 726: Commands for Unicode normalization <https:// 
> core.tcl-lang.org/tips/doc/trunk/tip/726.md>
> 
> CFV ends 2025-08-29 00:00 UTC.
Yes !

Thanks for all,
Harald

<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div dir="ltr"></div><div dir="ltr">TIP 726: &nbsp;YES</div><div dir="ltr"><br><blockquote type="cite">On Aug 18, 2025, at 10:19 PM, apnmbx-public--- via Tcl-Core &lt;tcl...@li...&gt; wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"><meta name="Generator" content="Microsoft Word 15 (filtered medium)"><style>@font-face { font-family: "Cambria Math"; }
@font-face { font-family: Aptos; }
p.MsoNormal, li.MsoNormal, div.MsoNormal { margin: 0in; font-size: 11pt; font-family: Aptos, sans-serif; }
span.EmailStyle17 { font-family: Aptos, sans-serif; color: windowtext; }
.MsoChpDefault { font-size: 11pt; }
@page WordSection1 { size: 8.5in 11in; margin: 1in; }
div.WordSection1 { page: WordSection1; }</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--><div class="WordSection1"><p class="MsoNormal">This is a CFV for <a href="https://core.tcl-lang.org/tips/doc/trunk/tip/726.md"><span style="color:blue">TIP 726: Commands for Unicode normalization</span></a> <o:p></o:p></p><p class="MsoNormal"><o:p>&nbsp;</o:p></p><p class="MsoNormal">CFV ends 2025-08-29 00:00 UTC.<o:p></o:p></p><p class="MsoNormal"><o:p>&nbsp;</o:p></p><p class="MsoNormal">My vote: Yes<o:p></o:p></p><p class="MsoNormal"><o:p>&nbsp;</o:p></p><p class="MsoNormal">/Ashok<o:p></o:p></p></div><span>_______________________________________________</span><br><span>Tcl-Core mailing list</span><br><span>Tcl...@li...</span><br><span>https://lists.sourceforge.net/lists/listinfo/tcl-core</span><br></div></blockquote></body></html>
TIP #726: YES

- Marc

On Mon, Aug 18, 2025 at 9:19 PM apnmbx-public--- via Tcl-Core <
tcl...@li...> wrote:

> This is a CFV for TIP 726: Commands for Unicode normalization
> <https://core.tcl-lang.org/tips/doc/trunk/tip/726.md>
>
>
>
> CFV ends 2025-08-29 00:00 UTC.
>
>
>
> My vote: Yes
>
>
>
> /Ashok
> _______________________________________________
> Tcl-Core mailing list
> Tcl...@li...
> https://lists.sourceforge.net/lists/listinfo/tcl-core
>

TIP #726:  YES

-- Steve
On 19 Aug 2025 at 10:20 AM +0800, apnmbx-public--- via Tcl-Core <tcl...@li...>, wrote:
> This is a CFV for TIP 726: Commands for Unicode normalization
>
> CFV ends 2025-08-29 00:00 UTC.
>
> My vote: Yes
>
> /Ashok
> _______________________________________________
> Tcl-Core mailing list
> Tcl...@li...
> https://lists.sourceforge.net/lists/listinfo/tcl-core

apnmbx-public--- writes:
> This is a CFV for  <https://core.tcl-lang.org/tips/doc/trunk/tip/726.md> TIP
> 726: Commands for Unicode normalization 

My vote: Yes

Basically.

rolf

Thread: [TCLCORE] CFV: TIP 726

The Tool Command Language implementation

tcl-core