This patch introduces a Tcl_UtfNcasecmp() function and
uses it as a replacement for strcasecmp in [lsearch],
[lsort], and [switch], when the -nocase option is given.
"make genstubs" is needed after applying the patch.
A TIP for this still has to be written.
Patch to add Tcl_UtfNcasecmp
Logged In: YES
user_id=79902
We could try doing this in two stages.
Stage #1: Define a TclUtfNCaseCmp function (or something
similar) that is *not* in any stubs table, but which is used
to fix the guts of the various Tcl commands listed. This
part can be done now.
Stage #2: Change the name of the function to something
exported and put it in the stubs table. This is the stage
that needs documentation work and a TIP.
Logged In: YES
user_id=124643
Whoops, there is a typo in my initial comment:
Tcl_UtfNcasecmp() already exists, and my patch adds a
Tcl_Utfcasecmp() function (without "N").
Patch to add Tcl_UtfCasecmp
Logged In: YES
user_id=124643
I am uploading a revised version of the patch that changes
the name of the new API to Tcl_UtfCasecmp and removes a
misleading comment that was a copy-n-paste oversight.
Meanwhile I timed [lsort -nocase] and found it to take about
twice as long after adding this patch.
Logged In: YES
user_id=80530
the new test cmdIL-5.6 has values
I assume are in the utf-8 encoding.
Since the *.test files get [source]d
in [encoding system], that's not
really portable. Would be better to
use \uHHHH backslash substitution
to construct the test strings and results.
Logged In: YES
user_id=80530
otherwise this looks ok.
I agree with dkf that a
two-stage implementation
is the way to go. Get the fix
in immediately. Make it public
as approved.
Logged In: YES
user_id=113501
Why does [lsort -nocase] take twice as long with this patch?
Logged In: YES
user_id=79902
'cos usually strcasecmp() assumes one-byte-per-char and
probably ISO8859-* so case conversion can be done very
quickly. The updated version has to de-UTF8 each character
before case conversion.
Logged In: YES
user_id=124643
Look at the implementation of Tcl_UtfCasecmp. It has to do
much more work than strcasecmp does to give correct results
for non-ASCII characters.
Logged In: YES
user_id=124643
The new version of the patch does special-casing for ASCII
characters. It makes [lsearch] only slightly slower on lists
of pure ASCII strings, and also improves performance for
strings that contain a mix of ASCII and other characters.
It also renames follows dkf's advice and makes the function
private so that it can be applied without a TIP.
Version 3 of the patch
Logged In: YES
user_id=80530
It looks like the latest patch
doesn't limit its case-equivalencies
to alphabetic characters.
For example, the characters
'[' (\x5b) and '{' (\x7b) get treated
as -nocase equivalents because
they are the same with the 0x20
bit forced set.
Logged In: YES
user_id=80530
demo session:
% lsort -nocase {@ `}
@ `
% lsort -nocase {` @}
` @
Logged In: YES
user_id=124643
OK, so this time it should really be correct, as I have
verified that TclUtfCasecmp produces exactly the same
results as strcasecmp for any combination of characters
between 0x01 and 0x7f, inclusive. I've also added dgp's demo
to the test suite.
I haven't yet timed this version, as I wrote it at home and
the code I used for the performance comparision is at work.
I'll do that tomorrow morning and report the results here.
Version 4 of the patch
Logged In: YES
user_id=80530
Here's an alternative patch.
Please check its performance
against the earlier patch.
Logged In: YES
user_id=124643
This is the fastest version I've been able to come up with.
The patch now adds the same ASCII optimisations to
Tcl_UtfNcmp and Tcl_UtfNcasecmp.
Logged In: YES
user_id=68433
Originator: NO
See also #1902648. This is a portability problem as well as a correctness problem.