make -nocase behave correct with non-ASCII characters

The Tool Command Language implementation

Brought to you by: andreas_kupries, apnadkarni, bgriffin, das, and 10 others

#422 make -nocase behave correct with non-ASCII characters

Status: open

Owner: Donal K. Fellows

Labels: 44. UTF-8 Strings (7)

Priority: 5

Updated: 2005-07-13

Created: 2005-07-12

Creator: Reinhard Max

Private: No

This patch introduces a Tcl_UtfNcasecmp() function and
uses it as a replacement for strcasecmp in [lsearch],
[lsort], and [switch], when the -nocase option is given.

"make genstubs" is needed after applying the patch.

A TIP for this still has to be written.

Discussion

Reinhard Max - 2005-07-12

Patch to add Tcl_UtfNcasecmp

Tcl_Utfcasecmp.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2005-07-13

Logged In: YES
user_id=79902

We could try doing this in two stages.

Stage #1: Define a TclUtfNCaseCmp function (or something
similar) that is *not* in any stubs table, but which is used
to fix the guts of the various Tcl commands listed. This
part can be done now.

Stage #2: Change the name of the function to something
exported and put it in the stubs table. This is the stage
that needs documentation work and a TIP.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2005-07-13

assigned_to: nijtmans --> dkf
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-13

Logged In: YES
user_id=124643

Whoops, there is a typo in my initial comment:
Tcl_UtfNcasecmp() already exists, and my patch adds a
Tcl_Utfcasecmp() function (without "N").

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-13

Patch to add Tcl_UtfCasecmp

Tcl_Utfcasecmp.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-13

Logged In: YES
user_id=124643

I am uploading a revised version of the patch that changes
the name of the new API to Tcl_UtfCasecmp and removes a
misleading comment that was a copy-n-paste oversight.

Meanwhile I timed [lsort -nocase] and found it to take about
twice as long after adding this patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-07-13

Logged In: YES
user_id=80530

the new test cmdIL-5.6 has values
I assume are in the utf-8 encoding.

Since the *.test files get [source]d
in [encoding system], that's not
really portable. Would be better to
use \uHHHH backslash substitution
to construct the test strings and results.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-07-13

Logged In: YES
user_id=80530

otherwise this looks ok.

I agree with dkf that a
two-stage implementation
is the way to go. Get the fix
in immediately. Make it public
as approved.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joe Mistachkin - 2005-07-14

Logged In: YES
user_id=113501

Why does [lsort -nocase] take twice as long with this patch?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2005-07-14

Logged In: YES
user_id=79902

'cos usually strcasecmp() assumes one-byte-per-char and
probably ISO8859-* so case conversion can be done very
quickly. The updated version has to de-UTF8 each character
before case conversion.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-14

Logged In: YES
user_id=124643

Look at the implementation of Tcl_UtfCasecmp. It has to do
much more work than strcasecmp does to give correct results
for non-ASCII characters.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-14

Logged In: YES
user_id=124643

The new version of the patch does special-casing for ASCII
characters. It makes [lsearch] only slightly slower on lists
of pure ASCII strings, and also improves performance for
strings that contain a mix of ASCII and other characters.

It also renames follows dkf's advice and makes the function
private so that it can be applied without a TIP.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-14

Version 3 of the patch

TclUtfcasecmp-3.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-07-14

Logged In: YES
user_id=80530

It looks like the latest patch
doesn't limit its case-equivalencies
to alphabetic characters.

For example, the characters
'[' (\x5b) and '{' (\x7b) get treated
as -nocase equivalents because
they are the same with the 0x20
bit forced set.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-07-14

Logged In: YES
user_id=80530

demo session:

% lsort -nocase {@ `}
@ `
% lsort -nocase {` @}
` @

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-14

Logged In: YES
user_id=124643

OK, so this time it should really be correct, as I have
verified that TclUtfCasecmp produces exactly the same
results as strcasecmp for any combination of characters
between 0x01 and 0x7f, inclusive. I've also added dgp's demo
to the test suite.

I haven't yet timed this version, as I wrote it at home and
the code I used for the performance comparision is at work.
I'll do that tomorrow morning and report the results here.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-14

Version 4 of the patch

TclUtfCasecmp-4.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-07-18

Logged In: YES
user_id=80530

Here's an alternative patch.
Please check its performance
against the earlier patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-07-18

1236896.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-18

Logged In: YES
user_id=124643

This is the fastest version I've been able to come up with.
The patch now adds the same ASCII optimisations to
Tcl_UtfNcmp and Tcl_UtfNcasecmp.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Reinhard Max - 2005-07-18

TclUtfCasecmp-10.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joe English - 2008-02-28

Logged In: YES
user_id=68433
Originator: NO

See also #1902648. This is a portability problem as well as a correctness problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: