Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#422 make -nocase behave correct with non-ASCII characters

open
5
2005-07-13
2005-07-12
Reinhard Max
No

This patch introduces a Tcl_UtfNcasecmp() function and
uses it as a replacement for strcasecmp in [lsearch],
[lsort], and [switch], when the -nocase option is given.

"make genstubs" is needed after applying the patch.

A TIP for this still has to be written.

Discussion

  • Reinhard Max
    Reinhard Max
    2005-07-12

    Patch to add Tcl_UtfNcasecmp

     
    Attachments
  • Logged In: YES
    user_id=79902

    We could try doing this in two stages.

    Stage #1: Define a TclUtfNCaseCmp function (or something
    similar) that is *not* in any stubs table, but which is used
    to fix the guts of the various Tcl commands listed. This
    part can be done now.

    Stage #2: Change the name of the function to something
    exported and put it in the stubs table. This is the stage
    that needs documentation work and a TIP.

     
    • assigned_to: nijtmans --> dkf
     
  • Reinhard Max
    Reinhard Max
    2005-07-13

    Logged In: YES
    user_id=124643

    Whoops, there is a typo in my initial comment:
    Tcl_UtfNcasecmp() already exists, and my patch adds a
    Tcl_Utfcasecmp() function (without "N").

     
  • Reinhard Max
    Reinhard Max
    2005-07-13

    Patch to add Tcl_UtfCasecmp

     
    Attachments
  • Reinhard Max
    Reinhard Max
    2005-07-13

    Logged In: YES
    user_id=124643

    I am uploading a revised version of the patch that changes
    the name of the new API to Tcl_UtfCasecmp and removes a
    misleading comment that was a copy-n-paste oversight.

    Meanwhile I timed [lsort -nocase] and found it to take about
    twice as long after adding this patch.

     
  • Don Porter
    Don Porter
    2005-07-13

    Logged In: YES
    user_id=80530

    the new test cmdIL-5.6 has values
    I assume are in the utf-8 encoding.

    Since the *.test files get [source]d
    in [encoding system], that's not
    really portable. Would be better to
    use \uHHHH backslash substitution
    to construct the test strings and results.

     
  • Don Porter
    Don Porter
    2005-07-13

    Logged In: YES
    user_id=80530

    otherwise this looks ok.

    I agree with dkf that a
    two-stage implementation
    is the way to go. Get the fix
    in immediately. Make it public
    as approved.

     
  • Joe Mistachkin
    Joe Mistachkin
    2005-07-14

    Logged In: YES
    user_id=113501

    Why does [lsort -nocase] take twice as long with this patch?

     
  • Logged In: YES
    user_id=79902

    'cos usually strcasecmp() assumes one-byte-per-char and
    probably ISO8859-* so case conversion can be done very
    quickly. The updated version has to de-UTF8 each character
    before case conversion.

     
  • Reinhard Max
    Reinhard Max
    2005-07-14

    Logged In: YES
    user_id=124643

    Look at the implementation of Tcl_UtfCasecmp. It has to do
    much more work than strcasecmp does to give correct results
    for non-ASCII characters.

     
  • Reinhard Max
    Reinhard Max
    2005-07-14

    Logged In: YES
    user_id=124643

    The new version of the patch does special-casing for ASCII
    characters. It makes [lsearch] only slightly slower on lists
    of pure ASCII strings, and also improves performance for
    strings that contain a mix of ASCII and other characters.

    It also renames follows dkf's advice and makes the function
    private so that it can be applied without a TIP.

     
  • Reinhard Max
    Reinhard Max
    2005-07-14

    Version 3 of the patch

     
    Attachments
  • Don Porter
    Don Porter
    2005-07-14

    Logged In: YES
    user_id=80530

    It looks like the latest patch
    doesn't limit its case-equivalencies
    to alphabetic characters.

    For example, the characters
    '[' (\x5b) and '{' (\x7b) get treated
    as -nocase equivalents because
    they are the same with the 0x20
    bit forced set.

     
  • Don Porter
    Don Porter
    2005-07-14

    Logged In: YES
    user_id=80530

    demo session:

    % lsort -nocase {@ `}
    @ `
    % lsort -nocase {` @}
    ` @

     
  • Reinhard Max
    Reinhard Max
    2005-07-14

    Logged In: YES
    user_id=124643

    OK, so this time it should really be correct, as I have
    verified that TclUtfCasecmp produces exactly the same
    results as strcasecmp for any combination of characters
    between 0x01 and 0x7f, inclusive. I've also added dgp's demo
    to the test suite.

    I haven't yet timed this version, as I wrote it at home and
    the code I used for the performance comparision is at work.
    I'll do that tomorrow morning and report the results here.

     
  • Reinhard Max
    Reinhard Max
    2005-07-14

    Version 4 of the patch

     
    Attachments
  • Don Porter
    Don Porter
    2005-07-18

    Logged In: YES
    user_id=80530

    Here's an alternative patch.
    Please check its performance
    against the earlier patch.

     
  • Don Porter
    Don Porter
    2005-07-18

     
    Attachments
  • Reinhard Max
    Reinhard Max
    2005-07-18

    Logged In: YES
    user_id=124643

    This is the fastest version I've been able to come up with.
    The patch now adds the same ASCII optimisations to
    Tcl_UtfNcmp and Tcl_UtfNcasecmp.

     
  • Reinhard Max
    Reinhard Max
    2005-07-18

     
  • Joe English
    Joe English
    2008-02-28

    Logged In: YES
    user_id=68433
    Originator: NO

    See also #1902648. This is a portability problem as well as a correctness problem.