Menu

#4291 string equality testing does not cover encoding variants

open
7
2009-12-11
2009-02-04
Don Porter
No

% set a [encoding convertfrom identity \x21]
!
% set b [encoding convertfrom identity \xc0\xA1]
!
% expr {$a eq $b}
0
% string equal $a $b
0

Discussion

  • Donal K. Fellows

    • labels: 105659 --> 10. Objects
     
  • Donal K. Fellows

    • assigned_to: dkf --> msofer
     
  • Don Porter

    Don Porter - 2009-02-04

    extended example showing the inconsistency
    due to shimmering, and differing definitions
    of equality for different intreps:

    % set a [encoding convertfrom identity \x21]
    !
    % set b [encoding convertfrom identity \xc0\xa1]
    !
    % set s string; # Force direct evaluation - no compile!
    string
    % $s equal $a $b
    0
    % string length $a; # Convert to the "string" objType
    1
    % string length $b
    1
    % $s equal $a $b
    1

     
  • Donal K. Fellows

    IMO the core problem is the identity encoding.

     
  • Don Porter

    Don Porter - 2009-02-04

    here the identity encoding is simply
    a tool for introducing the encoding
    variants to be tested.

    The core problem is that Tcl has not
    made up its mind whether such variants
    are to be accepted or rejected.

     
  • Alexandre Ferrieux

    It seems we are clearly violating advice given in (uncompiled) StringEqualCmd:

    * Remember to keep code here in some sync with the byte-compiled versions
    * in tclExecute.c (INST_STR_EQ, INST_STR_NEQ and INST_STR_CMP as well as

    Indeed the code near that comment tries two shimmer-less comparisons (ByteArray and String) before resorting to GetStringFromObj(). While the equivalent code in TEBC goes straight to it...
    One first idea would be to respect the above advice and stick to perfet eval-compile symmetry.
    However, shimmering is just a matter of speed and shouldn't affect the external EIAS semantics.
    So it might be better to remove that comment, since the lack of symmetry helps highlight a nasty bug.

    Now to the bug itself: here we have two pure-strep's (results from [encoding convertfrom identity]), which are two different UTF-8 strings. By calling [string length] on them we compute their String(unicode) intrep, which happen to be the same. Then when we call the non-compiled variant of [string equal] we hit the shimmer-less case, comparing on the (equal) unicode strings.

    So it seems we have a situation similar to non-canonical lists that would be compared on their List intreps.
    The solution would be to add similarly an "isCanonical" flag to the String intrep, and take that flag into account in the "fast-track" comparisons (ie forbid such comparisons on all but canonical Strings).

    More generally this would be the case of any intrep that is "fast-tracked" in one of the equality tests and fails to record deviation from canonicity.

     
  • Joe English

    Joe English - 2009-02-05

    > The core problem is that Tcl has not
    > made up its mind whether such variants
    > are to be accepted or rejected.

    My vote: rejected. More specifically: invalid UTF-8 octet sequences as a Tcl_Obj* string value leads to undefined behavior. (Not an error, _undefined behavior_. Tcl should not be required to detect such conditions, either.)

    And [encoding convertfrom identity] has to go.

     
  • miguel sofer

    miguel sofer - 2009-05-01
    • priority: 5 --> 6
     
  • miguel sofer

    miguel sofer - 2009-05-01
    • priority: 6 --> 7
     
  • Donal K. Fellows

    I agree about eliminating the identity “encoding”; it's nothing but trouble.

     
  • miguel sofer

    miguel sofer - 2009-12-11

    just passing it around ....

     
  • miguel sofer

    miguel sofer - 2009-12-11
    • assigned_to: msofer --> dgp
     
MongoDB Logo MongoDB