It seems we are clearly violating advice given in (uncompiled) StringEqualCmd:
* Remember to keep code here in some sync with the byte-compiled versions
* in tclExecute.c (INST_STR_EQ, INST_STR_NEQ and INST_STR_CMP as well as
Indeed the code near that comment tries two shimmer-less comparisons (ByteArray and String) before resorting to GetStringFromObj(). While the equivalent code in TEBC goes straight to it...
One first idea would be to respect the above advice and stick to perfet eval-compile symmetry.
However, shimmering is just a matter of speed and shouldn't affect the external EIAS semantics.
So it might be better to remove that comment, since the lack of symmetry helps highlight a nasty bug.
Now to the bug itself: here we have two pure-strep's (results from [encoding convertfrom identity]), which are two different UTF-8 strings. By calling [string length] on them we compute their String(unicode) intrep, which happen to be the same. Then when we call the non-compiled variant of [string equal] we hit the shimmer-less case, comparing on the (equal) unicode strings.
So it seems we have a situation similar to non-canonical lists that would be compared on their List intreps.
The solution would be to add similarly an "isCanonical" flag to the String intrep, and take that flag into account in the "fast-track" comparisons (ie forbid such comparisons on all but canonical Strings).
More generally this would be the case of any intrep that is "fast-tracked" in one of the equality tests and fails to record deviation from canonicity.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> The core problem is that Tcl has not
> made up its mind whether such variants
> are to be accepted or rejected.
My vote: rejected. More specifically: invalid UTF-8 octet sequences as a Tcl_Obj* string value leads to undefined behavior. (Not an error, _undefined behavior_. Tcl should not be required to detect such conditions, either.)
And [encoding convertfrom identity] has to go.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
extended example showing the inconsistency
due to shimmering, and differing definitions
of equality for different intreps:
% set a [encoding convertfrom identity \x21]
!
% set b [encoding convertfrom identity \xc0\xa1]
!
% set s string; # Force direct evaluation - no compile!
string
% $s equal $a $b
0
% string length $a; # Convert to the "string" objType
1
% string length $b
1
% $s equal $a $b
1
IMO the core problem is the identity encoding.
here the identity encoding is simply
a tool for introducing the encoding
variants to be tested.
The core problem is that Tcl has not
made up its mind whether such variants
are to be accepted or rejected.
It seems we are clearly violating advice given in (uncompiled) StringEqualCmd:
* Remember to keep code here in some sync with the byte-compiled versions
* in tclExecute.c (INST_STR_EQ, INST_STR_NEQ and INST_STR_CMP as well as
Indeed the code near that comment tries two shimmer-less comparisons (ByteArray and String) before resorting to GetStringFromObj(). While the equivalent code in TEBC goes straight to it...
One first idea would be to respect the above advice and stick to perfet eval-compile symmetry.
However, shimmering is just a matter of speed and shouldn't affect the external EIAS semantics.
So it might be better to remove that comment, since the lack of symmetry helps highlight a nasty bug.
Now to the bug itself: here we have two pure-strep's (results from [encoding convertfrom identity]), which are two different UTF-8 strings. By calling [string length] on them we compute their String(unicode) intrep, which happen to be the same. Then when we call the non-compiled variant of [string equal] we hit the shimmer-less case, comparing on the (equal) unicode strings.
So it seems we have a situation similar to non-canonical lists that would be compared on their List intreps.
The solution would be to add similarly an "isCanonical" flag to the String intrep, and take that flag into account in the "fast-track" comparisons (ie forbid such comparisons on all but canonical Strings).
More generally this would be the case of any intrep that is "fast-tracked" in one of the equality tests and fails to record deviation from canonicity.
> The core problem is that Tcl has not
> made up its mind whether such variants
> are to be accepted or rejected.
My vote: rejected. More specifically: invalid UTF-8 octet sequences as a Tcl_Obj* string value leads to undefined behavior. (Not an error, _undefined behavior_. Tcl should not be required to detect such conditions, either.)
And [encoding convertfrom identity] has to go.
I agree about eliminating the identity “encoding”; it's nothing but trouble.
just passing it around ....