#3944 string trim not trimming Unicode spaces

obsolete: 8.6b3
closed-fixed
5
2012-11-05
2008-03-05
No

A string containing a leading Unicode nonbreaking space (\u00A0) did not have that character removed by [string trim ...] (no chars specified for removal).

In Unicode there are several "whitespace" characters that [string trim] does not remove by default. (See http://en.wikipedia.org/wiki/Whitespace_%28computer_science%29
for a working definition of white space.)

Ironically, [string is space ...] will return 1 for most of these characters, signifying that at some level these characters are known to be spaces.

Interestingly, the Unicode NEL (Next Line), \u0085 gets 0 from [string is space \u0085].
So do the following two:
U180E MONGOLIAN VOWEL SEPARATOR
U205F MEDIUM MATHEMATICAL SPACE

There are 3 issues then.

[string is space ...] returns the wrong results for certain Unicode characters.

[string trim ...] (the form without characters specified) is not deleting Unicode space characters.

The documentation for [string trim ...] is not precise about which space characters are deleted (or not deleted).

The behavior of [string trim ...] should (in my opinion) be defining as removing those characters for which [string is space .] is 1.

If [string trim ...] only trims ASCII white space characters, then they should be explicitly described and listed.

Discussion

  • Jan Nijtmans

    Jan Nijtmans - 2012-09-21
    • milestone: 806152 --> obsolete: 8.6b3
     
  • Donal K. Fellows

    So, which Unicode character class are you thinking should satisfy the predicate? (I can't quite remember the details, but I do remember that \u0020 and \u00a0 are in different classes; the latter is I believe usually not regarded as a space even though it has a non-printing glyph...)

     
  • Jan Nijtmans

    Jan Nijtmans - 2012-09-23

    > So, which Unicode character class are you thinking should satisfy the
    > predicate?
    I'm not sure yet, but most close would be unicodes
    "Whitespace" property. See:
    <http://www.unicode.org/Public/6.1.0/ucd/PropList.txt>

    The current "string is space" doesn't correspond to
    the unicode definition of "space" anyway, because
    CR, LF are not spaces either. "Whitespace" would
    be more close.

    Anyway, whatever change, I prefer to do that through
    a TIP. Adding <0085> could be done without
    a TIP, but <>

    But there are more candidates. For example,
    200B was considered a space in Unicode
    2.1, but not anymore now.
    http://www.unicode.org/Public/2.1-Update4/PropList-2.1.9.txt

     
  • Jan Nijtmans

    Jan Nijtmans - 2012-10-19

    See: <http://www.tcl.tk/cgi-bin/tct/tip/413>

     
  • Jan Nijtmans

    Jan Nijtmans - 2012-11-05
    • status: open --> closed-fixed
     
  • Jan Nijtmans

    Jan Nijtmans - 2012-11-05

    Fixed in (upcoming) Tcl 8.6

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks