[Tcl-bugs] [ tcl-Bugs-1908077 ] string trim not trimming Unicode spaces

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1908077, was opened at 2008-03-05 07:04
Message generated for change (Comment added) made by nijtmans
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1908077&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 44. UTF-8 Strings
Group: development: 8.6b3
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: David Scott Cargo (escargo)
Assigned to: Jan Nijtmans (nijtmans)
Summary: string trim not trimming Unicode spaces

Initial Comment:
A string containing a leading Unicode nonbreaking space (\u00A0) did not have that character removed by [string trim ...] (no chars specified for removal).

In Unicode there are several "whitespace" characters that  [string trim] does not remove by default. (See http://en.wikipedia.org/wiki/Whitespace_%28computer_science%29
for a working definition of white space.)

Ironically, [string is space ...] will return 1 for most of these characters, signifying that at some level these characters are known to be spaces.

Interestingly, the Unicode NEL (Next Line), \u0085 gets 0 from [string is space \u0085].
So do the following two:
U180E MONGOLIAN VOWEL SEPARATOR
U205F MEDIUM MATHEMATICAL SPACE

There are 3 issues then.

[string is space ...] returns the wrong results for certain Unicode characters.

[string trim ...] (the form without characters specified) is not deleting Unicode space characters.

The documentation for [string trim ...] is not precise about which space characters are deleted (or not deleted).

The behavior of [string trim ...] should (in my opinion) be defining as removing those characters for which [string is space .] is 1.

If [string trim ...] only trims ASCII white space characters, then they should be explicitly described and listed.

----------------------------------------------------------------------

>Comment By: Jan Nijtmans (nijtmans)
Date: 2012-11-05 06:45

Message:
Fixed in (upcoming) Tcl 8.6

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2012-10-19 00:22

Message:
See: <http://www.tcl.tk/cgi-bin/tct/tip/413>

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2012-09-23 01:26

Message:
> So, which Unicode character class are you thinking should satisfy the
> predicate?
I'm not sure yet, but most close would be unicodes
"Whitespace" property. See:
<http://www.unicode.org/Public/6.1.0/ucd/PropList.txt>

The current "string is space" doesn't correspond to
the unicode definition of "space" anyway, because
CR, LF  are not spaces either. "Whitespace" would
be more close.

Anyway, whatever change, I prefer to do that through
a TIP. Adding <0085> could be done without
a TIP, but <>

But there are more candidates. For example,
200B was considered a space in Unicode
2.1, but not anymore now.
http://www.unicode.org/Public/2.1-Update4/PropList-2.1.9.txt

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2012-09-22 19:17

Message:
So, which Unicode character class are you thinking should satisfy the
predicate? (I can't quite remember the details, but I do remember that
\u0020 and \u00a0 are in different classes; the latter is I believe usually
not regarded as a space even though it has a non-printing glyph...)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1908077&group_id=10894

[Tcl-bugs] [ tcl-Bugs-1908077 ] string trim not trimming Unicode spaces

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-1908077 ] string trim not trimming Unicode spaces