From: SourceForge.net <no...@so...> - 2012-11-05 14:45:39
|
Bugs item #1908077, was opened at 2008-03-05 07:04 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1908077&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: development: 8.6b3 >Status: Closed >Resolution: Fixed Priority: 5 Private: No Submitted By: David Scott Cargo (escargo) Assigned to: Jan Nijtmans (nijtmans) Summary: string trim not trimming Unicode spaces Initial Comment: A string containing a leading Unicode nonbreaking space (\u00A0) did not have that character removed by [string trim ...] (no chars specified for removal). In Unicode there are several "whitespace" characters that [string trim] does not remove by default. (See http://en.wikipedia.org/wiki/Whitespace_%28computer_science%29 for a working definition of white space.) Ironically, [string is space ...] will return 1 for most of these characters, signifying that at some level these characters are known to be spaces. Interestingly, the Unicode NEL (Next Line), \u0085 gets 0 from [string is space \u0085]. So do the following two: U180E MONGOLIAN VOWEL SEPARATOR U205F MEDIUM MATHEMATICAL SPACE There are 3 issues then. [string is space ...] returns the wrong results for certain Unicode characters. [string trim ...] (the form without characters specified) is not deleting Unicode space characters. The documentation for [string trim ...] is not precise about which space characters are deleted (or not deleted). The behavior of [string trim ...] should (in my opinion) be defining as removing those characters for which [string is space .] is 1. If [string trim ...] only trims ASCII white space characters, then they should be explicitly described and listed. ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-11-05 06:45 Message: Fixed in (upcoming) Tcl 8.6 ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-10-19 00:22 Message: See: <http://www.tcl.tk/cgi-bin/tct/tip/413> ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-09-23 01:26 Message: > So, which Unicode character class are you thinking should satisfy the > predicate? I'm not sure yet, but most close would be unicodes "Whitespace" property. See: <http://www.unicode.org/Public/6.1.0/ucd/PropList.txt> The current "string is space" doesn't correspond to the unicode definition of "space" anyway, because CR, LF are not spaces either. "Whitespace" would be more close. Anyway, whatever change, I prefer to do that through a TIP. Adding <0085> could be done without a TIP, but <> But there are more candidates. For example, 200B was considered a space in Unicode 2.1, but not anymore now. http://www.unicode.org/Public/2.1-Update4/PropList-2.1.9.txt ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2012-09-22 19:17 Message: So, which Unicode character class are you thinking should satisfy the predicate? (I can't quite remember the details, but I do remember that \u0020 and \u00a0 are in different classes; the latter is I believe usually not regarded as a space even though it has a non-printing glyph...) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1908077&group_id=10894 |