From: SourceForge.net <no...@so...> - 2012-01-09 20:41:26
|
Bugs item #3464428, was opened at 2011-12-23 07:11 Message generated for change (Comment added) made by nijtmans You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3464428&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 44. UTF-8 Strings Group: current: 8.5.11 >Status: Closed >Resolution: Fixed Priority: 5 Private: No Submitted By: Jan Nijtmans (nijtmans) Assigned to: Jan Nijtmans (nijtmans) Summary: string is graph \u0120 is wrong Initial Comment: This should give 1, but it gives 0 ---------------------------------------------------------------------- >Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-09 12:41 Message: Fixed on all open branches ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2012-01-04 05:51 Message: Re-open, because it appears that some other characters give wrong results with "string is graph" (and "string is control"): %string is graph \u00a0 1 %string is graph \u2028 1 %string is graph \u2029 1 Comparing this with the Ruby implementation: >/[[:graph:]]/.match("\u00a0") => nil >/[[:graph:]]/.match("\u2028") => nil >/[[:graph:]]/.match("\u2029") => nil This tells us that space (\u0020) is not the only character that needs to be excluded from "string is graph", but all characters in the Unicode SPACE cathegories ("Zs", "Zl" and "Zp"), which is indeed more logical And, while we are on it anyway: %string is control \u00ad 0 %string is control \ue000 0 Compare this with Ruby: /[[:cntrl:]]/.match("\u00ad") => #<TypeError: can't dump MatchData> /[[:cntrl:]]/.match("\ue000") => #<TypeError: can't dump MatchData> This tells us that for "string is control" we need to take the Unicode classes "Cf" and "Co" in account in addition to "Cc" (NOT "Cs", because that's for surrogates; Ruby agrees with that!) All of those characters are outside of the ASCII range, which explains why this was never noticed before. ---------------------------------------------------------------------- Comment By: Jan Nijtmans (nijtmans) Date: 2011-12-23 16:32 Message: fixed on all open branches ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3464428&group_id=10894 |