% set str {moi+moi+moi}
% regsub -all {[^[:print:]]} $str {} str2; puts $str2
moimoimoi
Expected result is the original string.
This, however, is subject to definition and Tcl's
specification. If one keeps `perlre' as the refence,
this is a bug. Perlre (v5.6.1) says: 'print -- Any
alphanumeric or punctuation (special) character or space.'
Logged In: YES
user_id=79902
It seems that Perl defines [[:print:]] as
[[:space:][:graph:]] and we define it as [[:alnum:]] (and
yes, it is documented that way.) Or perhaps [:blank:]
instead of [:space:], the documentation being a bit hazy in
that respect.
The question is, what *should* we do?
The following procedure helps with checking these sorts of
things out:
proc matches args {
set RE [format {[[:%s:]]} [join $args ":\]\[:"]]
for {set i 32} {$i<127} {incr i} {
set c [format %c $i]
puts -nonewline "$c-[regexp $RE $c]\t"
}
puts ""
}
Logged In: YES
user_id=80530
should we also consider
consistency with
[string is print] ?
Logged In: YES
user_id=79902
Good point. Yes. And [string is print] follows C's isprint()
IIRC, so it is (almost certainly) the RE engine that is wrong.
OK, this should be fixed.
Logged In: YES
user_id=939324
Perhaps more important than the Perl definition is the POSIX
definition, according to which [:print:] = [:alnum:] U
[:punct:] U SPACE. Curiously, the current behaviour with
[:print:] = [:alnum:] is documented in Welch, Jones, and
Hobbs with no mention of it being a bug.
Logged In: YES
user_id=79902
Bug is located at line 817 of the HEAD regc_locale.c, and
consists of a missing arm for the CC_PRINT case. I believe
that the fix required (based on the POSIX definition, thanks
Bill!) is the attached patch, which I'd appreciate people
testing... :-)
Logged In: YES
user_id=939324
Donal,
I think that your patch does what you intended, but what you
intended and what I intended aren't the same. What you've
got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
which is what Perl does. What I intended was [:alnum:] U
[:punct:] U SPACE, where SPACE = 0x20. (The simple test is
to see whether [[:print:]] matches tab.) That is my
understanding of the POSIX standard. I just checked a few
other regexp engines. TRE, which makes a point of strict
POSIX conformance, follows my interpretation, as do GNU awk,
java.util.regex,
ruby, vim, zsh, and, interestingly, pcre. So I would say
that Perl has got it wrong. The necessary fix is just to
delete the two bits involving NUM_SPACE_RANGE from your patch.
Logged In: YES
user_id=939324
Donal,
I think that your patch does what you intended, but what you
intended and what I intended aren't the same. What you've
got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
which is what Perl does. What I intended was [:alnum:] U
[:punct:] U SPACE, where SPACE = 0x20. (The simple test is
to see whether [[:print:]] matches tab.) That is my
understanding of the POSIX standard. I just checked a few
other regexp engines. TRE, which makes a point of strict
POSIX conformance, follows my interpretation, as do GNU awk,
java.util.regex,
ruby, vim, zsh, and, interestingly, pcre. So I would say
that Perl has got it wrong. The necessary fix is just to
delete the two bits involving NUM_SPACE_RANGE from your patch.
Logged In: YES
user_id=939324
GNU egrep agrees with gawk, java, ruby, etc. as opposed to Perl.
Logged In: YES
user_id=79902
What about [:space:] characters outside the classic ASCII
range? That's a total of 20 characters, and I'm not willing
to automatically just go with non-UNICODE-aware tools on
this. I ask this because it seems unreasonable to me to just
assume that old stuff is holy (an approach that has happened
in this area in the past; as a point to help understanding,
[:digit:] isn't the same as [0-9], and this is good.)
My characters of concern are:
\u00a0, \u1680, \u2000-\u200b, \u2028, \u2029, \u202f, \u3000
Logged In: YES
user_id=79902
The following C program indicates that there are large
numbers of characters that satisfy isprint() but neither
isalnum() nor ispunct()
#include <ctype.h>
#include <stdio.h>
#include <locale.h>
int main() {
unsigned int i,j=1000000000;
setlocale(LC_ALL, "en_GB.UTF-8");
for (i=0 ; i<65536 ; i++) {
if (isprint(i) && !isalnum(i) && !ispunct(i)) {
if (i!=j+1) {
printf("%04x-", i);
}
j = i;
} else if (i == j+1) {
printf("%04x\n", j);
}
}
return 0;
}
Interestingly, there are also many characters that are
isalnum||ispunct but not isprint. That seems very strange to
me; perhaps we need to find a real spec and use that instead
of guessing... :-)
Logged In: YES
user_id=939324
>Interestingly, there are also many characters that are
>isalnum||ispunct but not isprint. That seems very strange to
>me; perhaps we need to find a real spec and use that instead
>of guessing... :-)
My experience suggests that a lot of software has been
rather sloppily extended to handle Unicode with the result
that for many features the behavior is not only non-standard
not "common sense" but downright bizarre. For instance, try
a range like [a-ALPHA] in your favorite regexp engine (other
than Tcl). The common sense correct result is that this
should match the characters U+0061 through U+03B1. Another
plausible result would be an error because it crosses
Unicode blocks (this is the gawk behaviour). But in addition
to these I have found several other things, including
matches that include not only alpha but the entire Greek range!
Anyhow, the other problem is that I'm pretty sure that there
isn't any standard governing the extension of the POSIX
classes to Unicode. POSIX states some principles but they
are very general, basically just that you have to preserve
the ASCII classes. Unicode has classes of its own in the
form of the General Character Properties, but they aren't
the same and don't map to the POSIX classes in an obvious way.
Logged In: YES
user_id=939324
Regarding [:space:], I checked out the classification
provided by the glibc wide character class functions using
the following program in a variety of locales followed by:
egrep "space T|Locale" ClassResults > SpaceResults
#include <stdlib.h>
#include <stdio.h>
#include <wctype.h>
#include <wchar.h>
#include <locale.h>
int main(int ac, char *av[]) {
wchar_t i;
setlocale(LC_ALL,"");
printf("Locale: %s\n",setlocale(LC_ALL,NULL));
for(i=0;i<0xFFFF;i++) {
printf("U+%04X:\t",i);
printf("alpha %s\t",(iswalpha(i)? "T":"F"));
printf("alnum %s\t",(iswalnum(i)? "T":"F"));
printf("digit %s\t",(iswdigit(i)? "T":"F"));
printf("cntrl %s\t",(iswcntrl(i)? "T":"F"));
printf("punct %s\t",(iswpunct(i)? "T":"F"));
printf("upper %s\t",(iswupper(i)? "T":"F"));
printf("lower %s\t",(iswlower(i)? "T":"F"));
printf("blank %s\t",(iswblank(i)? "T":"F"));
printf("space %s\t",(iswspace(i)? "T":"F"));
printf("graph %s\t",(iswgraph(i)? "T":"F"));
printf("print %s\t",(iswprint(i)? "T":"F"));
printf("xdigit %s\n",(iswxdigit(i)? "T":"F"));
}
exit(0);
}
In the C locale I got the expected:
Locale: C
U+0009: alpha F alnum F digit F cntrl T punct F upper F
lower F blank T space T graph F print F xdigit F
U+000A: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000B: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000C: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000D: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
In all of the other locales that I tried
(ca_ES,de_DE,en_US,hi_IN,ja_JP,kk_KZ,th_TH,zh_TW)
I got the same result:
Locale: hi_IN
U+0009: alpha F alnum F digit F cntrl T punct F upper F
lower F blank T space T graph F print F xdigit F
U+000A: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000B: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000C: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000D: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+1680: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2001: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2002: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2003: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2004: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2005: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2006: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2008: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2009: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200A: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200B: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2028: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+2029: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+205F: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+3000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
So, at least as far as glibc and the locale definitions
distributed with it are concerned there is a standard set of
space characters. The list is not the same as the characters
with Unicode General Property Zs or Z, nor with Bidi
property WS. Somebody has evidently worked through the
plausible candidates with their usage in mind.
Bill
Logged In: YES
user_id=939324
Using the same techniques as in my previous messsage, I get
a uniform list of characters that are in [:print:] but not
in [:alnum:] or [:punct:].
U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+1680: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2001: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2002: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2003: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2004: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2005: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2006: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2008: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2009: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200A: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200B: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+205F: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+3000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+FE45: alpha F alnum F digit F cntrl F punct F upper F
lower F blank F space F graph T print T xdigit F
U+FE46: alpha F alnum F digit F cntrl F punct F upper F
lower F blank F space F graph T print T xdigit F
Here is the diff against [:space:]:
> U+0009: alpha F alnum F digit F cntrl T punct F
upper F lower F blank T space T graph F print F xdigit F>
U+000A: alpha F alnum F digit F cntrl T punct F upper
F lower F blank F space T graph F print F xdigit F> U+000B:
alpha F alnum F digit F cntrl T punct F upper F lower
F blank F space T graph F print F xdigit F> U+000C:
alpha F alnum F digit F cntrl T punct F upper F lower F
blank F space T graph F print F xdigit F> U+000D:
alpha F alnum F digit F cntrl T punct F upper F lower F
blank F space T graph F print F xdigit F13a20,21
> U+2028: alpha F alnum F digit F cntrl T punct F
upper F lower F blank F space T graph F print F xdigit F>
U+2029: alpha F alnum F digit F cntrl T punct F upper
F lower F blank F space T graph F print F xdigit F16,17d23
< U+FE45: alpha F alnum F digit F cntrl F punct F
upper F lower F blank F space F graph T print T xdigit F<
U+FE46: alpha F alnum F digit F cntrl F punct F upper
F lower F blank F space F graph T print T xdigit F
It looks like [:print:] consists of [:graph:] plus [:space:]
minus (ASCII [:space:] - SPACE]) plus U+FE45 and U+FE46,
which are the sesame points. This seems sensible.
Logged In: YES
user_id=939324
I forgot to account for U+2028 and U+2029. These are the
abstract line and paragraph separators. I guess it makes
ense for them to be excluded from [:print:] even though the
other non-ASCII [:space:] characters are included since as I
understand it they have no corresponding glyphs but are
purely abstract.
Logged In: YES
user_id=79902
Reading around the web, I find that there's not much
agreement on what isprint() means at all outside the ASCII
domain. That really sucks.
So I'm defining it now. The [:print:] category shall now
contain all characters that are in any of the following
UNICODE categories:
Letter (L*)
Number (N*)
Punctuation (P*)
Symbol (S*)
Space (Zs) but not other kinds of whitespace
http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
Fixed in the HEAD with the attached patch. Backport candidate?
Unidiff vs HEAD
Logged In: YES
user_id=1312539
This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 14 days (the time period specified by
the administrator of this Tracker).