Tcl / Read-Only Bugs / #3320 [:print:] wrong behaviour

Donal K. Fellows - 2005-12-09

Logged In: YES
user_id=79902

It seems that Perl defines [[:print:]] as
[[:space:][:graph:]] and we define it as [[:alnum:]] (and
yes, it is documented that way.) Or perhaps [:blank:]
instead of [:space:], the documentation being a bit hazy in
that respect.

The question is, what *should* we do?

The following procedure helps with checking these sorts of
things out:
proc matches args {
set RE [format {[[:%s:]]} [join $args ":\]\[:"]]
for {set i 32} {$i<127} {incr i} {
set c [format %c $i]
puts -nonewline "$c-[regexp $RE $c]\t"
}
puts ""
}

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2005-12-09

Logged In: YES
user_id=80530

should we also consider
consistency with
[string is print] ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2005-12-09

Logged In: YES
user_id=79902

Good point. Yes. And [string is print] follows C's isprint()
IIRC, so it is (almost certainly) the RE engine that is wrong.

OK, this should be fixed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William John Poser - 2006-02-15

Logged In: YES
user_id=939324

Perhaps more important than the Perl definition is the POSIX
definition, according to which [:print:] = [:alnum:] U
[:punct:] U SPACE. Curiously, the current behaviour with
[:print:] = [:alnum:] is documented in Welch, Jones, and
Hobbs with no mention of it being a bug.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2006-02-16

Logged In: YES
user_id=79902

Bug is located at line 817 of the HEAD regc_locale.c, and
consists of a missing arm for the CC_PRINT case. I believe
that the fix required (based on the POSIX definition, thanks
Bill!) is the attached patch, which I'd appreciate people
testing... :-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2006-02-16

priority: 5 --> 7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William John Poser - 2006-02-16

Logged In: YES
user_id=939324

Donal,

I think that your patch does what you intended, but what you
intended and what I intended aren't the same. What you've
got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
which is what Perl does. What I intended was [:alnum:] U
[:punct:] U SPACE, where SPACE = 0x20. (The simple test is
to see whether [[:print:]] matches tab.) That is my
understanding of the POSIX standard. I just checked a few
other regexp engines. TRE, which makes a point of strict
POSIX conformance, follows my interpretation, as do GNU awk,
java.util.regex,
ruby, vim, zsh, and, interestingly, pcre. So I would say
that Perl has got it wrong. The necessary fix is just to
delete the two bits involving NUM_SPACE_RANGE from your patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William John Poser - 2006-02-16

Logged In: YES
user_id=939324

Donal,

I think that your patch does what you intended, but what you
intended and what I intended aren't the same. What you've
got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
which is what Perl does. What I intended was [:alnum:] U
[:punct:] U SPACE, where SPACE = 0x20. (The simple test is
to see whether [[:print:]] matches tab.) That is my
understanding of the POSIX standard. I just checked a few
other regexp engines. TRE, which makes a point of strict
POSIX conformance, follows my interpretation, as do GNU awk,
java.util.regex,
ruby, vim, zsh, and, interestingly, pcre. So I would say
that Perl has got it wrong. The necessary fix is just to
delete the two bits involving NUM_SPACE_RANGE from your patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William John Poser - 2006-02-16

Logged In: YES
user_id=939324

GNU egrep agrees with gawk, java, ruby, etc. as opposed to Perl.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2006-02-16

Logged In: YES
user_id=79902

What about [:space:] characters outside the classic ASCII
range? That's a total of 20 characters, and I'm not willing
to automatically just go with non-UNICODE-aware tools on
this. I ask this because it seems unreasonable to me to just
assume that old stuff is holy (an approach that has happened
in this area in the past; as a point to help understanding,
[:digit:] isn't the same as [0-9], and this is good.)

My characters of concern are:
\u00a0, \u1680, \u2000-\u200b, \u2028, \u2029, \u202f, \u3000

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2006-02-24

Logged In: YES
user_id=79902

The following C program indicates that there are large
numbers of characters that satisfy isprint() but neither
isalnum() nor ispunct()

#include <ctype.h>
#include <stdio.h>
#include <locale.h>
int main() {
unsigned int i,j=1000000000;
setlocale(LC_ALL, "en_GB.UTF-8");
for (i=0 ; i<65536 ; i++) {
if (isprint(i) && !isalnum(i) && !ispunct(i)) {
if (i!=j+1) {
printf("%04x-", i);
}
j = i;
} else if (i == j+1) {
printf("%04x\n", j);
}
}
return 0;
}

Interestingly, there are also many characters that are
isalnum||ispunct but not isprint. That seems very strange to
me; perhaps we need to find a real spec and use that instead
of guessing... :-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William John Poser - 2006-03-02

Logged In: YES
user_id=939324

>Interestingly, there are also many characters that are
>isalnum||ispunct but not isprint. That seems very strange to
>me; perhaps we need to find a real spec and use that instead
>of guessing... :-)

My experience suggests that a lot of software has been
rather sloppily extended to handle Unicode with the result
that for many features the behavior is not only non-standard
not "common sense" but downright bizarre. For instance, try
a range like [a-ALPHA] in your favorite regexp engine (other
than Tcl). The common sense correct result is that this
should match the characters U+0061 through U+03B1. Another
plausible result would be an error because it crosses
Unicode blocks (this is the gawk behaviour). But in addition
to these I have found several other things, including
matches that include not only alpha but the entire Greek range!

Anyhow, the other problem is that I'm pretty sure that there
isn't any standard governing the extension of the POSIX
classes to Unicode. POSIX states some principles but they
are very general, basically just that you have to preserve
the ASCII classes. Unicode has classes of its own in the
form of the General Character Properties, but they aren't
the same and don't map to the POSIX classes in an obvious way.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William John Poser - 2006-03-03

Logged In: YES
user_id=939324

Regarding [:space:], I checked out the classification
provided by the glibc wide character class functions using
the following program in a variety of locales followed by:

egrep "space T|Locale" ClassResults > SpaceResults

#include <stdlib.h>
#include <stdio.h>
#include <wctype.h>
#include <wchar.h>
#include <locale.h>

int main(int ac, char *av[]) {
wchar_t i;
setlocale(LC_ALL,"");
printf("Locale: %s\n",setlocale(LC_ALL,NULL));
for(i=0;i<0xFFFF;i++) {
printf("U+%04X:\t",i);
printf("alpha %s\t",(iswalpha(i)? "T":"F"));
printf("alnum %s\t",(iswalnum(i)? "T":"F"));
printf("digit %s\t",(iswdigit(i)? "T":"F"));
printf("cntrl %s\t",(iswcntrl(i)? "T":"F"));
printf("punct %s\t",(iswpunct(i)? "T":"F"));
printf("upper %s\t",(iswupper(i)? "T":"F"));
printf("lower %s\t",(iswlower(i)? "T":"F"));
printf("blank %s\t",(iswblank(i)? "T":"F"));
printf("space %s\t",(iswspace(i)? "T":"F"));
printf("graph %s\t",(iswgraph(i)? "T":"F"));
printf("print %s\t",(iswprint(i)? "T":"F"));
printf("xdigit %s\n",(iswxdigit(i)? "T":"F"));
}
exit(0);
}

In the C locale I got the expected:
Locale: C
U+0009: alpha F alnum F digit F cntrl T punct F upper F
lower F blank T space T graph F print F xdigit F
U+000A: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000B: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000C: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000D: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F

In all of the other locales that I tried
(ca_ES,de_DE,en_US,hi_IN,ja_JP,kk_KZ,th_TH,zh_TW)
I got the same result:

Locale: hi_IN
U+0009: alpha F alnum F digit F cntrl T punct F upper F
lower F blank T space T graph F print F xdigit F
U+000A: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000B: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000C: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000D: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+1680: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2001: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2002: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2003: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2004: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2005: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2006: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2008: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2009: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200A: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200B: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2028: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+2029: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+205F: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+3000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F

So, at least as far as glibc and the locale definitions
distributed with it are concerned there is a standard set of
space characters. The list is not the same as the characters
with Unicode General Property Zs or Z, nor with Bidi
property WS. Somebody has evidently worked through the
plausible candidates with their usage in mind.

Bill

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William John Poser - 2006-03-03

Logged In: YES
user_id=939324

Using the same techniques as in my previous messsage, I get
a uniform list of characters that are in [:print:] but not
in [:alnum:] or [:punct:].

U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+1680: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2001: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2002: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2003: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2004: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2005: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2006: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2008: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2009: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200A: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200B: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+205F: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+3000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+FE45: alpha F alnum F digit F cntrl F punct F upper F
lower F blank F space F graph T print T xdigit F
U+FE46: alpha F alnum F digit F cntrl F punct F upper F
lower F blank F space F graph T print T xdigit F

Here is the diff against [:space:]:

> U+0009: alpha F alnum F digit F cntrl T punct F
upper F lower F blank T space T graph F print F xdigit F>
U+000A: alpha F alnum F digit F cntrl T punct F upper
F lower F blank F space T graph F print F xdigit F> U+000B:
alpha F alnum F digit F cntrl T punct F upper F lower
F blank F space T graph F print F xdigit F> U+000C:
alpha F alnum F digit F cntrl T punct F upper F lower F
blank F space T graph F print F xdigit F> U+000D:
alpha F alnum F digit F cntrl T punct F upper F lower F
blank F space T graph F print F xdigit F13a20,21
> U+2028: alpha F alnum F digit F cntrl T punct F
upper F lower F blank F space T graph F print F xdigit F>
U+2029: alpha F alnum F digit F cntrl T punct F upper
F lower F blank F space T graph F print F xdigit F16,17d23
< U+FE45: alpha F alnum F digit F cntrl F punct F
upper F lower F blank F space F graph T print T xdigit F<
U+FE46: alpha F alnum F digit F cntrl F punct F upper
F lower F blank F space F graph T print T xdigit F

It looks like [:print:] consists of [:graph:] plus [:space:]
minus (ASCII [:space:] - SPACE]) plus U+FE45 and U+FE46,
which are the sesame points. This seems sensible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William John Poser - 2006-03-03

Logged In: YES
user_id=939324

I forgot to account for U+2028 and U+2029. These are the
abstract line and paragraph separators. I guess it makes
ense for them to be excluded from [:print:] even though the
other non-ASCII [:space:] characters are included since as I
understand it they have no corresponding glyphs but are
purely abstract.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2006-04-12

assigned_to: pvgoran --> dkf

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2006-04-12

Logged In: YES
user_id=79902

Reading around the web, I find that there's not much
agreement on what isprint() means at all outside the ASCII
domain. That really sucks.

So I'm defining it now. The [:print:] category shall now
contain all characters that are in any of the following
UNICODE categories:
Letter (L*)
Number (N*)
Punctuation (P*)
Symbol (S*)
Space (Zs) but not other kinds of whitespace

http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values

Fixed in the HEAD with the attached patch. Backport candidate?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2006-04-12

Unidiff vs HEAD

re_print.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2006-08-09

status: open-fixed --> pending-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SourceForge Robot - 2006-08-24

Logged In: YES
user_id=1312539

This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 14 days (the time period specified by
the administrator of this Tracker).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SourceForge Robot - 2006-08-24

status: pending-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

[:print:] wrong behaviour

The Tool Command Language implementation

Group

Searches

Help

#3320 [:print:] wrong behaviour

Discussion