Menu

#3320 [:print:] wrong behaviour

obsolete: 8.4.11
closed-fixed
7
2006-08-24
2005-12-09
No

% set str {moi+moi+moi}
% regsub -all {[^[:print:]]} $str {} str2; puts $str2
moimoimoi

Expected result is the original string.

This, however, is subject to definition and Tcl's
specification. If one keeps `perlre' as the refence,
this is a bug. Perlre (v5.6.1) says: 'print -- Any
alphanumeric or punctuation (special) character or space.'

Discussion

  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    It seems that Perl defines [[:print:]] as
    [[:space:][:graph:]] and we define it as [[:alnum:]] (and
    yes, it is documented that way.) Or perhaps [:blank:]
    instead of [:space:], the documentation being a bit hazy in
    that respect.

    The question is, what *should* we do?

    The following procedure helps with checking these sorts of
    things out:
    proc matches args {
    set RE [format {[[:%s:]]} [join $args ":\]\[:"]]
    for {set i 32} {$i<127} {incr i} {
    set c [format %c $i]
    puts -nonewline "$c-[regexp $RE $c]\t"
    }
    puts ""
    }

     
  • Don Porter

    Don Porter - 2005-12-09

    Logged In: YES
    user_id=80530

    should we also consider
    consistency with
    [string is print] ?

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    Good point. Yes. And [string is print] follows C's isprint()
    IIRC, so it is (almost certainly) the RE engine that is wrong.

    OK, this should be fixed.

     
  • William John Poser

    Logged In: YES
    user_id=939324

    Perhaps more important than the Perl definition is the POSIX
    definition, according to which [:print:] = [:alnum:] U
    [:punct:] U SPACE. Curiously, the current behaviour with
    [:print:] = [:alnum:] is documented in Welch, Jones, and
    Hobbs with no mention of it being a bug.

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    Bug is located at line 817 of the HEAD regc_locale.c, and
    consists of a missing arm for the CC_PRINT case. I believe
    that the fix required (based on the POSIX definition, thanks
    Bill!) is the attached patch, which I'd appreciate people
    testing... :-)

     
  • Donal K. Fellows

    • priority: 5 --> 7
     
  • William John Poser

    Logged In: YES
    user_id=939324

    Donal,

    I think that your patch does what you intended, but what you
    intended and what I intended aren't the same. What you've
    got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
    which is what Perl does. What I intended was [:alnum:] U
    [:punct:] U SPACE, where SPACE = 0x20. (The simple test is
    to see whether [[:print:]] matches tab.) That is my
    understanding of the POSIX standard. I just checked a few
    other regexp engines. TRE, which makes a point of strict
    POSIX conformance, follows my interpretation, as do GNU awk,
    java.util.regex,
    ruby, vim, zsh, and, interestingly, pcre. So I would say
    that Perl has got it wrong. The necessary fix is just to
    delete the two bits involving NUM_SPACE_RANGE from your patch.

     
  • William John Poser

    Logged In: YES
    user_id=939324

    Donal,

    I think that your patch does what you intended, but what you
    intended and what I intended aren't the same. What you've
    got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
    which is what Perl does. What I intended was [:alnum:] U
    [:punct:] U SPACE, where SPACE = 0x20. (The simple test is
    to see whether [[:print:]] matches tab.) That is my
    understanding of the POSIX standard. I just checked a few
    other regexp engines. TRE, which makes a point of strict
    POSIX conformance, follows my interpretation, as do GNU awk,
    java.util.regex,
    ruby, vim, zsh, and, interestingly, pcre. So I would say
    that Perl has got it wrong. The necessary fix is just to
    delete the two bits involving NUM_SPACE_RANGE from your patch.

     
  • William John Poser

    Logged In: YES
    user_id=939324

    GNU egrep agrees with gawk, java, ruby, etc. as opposed to Perl.

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    What about [:space:] characters outside the classic ASCII
    range? That's a total of 20 characters, and I'm not willing
    to automatically just go with non-UNICODE-aware tools on
    this. I ask this because it seems unreasonable to me to just
    assume that old stuff is holy (an approach that has happened
    in this area in the past; as a point to help understanding,
    [:digit:] isn't the same as [0-9], and this is good.)

    My characters of concern are:
    \u00a0, \u1680, \u2000-\u200b, \u2028, \u2029, \u202f, \u3000

     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    The following C program indicates that there are large
    numbers of characters that satisfy isprint() but neither
    isalnum() nor ispunct()

    #include <ctype.h>
    #include <stdio.h>
    #include <locale.h>
    int main() {
    unsigned int i,j=1000000000;
    setlocale(LC_ALL, "en_GB.UTF-8");
    for (i=0 ; i<65536 ; i++) {
    if (isprint(i) && !isalnum(i) && !ispunct(i)) {
    if (i!=j+1) {
    printf("%04x-", i);
    }
    j = i;
    } else if (i == j+1) {
    printf("%04x\n", j);
    }
    }
    return 0;
    }

    Interestingly, there are also many characters that are
    isalnum||ispunct but not isprint. That seems very strange to
    me; perhaps we need to find a real spec and use that instead
    of guessing... :-)

     
  • William John Poser

    Logged In: YES
    user_id=939324

    >Interestingly, there are also many characters that are
    >isalnum||ispunct but not isprint. That seems very strange to
    >me; perhaps we need to find a real spec and use that instead
    >of guessing... :-)

    My experience suggests that a lot of software has been
    rather sloppily extended to handle Unicode with the result
    that for many features the behavior is not only non-standard
    not "common sense" but downright bizarre. For instance, try
    a range like [a-ALPHA] in your favorite regexp engine (other
    than Tcl). The common sense correct result is that this
    should match the characters U+0061 through U+03B1. Another
    plausible result would be an error because it crosses
    Unicode blocks (this is the gawk behaviour). But in addition
    to these I have found several other things, including
    matches that include not only alpha but the entire Greek range!

    Anyhow, the other problem is that I'm pretty sure that there
    isn't any standard governing the extension of the POSIX
    classes to Unicode. POSIX states some principles but they
    are very general, basically just that you have to preserve
    the ASCII classes. Unicode has classes of its own in the
    form of the General Character Properties, but they aren't
    the same and don't map to the POSIX classes in an obvious way.

     
  • William John Poser

    Logged In: YES
    user_id=939324

    Regarding [:space:], I checked out the classification
    provided by the glibc wide character class functions using
    the following program in a variety of locales followed by:

    egrep "space T|Locale" ClassResults > SpaceResults

    #include <stdlib.h>
    #include <stdio.h>
    #include <wctype.h>
    #include <wchar.h>
    #include <locale.h>

    int main(int ac, char *av[]) {
    wchar_t i;
    setlocale(LC_ALL,"");
    printf("Locale: %s\n",setlocale(LC_ALL,NULL));
    for(i=0;i<0xFFFF;i++) {
    printf("U+%04X:\t",i);
    printf("alpha %s\t",(iswalpha(i)? "T":"F"));
    printf("alnum %s\t",(iswalnum(i)? "T":"F"));
    printf("digit %s\t",(iswdigit(i)? "T":"F"));
    printf("cntrl %s\t",(iswcntrl(i)? "T":"F"));
    printf("punct %s\t",(iswpunct(i)? "T":"F"));
    printf("upper %s\t",(iswupper(i)? "T":"F"));
    printf("lower %s\t",(iswlower(i)? "T":"F"));
    printf("blank %s\t",(iswblank(i)? "T":"F"));
    printf("space %s\t",(iswspace(i)? "T":"F"));
    printf("graph %s\t",(iswgraph(i)? "T":"F"));
    printf("print %s\t",(iswprint(i)? "T":"F"));
    printf("xdigit %s\n",(iswxdigit(i)? "T":"F"));
    }
    exit(0);
    }

    In the C locale I got the expected:
    Locale: C
    U+0009: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank T space T graph F print F xdigit F
    U+000A: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+000B: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+000C: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+000D: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+0020: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F

    In all of the other locales that I tried
    (ca_ES,de_DE,en_US,hi_IN,ja_JP,kk_KZ,th_TH,zh_TW)
    I got the same result:

    Locale: hi_IN
    U+0009: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank T space T graph F print F xdigit F
    U+000A: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+000B: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+000C: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+000D: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+0020: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+1680: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2000: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2001: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2002: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2003: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2004: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2005: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2006: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2008: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2009: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+200A: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+200B: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2028: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+2029: alpha F alnum F digit F cntrl T punct F upper F
    lower F blank F space T graph F print F xdigit F
    U+205F: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+3000: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F

    So, at least as far as glibc and the locale definitions
    distributed with it are concerned there is a standard set of
    space characters. The list is not the same as the characters
    with Unicode General Property Zs or Z, nor with Bidi
    property WS. Somebody has evidently worked through the
    plausible candidates with their usage in mind.

    Bill

     
  • William John Poser

    Logged In: YES
    user_id=939324

    Using the same techniques as in my previous messsage, I get
    a uniform list of characters that are in [:print:] but not
    in [:alnum:] or [:punct:].

    U+0020: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+1680: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2000: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2001: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2002: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2003: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2004: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2005: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2006: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2008: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+2009: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+200A: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+200B: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+205F: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+3000: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank T space T graph F print T xdigit F
    U+FE45: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank F space F graph T print T xdigit F
    U+FE46: alpha F alnum F digit F cntrl F punct F upper F
    lower F blank F space F graph T print T xdigit F

    Here is the diff against [:space:]:

    > U+0009: alpha F alnum F digit F cntrl T punct F
    upper F lower F blank T space T graph F print F xdigit F>
    U+000A: alpha F alnum F digit F cntrl T punct F upper
    F lower F blank F space T graph F print F xdigit F> U+000B:
    alpha F alnum F digit F cntrl T punct F upper F lower
    F blank F space T graph F print F xdigit F> U+000C:
    alpha F alnum F digit F cntrl T punct F upper F lower F
    blank F space T graph F print F xdigit F> U+000D:
    alpha F alnum F digit F cntrl T punct F upper F lower F
    blank F space T graph F print F xdigit F13a20,21
    > U+2028: alpha F alnum F digit F cntrl T punct F
    upper F lower F blank F space T graph F print F xdigit F>
    U+2029: alpha F alnum F digit F cntrl T punct F upper
    F lower F blank F space T graph F print F xdigit F16,17d23
    < U+FE45: alpha F alnum F digit F cntrl F punct F
    upper F lower F blank F space F graph T print T xdigit F<
    U+FE46: alpha F alnum F digit F cntrl F punct F upper
    F lower F blank F space F graph T print T xdigit F

    It looks like [:print:] consists of [:graph:] plus [:space:]
    minus (ASCII [:space:] - SPACE]) plus U+FE45 and U+FE46,
    which are the sesame points. This seems sensible.

     
  • William John Poser

    Logged In: YES
    user_id=939324

    I forgot to account for U+2028 and U+2029. These are the
    abstract line and paragraph separators. I guess it makes
    ense for them to be excluded from [:print:] even though the
    other non-ASCII [:space:] characters are included since as I
    understand it they have no corresponding glyphs but are
    purely abstract.

     
  • Donal K. Fellows

    • assigned_to: pvgoran --> dkf
    • status: open --> open-fixed
     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    Reading around the web, I find that there's not much
    agreement on what isprint() means at all outside the ASCII
    domain. That really sucks.

    So I'm defining it now. The [:print:] category shall now
    contain all characters that are in any of the following
    UNICODE categories:
    Letter (L*)
    Number (N*)
    Punctuation (P*)
    Symbol (S*)
    Space (Zs) but not other kinds of whitespace

    http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values

    Fixed in the HEAD with the attached patch. Backport candidate?

     
  • Donal K. Fellows

    Unidiff vs HEAD

     
  • Donal K. Fellows

    • status: open-fixed --> pending-fixed
     
  • SourceForge Robot

    Logged In: YES
    user_id=1312539

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 14 days (the time period specified by
    the administrator of this Tracker).

     
  • SourceForge Robot

    • status: pending-fixed --> closed-fixed