[Tcl-bugs] [ tcl-Bugs-1376892 ] [:print:] wrong behaviour

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1376892, was opened at 2005-12-09 05:23
Message generated for change (Comment added) made by dkf
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1376892&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 42. Regexp
Group: obsolete: 8.4.11
Status: Open
Resolution: None
Priority: 7
Submitted By: Petteri Kettunen (petterik)
Assigned to: Pavel Goran (pvgoran)
Summary: [:print:] wrong behaviour

Initial Comment:
% set str {moi+moi+moi}
% regsub -all {[^[:print:]]} $str {} str2; puts $str2
moimoimoi

Expected result is the original string.

This, however, is subject to definition and Tcl's
specification. If one keeps `perlre' as the refence,
this is a bug. Perlre (v5.6.1) says: 'print -- Any
alphanumeric or punctuation (special) character or space.'

----------------------------------------------------------------------

>Comment By: Donal K. Fellows (dkf)
Date: 2006-02-16 09:22

Message:
Logged In: YES 
user_id=79902

What about [:space:] characters outside the classic ASCII
range? That's a total of 20 characters, and I'm not willing
to automatically just go with non-UNICODE-aware tools on
this. I ask this because it seems unreasonable to me to just
assume that old stuff is holy (an approach that has happened
in this area in the past; as a point to help understanding,
[:digit:] isn't the same as [0-9], and this is good.)

My characters of concern are:
  \u00a0, \u1680, \u2000-\u200b, \u2028, \u2029, \u202f, \u3000

----------------------------------------------------------------------

Comment By: William John Poser (billposer)
Date: 2006-02-16 09:07

Message:
Logged In: YES 
user_id=939324

GNU egrep agrees with gawk, java, ruby, etc. as opposed to Perl.

----------------------------------------------------------------------

Comment By: William John Poser (billposer)
Date: 2006-02-16 08:15

Message:
Logged In: YES 
user_id=939324

Donal,

I think that your patch does what you intended, but what you
intended and what I intended aren't the same. What you've
got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
which is what Perl does. What I intended was [:alnum:] U
[:punct:] U SPACE, where SPACE = 0x20. (The simple test is
to see whether [[:print:]] matches tab.) That is my
understanding of the POSIX standard. I just checked a few
other regexp engines. TRE, which makes a point of strict
POSIX conformance, follows my interpretation, as do GNU awk,
java.util.regex,
ruby, vim, zsh, and, interestingly, pcre. So I would say
that Perl has got it wrong. The necessary fix is just to
delete the two bits involving NUM_SPACE_RANGE from your patch.

----------------------------------------------------------------------

Comment By: William John Poser (billposer)
Date: 2006-02-16 08:15

Message:
Logged In: YES 
user_id=939324

Donal,

I think that your patch does what you intended, but what you
intended and what I intended aren't the same. What you've
got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
which is what Perl does. What I intended was [:alnum:] U
[:punct:] U SPACE, where SPACE = 0x20. (The simple test is
to see whether [[:print:]] matches tab.) That is my
understanding of the POSIX standard. I just checked a few
other regexp engines. TRE, which makes a point of strict
POSIX conformance, follows my interpretation, as do GNU awk,
java.util.regex,
ruby, vim, zsh, and, interestingly, pcre. So I would say
that Perl has got it wrong. The necessary fix is just to
delete the two bits involving NUM_SPACE_RANGE from your patch.

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2006-02-16 06:53

Message:
Logged In: YES 
user_id=79902

Bug is located at line 817 of the HEAD regc_locale.c, and
consists of a missing arm for the CC_PRINT case. I believe
that the fix required (based on the POSIX definition, thanks
Bill!) is the attached patch, which I'd appreciate people
testing... :-)

----------------------------------------------------------------------

Comment By: William John Poser (billposer)
Date: 2006-02-15 19:08

Message:
Logged In: YES 
user_id=939324

Perhaps more important than the Perl definition is the POSIX
definition, according to which [:print:] = [:alnum:] U
[:punct:] U SPACE. Curiously, the current behaviour with
[:print:] = [:alnum:] is documented in Welch, Jones, and
Hobbs with no mention of it being a bug.

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2005-12-09 16:42

Message:
Logged In: YES 
user_id=79902

Good point. Yes. And [string is print] follows C's isprint()
IIRC, so it is (almost certainly) the RE engine that is wrong.

OK, this should be fixed.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2005-12-09 16:34

Message:
Logged In: YES 
user_id=80530

should we also consider
consistency with
[string is print] ?

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2005-12-09 09:19

Message:
Logged In: YES 
user_id=79902

It seems that Perl defines [[:print:]] as
[[:space:][:graph:]] and we define it as [[:alnum:]] (and
yes, it is documented that way.) Or perhaps [:blank:]
instead of [:space:], the documentation being a bit hazy in
that respect.

The question is, what *should* we do?

The following procedure helps with checking these sorts of
things out:
   proc matches args {
      set RE [format {[[:%s:]]} [join $args ":\]\[:"]]
      for {set i 32} {$i<127} {incr i} {
         set c [format %c $i]
         puts -nonewline "$c-[regexp $RE $c]\t"
      }
      puts ""
   }

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1376892&group_id=10894

[Tcl-bugs] [ tcl-Bugs-1376892 ] [:print:] wrong behaviour

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-1376892 ] [:print:] wrong behaviour