From: SourceForge.net <no...@so...> - 2006-02-16 09:22:27
|
Bugs item #1376892, was opened at 2005-12-09 05:23 Message generated for change (Comment added) made by dkf You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1376892&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 42. Regexp Group: obsolete: 8.4.11 Status: Open Resolution: None Priority: 7 Submitted By: Petteri Kettunen (petterik) Assigned to: Pavel Goran (pvgoran) Summary: [:print:] wrong behaviour Initial Comment: % set str {moi+moi+moi} % regsub -all {[^[:print:]]} $str {} str2; puts $str2 moimoimoi Expected result is the original string. This, however, is subject to definition and Tcl's specification. If one keeps `perlre' as the refence, this is a bug. Perlre (v5.6.1) says: 'print -- Any alphanumeric or punctuation (special) character or space.' ---------------------------------------------------------------------- >Comment By: Donal K. Fellows (dkf) Date: 2006-02-16 09:22 Message: Logged In: YES user_id=79902 What about [:space:] characters outside the classic ASCII range? That's a total of 20 characters, and I'm not willing to automatically just go with non-UNICODE-aware tools on this. I ask this because it seems unreasonable to me to just assume that old stuff is holy (an approach that has happened in this area in the past; as a point to help understanding, [:digit:] isn't the same as [0-9], and this is good.) My characters of concern are: \u00a0, \u1680, \u2000-\u200b, \u2028, \u2029, \u202f, \u3000 ---------------------------------------------------------------------- Comment By: William John Poser (billposer) Date: 2006-02-16 09:07 Message: Logged In: YES user_id=939324 GNU egrep agrees with gawk, java, ruby, etc. as opposed to Perl. ---------------------------------------------------------------------- Comment By: William John Poser (billposer) Date: 2006-02-16 08:15 Message: Logged In: YES user_id=939324 Donal, I think that your patch does what you intended, but what you intended and what I intended aren't the same. What you've got treats [:print:] as [:alnum:] U [:punct:] U [:space:], which is what Perl does. What I intended was [:alnum:] U [:punct:] U SPACE, where SPACE = 0x20. (The simple test is to see whether [[:print:]] matches tab.) That is my understanding of the POSIX standard. I just checked a few other regexp engines. TRE, which makes a point of strict POSIX conformance, follows my interpretation, as do GNU awk, java.util.regex, ruby, vim, zsh, and, interestingly, pcre. So I would say that Perl has got it wrong. The necessary fix is just to delete the two bits involving NUM_SPACE_RANGE from your patch. ---------------------------------------------------------------------- Comment By: William John Poser (billposer) Date: 2006-02-16 08:15 Message: Logged In: YES user_id=939324 Donal, I think that your patch does what you intended, but what you intended and what I intended aren't the same. What you've got treats [:print:] as [:alnum:] U [:punct:] U [:space:], which is what Perl does. What I intended was [:alnum:] U [:punct:] U SPACE, where SPACE = 0x20. (The simple test is to see whether [[:print:]] matches tab.) That is my understanding of the POSIX standard. I just checked a few other regexp engines. TRE, which makes a point of strict POSIX conformance, follows my interpretation, as do GNU awk, java.util.regex, ruby, vim, zsh, and, interestingly, pcre. So I would say that Perl has got it wrong. The necessary fix is just to delete the two bits involving NUM_SPACE_RANGE from your patch. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2006-02-16 06:53 Message: Logged In: YES user_id=79902 Bug is located at line 817 of the HEAD regc_locale.c, and consists of a missing arm for the CC_PRINT case. I believe that the fix required (based on the POSIX definition, thanks Bill!) is the attached patch, which I'd appreciate people testing... :-) ---------------------------------------------------------------------- Comment By: William John Poser (billposer) Date: 2006-02-15 19:08 Message: Logged In: YES user_id=939324 Perhaps more important than the Perl definition is the POSIX definition, according to which [:print:] = [:alnum:] U [:punct:] U SPACE. Curiously, the current behaviour with [:print:] = [:alnum:] is documented in Welch, Jones, and Hobbs with no mention of it being a bug. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2005-12-09 16:42 Message: Logged In: YES user_id=79902 Good point. Yes. And [string is print] follows C's isprint() IIRC, so it is (almost certainly) the RE engine that is wrong. OK, this should be fixed. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2005-12-09 16:34 Message: Logged In: YES user_id=80530 should we also consider consistency with [string is print] ? ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2005-12-09 09:19 Message: Logged In: YES user_id=79902 It seems that Perl defines [[:print:]] as [[:space:][:graph:]] and we define it as [[:alnum:]] (and yes, it is documented that way.) Or perhaps [:blank:] instead of [:space:], the documentation being a bit hazy in that respect. The question is, what *should* we do? The following procedure helps with checking these sorts of things out: proc matches args { set RE [format {[[:%s:]]} [join $args ":\]\[:"]] for {set i 32} {$i<127} {incr i} { set c [format %c $i] puts -nonewline "$c-[regexp $RE $c]\t" } puts "" } ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1376892&group_id=10894 |