#676 Perl lexer limited to basic latin delimiters?

Bug
open-wont-fix
Neil Hodgson
Scintilla (790)
3
2008-05-08
2008-05-04
No

Apparently Scintilla's Perl lexer does not recognize regular expression (and other) delimiters beyond code point 0x7E, although these are perfectly legal in Perl (and sometimes come in very handy).

For example this substitution will be not be highlighted correctly:

s¦foo¦bar¦;

Note that ¦ is the BROKEN BAR character U+00A6, not to be confused with | or U+00A6 VERTICAL LINE.

A more complex expression like this will throw the lexer completely off track:

s¦(?:<p>)?(<div class="foo(?:bar)?">)(?:<p>)?(\s|<a [^<>]+><!--[^<>]+--></a>)¦<p></p>$1$2¦g;

The same will happen with any other delimiter from outside the basic latin block, e.g. U+00F7 DIVISION SIGN.

I'm using Geany 0.13 on Linux, based on Scintilla 1.75. Other Scintilla-based applications, like Notepad++ or Komodo Edit, show similar problems.

Discussion

  • Kein-Hong Man
    Kein-Hong Man
    2008-05-05

    Logged In: YES
    user_id=785951
    Originator: NO

    I'm mostly to blame for recent fixes in the Perl lexer.

    I agree that code points beyond 0x7E should be supported. I've always thought it should work for Latin-1 type charsets, but apparently it does not. I'll look into fixing the bug.

    However in utf-8, the following fails (with and without the 'use') in Perl:
    use utf8;
    s¦foo¦bar¦;

    Can you confirm that the behaviour is valid only for Latin-1 type single-byte charsets?

     
  • Eric Promislow
    Eric Promislow
    2008-05-07

    Logged In: YES
    user_id=63713
    Originator: NO

    I've been putting this off for years as well. The scintilla lexer uses this code:

    static inline bool isNonQuote(char ch) {
    return !isascii(ch) || isalnum(ch) || ch == '_';
    }

    ...

    if (ch == 's' && !isNonQuote(chNext)) {
    state = SCE_PL_REGSUBST;
    Quote.New(2);
    } else if (ch == 'm' && !isNonQuote(chNext)) {

    Komodo uses similar code:

    static inline bool isSafeAlnum(char ch) {
    return ((unsigned int) ch >= 128) || isalnum(ch) || ch == '_';
    }
    ...

    } else if (ch == 's' && !isSafeAlnum(chNext) && !isEOLChar(chNext)) {
    state = SCE_PL_REGSUBST;
    Quote.New(2);

    The old code assumed that any character > 127 was acting as an identifier
    character, which is not true for Perl. This was convenient for dealing with
    UTF-8-encoded buffers -- if it looks like a high bit character, don't deal
    with it any further: the encoded character behind the utf-8 bytes would be
    treated like an identifier character. The problem is that Perl
    gives the high-bit characters different roles.

    Is there a table lurking in the Perl lexer that will show which characters
    play which roles?

     
  • Logged In: NO

    UTF-8 seems to be a bit of a problem when it comes to delimiters in Perl. When I save a script containing my first example re (s¦foo¦bar¦;) encoded in UTF-8, it will fail to compile whether or not I use the utf8 pragma. It's just the error messages that are different:

    - "Unrecognized character \xA6" without use utf8.
    - "Substitution replacement not terminated" with use utf8.

    This may be construed as a bug in Perl. Consequently delimiters beyond code point 0xFF do not work at all. I tried U+1D09 LATIN SMALL LETTER TURNED I, but no luck here. This is contrary to what the Camel book says in Chapter 2, Bits and Pieces, Pick Your Own Quotes:

    "... any nonalphanumeric, nonwhitespace delimiter may be used in place of /."

    Depends on what your definition of any is, of course... Anyway, I think Scintilla's Perl lexer should do the sensible thing and extend its range of recognised delimiters to the latin-1 supplement block, regardless of the current document's encoding.

     
  • Logged In: YES
    user_id=2079111
    Originator: YES

    Sorry, forgot to log in, the previous comment is mine.

     
  • Kein-Hong Man
    Kein-Hong Man
    2008-05-07

    Logged In: YES
    user_id=785951
    Originator: NO

    Well, vim 7.0 fails for both examples too.

    Your statement of "construed as a bug in Perl" is incorrect. Reading the 5.10 PODs, Perl has limited support for UTF-8 source code. Most of it talked about UTF-8 string data and also about regex contents, but never about the regex delimiter. I will not patch Scintilla for UTF-8 regex delimiters that are 2 bytes or more, unless Perl's behaviour changes in that direction. So I am only thinking about Latin-1 type charset support in the following.

    That said, Eric has pointed out the problem nicely. I am reluctant to deal with it too, and there is no easy solution. To me, it is a rather dusty corner of the Perl lexing, one that I would prefer to avoid, or shirk cowardly from.

    Now, the following is a substitution regex in Latin-1 ("sáfooábar" reports a failed substitution):

    sáfooábará;

    If I start the source file with "use utf8;" and write the above in utf-8, it is apparently lexed as a bareword ("sáfooábar" doesn't raise an error.) Yet trying "s¦foo¦bar¦;" in utf-8 fails -- the broken bar is recognized as non-identifier, but then the attempt to use it as a delimiter borks. So, does Perl have clean support for UTF-8 in toke.c? I think not. Perl seem to lex a UTF-8 bareword or identifier, yet other parts cannot recognize UTF-8 characters properly.

    So, should we now need a UTF-8 flag plus full UTF-8 smarts in Scintilla? For now, I will do some low-priority testing and pondering. Anyone else is welcome to supply a patch to Neil. So far, I can't think of a clean solution that I want to implement.

    For the Latin-1 case, how widespread is this "s¦foo¦bar¦;" practice anyway? Normally, I would have thought that people use what they can type quickly on their keyboards as delimiters. Can you provide example(s) of FLOSS Perl-based projects where such delimiters are used?

     
  • Eric Promislow
    Eric Promislow
    2008-05-07

    Logged In: YES
    user_id=63713
    Originator: NO

    I'm with Kein-Hong on this. Most of the lexer assumes
    7-bit character values for significant parts of the
    language, and treats other characters as identifier
    characters, if the current context isn't obvious
    (strings, comments, regexes, datasections, etc.).
    Rewriting it would be take a lot of work, cause us
    to replace character comparisons with string comparisons,
    add more memory management, etc.

    What are the compelling reasons for using high-bit
    characters as regex delimiters? I think I'll
    cite Conway's Perl Best Practices where he concisely
    says (page 246): "Don't use any delimiters other
    than /.../ or m{...}". Of course the lexer is
    far more lenient than that, but with an open-ended
    language like Perl I'm OK drawing the line somewhere.

    I recommend we close this bug with a WONTFIX
    resolution.

     
  • Logged In: YES
    user_id=2079111
    Originator: YES

    Thanks a lot for your feedback on this. I have to admit that I wasn't aware of the implications of using high-bit delimiters regarding UTF-8 as I have never until today used anything except ISO-8859-1 to save my Perl scripts.

    I can't cite any public projects that use the broken bar delimiter. Actually I'm just a hobby programmer, and I found it a smart idea to use this much neglected character as my regular expression delimiter of choice when I was doing a lot of HTML processing, simply because it saves me escapting all those slashes and because I like to take as much as I can of the freedom that Perl gives me. I defined my own keyboard shortcut for it in TextPad, my former editor, working around its share of highlighting problems by defining ¦ as another quote character. Recently I became a happy Linux and Geany user, although I'm not quite as happy about the way many of my Perl scripts look now.

    BTW, I can access ¦ with AltGr + Shift + < in Linux on my keyboard. That's standard in X on a German keyboard and it's quite easy to remember since AltGr + < produces | (in any operating system).

    sáfooábará; makes an interesting point: Obviously Perl regards á as an alphanumeric character and one that can't be a delimiter. The definition of alphanumeric that applies here seems to be one of Unicode properties, not the classic [0-9a-zA-Z]. This would pretty much limit the range of Latin-1 supplement characters as legal delimiters.

    Anyway, if there's no easy fix for this really very minor issue, I guess I just have to live with it, heed the Best Practices advice and change all my delimiters back to /. This may turn out an interesting challenge for a Perl script...

     
  • Kein-Hong Man
    Kein-Hong Man
    2008-05-08

    Logged In: YES
    user_id=785951
    Originator: NO

    I guess we should either follow Eric's recommendation or leave it here as a low-priority 'thingy'. I think it's not a clear-cut bug. If "s¦foo¦bar¦;" is not lexed properly, it's not the end of the world, but of course delimited stretches in the wrong places tend to be a nuisance.

    To, Sebastian, sorry, there are many dusty corners in Perl lexing that makes it difficult to highlight Perl perfectly. Perl has a >300KB tokenizer and a hacked parser; this is difficult to replicate perfectly whilst trying hard to limit complexity.

    I will put it in my TODO list, but can't promise anything and there is no timeframe. Gotta study toke.c first. Scintilla's Perl lexer has a number of known failure cases, and like your case, many of these are dusty corners or they can be avoided. It is unlikely we will be able to fix all failure cases, since we must balance cost and complexity versus the size of the 'win'. Plus, some behaviour can change from version to version.

    I maintain over 2000 lines of test cases and notes for the Perl lexer at:
    http://groups.google.com/group/scintilla-interest/files
    The file is perl-test-khman.zip. So for now, I will document your issue and at a future date investigate whether it is a sane rule that can be implemented with reasonable cost and which interacts reasonably with utf-8 features.

     
  • Neil Hodgson
    Neil Hodgson
    2008-05-08

    • priority: 5 --> 3
    • assigned_to: nobody --> nyamatongwe
    • status: open --> open-wont-fix
     
  • Neil Hodgson
    Neil Hodgson
    2008-05-08

    Logged In: YES
    user_id=12579
    Originator: NO

    For SinkWorld, I tried to write a class similar to Accessor that would present the document as a sequence of characters instead of bytes and coordinate setting styles over the byte ranges that made up those characters. The implementation didn't work out but it still looks like an OK approach.

    Marking as WontFix.