Scintilla / Bugs / #676 Perl lexer limited to basic latin delimiters?

Kein-Hong Man - 2008-05-05

Logged In: YES
user_id=785951
Originator: NO

I'm mostly to blame for recent fixes in the Perl lexer.

I agree that code points beyond 0x7E should be supported. I've always thought it should work for Latin-1 type charsets, but apparently it does not. I'll look into fixing the bug.

However in utf-8, the following fails (with and without the 'use') in Perl:
use utf8;
s¦foo¦bar¦;

Can you confirm that the behaviour is valid only for Latin-1 type single-byte charsets?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eric Promislow - 2008-05-07

Logged In: YES
user_id=63713
Originator: NO

I've been putting this off for years as well. The scintilla lexer uses this code:

static inline bool isNonQuote(char ch) {
return !isascii(ch) || isalnum(ch) || ch == '_';
}

...

if (ch == 's' && !isNonQuote(chNext)) {
state = SCE_PL_REGSUBST;
Quote.New(2);
} else if (ch == 'm' && !isNonQuote(chNext)) {

Komodo uses similar code:

static inline bool isSafeAlnum(char ch) {
return ((unsigned int) ch >= 128) || isalnum(ch) || ch == '_';
}
...

} else if (ch == 's' && !isSafeAlnum(chNext) && !isEOLChar(chNext)) {
state = SCE_PL_REGSUBST;
Quote.New(2);

The old code assumed that any character > 127 was acting as an identifier
character, which is not true for Perl. This was convenient for dealing with
UTF-8-encoded buffers -- if it looks like a high bit character, don't deal
with it any further: the encoded character behind the utf-8 bytes would be
treated like an identifier character. The problem is that Perl
gives the high-bit characters different roles.

Is there a table lurking in the Perl lexer that will show which characters
play which roles?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2008-05-07

Logged In: NO

UTF-8 seems to be a bit of a problem when it comes to delimiters in Perl. When I save a script containing my first example re (s¦foo¦bar¦;) encoded in UTF-8, it will fail to compile whether or not I use the utf8 pragma. It's just the error messages that are different:

- "Unrecognized character \xA6" without use utf8.
- "Substitution replacement not terminated" with use utf8.

This may be construed as a bug in Perl. Consequently delimiters beyond code point 0xFF do not work at all. I tried U+1D09 LATIN SMALL LETTER TURNED I, but no luck here. This is contrary to what the Camel book says in Chapter 2, Bits and Pieces, Pick Your Own Quotes:

"... any nonalphanumeric, nonwhitespace delimiter may be used in place of /."

Depends on what your definition of any is, of course... Anyway, I think Scintilla's Perl lexer should do the sensible thing and extend its range of recognised delimiters to the latin-1 supplement block, regardless of the current document's encoding.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Thürrschmidt - 2008-05-07

Logged In: YES
user_id=2079111
Originator: YES

Sorry, forgot to log in, the previous comment is mine.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kein-Hong Man - 2008-05-07

Logged In: YES
user_id=785951
Originator: NO

Well, vim 7.0 fails for both examples too.

Your statement of "construed as a bug in Perl" is incorrect. Reading the 5.10 PODs, Perl has limited support for UTF-8 source code. Most of it talked about UTF-8 string data and also about regex contents, but never about the regex delimiter. I will not patch Scintilla for UTF-8 regex delimiters that are 2 bytes or more, unless Perl's behaviour changes in that direction. So I am only thinking about Latin-1 type charset support in the following.

That said, Eric has pointed out the problem nicely. I am reluctant to deal with it too, and there is no easy solution. To me, it is a rather dusty corner of the Perl lexing, one that I would prefer to avoid, or shirk cowardly from.

Now, the following is a substitution regex in Latin-1 ("sáfooábar" reports a failed substitution):

sáfooábará;

If I start the source file with "use utf8;" and write the above in utf-8, it is apparently lexed as a bareword ("sáfooábar" doesn't raise an error.) Yet trying "s¦foo¦bar¦;" in utf-8 fails -- the broken bar is recognized as non-identifier, but then the attempt to use it as a delimiter borks. So, does Perl have clean support for UTF-8 in toke.c? I think not. Perl seem to lex a UTF-8 bareword or identifier, yet other parts cannot recognize UTF-8 characters properly.

So, should we now need a UTF-8 flag plus full UTF-8 smarts in Scintilla? For now, I will do some low-priority testing and pondering. Anyone else is welcome to supply a patch to Neil. So far, I can't think of a clean solution that I want to implement.

For the Latin-1 case, how widespread is this "s¦foo¦bar¦;" practice anyway? Normally, I would have thought that people use what they can type quickly on their keyboards as delimiters. Can you provide example(s) of FLOSS Perl-based projects where such delimiters are used?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eric Promislow - 2008-05-07

Logged In: YES
user_id=63713
Originator: NO

I'm with Kein-Hong on this. Most of the lexer assumes
7-bit character values for significant parts of the
language, and treats other characters as identifier
characters, if the current context isn't obvious
(strings, comments, regexes, datasections, etc.).
Rewriting it would be take a lot of work, cause us
to replace character comparisons with string comparisons,
add more memory management, etc.

What are the compelling reasons for using high-bit
characters as regex delimiters? I think I'll
cite Conway's Perl Best Practices where he concisely
says (page 246): "Don't use any delimiters other
than /.../ or m{...}". Of course the lexer is
far more lenient than that, but with an open-ended
language like Perl I'm OK drawing the line somewhere.

I recommend we close this bug with a WONTFIX
resolution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Thürrschmidt - 2008-05-07

Logged In: YES
user_id=2079111
Originator: YES

Thanks a lot for your feedback on this. I have to admit that I wasn't aware of the implications of using high-bit delimiters regarding UTF-8 as I have never until today used anything except ISO-8859-1 to save my Perl scripts.

I can't cite any public projects that use the broken bar delimiter. Actually I'm just a hobby programmer, and I found it a smart idea to use this much neglected character as my regular expression delimiter of choice when I was doing a lot of HTML processing, simply because it saves me escapting all those slashes and because I like to take as much as I can of the freedom that Perl gives me. I defined my own keyboard shortcut for it in TextPad, my former editor, working around its share of highlighting problems by defining ¦ as another quote character. Recently I became a happy Linux and Geany user, although I'm not quite as happy about the way many of my Perl scripts look now.

BTW, I can access ¦ with AltGr + Shift + < in Linux on my keyboard. That's standard in X on a German keyboard and it's quite easy to remember since AltGr + < produces | (in any operating system).

sáfooábará; makes an interesting point: Obviously Perl regards á as an alphanumeric character and one that can't be a delimiter. The definition of alphanumeric that applies here seems to be one of Unicode properties, not the classic [0-9a-zA-Z]. This would pretty much limit the range of Latin-1 supplement characters as legal delimiters.

Anyway, if there's no easy fix for this really very minor issue, I guess I just have to live with it, heed the Best Practices advice and change all my delimiters back to /. This may turn out an interesting challenge for a Perl script...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kein-Hong Man - 2008-05-08

Logged In: YES
user_id=785951
Originator: NO

I guess we should either follow Eric's recommendation or leave it here as a low-priority 'thingy'. I think it's not a clear-cut bug. If "s¦foo¦bar¦;" is not lexed properly, it's not the end of the world, but of course delimited stretches in the wrong places tend to be a nuisance.

To, Sebastian, sorry, there are many dusty corners in Perl lexing that makes it difficult to highlight Perl perfectly. Perl has a >300KB tokenizer and a hacked parser; this is difficult to replicate perfectly whilst trying hard to limit complexity.

I will put it in my TODO list, but can't promise anything and there is no timeframe. Gotta study toke.c first. Scintilla's Perl lexer has a number of known failure cases, and like your case, many of these are dusty corners or they can be avoided. It is unlikely we will be able to fix all failure cases, since we must balance cost and complexity versus the size of the 'win'. Plus, some behaviour can change from version to version.

I maintain over 2000 lines of test cases and notes for the Perl lexer at:
http://groups.google.com/group/scintilla-interest/files
The file is perl-test-khman.zip. So for now, I will document your issue and at a future date investigate whether it is a sane rule that can be implemented with reasonable cost and which interacts reasonably with utf-8 features.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2008-05-08

priority: 5 --> 3

assigned_to: nobody --> nyamatongwe

status: open --> open-wont-fix
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2008-05-08

Logged In: YES
user_id=12579
Originator: NO

For SinkWorld, I tried to write a class similar to Accessor that would present the document as a sequence of characters instead of bytes and coordinate setting styles over the byte ranges that made up those characters. The implementation didn't work out but it still looks like an OK approach.

Marking as WontFix.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Perl lexer limited to basic latin delimiters?

Group

Searches

Help

#676 Perl lexer limited to basic latin delimiters?

Discussion