Scintilla / Feature Requests / #1328 [PATCH] Raku (Perl 6) Lexer Development Created

Zufu Liu - 2019-12-05

Not related to the content of the lexer it self, but
1. GPL is not suitable, application use Scintilla may not use GPL.
2. 130 is already assigned to SCLEX_HOLLYWOOD (4.2.2, pending release)
3. The website for Raku is now https://raku.org/

Related [feature-requests:#1207]

Related

Feature Requests: ~~#1207~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Reay - 2019-12-05

Thanks Zufu,

I have replaced the GPL with "License for Scintilla and SciTE"

Used the most recent commit by Neil (1b8ce5991cb9) and incremented SCLEX_RAKU to 131

Updated the documentation links to https://docs.raku.org/* (same content, diferent name)

Attached are two new patch files based on the latest pre 4.2.2 Scintilla release:

Patch / Build Scintilla 4.2.2 (pre)

$ hg clone http://hg.code.sf.net/p/scintilla/code -r 1b8ce5991cb9 scintilla
$ cd scintilla
$ hg import scintilla-4.2.2-pre_raku_patch.diff
$ cd gtk/
$ make

Patch / Build SciTE 4.2.1

$ wget https://excellmedia.dl.sourceforge.net/project/scintilla/SciTE/4.2.1/scite421.tgz
$ tar xzf scite421.tgz
$ cd scite
$ patch -s -p0 < scite_4.2.1_raku_patch2.diff
$ cd gtk/
$ make
$ sudo make install

Mark.

scintilla_4.2.2-pre_raku_patch.diff

scite_4.2.1_raku_patch2.diff
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-12-05

The property names added by this lexer should be namespaced with the lexer name, similar to the Perl lexer, so fold.raku.comment.multiline and fold.raku.comment.pod. Properties that are global since they are used in other lexers, fold, fold.compact and fold.comment, should omit descriptions so they are not treated as lexer-specific.

There are non-ASCII characters in comments which can lead to problems with Microsoft Visual C++ in non-English locales. It is generally simplest to replace the literal characters with their Unicode description so the source code is pure ASCII.

The license is normally included by reference so consumers don't have to check whether this file's license text differs from License.txt.

The unnamed namespace does the same job as 'static' so makes the use of 'static' redundant.

There are some warnings from various tools. cppcheck is worth running although you should ignore 'constParameter' warnings.

For this warning from cppcheck 1.89, you are allowed to use the 'switch' statement. Coverity also doesn't like this code.

scintilla\lexers\LexRaku.cxx:252:10: warning: Opposite inner 'if' condition leads to a dead code block. [oppositeInnerCondition] if (ch == 0x201C) return 0x201D; // LEFT DOUBLE QUOTATION MARK ^ scintilla\lexers\LexRaku.cxx:244:9: note: outer condition: ch<8192 if (ch < 0x2000) { ^ scintilla\lexers\LexRaku.cxx:252:10: note: opposite inner condition: ch==8220 if (ch == 0x201C) return 0x201D; // LEFT DOUBLE QUOTATION MARK ^

Scope limiting can be useful but its a question of taste.

scintilla\lexers\LexRaku.cxx:1197:15: warning: The scope of the variable 'lengthToEnd' can be reduced. [variableScope] Sci_Position lengthToEnd; // length until the end of range ^

Unused variables are clutter and can also reveal unfinished plans.

scintilla\lexers\LexRaku.cxx:1194:14: warning: Unused variable: str [unusedVariable] std::string str; // temp string value ^

Possible that control flow ensures this is always set but its difficult to tell.

C:\u\hg\scintilla\lexers\LexRaku.cxx(1623) : warning C4701: potentially uninitialized local variable 'hereState' used

Debugging code is non-portable, never works for anyone else, and implies maintenance that doesn't happen, so DebugPrintSectionUnicode and printf shouldn't be included.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Reay - 2019-12-06

Thanks Neil,

I have made the following changes:

removed comments from global property names

used Raku specific property names:

fold.raku.comment.multiline

fold.raku.comment.pod

removed non-ASCII characters from comments

use the short license reference lines

removed anonymous namespace and used static functions

have replaced the whole code block for testing bracket characters with a switch statement

removed unused var: str

initialised 'hereState'

removed debugging code

parsed with cppcheck 1.89

I have also taken the opportunity to add better Unicode character mapping for Raku. On a personal level, I am not a fan of a language with such broad character support. It's one thing to allow diverse language characters, but Raku also interprets number glyphs as numbers. It's all a bit broad.

Attached are the new patch files for Scintilla (rev: 1b8ce5991cb9) and SciTE 4.2.1

Mark.

Last edit: Mark Reay 2019-12-06

scintilla_4.2.2-pre_raku_patch2.diff

scite_4.2.1_raku_patch3.diff
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-12-06

Hi Mark, personally, I think it's preferable to use anonymous namespace instead of static for entire file (after includes and before LexerModule lmRaku).
The LexerRaku::IsWordChar() method seems can be simplified by using IsIdStart(), IsIdContinue(), etc. from CharacterCategory header.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Reay - 2019-12-06

Hi Zufu,

Good point. I just changed the Raku::IsWordChar() function. That makes it very compact and much more efficient, now it's just:

const CharacterCategory cc = ccMap.CategoryFor(ch); switch (cc) { // Letters case ccLu: case ccLl: case ccLt: case ccLm: case ccLo: return true; default: return false; }

I also put the anonymous namespace back, I think it makes sence too, if that's okay.

Attached is LexRaku.cxx, as that contain the only changes.

Thanks,

Mark.

LexRaku.cxx
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Reay - 2019-12-07

I think I was a bit tired when I implemented the CharacterCategory check last night. I've now used the following instead:

const CharacterCategory cc = CategoriseCharacter(ch); switch (cc) { // etc... }

The categorisation also works well with allowed numbers so the two functions I've updated are:

LexerRaku::IsWordChar

LexerRaku::IsNumberChar

I will work to simplify GetBracketCloseChar next. Valid opening and closing delimiters can be any bi-directional pair of Unicode characters, as described in the first section of: http://www.unicode.org/Public/5.1.0/ucd/BidiMirroring.txt

Mark.

Last edit: Mark Reay 2019-12-07

LexRaku.cxx
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

'const' is redundant on return types. From clang-tidy:

C:\u\hg\scintilla\lexers\LexRaku.cxx(413,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
const bool IsValidOpener(const int ch, DelimPair &dp, int type = RAKUDELIM_BRACKET) noexcept {
^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(815,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
        const bool IsOperatorChar(const int ch);
        ^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(816,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
        const bool IsWordChar(const int ch, bool alowNumber = true);
        ^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(817,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
        const bool IsWordStartChar(const int ch);
        ^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(818,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
        const bool IsNumberChar(const int ch, int base = 10);
        ^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(819,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
        const bool ProcessRegexTwinCapture(StyleContext &sc, const Sci_Position length,
        ^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(822,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
        const bool ProcessValidRegQlangStart(StyleContext &sc, Sci_Position length, const int type,
        ^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(824,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
        const Sci_Position LengthToNonWordChar(StyleContext &sc, Sci_Position length,
        ^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(837,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
const bool LexerRaku::IsOperatorChar(const int ch) {
^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(872,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
const bool LexerRaku::IsWordChar(const int ch, bool alowNumber) {
^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(897,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
const bool LexerRaku::IsWordStartChar(const int ch) {
^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(910,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
const bool LexerRaku::IsNumberChar(const int ch, int base) {
^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(983,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
const bool LexerRaku::ProcessRegexTwinCapture(StyleContext &sc, const Sci_Position length,
^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(1031,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
const bool LexerRaku::ProcessValidRegQlangStart(StyleContext &sc, Sci_Position length, const int type,
^~~~~~
C:\u\hg\scintilla\lexers\LexRaku.cxx(1128,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
const Sci_Position LexerRaku::LengthToNonWordChar(StyleContext &sc, Sci_Position length,
^~~~~~

'alowNumber' should probably be spelled 'allowNumber'.

Scintilla uses a fixed #include order with C++ library headers after C library headers. The order is defined in scripts/HeaderOrder.txt and checked by scripts/HeaderCheck.py. <vector> goes between <string> and <map>.

I have reduced the size of the switch statement that was in GetBracketCloseChar. What was a 178 cases has now been simplified to three CharacterCategory types. Some opening characters have matching closing characters that are not simply opener + 1. These have been cased as follows:

int GetBracketCloseChar(const int ch) noexcept {
    const CharacterCategory cc = CategoriseCharacter(ch);
    switch (cc) {
        case ccSm:
            switch (ch) {
                case 0x3C: return 0x3E; // LESS-THAN SIGN
                case 0x2208: return 0x220B; // ELEMENT OF
                case 0x2209: return 0x220C; // NOT AN ELEMENT OF
                case 0x220A: return 0x220D; // SMALL ELEMENT OF
                case 0x2215: return 0x29F5; // DIVISION SLASH
                case 0x2243: return 0x22CD; // ASYMPTOTICALLY EQUAL TO
                case 0x2298: return 0x29B8; // CIRCLED DIVISION SLASH
                case 0x22A6: return 0x2ADE; // ASSERTION
                case 0x22A8: return 0x2AE4; // TRUE
                case 0x22A9: return 0x2AE3; // FORCES
                case 0x22AB: return 0x2AE5; // DOUBLE VERTICAL BAR DOUBLE RIGHT TURNSTILE
                case 0x22F2: return 0x22FA; // ELEMENT OF WITH LONG HORIZONTAL STROKE
                case 0x22F3: return 0x22FB; // ELEMENT OF WITH VERTICAL BAR AT END OF HORIZONTAL STROKE
                case 0x22F4: return 0x22FC; // SMALL ELEMENT OF WITH VERTICAL BAR AT END OF HORIZONTAL STROKE
                case 0x22F6: return 0x22FD; // ELEMENT OF WITH OVERBAR
                case 0x22F7: return 0x22FE; // SMALL ELEMENT OF WITH OVERBAR
                case 0xFF1C: return 0xFF1E; // FULLWIDTH LESS-THAN SIGN
            }
            break;
        case ccPs:
            switch (ch) {
                case 0x5B: return 0x5D; // LEFT SQUARE BRACKET
                case 0x7B: return 0x7D; // LEFT CURLY BRACKET
                case 0x298D: return 0x2990; // LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
                case 0x298F: return 0x298E; // LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
                case 0xFF3B: return 0xFF3D; // FULLWIDTH LEFT SQUARE BRACKET
                case 0xFF5B: return 0xFF5D; // FULLWIDTH LEFT CURLY BRACKET
            }
            break;
        case ccPi:
            break;
        default: return 0;
    }
    return ch + 1;
}

I have also removed the superfluous 'const' qualifiers from returning functions. Also moved <vector> to it's appropriate position.

Last edit: Mark Reay 2019-12-10

LexRaku.cxx

Neil Hodgson - 2019-12-13

The latest version seems reasonable to me.

Due to some upcoming changes to the way lexers work, the Raku lexer won't be committed until those changes have been committed. This will most likely occur in a couple of weeks and may require minor changes to this lexer.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2020-01-03

New lexing features have been committed and the ILexer interface updated to ILexer5 which adds new metadata retrieval calls. A patch with the changes needed is attached as RakuILexer5.patch.

The new lexer testing framework uses example files which are controlled by SciTE.properties files to produce expected output in .styled files. A minimal example x.p6 is attached as RakuTest.patch. The test file should include an example of each possible style.

Is the numeric '0' supposed to be in SCE_RAKU_DEFAULT instead of SCE_RAKU_NUMBER?

RakuILexer5.patch

RakuTest.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Reay - 2020-01-03

Thank you Neil,

I have gone over the lexer for final checks and fixed a few bugs:

numeral '0' is now SCE_RAKU_NUMBER, as it should be

multi-line comments with multiple start and end delimiters were not terminating correctly

folding multi-line comments are also now detected properly

I have updated the new style test files with tests for all style types:

SciTE.properties

x.p6

x.p6.styled

The raku.properties file has been updated for SciTE. Just fixed the keywords line wrapping.

Attached are the patch files for both Scintilla (tip: 295a6e54d582) and SciTE 4.2.3

Mark.

Last edit: Mark Reay 2020-01-03

scintilla_ILexer5_raku_patch.diff

scite_4.2.3_raku_patch.diff
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2020-01-04

Committed as [bcb951], [3be72c], [604485], [cb4c65].

In ProcessValidRegQlangStart, the decrementing of length inside a loop appears wrong as using the unchanging startPos turns a linear change into a multiplicative change and a possible early termination. Maybe something like

length = lengthOriginal - (sc.currentPos - startPos);

When using a file with \r\n line ends as is common on Windows, there are often mismatched styles on the \r and \n when turning on visible line ends - SciTE: View | End of Line). The most common is at the end of a '#' line comment where the \r is green and the \n grey. While this is not an error and happens with some other lexers, it is a source of problems and lexers that style both the \r and \n with the same style are more robust.

Related

Commit: [604485]
Commit: [cb4c65]
Commit: [3be72c]
Commit: [bcb951]
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Reay - 2020-01-04

I have found two places where CRLF was not being handled propperly and corrected them for:

SCE_RAKU_COMMENTLINE

SCE_RAKU_HEREDOC_Q / QQ

The length calculation in ProcessValidRegQlangStart was clearly in error. I have replaced it with:

length = startLen - (sc.currentPos - startPos);

scintilla_ILexer5_raku_crlf_fix.diff
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2020-01-05
  
  Committed as [4bdfd4].
  
  Related
  
  Commit: [4bdfd4]
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2020-01-08

labels: lexer raku perl6 --> lexer, raku, perl, scintilla

Group: Completed --> Committed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2020-01-15

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

[PATCH] Raku (Perl 6) Lexer Development Created

Group

Searches

Help

#1328 [PATCH] Raku (Perl 6) Lexer Development Created

Patch / Build Scintilla 4.2.1

Patch / Build SciTE 4.2.1

Related

Discussion

Related

Patch / Build Scintilla 4.2.2 (pre)

Patch / Build SciTE 4.2.1

Related

Related