Menu

#1328 [PATCH] Raku (Perl 6) Lexer Development Created

Committed
closed
nobody
5
2020-01-15
2019-12-04
Mark Reay
No

Dear Scintilla,

I have been working on a Raku (Perl6) lexer implementation and have a version for 4.2.1 included. I have covered support for the following so far:

  • Comments, both line and embedded (multi-line)
  • POD, no inline highlighting as yet...
  • Heredoc block string, with variable highlighting (with qq)
  • Strings, with variable highlighting (with ")
  • Q Language, including adverbs (also basic q and qq)
  • Regex, including adverbs
  • Numbers
  • Bareword / identifiers
  • Types
  • Variables: my, positional, associative, callable
  • Operators (blanket cover, no sequence checking)
  • Folding:
    • Comments: line / multi-line
    • POD sections
    • Code blocks {}

Patch / Build Scintilla 4.2.1

$ hg clone http://hg.code.sf.net/p/scintilla/code -r rel-4-2-1 scintilla
$ cd scintilla
$ cp ../scintilla-4.2.1_raku_patch.diff .
$ hg import scintilla-4.2.1_raku_patch.diff
$ cd gtk/
$ make

Patch / Build SciTE 4.2.1

$ wget https://excellmedia.dl.sourceforge.net/project/scintilla/SciTE/4.2.1/scite421.tgz
$ tar xzf scite421.tgz
$ cd scite
$ cp ../scite-4.2.1_raku_patch.diff .
$ patch -s -p0 < scite_4.2.1_raku_patch.diff
$ cd gtk/
$ make
$ sudo make install

There is still work to do, but I though it would be worth throwing open to the community. I also have an implementation for the Geany editor.

Best regards,

Mark.

2 Attachments

Related

Feature Requests: #1207

Discussion

  • Zufu Liu

    Zufu Liu - 2019-12-05

    Not related to the content of the lexer it self, but
    1. GPL is not suitable, application use Scintilla may not use GPL.
    2. 130 is already assigned to SCLEX_HOLLYWOOD (4.2.2, pending release)
    3. The website for Raku is now https://raku.org/

    Related [feature-requests:#1207]

     

    Related

    Feature Requests: #1207

  • Neil Hodgson

    Neil Hodgson - 2019-12-05

    The property names added by this lexer should be namespaced with the lexer name, similar to the Perl lexer, so fold.raku.comment.multiline and fold.raku.comment.pod. Properties that are global since they are used in other lexers, fold, fold.compact and fold.comment, should omit descriptions so they are not treated as lexer-specific.

    There are non-ASCII characters in comments which can lead to problems with Microsoft Visual C++ in non-English locales. It is generally simplest to replace the literal characters with their Unicode description so the source code is pure ASCII.

    The license is normally included by reference so consumers don't have to check whether this file's license text differs from License.txt.

    The unnamed namespace does the same job as 'static' so makes the use of 'static' redundant.

    There are some warnings from various tools. cppcheck is worth running although you should ignore 'constParameter' warnings.

    For this warning from cppcheck 1.89, you are allowed to use the 'switch' statement. Coverity also doesn't like this code.

    scintilla\lexers\LexRaku.cxx:252:10: warning: Opposite inner 'if' condition leads to a dead code block. [oppositeInnerCondition]
      if (ch == 0x201C) return 0x201D; // LEFT DOUBLE QUOTATION MARK
             ^
    scintilla\lexers\LexRaku.cxx:244:9: note: outer condition: ch<8192
     if (ch < 0x2000) {
            ^
    scintilla\lexers\LexRaku.cxx:252:10: note: opposite inner condition: ch==8220
      if (ch == 0x201C) return 0x201D; // LEFT DOUBLE QUOTATION MARK
             ^
    

    Scope limiting can be useful but its a question of taste.

    scintilla\lexers\LexRaku.cxx:1197:15: warning: The scope of the variable 'lengthToEnd' can be reduced. [variableScope]
     Sci_Position lengthToEnd;  // length until the end of range
                  ^
    

    Unused variables are clutter and can also reveal unfinished plans.

    scintilla\lexers\LexRaku.cxx:1194:14: warning: Unused variable: str [unusedVariable]
     std::string str;    // temp string value
                 ^
    

    Possible that control flow ensures this is always set but its difficult to tell.

    C:\u\hg\scintilla\lexers\LexRaku.cxx(1623) : warning C4701: potentially uninitialized local variable 'hereState' used
    

    Debugging code is non-portable, never works for anyone else, and implies maintenance that doesn't happen, so DebugPrintSectionUnicode and printf shouldn't be included.

     
  • Mark Reay

    Mark Reay - 2019-12-06

    Thanks Neil,

    I have made the following changes:

    • removed comments from global property names
    • used Raku specific property names:
      • fold.raku.comment.multiline
      • fold.raku.comment.pod
    • removed non-ASCII characters from comments
    • use the short license reference lines
    • removed anonymous namespace and used static functions
    • have replaced the whole code block for testing bracket characters with a switch statement
    • removed unused var: str
    • initialised 'hereState'
    • removed debugging code
    • parsed with cppcheck 1.89

    I have also taken the opportunity to add better Unicode character mapping for Raku. On a personal level, I am not a fan of a language with such broad character support. It's one thing to allow diverse language characters, but Raku also interprets number glyphs as numbers. It's all a bit broad.

    Attached are the new patch files for Scintilla (rev: 1b8ce5991cb9) and SciTE 4.2.1

    Mark.

     

    Last edit: Mark Reay 2019-12-06
  • Zufu Liu

    Zufu Liu - 2019-12-06

    Hi Mark, personally, I think it's preferable to use anonymous namespace instead of static for entire file (after includes and before LexerModule lmRaku).
    The LexerRaku::IsWordChar() method seems can be simplified by using IsIdStart(), IsIdContinue(), etc. from CharacterCategory header.

     
  • Mark Reay

    Mark Reay - 2019-12-06

    Hi Zufu,

    Good point. I just changed the Raku::IsWordChar() function. That makes it very compact and much more efficient, now it's just:

    const CharacterCategory cc = ccMap.CategoryFor(ch);
            switch (cc) {
                // Letters
                case ccLu:
                case ccLl:
                case ccLt:
                case ccLm:
                case ccLo:
                    return true;
                default:
                    return false;
            }
    

    I also put the anonymous namespace back, I think it makes sence too, if that's okay.

    Attached is LexRaku.cxx, as that contain the only changes.

    Thanks,

    Mark.

     
  • Mark Reay

    Mark Reay - 2019-12-07

    I think I was a bit tired when I implemented the CharacterCategory check last night. I've now used the following instead:

    const CharacterCategory cc = CategoriseCharacter(ch);
    switch (cc) {
        // etc...
    }
    

    The categorisation also works well with allowed numbers so the two functions I've updated are:

    • LexerRaku::IsWordChar
    • LexerRaku::IsNumberChar

    I will work to simplify GetBracketCloseChar next. Valid opening and closing delimiters can be any bi-directional pair of Unicode characters, as described in the first section of: http://www.unicode.org/Public/5.1.0/ucd/BidiMirroring.txt

    Mark.

     

    Last edit: Mark Reay 2019-12-07
  • Neil Hodgson

    Neil Hodgson - 2019-12-07

    'const' is redundant on return types. From clang-tidy:

    C:\u\hg\scintilla\lexers\LexRaku.cxx(413,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
    const bool IsValidOpener(const int ch, DelimPair &dp, int type = RAKUDELIM_BRACKET) noexcept {
    ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(815,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
            const bool IsOperatorChar(const int ch);
            ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(816,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
            const bool IsWordChar(const int ch, bool alowNumber = true);
            ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(817,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
            const bool IsWordStartChar(const int ch);
            ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(818,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
            const bool IsNumberChar(const int ch, int base = 10);
            ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(819,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
            const bool ProcessRegexTwinCapture(StyleContext &sc, const Sci_Position length,
            ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(822,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
            const bool ProcessValidRegQlangStart(StyleContext &sc, Sci_Position length, const int type,
            ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(824,2): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
            const Sci_Position LengthToNonWordChar(StyleContext &sc, Sci_Position length,
            ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(837,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
    const bool LexerRaku::IsOperatorChar(const int ch) {
    ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(872,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
    const bool LexerRaku::IsWordChar(const int ch, bool alowNumber) {
    ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(897,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
    const bool LexerRaku::IsWordStartChar(const int ch) {
    ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(910,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
    const bool LexerRaku::IsNumberChar(const int ch, int base) {
    ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(983,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
    const bool LexerRaku::ProcessRegexTwinCapture(StyleContext &sc, const Sci_Position length,
    ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(1031,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
    const bool LexerRaku::ProcessValidRegQlangStart(StyleContext &sc, Sci_Position length, const int type,
    ^~~~~~
    C:\u\hg\scintilla\lexers\LexRaku.cxx(1128,1): warning G2C97499B: 'const' type qualifier on return type has no effect [clang-diagnostic-ignored-qualifiers]
    const Sci_Position LexerRaku::LengthToNonWordChar(StyleContext &sc, Sci_Position length,
    ^~~~~~
    

    'alowNumber' should probably be spelled 'allowNumber'.

    Scintilla uses a fixed #include order with C++ library headers after C library headers. The order is defined in scripts/HeaderOrder.txt and checked by scripts/HeaderCheck.py. <vector> goes between <string> and <map>.

     
  • Mark Reay

    Mark Reay - 2019-12-10

    I have reduced the size of the switch statement that was in GetBracketCloseChar. What was a 178 cases has now been simplified to three CharacterCategory types. Some opening characters have matching closing characters that are not simply opener + 1. These have been cased as follows:

    int GetBracketCloseChar(const int ch) noexcept {
        const CharacterCategory cc = CategoriseCharacter(ch);
        switch (cc) {
            case ccSm:
                switch (ch) {
                    case 0x3C: return 0x3E; // LESS-THAN SIGN
                    case 0x2208: return 0x220B; // ELEMENT OF
                    case 0x2209: return 0x220C; // NOT AN ELEMENT OF
                    case 0x220A: return 0x220D; // SMALL ELEMENT OF
                    case 0x2215: return 0x29F5; // DIVISION SLASH
                    case 0x2243: return 0x22CD; // ASYMPTOTICALLY EQUAL TO
                    case 0x2298: return 0x29B8; // CIRCLED DIVISION SLASH
                    case 0x22A6: return 0x2ADE; // ASSERTION
                    case 0x22A8: return 0x2AE4; // TRUE
                    case 0x22A9: return 0x2AE3; // FORCES
                    case 0x22AB: return 0x2AE5; // DOUBLE VERTICAL BAR DOUBLE RIGHT TURNSTILE
                    case 0x22F2: return 0x22FA; // ELEMENT OF WITH LONG HORIZONTAL STROKE
                    case 0x22F3: return 0x22FB; // ELEMENT OF WITH VERTICAL BAR AT END OF HORIZONTAL STROKE
                    case 0x22F4: return 0x22FC; // SMALL ELEMENT OF WITH VERTICAL BAR AT END OF HORIZONTAL STROKE
                    case 0x22F6: return 0x22FD; // ELEMENT OF WITH OVERBAR
                    case 0x22F7: return 0x22FE; // SMALL ELEMENT OF WITH OVERBAR
                    case 0xFF1C: return 0xFF1E; // FULLWIDTH LESS-THAN SIGN
                }
                break;
            case ccPs:
                switch (ch) {
                    case 0x5B: return 0x5D; // LEFT SQUARE BRACKET
                    case 0x7B: return 0x7D; // LEFT CURLY BRACKET
                    case 0x298D: return 0x2990; // LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
                    case 0x298F: return 0x298E; // LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
                    case 0xFF3B: return 0xFF3D; // FULLWIDTH LEFT SQUARE BRACKET
                    case 0xFF5B: return 0xFF5D; // FULLWIDTH LEFT CURLY BRACKET
                }
                break;
            case ccPi:
                break;
            default: return 0;
        }
        return ch + 1;
    }
    

    I have also removed the superfluous 'const' qualifiers from returning functions. Also moved <vector> to it's appropriate position.

     

    Last edit: Mark Reay 2019-12-10
  • Neil Hodgson

    Neil Hodgson - 2019-12-13

    The latest version seems reasonable to me.

    Due to some upcoming changes to the way lexers work, the Raku lexer won't be committed until those changes have been committed. This will most likely occur in a couple of weeks and may require minor changes to this lexer.

     
  • Neil Hodgson

    Neil Hodgson - 2020-01-03

    New lexing features have been committed and the ILexer interface updated to ILexer5 which adds new metadata retrieval calls. A patch with the changes needed is attached as RakuILexer5.patch.

    The new lexer testing framework uses example files which are controlled by SciTE.properties files to produce expected output in .styled files. A minimal example x.p6 is attached as RakuTest.patch. The test file should include an example of each possible style.

    Is the numeric '0' supposed to be in SCE_RAKU_DEFAULT instead of SCE_RAKU_NUMBER?

     
  • Mark Reay

    Mark Reay - 2020-01-03

    Thank you Neil,

    I have gone over the lexer for final checks and fixed a few bugs:

    • numeral '0' is now SCE_RAKU_NUMBER, as it should be
    • multi-line comments with multiple start and end delimiters were not terminating correctly
    • folding multi-line comments are also now detected properly

    I have updated the new style test files with tests for all style types:

    • SciTE.properties
    • x.p6
    • x.p6.styled

    The raku.properties file has been updated for SciTE. Just fixed the keywords line wrapping.

    Attached are the patch files for both Scintilla (tip: 295a6e54d582) and SciTE 4.2.3

    Mark.

     

    Last edit: Mark Reay 2020-01-03
  • Neil Hodgson

    Neil Hodgson - 2020-01-04

    Committed as [bcb951], [3be72c], [604485], [cb4c65].

    In ProcessValidRegQlangStart, the decrementing of length inside a loop appears wrong as using the unchanging startPos turns a linear change into a multiplicative change and a possible early termination. Maybe something like

    length = lengthOriginal - (sc.currentPos - startPos);
    

    When using a file with \r\n line ends as is common on Windows, there are often mismatched styles on the \r and \n when turning on visible line ends - SciTE: View | End of Line). The most common is at the end of a '#' line comment where the \r is green and the \n grey. While this is not an error and happens with some other lexers, it is a source of problems and lexers that style both the \r and \n with the same style are more robust.

     

    Related

    Commit: [604485]
    Commit: [cb4c65]
    Commit: [3be72c]
    Commit: [bcb951]

  • Mark Reay

    Mark Reay - 2020-01-04

    I have found two places where CRLF was not being handled propperly and corrected them for:

    • SCE_RAKU_COMMENTLINE
    • SCE_RAKU_HEREDOC_Q / QQ

    The length calculation in ProcessValidRegQlangStart was clearly in error. I have replaced it with:

    length = startLen - (sc.currentPos - startPos);
    
     
    • Neil Hodgson

      Neil Hodgson - 2020-01-05

      Committed as [4bdfd4].

       

      Related

      Commit: [4bdfd4]

  • Neil Hodgson

    Neil Hodgson - 2020-01-08
    • labels: lexer raku perl6 --> lexer, raku, perl, scintilla
    • Group: Completed --> Committed
     
  • Neil Hodgson

    Neil Hodgson - 2020-01-15
    • status: open --> closed
     

Log in to post a comment.