Menu

#1314 [Scintilla 3] Please add a lexer for ARM assembly

Committed
closed
5
2021-04-09
2019-10-11
Daniel
No

The current LexAsm.cxx actually is only suitable for assembly files for Intel and MASM. I would like to work with ARM assembly files (ARMv4T) using the GNU assembler.

It's not too different from the existing lexer though. The main differences are:
* Single-line comments need their character changed (@)
* Multi-line comment ability must be added (the usual C-style /* and */)
* IsAWordChar should be letters, digits, _, ., and $ (not ?)
* IsAWordStart is just letters, _, ., and $ (no digits)
* Label support (including numbered labels) would be nice but not necessary.

I am not a C++ programmer so unfortunately my ability to help on this one is quite limited. I think I could easily edit the small stuff in if someone else can do the multi-line comment support.

Discussion

1 2 3 > >> (Page 1 of 3)
  • Neil Hodgson

    Neil Hodgson - 2019-10-12
    • labels: --> scintilla, lexer, assembler, GNU
    • assigned_to: Neil Hodgson
     
  • Neil Hodgson

    Neil Hodgson - 2019-10-12

    There are currently 3 assembler lexers - LexA68K, LexAsm and LexMMIXAL. There was a proposal for a GNU assembler lexer [feature-requests:#1072] but discussion petered out.

     

    Related

    Feature Requests: #1072

  • Dejan Budimir

    Dejan Budimir - 2019-10-18

    Maybe we should be thinking about writing an extra tool, a lexer-compiler. SublimeText has an effective simple stack-based meta-language for defining lexers, but it's interpreted rather than compiled, though it seems efficient enough in practice.

    We could use flex/bison for that, but ideally, I'd want users to be able to write and experiment with their own lexers at run-time, without having to recompile, for now and future languages, so a bytecode interpreter would be the thing to aim for. Better yet, why not expose the lexlib via Lua?

    Anyway, I'll probably be looking into the ARM lexer soon. And I'll be trying to translate e.g. LexAsm.cxx into something more general along the way which would be expressive enough to subsume all (reasonably possible) assemblers. Really, writing assembler parsers should be as easy as it gets, right? Cheers!

     
    • Neil Hodgson

      Neil Hodgson - 2019-10-19

      Some downstream projects use various lexing tools. Probably the most widespread is Scintillua https://foicica.com/wiki/scintillua

       
      • Dejan Budimir

        Dejan Budimir - 2019-10-21

        Well that is amazing. I'll be reading up on this today. Thank you Neil.

        I've been looking at the assembly lexers and whether the change would really be so minimal (I was hoping to simply add a properties file more or less). There is an interesting variety in the approaches of our lexer writers though, which makes for gripping reading.

        The assembly guys were expectedly circumspect but In the CSS lexer I found this poster-worthy gem:

        // look behind (from start of document to our start position) to determine current nesting level
        inline int NestingLevelLookBehind(Sci_PositionU startPos, Accessor &styler) {
            int ch;
            int nestingLevel = 0;
        
            for (Sci_PositionU i = 0; i < startPos; i++) {
                ch = styler.SafeGetCharAt(i);
                if (ch == '{')
                    nestingLevel++;
                else if (ch == '}')
                    nestingLevel--;
            }
        
            return nestingLevel;
        }
        

        which is called at (each and) every colorization call. In most cases it would be more efficient to have parsed the whole doc once in its entirety. I'm in the process of figuring out if this may be attributed to limitations with the available interface, or whether it was up to their free choice.

        In any case, I'm glad that after all these years there seems to still be plenty to occupy a newcomer contribution-wise.

         
        • Neil Hodgson

          Neil Hodgson - 2019-10-21

          Its just a simple implementation - the author of that change probably expects extended CSS files to be short. If it is a problem then keeping a nesting level in line state is easy.

           
          • Dejan Budimir

            Dejan Budimir - 2019-10-22

            I've never had any complaints myself. ( I guess you're not supposed to peek into your favourite restaurant's kitchen lest it put you off. ) But this may need to be looked at for improvement given bloaty web frameworks.

            Regarding the keeping of a nesting level, I'm not yet clear on how Colorize is called. My guess was that something clever was going on, to facilitate styling only (more or less) the text that is actually being drawn, making jumps to arbitrary lines efficient. But a cursory glance suggests that to a styler a document looks like a non-seekable stream?

            Is EnsureStyledTo called for everything up to the current pos?

            I haven't given it much thought but it seems to me that "smart styling" would involve a lexer's ability to be styling forwards and backwards from a given pos.

             
            • Neil Hodgson

              Neil Hodgson - 2019-10-22

              The lexer is supposed to lex the range given it. It may read anywhere in the document if necessary. Writing styles will generally just be the range although it is possible to drop back to fix up a multi-line element when the end delimiter is found. There is a single range of validly styled text from 0 to endStyled.

               
              • Dejan Budimir

                Dejan Budimir - 2019-10-23

                It may read anywhere in the document if necessary.

                So it's the most flexible and least restrictive approach regarding access. Makes good sense of course.

                I was wondering whether I could assume there beeing any hard guarantees. For example, whether a lexer-call to style from pos A through B implies that pos 0 through A had already been styled.

                There is a single range of validly styled text from 0 to endStyled.

                I believe that's what I was looking for. That would make good sense too for the current conception of a lexer which is supposed to handle indentation/folding and not just, basically, character classification. You earlier remark that

                keeping a nesting level in line state is easy

                is obviously why I asked, which now too makes sense.

                I'll be thinking about an alternate, possibly stricter interface for lexers. In particular I'd like to separate folding from highlighting. Because the highlighting of numbers, strings, etc. in the current couple of rows shouldn't depend on what happened 1k lines before. But that's a future problem. I'm reading my old ARM reference manuals right now, working on the OP's lexer. Cheers!

                 

                Last edit: Dejan Budimir 2019-10-23
  • Dejan Budimir

    Dejan Budimir - 2019-10-21

    Maybe a quick manual patch? Line 424 in asm.properties reads style.asm.0=$(style.asm.0) but should read style.as.0=$(style.asm.0).

     
    • Neil Hodgson

      Neil Hodgson - 2019-10-29

      Committed as [4501b1].

       

      Related

      Commit: [4501b1]

      • Dejan Budimir

        Dejan Budimir - 2019-10-29

        I do

        No worries then. I'll scrap my stuff and try to merge all assembly-related
        lexing into LexAsm.cxx now.

        I'm looking forward to a chrome-less version of SciTE. (I consider the
        annotations and code completion popups to belong to chrome as well.)
        Scintilla's read-only style is still considered experimental but it would
        be perfect for emulating lists that are actually inline with the text.

        I should have something by the end of the day today.

        I keep getting sidetracked. I've fixed some issues with the python scripts
        which I'll be posting soon.

         
  • Dejan Budimir

    Dejan Budimir - 2019-10-26

    The OP's feature request is a motion to make the existing lexer more configurable, but there are certain ARM-specifics that I think would warrant an independent lexer. So I'm not sure yet how to approach this.

    There's an additional (interesting!) complication with the introduction of the in-code syntax switch .intel_syntax. This, plus another (very interesting!) potential complication is the ability to switch into Thumb mode (for older targets that are from the pre-UAL era of major syntax differences). But I've never written Thumb so have no ready knowledge of the nuances.

    The last official migration guide that mentions GNU Assembler seems to be Version 0605.

    The GNU Assembler uses different comment chars depending on the target. There are also some historic quirks to it. I've been looking for documentation and, strangely, found the best one to be in a custom/non-official (?) version of its manual (PDF), so that is a huge help. But there seems to be a difference between 32bit ARM and 64bit ARM which isn't documented even there.

    For absolute starters -- to get things running -- it would be simple to add every possible mnemonic to a keyword list but the ARM syntax is heavily modular (PDF) and it feels like an utter waste of potential. E.g. almost every instruction can have one of 11 suffixes attached to mark it for conditional execution. (There's also the latest, most definitive version of above ARM Architecture Reference Manual here (PDF) but it's 45M and 7k9 pages and it crashed my PDF reader twice before it rendered).

    This is just a sampling of the design decisions to be made. I reckon it's good to have them mentioned here for any future contributors. Cheers!

     
    • Daniel

      Daniel - 2019-10-26

      Yes, .arm and .thumb (or .code32 and .code16) just set the code type and then affect all following code, so it's possible to have the mode set in an include file and then just trail out past that include directive. That said, all the Thumb instructions are a subset of the ARM instructions, even LSL and LSR and such are accepted in ARM code as being aliases for a MOV. So you could (I think) simply always lex for ARM code without worrying about it too much. The assembler will still give an error even if the visuals aren't perfect, same as if you typed "break" at an invalid place of a C file you'd see that as a keyword but the compiler would reject it.

      Also, the "conditional" codes come from a fixed list of 17 possibilities

      One part that makes it a little easier is that the conditionals always go at the same part of each particular instruction, so the best way to parse for keywords might be to parse the instruction prefix (eg ADD), parse for a possible conditional (eg EQ), parse for other possible flags (eg: S, L, etc), and then the whole thing can get marked as a keyword if it combines properly.

      I'd say that the big deals, compared to the current lexer which i'm kinda able to use, is that comments should work right, and symbol names should be parsed properly. The current lexer doesn't handle digits in symbol names so MODE3 is parsed as MODE and then the number 3.

       
  • Dejan Budimir

    Dejan Budimir - 2019-10-27

    simply always lex for ARM code without worrying about it too much [...] even if the visuals aren't perfect

    Indeed, I think this is how I'll do it for now. Though I'm uncertain whether I should extend/rewrite Scintilla's already existing lexer or add a new one. Writing a new one would be easier. But then you're saying that the current lexer is almost usable. And it groks intel syntax, not GNU-as's usual AT&T.. I was flirting with the idea of writing a GNU-as lexer that can do it all, but obviously that would be a much greater endeavour. And who cares if SciTE's "Language" menu get's too big? It should be seen a mark of strength.

    Also, the "conditional" codes come from a fixed list of 17 possibilities

    Good catch. I've pulled the 11 out of a hat. There are 4 bits in the opcode reserved for this, so 16 would be the number to aim for(barring syntactical sugar), but the arm_arm.pdf (ARM Arch Ref vintage 2005) has a nice chapter on it.

    and then the whole thing can get marked as a keyword

    I do feel though that this should be made optional, if we're already identifying the mnemonic as a composite, I'm in favor of making use of that extra info instead of discarding it. I think the user should decide whether to color it the same or highlight it. That's why I'll be writing a new one. We can think about merging the assembly lexers later. I should have something by the end of the day today.

     
    • Neil Hodgson

      Neil Hodgson - 2019-10-27

      And who cares if SciTE's "Language" menu get's too big? It should be seen a mark of strength.

      I do as long menus are difficult to navigate. Languages are not added to the menu unless they are quite popular.

       
      • Dejan Budimir

        Dejan Budimir - 2020-04-13

        We could always argue about the popularity of TADS. Relative to assembly. (I personally love that it is there.)

         
  • Mike Spivey

    Mike Spivey - 2020-04-07

    Was there any progress with the main issue -- making the lexer LexAsm.cxx adaptable to different comment conventions? It's not hard to make the comment character a configuration option, and to make C-style comments an option too. I will prepare a patch if there is any appetite to include it.

    The current lexer (and the list of instructions that appear in Geany's config file) seems not to suit any particular machine or assembler. The best thing, I think, would be to follow closely the machine-independent parts of the conventions of the GNU assembler, and make things like comment syntax modifiable in the config file.

     
    • Neil Hodgson

      Neil Hodgson - 2020-04-08

      I can not recall any other recent discussions about the assembler lexer.

      C-style stream comments should be in the already defined SCE_ASM_COMMENTBLOCK style and can be active whenever the lexer identifier (GetIdentifier()) is SCLEX_AS.

      For the configurable comment character, make sure there this a default that matches current behaviour and name the property starting with "lexer.as". According to https://en.wikipedia.org/wiki/GNU_Assembler#Single-Line_comments its "//" on AArch64 so probably should be a string instead of a character. This property should probably be ignored in MASM (SCLEX_ASM) mode.

      The patch should include an example file and corresponding SciTE.properties in lexilla/test/examples/as to ensure that this feature gets tested against future changes. See lexilla/test/README for information on the lexer testing framework.

       
    • Dejan Budimir

      Dejan Budimir - 2020-04-11
      diff --git a/scintilla/lexers/LexAsm.cxx b/scintilla/lexers/LexAsm.cxx
      index 8c30087..5ed57a8 100644
      --- a/scintilla/lexers/LexAsm.cxx
      +++ b/scintilla/lexers/LexAsm.cxx
      @@ -80,6 +80,7 @@ struct OptionsAsm {
          std::string foldExplicitEnd;
          bool foldExplicitAnywhere;
          bool foldCompact;
      +   std::string commentChar;
          OptionsAsm() {
              delimiter = "";
              fold = false;
      @@ -90,6 +91,7 @@ struct OptionsAsm {
              foldExplicitEnd   = "";
              foldExplicitAnywhere = false;
              foldCompact = true;
      +       commentChar = "";
          }
       };
      
      @@ -134,6 +136,8 @@ struct OptionSetAsm : public OptionSet<OptionsAsm> {
      
              DefineProperty("fold.compact", &OptionsAsm::foldCompact);
      
      +       DefineProperty("lexer.asm.comment.char", &OptionsAsm::commentChar);
      +
              DefineWordListSets(asmWordListDesc);
          }
       };
      @@ -149,10 +153,9 @@ class LexerAsm : public DefaultLexer {
          WordList directives4foldend;
          OptionsAsm options;
          OptionSetAsm osAsm;
      -   int commentChar;
       public:
          LexerAsm(const char *languageName_, int language_, int commentChar_) : DefaultLexer(languageName_, language_) {
      -       commentChar = commentChar_;
      +       options.commentChar = commentChar_;
          }
          virtual ~LexerAsm() {
          }
      @@ -350,7 +353,7 @@ void SCI_METHOD LexerAsm::Lex(Sci_PositionU startPos, Sci_Position length, int i
      
              // Determine if a new state should be entered.
              if (sc.state == SCE_ASM_DEFAULT) {
      -           if (sc.ch == commentChar){
      +           if (sc.ch == options.commentChar[0]){
                      sc.SetState(SCE_ASM_COMMENT);
                  } else if (IsASCII(sc.ch) && (isdigit(sc.ch) || (sc.ch == '.' && IsASCII(sc.chNext) && isdigit(sc.chNext)))) {
                      sc.SetState(SCE_ASM_NUMBER);
      diff --git a/scite/src/Embedded.properties b/scite/src/Embedded.properties
      index dcf999e..a5b7935 100644
      --- a/scite/src/Embedded.properties
      +++ b/scite/src/Embedded.properties
      @@ -437,6 +437,8 @@ keywords2.$(file.patterns.asl)=$(keywordclass2.asl)
      
       module asm
      
      +# lexer.asm.comment.char=
      +
       file.patterns.asm=*.asm
       file.patterns.as=*.s
      
      diff --git a/scite/src/SciTEProps.cxx b/scite/src/SciTEProps.cxx
      index b611f5b..830d40c 100644
      --- a/scite/src/SciTEProps.cxx
      +++ b/scite/src/SciTEProps.cxx
      @@ -490,6 +490,7 @@ static const char *propertiesToForward[] = {
          "fold.verilog.flags",
          "fold.xml.at.tag.open",
          "html.tags.case.sensitive",
      +   "lexer.asm.comment.char",
          "lexer.asm.comment.delimiter",
          "lexer.baan.styling.within.preprocessor",
          "lexer.caml.magic",
      diff --git a/scite/src/asm.properties b/scite/src/asm.properties
      index c244450..6a51bc3 100644
      --- a/scite/src/asm.properties
      +++ b/scite/src/asm.properties
      @@ -3,6 +3,8 @@
       # Updated by Kein-Hong Man mkh@pl.jaring.my 2003-10
       # Updated by Marat Dukhan (mdukhan3.at.gatech.dot.edu) 10/4/2011
      
      +# lexer.asm.comment.char=
      +
       # Nasm files
       file.patterns.asm=*.asm
       file.patterns.as=*.s
      
       
  • Neil Hodgson

    Neil Hodgson - 2020-04-11

    This works for many cases.

    I think that this option should only operate on GNU as files, not MASM files. An empty value should also return to the standard '#'. Therefore, there should be two variables, the current LexerAsm::commentChar and a new OptionsAsm::commentPrefix. Near the start of LexerAsm::Lex this should be resolved (depending on GetIdentifier() and a non empty value) to a local variable that is then used in the detection.

    The DefineProperty call should have an explanation string like other lexer-unique properties.

    Since AArch64 uses a 2 character string "//" testing against [0] won't work - use StyleContext::Match.

     
    • Dejan Budimir

      Dejan Budimir - 2020-04-12

      Valid points. This is a quick hack. For simple languages like assembly,, we should consider writing a minimal lexer that can be configured more comprehensively from a properties file. Also the existance of a xzy.properties file in, say, the user config folder, should run opened .xyz files through that lexer by default.

       
      • Dejan Budimir

        Dejan Budimir - 2020-04-12

        What do I mean by simple language? Perhaps a language that can be lexed on a line-by-line basis, meaning that each line is lexed as an independent unit, meaning that you could start lexing a file in the middle at line #(X+1) without having had lexed lines #1 through #X. This property would obviously allow for optimizations for very large files. Any thoughts?

         
        • Neil Hodgson

          Neil Hodgson - 2020-04-12

          If /* stream comments */ are allowed then assembler can't be lexed line by line.

           
          • Daniel

            Daniel - 2020-04-12

            GNU assembler allows multi-line /* and */ comments.
            (Other assemblers also might as well, but I only know about the GNU assembler.)

             
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.