The current LexAsm.cxx
actually is only suitable for assembly files for Intel and MASM. I would like to work with ARM assembly files (ARMv4T) using the GNU assembler.
It's not too different from the existing lexer though. The main differences are:
* Single-line comments need their character changed (@
)
* Multi-line comment ability must be added (the usual C-style /*
and */
)
* IsAWordChar
should be letters, digits, _
, .
, and $
(not ?
)
* IsAWordStart
is just letters, _
, .
, and $
(no digits)
* Label support (including numbered labels) would be nice but not necessary.
I am not a C++ programmer so unfortunately my ability to help on this one is quite limited. I think I could easily edit the small stuff in if someone else can do the multi-line comment support.
There are currently 3 assembler lexers - LexA68K, LexAsm and LexMMIXAL. There was a proposal for a GNU assembler lexer [feature-requests:#1072] but discussion petered out.
Related
Feature Requests: #1072
Maybe we should be thinking about writing an extra tool, a lexer-compiler. SublimeText has an effective simple stack-based meta-language for defining lexers, but it's interpreted rather than compiled, though it seems efficient enough in practice.
We could use flex/bison for that, but ideally, I'd want users to be able to write and experiment with their own lexers at run-time, without having to recompile, for now and future languages, so a bytecode interpreter would be the thing to aim for. Better yet, why not expose the
lexlib
via Lua?Anyway, I'll probably be looking into the ARM lexer soon. And I'll be trying to translate e.g.
LexAsm.cxx
into something more general along the way which would be expressive enough to subsume all (reasonably possible) assemblers. Really, writing assembler parsers should be as easy as it gets, right? Cheers!Some downstream projects use various lexing tools. Probably the most widespread is Scintillua https://foicica.com/wiki/scintillua
Well that is amazing. I'll be reading up on this today. Thank you Neil.
I've been looking at the assembly lexers and whether the change would really be so minimal (I was hoping to simply add a properties file more or less). There is an interesting variety in the approaches of our lexer writers though, which makes for gripping reading.
The assembly guys were expectedly circumspect but In the CSS lexer I found this poster-worthy gem:
which is called at (each and) every colorization call. In most cases it would be more efficient to have parsed the whole doc once in its entirety. I'm in the process of figuring out if this may be attributed to limitations with the available interface, or whether it was up to their free choice.
In any case, I'm glad that after all these years there seems to still be plenty to occupy a newcomer contribution-wise.
Its just a simple implementation - the author of that change probably expects extended CSS files to be short. If it is a problem then keeping a nesting level in line state is easy.
I've never had any complaints myself. ( I guess you're not supposed to peek into your favourite restaurant's kitchen lest it put you off. ) But this may need to be looked at for improvement given bloaty web frameworks.
Regarding the keeping of a nesting level, I'm not yet clear on how Colorize is called. My guess was that something clever was going on, to facilitate styling only (more or less) the text that is actually being drawn, making jumps to arbitrary lines efficient. But a cursory glance suggests that to a styler a document looks like a non-seekable stream?
Is
EnsureStyledTo
called for everything up to the current pos?I haven't given it much thought but it seems to me that "smart styling" would involve a lexer's ability to be styling forwards and backwards from a given pos.
The lexer is supposed to lex the range given it. It may read anywhere in the document if necessary. Writing styles will generally just be the range although it is possible to drop back to fix up a multi-line element when the end delimiter is found. There is a single range of validly styled text from 0 to endStyled.
So it's the most flexible and least restrictive approach regarding access. Makes good sense of course.
I was wondering whether I could assume there beeing any hard guarantees. For example, whether a lexer-call to style from pos A through B implies that pos 0 through A had already been styled.
I believe that's what I was looking for. That would make good sense too for the current conception of a lexer which is supposed to handle indentation/folding and not just, basically, character classification. You earlier remark that
is obviously why I asked, which now too makes sense.
I'll be thinking about an alternate, possibly stricter interface for lexers. In particular I'd like to separate folding from highlighting. Because the highlighting of numbers, strings, etc. in the current couple of rows shouldn't depend on what happened 1k lines before. But that's a future problem. I'm reading my old ARM reference manuals right now, working on the OP's lexer. Cheers!
Last edit: Dejan Budimir 2019-10-23
Maybe a quick manual patch? Line 424 in
asm.properties
readsstyle.asm.0=$(style.asm.0)
but should readstyle.as.0=$(style.asm.0)
.Committed as [4501b1].
Related
Commit: [4501b1]
No worries then. I'll scrap my stuff and try to merge all assembly-related
lexing into
LexAsm.cxx
now.I'm looking forward to a chrome-less version of SciTE. (I consider the
annotations and code completion popups to belong to chrome as well.)
Scintilla's read-only style is still considered experimental but it would
be perfect for emulating lists that are actually inline with the text.
I keep getting sidetracked. I've fixed some issues with the python scripts
which I'll be posting soon.
The OP's feature request is a motion to make the existing lexer more configurable, but there are certain ARM-specifics that I think would warrant an independent lexer. So I'm not sure yet how to approach this.
There's an additional (interesting!) complication with the introduction of the in-code syntax switch
.intel_syntax
. This, plus another (very interesting!) potential complication is the ability to switch into Thumb mode (for older targets that are from the pre-UAL era of major syntax differences). But I've never written Thumb so have no ready knowledge of the nuances.The last official migration guide that mentions GNU Assembler seems to be Version 0605.
The GNU Assembler uses different comment chars depending on the target. There are also some historic quirks to it. I've been looking for documentation and, strangely, found the best one to be in a custom/non-official (?) version of its manual (PDF), so that is a huge help. But there seems to be a difference between 32bit ARM and 64bit ARM which isn't documented even there.
For absolute starters -- to get things running -- it would be simple to add every possible mnemonic to a keyword list but the ARM syntax is heavily modular (PDF) and it feels like an utter waste of potential. E.g. almost every instruction can have one of 11 suffixes attached to mark it for conditional execution. (There's also the latest, most definitive version of above ARM Architecture Reference Manual here (PDF) but it's 45M and 7k9 pages and it crashed my PDF reader twice before it rendered).
This is just a sampling of the design decisions to be made. I reckon it's good to have them mentioned here for any future contributors. Cheers!
Yes,
.arm
and.thumb
(or.code32
and.code16
) just set the code type and then affect all following code, so it's possible to have the mode set in an include file and then just trail out past that include directive. That said, all the Thumb instructions are a subset of the ARM instructions, even LSL and LSR and such are accepted in ARM code as being aliases for a MOV. So you could (I think) simply always lex for ARM code without worrying about it too much. The assembler will still give an error even if the visuals aren't perfect, same as if you typed "break" at an invalid place of a C file you'd see that as a keyword but the compiler would reject it.Also, the "conditional" codes come from a fixed list of 17 possibilities
One part that makes it a little easier is that the conditionals always go at the same part of each particular instruction, so the best way to parse for keywords might be to parse the instruction prefix (eg ADD), parse for a possible conditional (eg EQ), parse for other possible flags (eg: S, L, etc), and then the whole thing can get marked as a keyword if it combines properly.
I'd say that the big deals, compared to the current lexer which i'm kinda able to use, is that comments should work right, and symbol names should be parsed properly. The current lexer doesn't handle digits in symbol names so
MODE3
is parsed asMODE
and then the number 3.Indeed, I think this is how I'll do it for now. Though I'm uncertain whether I should extend/rewrite Scintilla's already existing lexer or add a new one. Writing a new one would be easier. But then you're saying that the current lexer is almost usable. And it groks intel syntax, not GNU-as's usual AT&T.. I was flirting with the idea of writing a GNU-as lexer that can do it all, but obviously that would be a much greater endeavour. And who cares if SciTE's "Language" menu get's too big? It should be seen a mark of strength.
Good catch. I've pulled the 11 out of a hat. There are 4 bits in the opcode reserved for this, so 16 would be the number to aim for(barring syntactical sugar), but the arm_arm.pdf (ARM Arch Ref vintage 2005) has a nice chapter on it.
I do feel though that this should be made optional, if we're already identifying the mnemonic as a composite, I'm in favor of making use of that extra info instead of discarding it. I think the user should decide whether to color it the same or highlight it. That's why I'll be writing a new one. We can think about merging the assembly lexers later. I should have something by the end of the day today.
I do as long menus are difficult to navigate. Languages are not added to the menu unless they are quite popular.
We could always argue about the popularity of TADS. Relative to assembly. (I personally love that it is there.)
Was there any progress with the main issue -- making the lexer LexAsm.cxx adaptable to different comment conventions? It's not hard to make the comment character a configuration option, and to make C-style comments an option too. I will prepare a patch if there is any appetite to include it.
The current lexer (and the list of instructions that appear in Geany's config file) seems not to suit any particular machine or assembler. The best thing, I think, would be to follow closely the machine-independent parts of the conventions of the GNU assembler, and make things like comment syntax modifiable in the config file.
I can not recall any other recent discussions about the assembler lexer.
C-style stream comments should be in the already defined SCE_ASM_COMMENTBLOCK style and can be active whenever the lexer identifier (GetIdentifier()) is SCLEX_AS.
For the configurable comment character, make sure there this a default that matches current behaviour and name the property starting with "lexer.as". According to https://en.wikipedia.org/wiki/GNU_Assembler#Single-Line_comments its "//" on AArch64 so probably should be a string instead of a character. This property should probably be ignored in MASM (SCLEX_ASM) mode.
The patch should include an example file and corresponding SciTE.properties in lexilla/test/examples/as to ensure that this feature gets tested against future changes. See lexilla/test/README for information on the lexer testing framework.
This works for many cases.
I think that this option should only operate on GNU as files, not MASM files. An empty value should also return to the standard '#'. Therefore, there should be two variables, the current LexerAsm::commentChar and a new OptionsAsm::commentPrefix. Near the start of LexerAsm::Lex this should be resolved (depending on GetIdentifier() and a non empty value) to a local variable that is then used in the detection.
The DefineProperty call should have an explanation string like other lexer-unique properties.
Since AArch64 uses a 2 character string "//" testing against [0] won't work - use StyleContext::Match.
Valid points. This is a quick hack. For simple languages like assembly,, we should consider writing a minimal lexer that can be configured more comprehensively from a properties file. Also the existance of a xzy.properties file in, say, the user config folder, should run opened .xyz files through that lexer by default.
What do I mean by simple language? Perhaps a language that can be lexed on a line-by-line basis, meaning that each line is lexed as an independent unit, meaning that you could start lexing a file in the middle at line #(X+1) without having had lexed lines #1 through #X. This property would obviously allow for optimizations for very large files. Any thoughts?
If /* stream comments */ are allowed then assembler can't be lexed line by line.
GNU assembler allows multi-line
/*
and*/
comments.(Other assemblers also might as well, but I only know about the GNU assembler.)