From: Brian C. <B.C...@po...> - 2004-06-03 09:49:08
|
Just remembered something I meant to post a week or so ago. I happened to be talking to Philip Hazel, who told me that he was intending to release a new version of PCRE shortly (the Perl-Compatible Regular Expressions library). A new feature he's going to include is the ability to indicate partial matches. That is, if the expression runs out of input data but the RE has matched so far, then it can indicate this condition. It's intended for match-as-you-type applications. http://www.pcre.org/ This suggests to me another approach which Joe could use for syntax highlighting. In each state, you give a set of regular expressions with colours and next state. As you type, you match expressions from an anchor point; if more than one matches you colour using the first. When exactly one expression matches and is a complete match, then you move the anchor point on, switch state, and start again. In fact, for simple highlighters you probably don't need a 'state' at all, just the anchor point. A simple one-state XML highlighter might look like: <?.*?> PI <!--.*--> Comment <!\[CDATA\[.*\]\]> Cdata <!.*> Entity </.*> EndTag <.*/> EmptyTag <.*> StartTag . Content A more advanced one might want to use states to do more intelligent checks: e.g. for comments, :State0 <!-- Comment :State1 :State1 --> Comment :State0 -- Error :State0 . Comment :State1 That in itself simplifies the design of highlighters considerably (and pretty much eliminates the need for 'recolor'), but it also adds a lot of expressive power. Using back reference matching, you could for example match the perl construct s/foo/bar/ using s(.).*\1.*\1 [which also matches s,foo,bar, s@foo@bar@ etc]. If you want to colour the individual parts differently then you'd also need a way to indicate it: e.g. s(.)(.*)(\1)(.*)(\1) Keyword Delim Data Delim Data Delim Means "color the first parenthesised expression as Delim, the second as Data, the third as Delim, the fourth as Data, the fifth as Delim, and everything else as Keyword" Precompiling the REs at startup time should still make this a pretty efficient approach. What do you think? Regards, Brian. |
From: Brian C. <B.C...@po...> - 2004-06-03 10:30:46
|
On Thu, Jun 03, 2004 at 10:49:02AM +0100, Brian Candler wrote: > [which also matches s,foo,bar, s@foo@bar@ etc]. If you want to colour the > individual parts differently then you'd also need a way to indicate it: e.g. > > s(.)(.*)(\1)(.*)(\1) Keyword Delim Data Delim Data Delim And to demonstrate how simple it is to use for colouring: -------------------------------------------------------------------- #include <pcre.h> #include <stdio.h> int main(void) { pcre *re; const char *err; int offset; int match; int ovector[30]; int *op = ovector; re = pcre_compile("^s(.)(.*)(\\1)(.*)(\\1)$", 0, &err, &offset, 0); if (!re) { fprintf(stderr, "RE compile error at pos %d: %s\n", offset, err); return 1; } match = pcre_exec(re, 0, "s/foo/bar/", 10, 0, 0, ovector, 30); fprintf(stderr, "Match result %d\n", match); while (match > 0) { fprintf(stderr, "%d..%d\n", op[0], op[1]); op+=2; match--; } return 0; } -------------------------------------------------------------------- $ gcc -Wall -I/usr/local/include -L/usr/local/lib -o pcre pcre.c -lpcre && ./pcre Match result 6 0..10 1..2 2..5 5..6 6..9 9..10 which conveniently can be used directly as offsets for colouring. Cheers, Brian. |
From: <ja...@av...> - 2004-06-03 14:50:00
|
Brian Candler <B.C...@po...> wrote: > I happened to be talking to Philip Hazel, who told me that he was intending > to release a new version of PCRE shortly (the Perl-Compatible Regular > Expressions library). It's tempting :-) Which means I'm thinking about it- there have been numerous requests for pcre's for ^K F, plus I need something to parse compiler error messages for ESC c. There are still many issues to resolve: I either have to convert the edit buffer (from the anchor point to the end of the window) into a big string to pass to the RE parser, or have to do major hacking of the PCRE code (to make it read streams). Note also that even PCREs can't parse perl: > the perl construct s/foo/bar/ using > > s(.).*\1.*\1 doesn't work for s<foo><bar>. Still... I'm more likely (for expediency) to just expand the current parser so that it can store \1 along with the state. Also, it would not be hard to solve the above problem by adding a translation table: "<" is converted to ">" and then stored in \1. Another idea is to use a real interpreted language for highlighting (and other things in JOE). The entire state of the language (pgetc() becomes a call with current continuation) would be stored along with each line. Maybe it wouldn't be too bad (slow) if you use copy-on-write semantics (and garbage collection) for managing all data within the language. Then I could claim that it could parse anything. |
From: Brian C. <B.C...@po...> - 2004-06-03 15:42:50
|
On Thu, Jun 03, 2004 at 10:52:23AM -0400, ja...@av... wrote: > Note also that even PCREs can't parse perl: > > > the perl construct s/foo/bar/ using > > > > s(.).*\1.*\1 > > doesn't work for s<foo><bar>. Oh, I wasn't aware of that particular format, although I expect you could still write an RE to do it. Hmm, checks `man perlop`: If the PATTERN is delimited by bracketing quotes, the REPLACE- MENT has its own pair of quotes, which may or may not be brack- eting quotes, e.g., "s(foo)(bar)" or "s<foo>/bar/". Yuk. Well, we can still match using two REs: s([^<{\(\[]).*\1.*\1 s(<.*>|{.*}|\(.*\)|\[.*\])(<.*>|{.*}|\(.*\)|\[.*\]|([^<{\(\[]).*\3) which if you are feeling suitably perverse could be combined into a single RE. Note that you probably also want the PCRE_UNGREEDY flag set, or else replace .* with .*? in each case. Difficulties then arise with quoting: e.g. if I do s[foo][bar] can I include \] within foo, and it doesn't terminate the string? > Still... I'm more likely (for expediency) to just expand the current parser > so that it can store \1 along with the state. Also, it would not be hard to > solve the above problem by adding a translation table: "<" is converted to > ">" and then stored in \1. True, and your existing state machine is also very fast, and minimising external dependencies is also good. I still think it would be nice if: (a) the language description could be easier to maintain, easier to read, and potentially more portable. (It's already been noted here that there doesn't seem to be a commonly-accepted format for syntax colouring specifications, and perhaps this is because nobody fully agrees on the balance between expressiveness and complexity/efficiency) (b) the language description could be more powerful, in particular to be able to recognise nested balanced constructs like { ... } <foo> ... </foo> You could write a CFG parser for Joe, but perhaps that's going over the top. A push/pop stack would work, but the state machine descriptions would still be ungainly. Perhaps a preprocessor is the solution: enter the syntax description in a 'nice' format (regular expressions for tokens, context-free grammar for balancing constructs) which is compiled into a state-machine-with-stack representation for joe to load at run-time. Another idea is to modularise at the C level. If you can decouple the existing state machine so that it becomes a handful of functions: void create(void **state) -- build a new context int chomp(void **state, int ch) -- process one character, update state, return colouring action(s) void destroy(void **state) -- dispose then you could plug in other modules written in C (e.g. LALR parser, XML parser). Languages which need nothing more than the simple DFA would just use that (fast, compact). Languages which have lots of special cases, like Perl, could have a custom-written parser. Perhaps the first line of the .jsf file would be the name of the C module to use and the rest would be passed into that module to initialise it. It would be good to have some way to compare state. e.g. after inserting a character in a line and you reprocess to the end of the line, if the state is identical to the old state at the start of the next line, then there's no need to continue parsing further down the screen because nothing needs recolouring. (Maybe joe has some of this already - I've not actually been browsing the source code. It may be OK to run the simple DFA for a whole screenful of text on every keystroke, but perhaps not a more complex parser) > Another idea is to use a real interpreted language for highlighting (and > other things in JOE). The entire state of the language (pgetc() becomes a > call with current continuation) would be stored along with each line. Maybe > it wouldn't be too bad (slow) if you use copy-on-write semantics (and > garbage collection) for managing all data within the language. Then I could > claim that it could parse anything. That would certainly be cool. Perhaps Ruby with its continuations would do the job nicely, and it links very easily into C. And it's a hell of a lot easier to write than LISP :-) Regards, Brian. |
From: Mikhael G. <mi...@ho...> - 2004-06-04 00:56:17
|
On 03 Jun 2004 16:42:45 +0100, Brian Candler wrote: > > Difficulties then arise with quoting: e.g. if I do s[foo][bar] > can I include \] within foo, and it doesn't terminate the string? Yes, you may quote any char with backslash, including backslash itself, so this is ok: m{\}}; m{\\\}}; and this is a syntax error: m{\}; # not terminated m{\\}}; # excessive char m{\\\\}}; # excessive char To handle quoting, I think, it is enough to replace ".*" in your regular expression that tries to match perl regular expressions with: ([^\\]|(\\.))* Actually there are these multi-char quotes that may be handled too: \033 \x1B \x{263a} \c[ \N{name} The quotes in regexps are the same as within double quoted strings, so ".*" in your original regexp is actually just a perl string. > > Another idea is to use a real interpreted language for highlighting > > (and other things in JOE). The entire state of the language (pgetc() > > becomes a call with current continuation) would be stored along with > > each line. Maybe it wouldn't be too bad (slow) if you use > > copy-on-write semantics (and garbage collection) for managing all > > data within the language. Then I could claim that it could parse > > anything. > > That would certainly be cool. Perhaps Ruby with its continuations would do > the job nicely, and it links very easily into C. And it's a hell of a lot > easier to write than LISP :-) Oh, please, no mandatory Ruby or Lisp, they are slow! :-) Not that I'm against these or Perl, I just worry about speed and memory. Regards, Mikhael. |
From: Brian C. <B.C...@po...> - 2004-06-04 08:33:48
|
On Fri, Jun 04, 2004 at 12:56:03AM +0000, Mikhael Goikhman wrote: > > That would certainly be cool. Perhaps Ruby with its continuations would do > > the job nicely, and it links very easily into C. And it's a hell of a lot > > easier to write than LISP :-) > > Oh, please, no mandatory Ruby or Lisp, they are slow! :-) > > Not that I'm against these or Perl, I just worry about speed and memory. Well, I wouldn't make it compulsory to use them. If you are processing a file which can be handled adequately by the built-in deterministic state machine, I'd use that. If you're doing something more difficult then I'd want to be able to link either to a more advanced C parser, or to something in a scripting language. It's a common misconception that these languages are necessarily 'slow'. Typically you have a high startup overhead when loading the interpreter; which means it's certainly slow to write a standard shoot-once CGI for example. But once loaded, they can zip along nicely. With Ruby you can call the interpreter from C (getting it to run a particular function in an object, for example) and it will return happily, maintaining its internal state for the next call, so you're not tearing down and building an interpreter each time. Emacs users do lots of clever things in LISP, and don't complain about lack of speed. And that was long before you could get a 2GHz processor for $50. Actually, I can see two separate jobs which an external callout could do: (1) Syntax marking/tokenising [not just colouring]. As the content of the edit buffer changes, bits of reparsing are done as necessary for the purposes of display, but also in case the editor functions need to know what sort of token we are currently in or next to. Each character would know which token it belongs to, a token type, and possibly some ancilliary data (e.g. nesting depth) There is clearly optimisation needed here, so that each keystroke doesn't cause the entire edit buffer to be reparsed - or even the whole visible screen. (2) A hook to intercept keystrokes and/or edit events. These keystrokes could take actions based on the current context, i.e. the results of the syntax marking phase are made available to it. So for example, let's say we want an intelligent XML editor. We use the syntax marking to be able to decide where is a start tag, an end tag, or an empty tag, and note the nesting depth (so given a start tag we can locate the corresponding end tag, or vice versa) The keystroke hook would then be used to intercept keypresses '<' and '>' and implement rules such as: - if you type '<', and you are not currently with a tag, then a new tag is created using a pop-up line: Tag: <tag attr="val"> which inserts <tag attr="val"></tag> into the buffer and leaves the cursor between the two. Tab-completion will show you which tags and attributes are valid according to the DTD for this document. - if you type '<' and are within an end tag, then skip to the matching start tag - if you type '<' and are within any other tag, then skip to the previous tag - if you type '>' and are within a start tag, then skip to the matching end tag - if you type '>' anywhere else then skip to the next tag Now, actually this would have very little impact on performance. The keystroke hook just has to say def keystroke(k) case k when '<' puts "do some stuff" when '>' puts "do some other stuff" end end and the overhead of calling this function and it testing a case statement is actually very low. The re-parsing is the clever part. I think that whenever you type and you are within a token (or adjacent to it), you need to reparse from the start of that token. If the parser state at the end of the token is the same as it was before, then you can stop there. If not, you continue parsing until you get to a point where the parser state matches what was there originally, or until you hit the end of the screen. If this optimisation is not done carefully then it certainly *could* be slow. But even then, it would still be possible to write an XML parser in C, and have the keystroke hook done in Ruby (say). But maybe this is all just wishful thinking. One of the things I like so much about joe is that it is very compact (372K for the entire source tarball!) and does its job, editing text, with high speed and robustness. All this extra icing perhaps belongs in a completely separate optional package which can be added at build-time or linked dynamically at run-time. Or perhaps you turn the whole thing inside-out and make the core editor functions a library (libjoe); you can then call them *from* a scripting language. Cheers, Brian. |
From: Andy T. <tu...@mi...> - 2004-06-04 10:44:48
|
On Thu, Jun 03, 2004 at 04:42:45PM +0100, Brian Candler wrote: > On Thu, Jun 03, 2004 at 10:52:23AM -0400, ja...@av... wrote: > > > > doesn't work for s<foo><bar>. > > Yuk. Well, we can still match using two REs: > > s([^<{\(\[]).*\1.*\1 > s(<.*>|{.*}|\(.*\)|\[.*\])(<.*>|{.*}|\(.*\)|\[.*\]|([^<{\(\[]).*\3) > > Difficulties then arise with quoting: e.g. if I do s[foo][bar] > can I include \] within foo, and it doesn't terminate the string? Yeah. But that's not too bad to get around. A negative look behind assertion can catch it. Just stick a (?<!\\) in front of each closing char, eg: s(<.*(?<!\\)>|...) -- Andy <tu...@mi...> |
From: Brian C. <B.C...@po...> - 2004-06-04 11:36:37
|
On Fri, Jun 04, 2004 at 06:44:45AM -0400, Andy Turner wrote: > > > doesn't work for s<foo><bar>. > > > > Yuk. Well, we can still match using two REs: > > > > s([^<{\(\[]).*\1.*\1 > > s(<.*>|{.*}|\(.*\)|\[.*\])(<.*>|{.*}|\(.*\)|\[.*\]|([^<{\(\[]).*\3) > > > > Difficulties then arise with quoting: e.g. if I do s[foo][bar] > > can I include \] within foo, and it doesn't terminate the string? > > Yeah. But that's not too bad to get around. A negative look behind > assertion can catch it. > > Just stick a (?<!\\) in front of each closing char, eg: > > s(<.*(?<!\\)>|...) except that s<foo\\><bar> is valid, as well as s<foo\\\>bar><baz> At this level of complexity it's almost certainly *simpler* just to move from state to state: S0 s -> S1 S1 < -> S2 # inside s<...> S2 \ -> S21 # escaped character > -> S3 . -> S2 S21 x -> S22 # \xNN . -> S2 # \ followed by any other single character S22 0-9a-f -> S23 S23 0-9a-f -> S2 S3 ... etc but a more compact and easier-to-read notation for this would be very nice. We also need some 'start capture' / 'end capture' / 'match backreference'. Incidentally, I just had a brief E-mail exchange with Philip Hazel; he has some arm problems right now which make it hard for him to type, so the new PCRE is not likely to be released for a while. Also, because it's a backtracking parser, it *does* really require the whole string to be copied into a linear buffer first. Regards, Brian. |
From: <ja...@av...> - 2004-06-07 21:07:55
|
Latest CVS checking has an improved highlighter: now you can store a string along with the state. This means the perl and shell highlighters work better: print <<EOF this " works EOF and: $a =~ s<fo"o>(b"ar); I haven't updated the Mason highlighter yet. There are still holes in the perl highlighter to be patched. |
From: Andy T. <tu...@mi...> - 2004-06-08 14:18:33
Attachments:
perlsyn.diff
|
Here's a patch to the Perl rules currently in CVS. It adds support for Perl's quote like operators (q, qq, qr, qx, qw). It also removes support for "qr" as a keyword. It doesn't support things like: q{foo{bar}baz} -- Andy <tu...@mi...> |
From: Andy T. <tu...@mi...> - 2004-06-09 23:02:02
Attachments:
perlsyn.diff
|
On Tue, Jun 08, 2004 at 10:18:29AM -0400, Andy Turner wrote: > Here's a patch to the Perl rules currently in CVS. It adds support for > Perl's quote like operators (q, qq, qr, qx, qw). It also removes support > for "qr" as a keyword. It doesn't support things like: q{foo{bar}baz} Now attached is an updated version. It addes support for __END__, __DATA__ and =pod blocks. -- Andy <tu...@mi...> |