Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Unlimited Recursion in the Syntax Highlighter

2008-10-29
2013-05-02
  • As you may know, JOE syntax-highlighting descriptions can have subroutines.  However, the depth of the call graph is limited to five.  The reason is that syntax highlighting in JOE is implemented with a state machine.  States in a subroutine are actually just appended to the main state machine, but in a different namespace for each invocation, so to speak.  Essentially, the return stack for each invocation of each subroutine is encoded in the state number.  As you can imagine, with a complicated call graph, the number of states can explode.  And recursion?  Fuggetaboutit!

    A while back, I ran into two cases where I would have liked to have recursive syntax highlighting.  The first was GCC machine descriptions, where some of the strings contains C code, but to know which ones, you have to be able to count complete RTL (lisp-like) expressions.  That means matching parentheses.  The second was makefiles, where it would be nice to know whether you are inside a variable reference and/or function call, which both look like either $(...) or ${...}, but variable references and function calls can contain variable references and function calls with unlimited nesting.

    A bit over a year ago, I changed the way subroutines work to allow recursion.  I never quite finished (Is software ever finished?), and the changes have sat on my hard drive since.  The recent activity by Joe has rekindled my interest in improving JOE.

    My changes work.  I just haven't yet tried to recover memory once it becomes unused.  Here is what I did:

    Now each subroutine gets its own state machine (high_syntax).  Each state machine is uniquely identified by the triplet of file name, subroutine name, and set of parameters defined.  If the same subroutine is called with the same parameters defined from two different places, then both calls will point to the same state machine.  Instead of calls and returns being resolved as ordinary state transitions, they are treated specially.  If a transition is a call, then it points to the state machine for the subroutine (high_cmd has a new field, high_syntax *call).  A call transition takes you to the first state of the subroutine's state machine.  So once you go to the new state machine, how do you get back?  You use a run-time stack.

    The problem is you can't just have one stack.  JOE doesn't parse the whole file from the beginning every time it needs to update the highlighting.  It saves the state of the highlighter at the beginning of every line (at most).  The entire run-time stack is part of the state of the highlighter.  The stack used to be encoded in the state number, but now it is not.  That means that there needs to be some way of recording what the entire run-time stack of the highlighter is at every line.  One way to do that would be to make a copy of the entire stack for each line, but that would be very wasteful.  Most of the time, many different lines will have identical stacks.

    Consider the case of makefiles, where we might call a function every time we see "$(" or "${", with a different parameter defined depending on which one we saw.  We would return once we saw the matching ")" or "}", which one determined by the parameter.  Lets say we edit a 1000 line file that contains such expressions, and none of the expressions have more than three levels of nesting.  If "0" is an empty stack, "P" is a frame representing a subexpression in parentheses, and "C" is a frame representing a subexpression in curly brackets, then the only possible return stacks are:

    0
    P
    C
    PP
    PC
    CP
    CC
    PPP
    PPC
    PCP
    PCC
    CPP
    CPC
    CCP
    CCC

    There are only 15 possibilities, therefore at least 986 of the lines must have non-unique return stacks.  We could have one copy of each unique stack and have each line point to theirs.  Furthermore, many of the stacks have common prefixes.  Both "PCP" and "PCC" start with "PC".  Those stacks could simply have an entry for the last frame and a pointer to their prefix.  What you end up with is a call tree:

    0
    +P
    |+P
    ||+P
    ||+C
    |+C
    |.+P
    |.+C
    +C
    .+P
    .|+P
    .|+C
    .+C
    ..+P
    ..+C

    This is the dynamic call tree, not to be confused with the static call graph.  The state saved for each line need only point to the node corresponding to its return stack.  Pointer equality can be used to determine whether two return stacks are the same (highlight_state has a new field, high_frame *stack).

    Initially, a given syntax has an empty call tree (high_syntax has a new field, high_frame *stack_base).  When a call is made, I look through the children of the current frame, if any.  If one of the children is already the frame I need, then I simply update the stack pointer to that child.  If such a child does not exist, I create a new one.

    A stack frame contains a pointer to its parent, a pointer to its first child, a pointer to its next sibling, a pointer to the current subroutine, and a pointer to the return state.  It is not enough to know what subroutine to return to.  You need to know what state to return to within that subroutine.  If one subroutine calls another subroutine in two places with different next states upon return, then the same child frame cannot be used for both calls.

    When a return occurs, I get the subroutine from the parent frame, the next state from the return state saved in the current frame, and then move the stack pointer to the parent frame.

    That's about it for how subroutines work after my changes.  Now JOE can recognize deterministic context-free languages instead of just regular languages.  :-)

    I also implemented a new transition option, reset, that pops the entire return stack and goes to the initial state.  I'm not sure how useful it is.  The main reason I implemented it was to use it as the default command, so that the default command would have the same behavior.  I think it's iffy to rely on the behavior of a default command though.

    Another change I made was to propagate parameters defined in the current routine to any subroutines called.  See https://sourceforge.net/forum/forum.php?thread_id=1651476&forum_id=74214 for the motivation.  In order to support the case where you do not want a parameter to be propagated, I added new syntax.  Now, when calling a subroutine, you can undefine a parameter by preceding it with '-' in the parameter list.

    That's it for feature changes.  Does anyone have any questions, comments, suggestions, or objections?

    What I haven't done is try to reclaim the memory of nodes in the call tree that no one is pointing to anymore.  I've looked into what it would take, and it seems doable.  Basically each node would need a reference count.  The trick is keeping the reference counts accurate.

    Though, do you think it is worth the effort?  If you open a file with very deep nesting of something that causes a call for each level of nesting, then a large call tree will be created.  If you then close the file, the call tree will remain in memory.  However, if you open another file that also has very deep nesting, then the existing call tree will be reused.  In other words, the size of a call tree is bounded by the size of the largest file ever opened.

    Unfortunately, I never finished writing make.jsf, but I have found a way to demonstrate that the new feature is useful.  Here is a comment from haskell.jsf:

    # Missing: nesting of nested comments. Needs joe support.
    # fudged in one level of nesting

    In my patch, I changed haskell.jsf to properly support nested comments.  Also, along with my patch, I attached some example jsf files along with text to highlight with them:

    matching.jsf matches nested parentheses, curly brackets, square brackets, and angle brackets.  It highlights closing delimiters that are properly matched and underlines those that are not.  It also demonstrates undefining parameters.

    expr.jsf highlights the contents of complete, parenthesized expressions, and rotates the color it uses each time.  I made it more complicated than it needs to be in order to demonstrate the reset option.

    Enjoy,
    Charles J. Tabony

     
    • The patch is number 2207149.

       
    • Joe Allen
      Joe Allen
      2008-10-29

      I think this is an excellent idea: I've been thinking that I should probably write a regex compiler for the existing FSM-based highlighter... but it looks like we need YACC instead :-)  Hmm.. the xml highlighter could check for unmatched tags...

      I'm trying to think if some form of garbage collection could be used instead of the reference counting.  Perhaps if the pool of nodes is full, try to garbage collect: mark and sweep the nodes based on the current state plus all states in the line attribute database (or maybe dump the database at that point).  If there is still not enough space, double the pool size.

       
    • Mark and sweep seems straightforward enough.  When a collection is triggered, mark all nodes as unused.  Iterate through the attribute databases.  For each attribute database referring to the current syntax, iterate through its valid entries.  For each entry, mark the frame being referenced as used.  Mark its parent and its parent's parent etc. until you hit the root or a frame that has already been marked as used.  Do the same for the current state.  Sweep.

      Like you say, an alternative would be to mark each of the databases as entirely invalid.  That would make the marking much faster, but would mean you would need to re-parse every file highlighted by the current syntax.

      I don't have much experience with memory management.  I'm guessing the pool would be an array of struct high_frame's, and references to frames would use indices instead of pointers.  I can think of two ways to handle allocations:

      One would be to have an offset into the array that indicates the boundary between valid and invalid entries.  An entry would be allocated by simply incrementing the offset.  The problem is that once you free any entries, if the set of valid entries is no longer contiguous, then you have to move some of the valid entries up.  To move an entry, you would have to update all references to it.  Is there an efficient way to do that?

      The other scheme would be to have a valid bit for each entry, either within each entry or in a second array.  An entry could be freed by simply clearing its valid bit.  However, you then get fragmentation.  That makes allocating an entry more time consuming, because you have to scan for the next invalid entry.  Though I suppose such scanning would be cheaper than the copying needed by the first scheme.

      Are there any other schemes I am unaware of or tricks to make either of these schemes more efficient?

      With the first scheme, you would still need a bit per entry to mark entries as used or unused.  With the second scheme, you could use the valid bits as used bits during collection.  You could either clear all of the valid bits before you do the marking, or, if you only collect when all entries are valid, you could use the trick of inverting the meaning of the bit.

      What are you envisioning?

       
    • Joe Allen
      Joe Allen
      2008-10-30

      So I'm playing with your patch now: it's awesome.  With it, JOE can highlight the aclocal.m4 file
      (you have it if you get JOE with cvs). I did have to add the some -parameter arguments to the calls in m4.jsf: change call(brace) to call(brace -quote) and call(quote) to call(quote -brace).  I'm still not sure if it is better to propagate by default or not.

      I'm going to have it print some statistics on how many struct high_frames actually get allocated. Maybe if they get reused enough we can just forget about trying to reclaim the memory.  It probably uses less than the many levels of template instantiations that it's replacing.

       
    • Joe Allen
      Joe Allen
      2008-10-30

      I've applied the patch: it's now in CVS.

      ESC x debug_joe has a line which says "Allocated N stack frames".  So far this number is quite low, so I'm not worried about memory management.

      I had to fix one thing: debug_joe was segfaulting because it was trying to print the name of the next_state in default_cmd (this is not a problem for the highlighter state machine since you have the reset bit set in default_cmd).  I should really add another pass which checks for unknown states and prints errors to the startup log.

      One big problem with this change: now many of the highlighters can be substantially improved... which is lots of work.

       
    • > So I'm playing with your patch now: it's awesome.

      Thanks!

      > I'm still not sure if it is better to propagate by default or not.

      Yeah.  I have seen a few cases where I don't want a parameter to be propagated.  The classic example is nested expressions that can use more than one type of delimiter, such as in makefiles.

      However, I can think of at least one case where propagation of parameters would be desired.  A while back I started writing a syntax-highlighting description for printf/scanf style escape sequences.  The idea would be to call that file from other files, such as c.jsf, that highlight languages that use such formatting strings.  GCC machine descriptions have embedded C code, but strings containing %'s, even in the C code, are more often instruction output templates than arguments to printf.  I want to pass a parameter when calling c.jsf and have it affect the highlighting of formatting sequences described in another file.

      The general case where you want parameters to be propagated is when you have A.jsf calling B.jsf calling C.jsf, where A wants to set a parameter to affect the behavior of C, but not B.  If parameters are not propagated, then B has to know it needs to propagate such a parameter and do so explicitly.  I could see this happening with HTML and friends.

      While both use cases exist, it seems to me that the knowledge of when you don't want to propagate a parameter is more localized, whereas the times you want to propagate a parameter, it is more likely to be one that does not concern the current code.  That is why I went with propagating parameters, even though not wanting to seems like the more common case.  Also, if you need to explicitly propagate a parameter, then you have to do it at each call that could lead to its use.  If you add an intermediate subroutine and forget to propagate a parameter, then things break.  If, instead, undefining a parameter is explicit, then you only need to break the chain once.

      If you want to change the behavior such that parameters are not propagated by default, then you could simplify the syntax of forwarding parameters by having "?param" in a parameter list pass "param" if and only if it is currently defined.

      > I've applied the patch: it's now in CVS.

      Cool!

      > ESC x debug_joe has a line which says "Allocated N stack frames".
      > So far this number is quite low, so I'm not worried about memory
      > management.

      I expect it to be quite low for most files.  Even with a complicated syntax-highlighting description that went crazy with subroutines, it would take code with some very deeply nested expressions to create enough frames worth worrying about, enough nesting that it would be difficult for a human to decipher.  But such code is possible, especially coming out of a code generator, and the memory used for the highlighter would not be reclaimed, even after closing such a file.  However, if you were to open another such file highlighted with the same syntax, then it would reuse those frames.

      Even if it got into the thousands of frames, each frame is 8 pointers, which is 40 bytes on a 64-bit machine, so each 1024 frames is 40k.  You'd need 25,000 frames to make a megabyte, which would probably be a very small fraction of the file being highlighted.

      You might still want to use a memory pool though, since frames are created at run-time and could lead to heap fragmentation.  Is it worth the code complexity and effort?

      > I had to fix one thing: debug_joe was segfaulting because it was
      > trying to print the name of the next_state in default_cmd

      D'oh!

      I noticed that diff.jsf is the only .jsf file in CVS that has a state (ndiff_re) where the first transition does not use '*'.  In the case of diff.jsf, it is actually not relying on the default command, because it has a transition for '\n', and the only transitions that lead to that state are transitions for '\n' that use the "noeat" option.  ndiff_re is a dummy state that is used for recoloring.  It could use '*' for its only transition just like every other dummy color state.

      Relying on the default command seems iffy to me.  I think JOE should warn about any states that do.

      > I should really add another pass which checks for unknown states
      > and prints errors to the startup log.

      I've thought about this.  I think the best solution is to create a second variable-length array alongside syntax->states, where each element has two bits for the corresponding state.

      One is "defined".  There would be a new argument to find_state() that is true or false depending on whether we are defining the state or referencing it.  If we are referencing it, and it doesn't exist yet, then clear the "defined" bit.  If we are defining it, set the "defined" bit.  At the end of load_dfa(), any state with the "defined" bit still clear has not been defined.

      The other is "reachable".  The "reachable" bit would be cleared initially.  At the end of load_dfa(), start at state[0] and recurse through the state graph.  For each node, set the "reachable" bit, iterate through the next states, and for each next state that does not have the "reachable" bit set, recurse.  Once the recursion is complete, any state with the "reachable" bit clear is unreachable.  Of course, that will only find states that are trivially unreachable.  There could be other states that are unreachable, but then you're talking about solving the halting problem.  :-)

      load_dfa() can then free the array before returning.

      > One big problem with this change: now many of the highlighters
      > can be substantially improved... which is lots of work.

      My bad.  :-)

      There are at least a couple of things to be careful of.  One is any language that has conditional directives or preprocessing.  C is a lost cause, since you could have things like

        if(something
      #ifdef foo
           && something else){
      #else
           || something completely different){
      #endif

      or

      #define FOO (unbalanced (expression

      The other is that you can't rely on being able to recolor an entire expression if it can span multiple lines.

      Oh, and I don't mind helping out with some of that work.  :-)