Q about htmlComment and pyparsing tricks

  • joe


    First off, pyparsing looks like a ridiculously cool tool. Thanks for writing it!

    I've got a quick question about one of your predefined parsers, namely htmlComment. I suspect the real answer to this question is "take a course in programming languages", or "read a book on parsing", which I probably should do anyway, but I want to get an idea of what it is that I don't know.

    htmlComment is defined as:

    htmlComment = Combine( Literal("<!--") + ZeroOrMore( CharsNotIn("-") |
                                                       (~Literal("-->") + Literal("-").leaveWhitespace() ) ) +
                            Literal("-->") ).streamline().setName("htmlComment enclosed in <!-- ... -->")

    If we strip off the stuff to make the output look nicer, we get:

    htmlComment = Literal("<!--") + ZeroOrMore( CharsNotIn("-")
                                | (~Literal("-->") + Literal("-").leaveWhitespace() ) ) + Literal("-->") 

    I'm curious as to why there is a "+ Literal("-").leaveWhitespace()" following the ~Literal("-->"). I know that if I _remove_ that from the definition, the parser never returns under certain circumstances:

    In [4]: htmlComment = Literal("<!--") + ZeroOrMore( CharsNotIn("-") | (~Literal("-->") ) ) + Literal("-->")

    In [5]: htmlComment.parseString("<!-- a -->")
    Out[5]: (['<!--', 'a ', '-->'], {})             # <--- works

    In [6]: htmlComment.parseString("<!-- a --->")  # <--- grinds on forever

    If I convert this to what I _think_ is the English-language explanation of the parser, it is "match the exact string "<!--" followed by a string of zero or more of the following: (1) characters from the set that includes everything except "-" , and (2) strings that start with "-" but are not equal to "-->" followed by "-", except that any whitespace between the "-->" and the "-" will not be deleted before making a match. (Is that correct?)

    Essentially, I'm curious as to why parsing algorithm requires that "Literal("-").leaveWhitespace()" be appended to the ~Literal("-->") to prevent an infinite loop.

    thanks in advance!


    • Paul McGuire
      Paul McGuire

      Joe -

      Well, I think this *is* covered in CS compiler courses, but don't worry about it.  I never took one either.  I *have* written countless parsing programs, in half a dozen different programming languages, and so pyparsing is the result of "I wish I'd had this when I started..." thinking.

      Before I answer your question, take a look a the cStyleComment.  It has a very similar issue.  It reads an opening '/*', then any non-star characters, or any '*' not followed by a slash, followed by a '*/'. 

      The problem is that pyparsing does not use lookahead as part of its scanning process.  The terminators for both cStyleComment and htmlComment are multiple characters long.  While scanning through the comment body, cStyleComment must handle '*' differently from all other characters.  If a '*' is seen, then we have to see if it not followed by a '/' - if it isn't then we are still in the body of the comment.  If it *is* followed by a slash, then we are not in the body of the comment, and we also need to back up our read position in the input string.  pyparsing does this automatically, that when the And object created by ( "*" + ~Literal("/") ) fails the second part, it throws an exception internally, whose handler reverts the position to where the And started.

      htmlComment is implemented a little differently, probably because here, the end-of-comment is 3 characters long, not just 2.  Here I *am* using a sort of lookahead, but it is because I have explicitly stated it in my grammar - if I were to read the comment body definition aloud, it would read "accept any non-dash character or, if it's not part of a closing '-->' terminator, then read a dash character".  (I don't recall why I call leaveWhitespace here, though.  I don't think it makes a difference.  You can experiment with leaving it out, and then create HTML comments with various '-->' fragments, with and without intervening whitespace.)

      If I wanted to make HTML comment more consistent with cStyleComment, it would probably read

      htmlComment = Combine( Literal("<!--") + ZeroOrMore( CharsNotIn("-") |
      (Literal("-") + ~Literal("->") ) ) +
      Literal("-->") ).streamline().setName("htmlComment enclosed in <!-- ... -->")

      Feel free to test this, again, making sure you don't accidentally accept "- ->" as a terminator.  This may also parse faster, and would be worth changing in the next release.  Let me know what you find.

      -- Paul