Python parsing module / Discussion / Help/Open Discussion: Q about htmlComment and pyparsing tricks

Q about htmlComment and pyparsing tricks

Forum: Help/Open Discussion

Creator: joe

Created: 2005-05-25

Updated: 2013-05-14

joe - 2005-05-25

Hi,

First off, pyparsing looks like a ridiculously cool tool. Thanks for writing it!

I've got a quick question about one of your predefined parsers, namely htmlComment. I suspect the real answer to this question is "take a course in programming languages", or "read a book on parsing", which I probably should do anyway, but I want to get an idea of what it is that I don't know.

htmlComment is defined as:

htmlComment = Combine( Literal("") + Literal("-").leaveWhitespace() ) ) +
                        Literal("-->") ).streamline().setName("htmlComment enclosed in ")

If we strip off the stuff to make the output look nicer, we get:

htmlComment = Literal("") + Literal("-").leaveWhitespace() ) ) + Literal("-->")

I'm curious as to why there is a "+ Literal("-").leaveWhitespace()" following the ~Literal("-->"). I know that if I _remove_ that from the definition, the parser never returns under certain circumstances:

In [4]: htmlComment = Literal("") ) ) + Literal("-->")

In [5]: htmlComment.parseString("")
Out[5]: ([''], {})             # <--- works

In [6]: htmlComment.parseString("") # <--- grinds on forever

If I convert this to what I _think_ is the English-language explanation of the parser, it is "match the exact string "" followed by "-", except that any whitespace between the "-->" and the "-" will not be deleted before making a match. (Is that correct?)

Essentially, I'm curious as to why parsing algorithm requires that "Literal("-").leaveWhitespace()" be appended to the ~Literal("-->") to prevent an infinite loop.

thanks in advance!

Joe

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul McGuire - 2005-05-25
  
  Joe -
  
  Well, I think this *is* covered in CS compiler courses, but don't worry about it. I never took one either. I *have* written countless parsing programs, in half a dozen different programming languages, and so pyparsing is the result of "I wish I'd had this when I started..." thinking.
  
  Before I answer your question, take a look a the cStyleComment. It has a very similar issue. It reads an opening '/*', then any non-star characters, or any '*' not followed by a slash, followed by a '*/'.
  
  The problem is that pyparsing does not use lookahead as part of its scanning process. The terminators for both cStyleComment and htmlComment are multiple characters long. While scanning through the comment body, cStyleComment must handle '*' differently from all other characters. If a '*' is seen, then we have to see if it not followed by a '/' - if it isn't then we are still in the body of the comment. If it *is* followed by a slash, then we are not in the body of the comment, and we also need to back up our read position in the input string. pyparsing does this automatically, that when the And object created by ( "*" + ~Literal("/") ) fails the second part, it throws an exception internally, whose handler reverts the position to where the And started.
  
  htmlComment is implemented a little differently, probably because here, the end-of-comment is 3 characters long, not just 2. Here I *am* using a sort of lookahead, but it is because I have explicitly stated it in my grammar - if I were to read the comment body definition aloud, it would read "accept any non-dash character or, if it's not part of a closing '-->' terminator, then read a dash character". (I don't recall why I call leaveWhitespace here, though. I don't think it makes a difference. You can experiment with leaving it out, and then create HTML comments with various '-->' fragments, with and without intervening whitespace.)
  
  If I wanted to make HTML comment more consistent with cStyleComment, it would probably read
  
  htmlComment = Combine( Literal("") ).streamline().setName("htmlComment enclosed in ")
  
  Feel free to test this, again, making sure you don't accidentally accept "- ->" as a terminator. This may also parse faster, and would be worth changing in the next release. Let me know what you find.
  
  -- Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Q about htmlComment and pyparsing tricks

Forums

Help

Q about htmlComment and pyparsing tricks document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Q about htmlComment and pyparsing tricks