I am using pyparsing to parse some light markup I created, but I am having trouble putting the final grammar together (i.e. the one that will parse an entire file). I don't think I am understanding the correct usage of OneOrMore. Here is what I want my final grammar to look like:
LINK, WORD, POS, etc. are pretty basic Literal / Word stuff. When I run GRAMMAR.parseFile it gets stuck in an infinite loop for certain input (maybe 50% of the time), and I am not sure what is causing it. If I run something like:
which destructively parses the string without using oneOrMore, there is no problem. Does anyone have an idea why my grammar works using iter_over_tokens but when I use oneOrMore (which I think should be the correct way) it gives the infinite loop?
And thank you to Paul McGuire for such a great tool :]
Thanks,
Steve
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Welcome to pyparsing, and thanks for such nice compliments!
It's a bit difficult to say for sure why OneOrMore is infinitely looping without seeing more of your grammar, but here are some hunches.
1. NEWLINE is a funny-sounding expression in pyparsing. By default, pyparsing ignores whitespace, other than as a potential expression separator. So if NEWLINE happens to be something like Literal("\n"), forget it, it will never match. If your grammar *does* need to detect the end of an incoming line, use LineEnd() instead.
2. Your expression names sound very primitive/low-level. I would guess this parser (at least as much of it as I can see) is more of a tokenizer than a real parser. Look at some of the examples that ship with pyparsing, such as the simple SQL parser. The real power of pyparsing is in defining higher-order expressions, such as selectStatement, whereClause, orderByClause, etc., vs. word, punc, hyphen, etc. Of course, selectStatement et al. are all composed of smaller elements, such as Literal("select"), Literal("*"), and so on, but these primitive elements don't get much exposure at the overall grammar level.
3. I suspect that your Grammar is looping forever at the end of the input string. To force the OneOrMore to exit at the end of the string, try this grammar definition:
If this is not the case, you can have pyparsing emit debugging information using setDebug(). Try this (with your original grammar, but before calling parseString):
for exprName in "LINK WORD POS SYM PINYIN PUNC NEWLINE BIGC LPAREN RPAREN LBRACK RBRACK HYPHEN".split():
expr = vars()[exprName]
expr.setName(exprName)
expr.setDebug()
4. I'm so glad to see you wrote the iter_over_tokens generator, it reaffirms to me the need for such a function. Pyparsing comes with a friendlier (and non-destructive) version, called scanString, but it's sort of a poor distant cousin to parseString. scanString is a generator that yields each matching token set, almost exactly as you have done with iter_over_tokens. The exception is that scanString is able to keep track of how far the parsing has gone, and can resume at the end of the last match, so there is no need to muck about with the original string contents.
Each call to scanString returns a tuple of:
- tokens - a ParseResults object containing the matched tokens
- start - the starting position of the match within the input string
- end - the ending position of the match within the input string
You could iterate over your string using the following code:
for tokens,start,end in G.scanString(s):
print tokens.asList()
Look at the scanExamples.py file that ships with pyparsing.
5. Hey, since you are writing a markup language, then you might want to look at scanString's twin-brother, transformString. Using parseActions, you can define how each markup construct should be converted (to HTML? PDF? ReST?), then call G.transformString(s). transformString returns a new string, replacing any matched expressions with the results defined in their corresponding parse actions. Again, there are some examples of transformString given in scanExamples.py. (transformString uses scanString internally to locate the matching expressions, and takes care of the tedium of building up the resulting string from the matched and unmatched bits and pieces.)
If you need more help or advice, please post a bit more of your grammar.
HTH,
-- Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When I was starting out with pyparsing (having not written parsers for a long time or having much experience in it) I often ran into infinite loops by having something like:
OneOrMore( EXPR1 | EXPR2 etc... )
# The same is true for using ZeroOrMore
If any part of the expression supplied to OneOrMore or ZeroOrMore is capable of matching something but not moving the parsing position forward within the input text then it can loop forever if reaches the expression in the OR'ed list.
This could happen, for example, if one of the expressions is capable of matching nothing, e.g.
EXPR2 = Literal("word") | Empty()
or
EXPR2 = ZeroOrMore(cStyleComment)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you both for the help, I have everything working now. Using Paul's tips on debugging I was able to see that it was stuck parsing the final '\n' in the file. Change the expression to OneOrMore(~StringEnd() + ...) worked.
You are right that my grammar is very low level. I understand the purpose of the high level expressions like in the SQL parser, but the text that I am processing does not have a lot of structure to work with (my hope is that parsing it will give me something easier to use). Now that I have everything in the low level grammar working I can make a stab at the higher level expressions.
I have one final (small) problem. I am using setResultsName and asXML to view the output of the parser. The first element in the XML output should always be of type 'link', but it comes out as the default 'ITEM'. For example:
Everything else in the XML output is fine. Also, when I look at the results of scanString, everything is given the proper results name (including that first link). Any ideas why this might happen?
Sorry I did not post my code sooner, maybe that would have helped more. You can find it (with some test input and example of the output) here: http://lost-theory.org/python/pyparsing/
Thanks again for your help,
Steve
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've looked at the code a bit, before tracing the trouble back into pyparsing and getting a bit bewildered by some code in there. As far as I've traced it back it's something that happens inside pyparsing for some unknown reason... The text gets parsed correctly, but something goes wrong in internally constructing the ParseResults objects (I think).
Random confused stuff:
In asXML() when constructing namedItems the first link token internally gets an index of -1, where it should get an index of 0, messing up the construction of namedItems. This problem seems to stem further back in to when the ParseResults objects are created, and/or inserted into higher-level ParseResults. Not sure if that makes much sense, but I hope that Paul McGuire makes some sense of it :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm sorry not to have responded sooner, but I am under a huge load at work these days, and no end in sight for the next two months. So I'm not sure how much help I'll be, at least near term.
asXML has been one of my more problematic efforts, and the current version is fragile at best. I really appreciate your debugging efforts, and if you can crack this problem, you will be lauded high in the annals of pyparsingdom. :)
The -1 index is used when adding a ParseResults to a ParseResults, to reflect the nesting, but I thought that the indices were "fixed up" somewhere in the process (it *has* been a while since I wrote this code). ParseResults get very tricky to build and navigate, since the final structure isn't really known when you're in the middle of parsing, but of course, that is when the structure is being built.
Have you tried passing a wrapping tag in to asXML, as in asXML("DATA")? This could work around your "first tag" issues.
-- Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This isn't a fix, but it might be of some help in tracing the problem... I turned on debugging in the program, and I put this line at the start of ParseResults.__setitem__
print "ParseResults.__setitem__",repr((k,v))
Which resulted in the following stuff being printed:
ParseResults.__setitem__ ('body', (([([u'\xc0', u'\xd6'], {}), ([u'\xd8', u'\xf6'], {}), ([u'\xf8', u'\xfe'], {})], {}), -1))
ParseResults.__setitem__ ('body', (([([u'\xc0', u'\xd6'], {}), ([u'\xd8', u'\xf6'], {}), ([u'\xf8', u'\xfe'], {})], {}), 1))
Match link at loc 0 (1,1)
ParseResults.__setitem__ ('link', ((['../166/x114.htm'], {}), -1))
Matched link -> ['../166/x114.htm']
ParseResults.__setitem__ ('link', ((['../166/x114.htm'], {}), -1))
Match link at loc 24 (1,25)
ParseResults.__setitem__ ('link', ((['../168/x229.htm'], {}), -1))
Matched link -> ['../168/x229.htm']
ParseResults.__setitem__ ('link', ((['../168/x229.htm'], {}), -1))
ParseResults.__setitem__ ('link', ((['../168/x229.htm'], {}), 1))
...
Note the -1 values, and that the 2nd link has 3 calls (the last having a value of 1), but the 1st link only has 2 calls (both values being -1).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am using pyparsing to parse some light markup I created, but I am having trouble putting the final grammar together (i.e. the one that will parse an entire file). I don't think I am understanding the correct usage of OneOrMore. Here is what I want my final grammar to look like:
GRAMMAR = OneOrMore(LINK | WORD | POS | SYM | PINYIN | PUNC | NEWLINE | BIGC | LPAREN | RPAREN | LBRACK | RBRACK | HYPHEN)
LINK, WORD, POS, etc. are pretty basic Literal / Word stuff. When I run GRAMMAR.parseFile it gets stuck in an infinite loop for certain input (maybe 50% of the time), and I am not sure what is causing it. If I run something like:
def iter_over_tokens(G, s):
... while s.strip() != "":
... token = G.parseString(s)
... yield token
... s = s.replace("".join(token.asList()), "", 1)
which destructively parses the string without using oneOrMore, there is no problem. Does anyone have an idea why my grammar works using iter_over_tokens but when I use oneOrMore (which I think should be the correct way) it gives the infinite loop?
And thank you to Paul McGuire for such a great tool :]
Thanks,
Steve
Steve -
Welcome to pyparsing, and thanks for such nice compliments!
It's a bit difficult to say for sure why OneOrMore is infinitely looping without seeing more of your grammar, but here are some hunches.
1. NEWLINE is a funny-sounding expression in pyparsing. By default, pyparsing ignores whitespace, other than as a potential expression separator. So if NEWLINE happens to be something like Literal("\n"), forget it, it will never match. If your grammar *does* need to detect the end of an incoming line, use LineEnd() instead.
2. Your expression names sound very primitive/low-level. I would guess this parser (at least as much of it as I can see) is more of a tokenizer than a real parser. Look at some of the examples that ship with pyparsing, such as the simple SQL parser. The real power of pyparsing is in defining higher-order expressions, such as selectStatement, whereClause, orderByClause, etc., vs. word, punc, hyphen, etc. Of course, selectStatement et al. are all composed of smaller elements, such as Literal("select"), Literal("*"), and so on, but these primitive elements don't get much exposure at the overall grammar level.
3. I suspect that your Grammar is looping forever at the end of the input string. To force the OneOrMore to exit at the end of the string, try this grammar definition:
GRAMMAR = OneOrMore(~StringEnd() + (LINK | WORD | POS | SYM | PINYIN | PUNC | NEWLINE | BIGC | LPAREN | RPAREN | LBRACK | RBRACK | HYPHEN) )
If this is not the case, you can have pyparsing emit debugging information using setDebug(). Try this (with your original grammar, but before calling parseString):
for exprName in "LINK WORD POS SYM PINYIN PUNC NEWLINE BIGC LPAREN RPAREN LBRACK RBRACK HYPHEN".split():
expr = vars()[exprName]
expr.setName(exprName)
expr.setDebug()
4. I'm so glad to see you wrote the iter_over_tokens generator, it reaffirms to me the need for such a function. Pyparsing comes with a friendlier (and non-destructive) version, called scanString, but it's sort of a poor distant cousin to parseString. scanString is a generator that yields each matching token set, almost exactly as you have done with iter_over_tokens. The exception is that scanString is able to keep track of how far the parsing has gone, and can resume at the end of the last match, so there is no need to muck about with the original string contents.
Each call to scanString returns a tuple of:
- tokens - a ParseResults object containing the matched tokens
- start - the starting position of the match within the input string
- end - the ending position of the match within the input string
You could iterate over your string using the following code:
for tokens,start,end in G.scanString(s):
print tokens.asList()
Look at the scanExamples.py file that ships with pyparsing.
5. Hey, since you are writing a markup language, then you might want to look at scanString's twin-brother, transformString. Using parseActions, you can define how each markup construct should be converted (to HTML? PDF? ReST?), then call G.transformString(s). transformString returns a new string, replacing any matched expressions with the results defined in their corresponding parse actions. Again, there are some examples of transformString given in scanExamples.py. (transformString uses scanString internally to locate the matching expressions, and takes care of the tedium of building up the resulting string from the matched and unmatched bits and pieces.)
If you need more help or advice, please post a bit more of your grammar.
HTH,
-- Paul
When I was starting out with pyparsing (having not written parsers for a long time or having much experience in it) I often ran into infinite loops by having something like:
OneOrMore( EXPR1 | EXPR2 etc... )
# The same is true for using ZeroOrMore
If any part of the expression supplied to OneOrMore or ZeroOrMore is capable of matching something but not moving the parsing position forward within the input text then it can loop forever if reaches the expression in the OR'ed list.
This could happen, for example, if one of the expressions is capable of matching nothing, e.g.
EXPR2 = Literal("word") | Empty()
or
EXPR2 = ZeroOrMore(cStyleComment)
Hello,
Thank you both for the help, I have everything working now. Using Paul's tips on debugging I was able to see that it was stuck parsing the final '\n' in the file. Change the expression to OneOrMore(~StringEnd() + ...) worked.
You are right that my grammar is very low level. I understand the purpose of the high level expressions like in the SQL parser, but the text that I am processing does not have a lot of structure to work with (my hope is that parsing it will give me something easier to use). Now that I have everything in the low level grammar working I can make a stab at the higher level expressions.
I have one final (small) problem. I am using setResultsName and asXML to view the output of the parser. The first element in the XML output should always be of type 'link', but it comes out as the default 'ITEM'. For example:
<165-d65>
<ITEM>../166/x114.htm</ITEM>
<link>../168/x229.htm</link>
<word>Click</word>
<word>on</word>
<word>any</word>
....
But if I put something else at the front of the file:
<165-d65>
<word>Hello</word>
<link>../166/x114.htm</link>
<link>../168/x229.htm</link>
<word>Click</word>
<word>on</word>
<word>any</word>
<word>character</word>
...
Everything else in the XML output is fine. Also, when I look at the results of scanString, everything is given the proper results name (including that first link). Any ideas why this might happen?
Sorry I did not post my code sooner, maybe that would have helped more. You can find it (with some test input and example of the output) here:
http://lost-theory.org/python/pyparsing/
Thanks again for your help,
Steve
I've looked at the code a bit, before tracing the trouble back into pyparsing and getting a bit bewildered by some code in there. As far as I've traced it back it's something that happens inside pyparsing for some unknown reason... The text gets parsed correctly, but something goes wrong in internally constructing the ParseResults objects (I think).
Random confused stuff:
In asXML() when constructing namedItems the first link token internally gets an index of -1, where it should get an index of 0, messing up the construction of namedItems. This problem seems to stem further back in to when the ParseResults objects are created, and/or inserted into higher-level ParseResults. Not sure if that makes much sense, but I hope that Paul McGuire makes some sense of it :)
Greatred -
I'm sorry not to have responded sooner, but I am under a huge load at work these days, and no end in sight for the next two months. So I'm not sure how much help I'll be, at least near term.
asXML has been one of my more problematic efforts, and the current version is fragile at best. I really appreciate your debugging efforts, and if you can crack this problem, you will be lauded high in the annals of pyparsingdom. :)
The -1 index is used when adding a ParseResults to a ParseResults, to reflect the nesting, but I thought that the indices were "fixed up" somewhere in the process (it *has* been a while since I wrote this code). ParseResults get very tricky to build and navigate, since the final structure isn't really known when you're in the middle of parsing, but of course, that is when the structure is being built.
Have you tried passing a wrapping tag in to asXML, as in asXML("DATA")? This could work around your "first tag" issues.
-- Paul
This isn't a fix, but it might be of some help in tracing the problem... I turned on debugging in the program, and I put this line at the start of ParseResults.__setitem__
print "ParseResults.__setitem__",repr((k,v))
Which resulted in the following stuff being printed:
ParseResults.__setitem__ ('body', (([([u'\xc0', u'\xd6'], {}), ([u'\xd8', u'\xf6'], {}), ([u'\xf8', u'\xfe'], {})], {}), -1))
ParseResults.__setitem__ ('body', (([([u'\xc0', u'\xd6'], {}), ([u'\xd8', u'\xf6'], {}), ([u'\xf8', u'\xfe'], {})], {}), 1))
Match link at loc 0 (1,1)
ParseResults.__setitem__ ('link', ((['../166/x114.htm'], {}), -1))
Matched link -> ['../166/x114.htm']
ParseResults.__setitem__ ('link', ((['../166/x114.htm'], {}), -1))
Match link at loc 24 (1,25)
ParseResults.__setitem__ ('link', ((['../168/x229.htm'], {}), -1))
Matched link -> ['../168/x229.htm']
ParseResults.__setitem__ ('link', ((['../168/x229.htm'], {}), -1))
ParseResults.__setitem__ ('link', ((['../168/x229.htm'], {}), 1))
...
Note the -1 values, and that the 2nd link has 3 calls (the last having a value of 1), but the 1st link only has 2 calls (both values being -1).