Thread: [Pyparsing] Proposed Visitor interface to parse strings
Brought to you by:
ptmcg
From: Paul M. <pt...@au...> - 2009-09-24 08:15:28
|
I am considering adding an interface to pyparsing along the lines of the Visitor pattern. My intent is to make it easier to work with the scanString method. Currently, when using scanString, one gets the tokens, start, and end for each matching text in the input string. This forces the caller to keep track of some low-level parsing state/locations if they need to do some processing of the intervening text, or some other stateful work. By writing a Visitor, this can be tracked in a more object-friendly way. Here's how the Visitor would work. The concept is that, after creating a pyparsing grammar, one could define a class that implements a method visitExpr, which receives a ParseResults containing the matching tokens, and optionally a method visitIntervening, which receives a string containing the portion of the input string between matches - call this class ParseVisitor. The pyparsing grammar expression - let's call it expr - then accepts this visitor, and gives us a callable object. This new object can now be called with an input string, and the visitExpr and visitIntervening methods will get called as the input string is parsed. Here is a sample: from pyparsing import * expr = Word(alphas) tests = """\ ABC 123 DEF 456 ABC 123 DEF 456 XYZ ABC 123 DEF 456 XYZ 0 ABC 123 DEF 456 XYZ """.splitlines() class ParseVisitor(object): def visitExpr(self, tokens): print ">%s<" % tokens.asList(), def visitIntervening(self, strng): print "^%s^" % strng, visitor = ParseVisitor() processor = expr.accept(visitor) for t in tests: print t processor(t) print print Prints out: ABC 123 DEF 456 ^^ >['ABC']< ^ 123 ^ >['DEF']< ^ 456^ ABC 123 DEF 456 XYZ ^^ >['ABC']< ^ 123 ^ >['DEF']< ^ 456 ^ >['XYZ']< ABC 123 DEF 456 XYZ ^ ^ >['ABC']< ^ 123 ^ >['DEF']< ^ 456 ^ >['XYZ']< 0 ABC 123 DEF 456 XYZ ^0 ^ >['ABC']< ^ 123 ^ >['DEF']< ^ 456 ^ >['XYZ']< (In the pure Visitor pattern, ParseVisitor would implement two different methods, both named visit, with one taking a ParseResults and the other taking a string. But since Python doesn't do function overloading, I've had to give these different names. But now, how nice and explicit the resulting class is!) What do people think of this idea? -- Paul |
From: Helmut J. <jar...@ig...> - 2009-09-24 08:58:08
|
On 24 Sep, Paul McGuire wrote: > I am considering adding an interface to pyparsing along the lines of the > Visitor pattern. My intent is to make it easier to work with the scanString > method. Currently, when using scanString, one gets the tokens, start, and > end for each matching text in the input string. This forces the caller to > keep track of some low-level parsing state/locations if they need to do some > processing of the intervening text, or some other stateful work. By writing > a Visitor, this can be tracked in a more object-friendly way. > > Here's how the Visitor would work. The concept is that, after creating a > pyparsing grammar, one could define a class that implements a method > visitExpr, which receives a ParseResults containing the matching tokens, and > optionally a method visitIntervening, which receives a string containing the > portion of the input string between matches - call this class ParseVisitor. > The pyparsing grammar expression - let's call it expr - then accepts this > visitor, and gives us a callable object. This new object can now be called > with an input string, and the visitExpr and visitIntervening methods will > get called as the input string is parsed. Here is a sample: > > from pyparsing import * > > expr = Word(alphas) > > tests = """\ > ABC 123 DEF 456 > ABC 123 DEF 456 XYZ > ABC 123 DEF 456 XYZ > 0 ABC 123 DEF 456 XYZ > """.splitlines() > > > class ParseVisitor(object): > def visitExpr(self, tokens): > print ">%s<" % tokens.asList(), > def visitIntervening(self, strng): > print "^%s^" % strng, > > > visitor = ParseVisitor() > processor = expr.accept(visitor) > for t in tests: > print t > processor(t) > print > print > > Prints out: > > ABC 123 DEF 456 > ^^ >['ABC']< ^ 123 ^ >['DEF']< ^ 456^ > > ABC 123 DEF 456 XYZ > ^^ >['ABC']< ^ 123 ^ >['DEF']< ^ 456 ^ >['XYZ']< > > ABC 123 DEF 456 XYZ > ^ ^ >['ABC']< ^ 123 ^ >['DEF']< ^ 456 ^ >['XYZ']< > > 0 ABC 123 DEF 456 XYZ > ^0 ^ >['ABC']< ^ 123 ^ >['DEF']< ^ 456 ^ >['XYZ']< > > > (In the pure Visitor pattern, ParseVisitor would implement two different > methods, both named visit, with one taking a ParseResults and the other > taking a string. But since Python doesn't do function overloading, I've had > to give these different names. But now, how nice and explicit the resulting > class is!) > > What do people think of this idea? > Yes, that looks great! One question though, would be possible to let visitExpr and visitIntervening return a value (None or a bool) to indicate if they like to be called again or not for the remaining tokens? Helmut. -- Helmut Jarausch Lehrstuhl fuer Numerische Mathematik RWTH - Aachen University D 52056 Aachen, Germany |
From: Paul M. <pt...@au...> - 2009-09-27 12:42:32
|
> would be possible to let visitExpr and visitIntervening > return a value (None or a bool) > to indicate if they like to be called again or not for > the remaining tokens? > Helmut, Thanks for your suggestion. I've thought about it a bit, see what you think. I think the default behavior would be that these methods, if defined, would continue to be called as long as there are parsed matches in the input text. Since Python methods that don't explicitly return anything actually return None, then I would interpret a None return as the same as the default case, that is, to keep on matching. To add an overt return value to indicate that no more calls should be made, we could return True to keep on calling, or False to stop calling. Oddly, this would have pyparsing treating returned values of True and None equally, which is a bit of a code smell. If I invert the meaning of the returned flag, False meaning keeping calling and True meaning stop calling, then my flag asserts a negative, which is a different kind of smell to me. Instead of returning a flag, the methods could raise an exception, and StopIteration seems like a logical choice. My first thought is to have either one of these visit methods raise StopIteration, and have that stop the parsing process altogether - this seems to me to be in line with the spirit of the original Visitor pattern, in which all visit() methods were roughly the same, differing only in argument signatures. Or I could track visitExpr and visitIntervening separately, and if one raises StopIteration, I could have pyparsing continue to call the other. But this feels weird, my instinct would be to have StopIteration just stop parsing altogether, whichever visit method raised it. Here is an alternative: instead of adding this flag or exception as part of the interaction between pyparsing and your Visitor code, you could have your Visitor class handle the alternative logic. Here is a class that, after having had a method called once, changes the instance's method to a do-nothing method. class CallOnceVisitor(object): def method(self): print "method" # redefine method, since we just wanted the first self.method = self.do_nothing def do_nothing(self): pass co = CallOnceVisitor() co.method() co.method() You could do the same in visitExpr, by changing self.visitExpr to self.donothing. Pyparsing will still make the function calls, but they will just return immediately. But now you have more control over what happens when, and pyparsing's logic stays fairly simple-minded. Overall, I think adding support for StopIteration (or similar exception) is good, in which any visitXXX method can raise it and stop the parsing process. If finer control is needed, then I would put the burden back on the visitXXX method implementations to keep track, perhaps using techniques like in CallOnceVisitor. -- Paul |