Thread: [Pyparsing] Better way than operatorPrecedence to parse a regexp-like grammar?
Brought to you by:
ptmcg
From: Andrew W. <and...@gm...> - 2009-02-26 06:12:42
|
I am writing a proxy (http://xuproxy.sourceforge.net/) that includes header and content filters that use a regexp-like matching language (basically a hybrid of globs and regexps). The parser for this matching language is implemented using pyparsing (the actual matching code does not use pyparsing, only the parser for the patterns themselves). Currently, it uses the operatorPrecedence function (actually, a custom version of it, modified to deal with certain features of the matching language). There are two problems with this, though. The first is that it is slow (up to a second or more on complex patterns), even with packrat parsing enabled. The second is that very complex patterns cause the maximum recursion depth to be exceeded. Is there a way to parse a regexp-like language with pyparsing that is faster and involves less recursion than operatorPrecedence? The source for the parser module can be seen at http://freehg.org/u/andreww591/xuproxy/file/tip/lib/xuproxy/proxo_filter/matching_parser.py |
From: spir <den...@fr...> - 2009-02-26 07:50:12
|
Le Wed, 25 Feb 2009 23:14:44 -0700, Andrew Warkentin <and...@gm...> s'exprima ainsi: > I am writing a proxy (http://xuproxy.sourceforge.net/) that includes > header and content filters that use a regexp-like matching language > (basically a hybrid of globs and regexps). The parser for this matching > language is implemented using pyparsing (the actual matching code does > not use pyparsing, only the parser for the patterns themselves). > Currently, it uses the operatorPrecedence function (actually, a custom > version of it, modified to deal with certain features of the matching > language). There are two problems with this, though. The first is that > it is slow (up to a second or more on complex patterns), even with > packrat parsing enabled. The second is that very complex patterns cause > the maximum recursion depth to be exceeded. Is there a way to parse a > regexp-like language with pyparsing that is faster and involves less > recursion than operatorPrecedence? Maybe it is a stupid -- anyway ;-) Have you considered simply letting down operator precedence for your language? In favor of sequential semantics + (gouping). I had a similar problem and this worked fine for me (actually I just did not wish to cope with operator precedence, I had no perfomance issue), because: * Many expressions did not involve operators of different precedence level * Many where logically commutative, meaning one could rewrite "a | b c" to "b c | a" when "b c" cannot be a prefix of "a" * I often simply wrote intermediate (sub) expressions so that composite ones hold higher level non-terminals in which there is no more precedence: p = a b | c d ==> p1 = a b p2 = c d p = p1 | p2 * In other cases writing () was not such a burden. I may be wrong but I think operator precedence is inherently time-consuming precisely because it involves recursion. As recursion cannot be done at the same location in the input string (otherwise infinite loop -- the reason why left recursion is not possible), then packrat memoïzing will not help with this one aspect of the issue. As I understand it, it can help only in such case: a b | a c Then when b fails, a will not be evaluated twice. (Please tell me if I'm wrong.) Denis ------ la vita e estrany |
From: Paul M. <pt...@au...> - 2009-02-26 14:58:28
|
You should start by trying to tune up your definition of term, since this expression gets used a *lot* internally to the operatorPrecedence code. Here are some comments/questions/suggestions on cleaning up term: 1. term includes 2 references to numericRange, why? 2. variable() returns an expression that is a MatchFirst of 10 different characters. Just have this method return Regex("[0-9#]"), it will evaluate much faster. 3. numericRange tests for signedNumbers, and then for unsigned numbers. But your unsigned numbers would match the signedNumber expression, so the second alternative test will never match. Also, signedNumber the way you have defined it would match a single "-" character, probably not desired. Try this: signedNumber = Optional('-') + Word(nums) # or even just Regex(r"-?\d+") numericRange = ( (lbrack + Literal('#') + (signedNumber | '*').setResultsName('min') + Suppress(':') + (signedNumber | '*').setResultsName('max') + rbrack) ) (I also removed the Combine - you might want Group instead.) 4. I streamlined repetition a bit from: repetition = ( ( plus + lbrace + Word(nums).setResultsName("count") + rbrace ) | ( plus + lbrace + Word(nums).setResultsName("minCount")+","+ Word(nums).setResultsName("maxCount") + rbrace ) | plus ) to: repetition = plus + Optional( lbrace + ( ( Word(nums).setResultsName("minCount")+","+ Word(nums).setResultsName("maxCount") ) | Word(nums).setResultsName("count") ) + rbrace ) Which could look a little nicer as: repetition = plus + Optional( lbrace + ( ( Word(nums)("minCount")+","+ Word(nums)("maxCount") ) | Word(nums)("count") ) + rbrace ) (runs no faster, but I find it a little easier to read). Since repetition is the first precedence level, it gets used a lot, so any streamlining here helps. 5. space = OneOrMore(White()) Really? I doubt you are matching any these at the moment, since you aren't taking any steps to disable pyparsing's default behavior of skipping whitespace. But as the first alternative in the list of expressions in term, you are testing for it *many* times. 6. You might be able to reorder the options in term based on the likelihood of occurrence in the input text. Since this is a MatchFirst, testing for more common options ahead of rarer ones will shortcut the rest of the tests, with a performance win. You might also define: integer = Word(nums) And then use integer in all your related expressions, instead of repeating Word(nums) all the time - this will make your code a little easier to read, and the packratting will be a little more efficient, too. I also have some comments on operatorPrecedence itself, but I'll wait until you have gotten term to run a bit better before delving into oP. Just one note, instead of (in your list of precedence definitions): (Empty(), 2, opAssoc.LEFT, self.handleSequence), Try: (None, 2, opAssoc.LEFT, self.handleSequence), -- Paul |