Thread: [Pyparsing] Better way than operatorPrecedence to parse a regexp-like grammar?

Brought to you by: ptmcg

pyparsing-users

[Pyparsing] Better way than operatorPrecedence to parse a regexp-like grammar?

From: Andrew W. <and...@gm...> - 2009-02-26 06:12:42

I am writing a proxy (http://xuproxy.sourceforge.net/) that includes 
header and content filters that use a regexp-like matching language 
(basically a hybrid of globs and regexps). The parser for this matching 
language is implemented using pyparsing (the actual matching code does 
not use pyparsing, only the parser for the patterns themselves). 
Currently, it uses the operatorPrecedence function (actually, a custom 
version of it, modified to deal with certain features of the matching 
language). There are two problems with this, though. The first is that 
it is slow (up to a second or more on complex patterns), even with 
packrat parsing enabled. The second is that very complex patterns cause 
the maximum recursion depth to be exceeded. Is there a way to parse a 
regexp-like language  with pyparsing that is faster and involves less 
recursion than operatorPrecedence?

The source for the parser module can be seen at 
http://freehg.org/u/andreww591/xuproxy/file/tip/lib/xuproxy/proxo_filter/matching_parser.py

Re: [Pyparsing] Better way than operatorPrecedence to parse a regexp-like grammar?

From: spir <den...@fr...> - 2009-02-26 07:50:12

Le Wed, 25 Feb 2009 23:14:44 -0700,
Andrew Warkentin <and...@gm...> s'exprima ainsi:

> I am writing a proxy (http://xuproxy.sourceforge.net/) that includes 
> header and content filters that use a regexp-like matching language 
> (basically a hybrid of globs and regexps). The parser for this matching 
> language is implemented using pyparsing (the actual matching code does 
> not use pyparsing, only the parser for the patterns themselves). 
> Currently, it uses the operatorPrecedence function (actually, a custom 
> version of it, modified to deal with certain features of the matching 
> language). There are two problems with this, though. The first is that 
> it is slow (up to a second or more on complex patterns), even with 
> packrat parsing enabled. The second is that very complex patterns cause 
> the maximum recursion depth to be exceeded. Is there a way to parse a 
> regexp-like language  with pyparsing that is faster and involves less 
> recursion than operatorPrecedence?

Maybe it is a stupid -- anyway ;-)
Have you considered simply letting down operator precedence for your language? In favor of sequential semantics + (gouping).

I had a similar problem and this worked fine for me (actually I just did not wish to cope with operator precedence, I had no perfomance issue), because:

* Many expressions did not involve operators of different precedence level

* Many where logically commutative, meaning one could rewrite
"a | b c"   to   "b c | a"   when "b c" cannot be a prefix of "a"

* I often simply wrote intermediate (sub) expressions so that composite ones hold higher level non-terminals in which there is no more precedence:
p = a b | c d	==>
p1 = a b
p2 = c d
p = p1 | p2

* In other cases writing () was not such a burden.

I may be wrong but I think operator precedence is inherently time-consuming precisely because it involves recursion. As recursion cannot be done at the same location in the input string (otherwise infinite loop -- the reason why left recursion is not possible), then packrat memoïzing will not help with this one aspect of the issue. As I understand it, it can help only in such case:
	a b | a c
Then when b fails, a will not be evaluated twice.
(Please tell me if I'm wrong.)

Denis
------
la vita e estrany

Re: [Pyparsing] Better way than operatorPrecedence to parse aregexp-like grammar?

From: Paul M. <pt...@au...> - 2009-02-26 14:58:28

You should start by trying to tune up your definition of term, since this
expression gets used a *lot* internally to the operatorPrecedence code.
Here are some comments/questions/suggestions on cleaning up term:
1. term includes 2 references to numericRange, why?
2. variable() returns an expression that is a MatchFirst of 10 different
characters.  Just have this method return Regex("[0-9#]"), it will evaluate
much faster.
3. numericRange tests for signedNumbers, and then for unsigned numbers.  But
your unsigned numbers would match the signedNumber expression, so the second
alternative test will never match.  Also, signedNumber the way you have
defined it would match a single "-" character, probably not desired.  Try
this:
signedNumber = Optional('-') + Word(nums)  # or even just Regex(r"-?\d+")
numericRange = ( 
    (lbrack + Literal('#') + (signedNumber | '*').setResultsName('min') +
Suppress(':') + (signedNumber | '*').setResultsName('max') + rbrack)
    )
(I also removed the Combine - you might want Group instead.)

4. I streamlined repetition a bit from:

repetition = (
    ( plus + lbrace + Word(nums).setResultsName("count") + rbrace ) |
    ( plus + lbrace + Word(nums).setResultsName("minCount")+","+
Word(nums).setResultsName("maxCount") + rbrace ) |
    plus 
    )

to:
repetition = plus + Optional( lbrace + 
    ( ( Word(nums).setResultsName("minCount")+","+
Word(nums).setResultsName("maxCount") ) |
      Word(nums).setResultsName("count") )
    + rbrace 
    )

Which could look a little nicer as:
repetition = plus + Optional( lbrace + 
    ( ( Word(nums)("minCount")+","+ Word(nums)("maxCount") ) |
      Word(nums)("count") )
    + rbrace 
    )
(runs no faster, but I find it a little easier to read).  Since repetition
is the first precedence level, it gets used a lot, so any streamlining here
helps.


5. 		space = OneOrMore(White())
Really?  I doubt you are matching any these at the moment, since you aren't
taking any steps to disable pyparsing's default behavior of skipping
whitespace.  But as the first alternative in the list of expressions in
term, you are testing for it *many* times.

6. You might be able to reorder the options in term based on the likelihood
of occurrence in the input text.  Since this is a MatchFirst, testing for
more common options ahead of rarer ones will shortcut the rest of the tests,
with a performance win.

You might also define:
	integer = Word(nums)
And then use integer in all your related expressions, instead of repeating
Word(nums) all the time - this will make your code a little easier to read,
and the packratting will be a little more efficient, too.

I also have some comments on operatorPrecedence itself, but I'll wait until
you have gotten term to run a bit better before delving into oP.  Just one
note, instead of (in your list of precedence definitions):
			(Empty(), 2, opAssoc.LEFT, self.handleSequence),
Try:
			(None, 2, opAssoc.LEFT, self.handleSequence),


-- Paul