Match with a negative expression

  • Hello,

    I have been able to use pyparsing quite effectively for parsing natural language text containing some particular expressions. Until now.

    I need to match a phrase with 0 or more words. But it has to stop matching on one of a particular set of words. Here's what I mean.

    I define the following constraints:

    Years = Word('1', nums, exact=4)
    Months = oneOf('Jan Feb Mar Apr')
    Days = Word(nums,min=1,max=2)
    Places = OneOrMore(Word(alphas, alphas + '.' + ',')

    I define the following parse rule:

    r = (CaselessLiteral('arrived') ^
         CaselessLiteral('departed')) +
         Optional(oneOf('in on at from')) +
         Optional(Places) +
         Optional(Months) +
         Optional(Days) +

    testString1 = 'Departed from Kansas Jan 4 1987'
    testString2 = 'Departed from Kansas City Jan 4 1987'
    testString2 = 'Arrived in Kansas Feb 4 1988'
    testString3 = 'Arrived in New York Mar 6 1989'

    When I do something like the following:

       for match in r.scanString(testString1)

    the parser matches Places with 'Kansas' and 'Jan' but Months doesn't get matched. Day and Year match correctly.

    What I would like to be able to do is to define a ParserElement subclass like I have done with Places, but somehow tell it to exclude the words: 'Jan', 'Feb', 'Mar, and 'Apr'. Then this definition would allow the month to match correctly.

    I tried using NotAny and CharsNotIn without any luck. Is there a way to specify a ParserElement subclass with the normal Word() syntax for what _should_ match but also with a specific set of words that must not cause a match?



    • Paul McGuire
      Paul McGuire

      The short answer is, try changing Places to:

      Places = Group(OneOrMore(~Months+Word(alphas, alphas + '.' + ',')))

      (~ is operator shorthand for NotAny)

      What this does is, before accepting another Word, first makes sure it is *not* a Months - if it is, the OneOrMore will stop reading Words and go on to the next part of your expression.

      The Group is there to keep all your Places words together - otherwise, you just end up with a list of tokens that you'll have to pick apart again later - this way pyparsing keeps track of them while you are parsing.

      Glad to hear pyparsing is working well for you!
      -- Paul

    • Paul McGuire
      Paul McGuire

      As a nicer-looking alternative to Group, you can specify Combine with a join string of " ", and adjacent=False, as in:

      Places = Combine(OneOrMore(~Months+Word(alphas, alphas + '.' + ','))," ",adjacent=False)

      This will give you parsing results like:
      ['departed', 'from', 'Kansas', 'Jan', '4', '1987']
      ['departed', 'from', 'Kansas City', 'Jan', '4', '1987']
      ['arrived', 'in', 'Kansas', 'Feb', '4', '1988']
      ['arrived', 'in', 'New York', 'Mar', '6', '1989']

      instead of
      ['departed', 'from', ['Kansas'], 'Jan', '4', '1987']
      ['departed', 'from', ['Kansas', 'City'], 'Jan', '4', '1987']
      ['arrived', 'in', ['Kansas'], 'Feb', '4', '1988']
      ['arrived', 'in', ['New', 'York'], 'Mar', '6', '1989']

      -- Paul