I have been able to use pyparsing quite effectively for parsing natural language text containing some particular expressions. Until now.
I need to match a phrase with 0 or more words. But it has to stop matching on one of a particular set of words. Here's what I mean.
I define the following constraints:
Years = Word('1', nums, exact=4)
Months = oneOf('Jan Feb Mar Apr')
Days = Word(nums,min=1,max=2)
Places = OneOrMore(Word(alphas, alphas + '.' + ',')
I define the following parse rule:
r = (CaselessLiteral('arrived') ^
CaselessLiteral('departed')) +
Optional(oneOf('in on at from')) +
Optional(Places) +
Optional(Months) +
Optional(Days) +
Years
testString1 = 'Departed from Kansas Jan 4 1987'
testString2 = 'Departed from Kansas City Jan 4 1987'
testString2 = 'Arrived in Kansas Feb 4 1988'
testString3 = 'Arrived in New York Mar 6 1989'
When I do something like the following:
for match in r.scanString(testString1)
the parser matches Places with 'Kansas' and 'Jan' but Months doesn't get matched. Day and Year match correctly.
What I would like to be able to do is to define a ParserElement subclass like I have done with Places, but somehow tell it to exclude the words: 'Jan', 'Feb', 'Mar, and 'Apr'. Then this definition would allow the month to match correctly.
I tried using NotAny and CharsNotIn without any luck. Is there a way to specify a ParserElement subclass with the normal Word() syntax for what _should_ match but also with a specific set of words that must not cause a match?
Thanks,
~Michael.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What this does is, before accepting another Word, first makes sure it is *not* a Months - if it is, the OneOrMore will stop reading Words and go on to the next part of your expression.
The Group is there to keep all your Places words together - otherwise, you just end up with a list of tokens that you'll have to pick apart again later - this way pyparsing keeps track of them while you are parsing.
Glad to hear pyparsing is working well for you!
-- Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I have been able to use pyparsing quite effectively for parsing natural language text containing some particular expressions. Until now.
I need to match a phrase with 0 or more words. But it has to stop matching on one of a particular set of words. Here's what I mean.
I define the following constraints:
Years = Word('1', nums, exact=4)
Months = oneOf('Jan Feb Mar Apr')
Days = Word(nums,min=1,max=2)
Places = OneOrMore(Word(alphas, alphas + '.' + ',')
I define the following parse rule:
r = (CaselessLiteral('arrived') ^
CaselessLiteral('departed')) +
Optional(oneOf('in on at from')) +
Optional(Places) +
Optional(Months) +
Optional(Days) +
Years
testString1 = 'Departed from Kansas Jan 4 1987'
testString2 = 'Departed from Kansas City Jan 4 1987'
testString2 = 'Arrived in Kansas Feb 4 1988'
testString3 = 'Arrived in New York Mar 6 1989'
When I do something like the following:
for match in r.scanString(testString1)
the parser matches Places with 'Kansas' and 'Jan' but Months doesn't get matched. Day and Year match correctly.
What I would like to be able to do is to define a ParserElement subclass like I have done with Places, but somehow tell it to exclude the words: 'Jan', 'Feb', 'Mar, and 'Apr'. Then this definition would allow the month to match correctly.
I tried using NotAny and CharsNotIn without any luck. Is there a way to specify a ParserElement subclass with the normal Word() syntax for what _should_ match but also with a specific set of words that must not cause a match?
Thanks,
~Michael.
The short answer is, try changing Places to:
Places = Group(OneOrMore(~Months+Word(alphas, alphas + '.' + ',')))
(~ is operator shorthand for NotAny)
What this does is, before accepting another Word, first makes sure it is *not* a Months - if it is, the OneOrMore will stop reading Words and go on to the next part of your expression.
The Group is there to keep all your Places words together - otherwise, you just end up with a list of tokens that you'll have to pick apart again later - this way pyparsing keeps track of them while you are parsing.
Glad to hear pyparsing is working well for you!
-- Paul
As a nicer-looking alternative to Group, you can specify Combine with a join string of " ", and adjacent=False, as in:
Places = Combine(OneOrMore(~Months+Word(alphas, alphas + '.' + ','))," ",adjacent=False)
This will give you parsing results like:
['departed', 'from', 'Kansas', 'Jan', '4', '1987']
['departed', 'from', 'Kansas City', 'Jan', '4', '1987']
['arrived', 'in', 'Kansas', 'Feb', '4', '1988']
['arrived', 'in', 'New York', 'Mar', '6', '1989']
instead of
['departed', 'from', ['Kansas'], 'Jan', '4', '1987']
['departed', 'from', ['Kansas', 'City'], 'Jan', '4', '1987']
['arrived', 'in', ['Kansas'], 'Feb', '4', '1988']
['arrived', 'in', ['New', 'York'], 'Mar', '6', '1989']
-- Paul