Thread: [Pyparsing] Efficency of Keyword (and a couple other bits)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,
=20
I have pyparsing working fairly well, but it is going extremely slowly
so I'd like to know how to make it faster. (Current performance is
roughly one line per second).
=20
Problem: I'm testing addresses to ensure they are correctly formatted.
New Zealand Post has recently moved to computerised delivery.  That
means all business mail must be perfectly formed according to a grammar
or else they don't get delivered.  I implemented the grammer in
pyparsing initially using rules like:
=20
STREET_DIRECTION =3D
OneOf("NORTH","EAST","WEST","SOUTH","N","E","W","S").setResultsName("Str
eet Direction")
STREET_SUFFIX =3D
OneOf("STREET","ROAD","LANE",...,"ST","RD","LN",...).setResultsName("Str
eet Suffix")
STREET_NUMBER =3D Word(nums).setResultsName("Street Number")
STREET_ALPHA =3D Optional(Word(alphas,exact=3D1)).setResultsName("Street
Alpha")
STREET_NAME =3D OneOrMore(Word(alphas)).setResultsName("Street Name")
STREET =3D  STREET_NUMBER + STREET_ALPHA +  STREET_NAME + STREET_SUFFIX =
+
STREET_DIRECTION + FollowedBy(FieldBreak)
=20
However, I ran into a few problems.  Firstly, the use of OneOf(Literal)
meant that ST or RD appearing anywhere in an address matched a suffix
even if it is part of a larger word.  I solved that problem by replacing
OneOf(Literal) with Keywords seperated by bars, as in:
=20
        cursor.execute(r'select distinct * from (select short_suffix
from suffix_to_long UNION select long_suffix from suffix_to_long) as f')
        valid_suffix_str =3D "|".join([ "Keyword(\"" + x[0] + "\")" for =
x
in cursor.fetchall() ])
        STREET_SUFFIX =3D eval(valid_suffix_str).setResultsName("Street
Suffix")
=20
Disgusting huh? But I couldn't find anything else that worked accurately
:-(.  So I guess my first question is, is there any better way of doing
this?  Or of speeding this up (because in comparison to OneOf, it is
_really_ slow, even with enablePackrat.)
=20
The second problem I ran into was the parser was too greedy.
STREET_NAME did its best to suck up the STREET_SUFFIX without passing it
over.  I got around that by replacing the definition of STREET_NAME by a
SkipTo(STREET_SUFFIX) but it is still more greedy than necessary.  I
have these nice clear FieldBreaks that split up the address, but my
pyparsing grammer does not take advantage of them for increasing
efficency.  E.g. there is no point looking for a street name that spans
them.  I just couldn't find any efficient way of forcing this STREET
line to be locked just to one field. =20

I also ran into an intermittant problem in pyparsing's backtracking,
where the city would be parsed as a suburb successfully but the whole
address would be rejected as the 'suburb' was not followed by a
postcode.  Pyparsing would correctly backtrack and find the correct
parse, but the setResultsName("Suburb Name") resulted in both the suburb
and the city being set to the same thing!  (Un)fortunately I have
changed the code since and the current version does not exhibit this
behaviour. =20
=20
The last problem I ran into is with building names.  A building name is
defined as any string that is not a valid unit or a valid floor.  E.g.
"HARBOUR APARTMENTS" is a valid building name, as is "23 THE TERRACE",
but "FLOOR 2" or "SUITE A" isn't.   The only way I found to implement
that was to create two instances of the parser and calling the second
instance inside setParseAction but that's really slow too.  I guess it's
having to create a whole parse results when all I'm interested in is
success for failure.

So, any ideas or suggestions welcomed, especially with respect to the
Keyword issue.

Corrin Lakeland

Thread: [Pyparsing] Efficency of Keyword (and a couple other bits)

pyparsing-users