Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Here you go:

www.nzpost.co.nz/NZPost/Images/addressing.nzpost/pdfs/AddressStandards.p
df

It is very long and tiresome, though you can probably get away with just
Chapter 4 and Appendix A.  Since I'm only doing validation I don't have
to worry about any imperfect input which helps simplify things a lot.

1) The unit line is a mess:

UNIT =3D Optional((UNIT_TYPE + UNIT_IDENTIFIER)) + Optional(FLOOR) +
Optional(BUILDING_NAME)
That's made trickier since if none of the elements is present then the
whole line is skipped and BUILDING_NAME is defined pretty much as .*
Also, building name may go on a separate line, as in
ADDRESS =3D REST
        | (UNIT + SEPERATOR + REST) |=20
        | ((Optional(UNIT_TYPE + UNIT_IDENTIFIER) + Optional(FLOOR)) +
SEPERATOR + BUILDING_NAME + SEPERATOR + REST

2) The street name really complicates procesing a street line:

It starts with an optional UNIT_IDENTIFIER followed by a slash (2/22 Foo
St means FLAT 2, 22 Foo St). =20
A few streets don't have a street suffix and annoyingly often have a
street suffix in the name (The Terrace is the most well known). =20
A few streets have a street direction at the end of the street name
(e.g. North, Upper, Extension). Fortunately, street suffix and street
direction are disjoint.

So, if I was using a hypothetical perfect parser generator, I could
write it like (skipping setResultsName):

UNIT_IDENTIFER =3D Word(alphanums)
STREET_NUMBER =3D Word(nums)
STREET_ALPHA =3D alphas
STREET_NAME =3D OneOrMore(Word(alphas))
LONG_SUFFIX =3D "STREET" | "ROAD" | "DRIVE" | ...
SHORT_SUFFIX =3D "ST" | "RD" | "DR" | ...
STREET_SUFFIX =3D LONG_SUFFIX | SHORT_SUFFIX
STREET_DIRECTION =3D "NORTH" | "N" | "EAST" | "E" | "EXTENSION" | "EXT" =
|
"WEST" | "W"=20
STREET_LEFTPART =3D Optional(UNIT_IDENTIFIER + "/") + STREET_NUMBER +
Optional(STREET_ALPHA)
STREET_NORMAL =3D STREET_LEFTPART + STREET_NAME + =
Optional(STREET_SUFFIX)=20
HIGHWAY_NO =3D Word(alphanums)
STREET_SH =3D STREET_LEFTPART + ("SH"|"STATE HIGHWAY") + HIGHWAY_NO +
Optional("SH"|"STATE HIGHWAY")
STREET =3D STREET_NORMAL | STREET_SH

Apart from the crazy cases of "THE TERRACE" which I handle by a whole
separate rule, the interesting part here is that ambiguity is best
resolved right to left.  Looking leftmost an address could start with a
number but it means either a street number or a unit number - we don't
know until we look for the slash.  For the street name we don't know for
sure the street name has ended until we see the end of field.  At that
point we can consider if the thing before the end of field as a street
direction, or a street suffix, or part of the street name.

The right to left thing also applies at a global scale.  Seeing "SUITE"
at the start of an address could be the beginning of the unit or of a
building for a rural address, or of the unit or a building for an urban
address!  It isn't until we get to the suburb that we know if we're
processing an urban or a rural address (as rural addresses have a suburb
of RD <number>).  However looking rightmost we can only see a postcode
or a country.

I considered reversing the entire input and parsing the whole lot
backwards, but that felt inelegant.

However, your prefix tree suggestion has given me an idea, I'm going to
calculate the frequency of each different street suffix and direction,
and add that information to the list of each suffix, using 'order by' in
the select statement.

-----Original Message-----
From: Ralph Corderoy [mailto:ra...@in...]=20
Sent: Wednesday, March 21, 2007 2:23 AM
To: Corrin Lakeland
Cc: pyp...@li...
Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)=20

Have you a link to the NZ address format?

Cheers,

Ralph.