Re: [Pyparsing] Efficency of Keyword (and a couple other bits)
Brought to you by:
ptmcg
From: Corrin L. <Cor...@da...> - 2007-03-20 20:26:08
|
Here you go: www.nzpost.co.nz/NZPost/Images/addressing.nzpost/pdfs/AddressStandards.p df It is very long and tiresome, though you can probably get away with just Chapter 4 and Appendix A. Since I'm only doing validation I don't have to worry about any imperfect input which helps simplify things a lot. 1) The unit line is a mess: UNIT =3D Optional((UNIT_TYPE + UNIT_IDENTIFIER)) + Optional(FLOOR) + Optional(BUILDING_NAME) That's made trickier since if none of the elements is present then the whole line is skipped and BUILDING_NAME is defined pretty much as .* Also, building name may go on a separate line, as in ADDRESS =3D REST | (UNIT + SEPERATOR + REST) |=20 | ((Optional(UNIT_TYPE + UNIT_IDENTIFIER) + Optional(FLOOR)) + SEPERATOR + BUILDING_NAME + SEPERATOR + REST 2) The street name really complicates procesing a street line: It starts with an optional UNIT_IDENTIFIER followed by a slash (2/22 Foo St means FLAT 2, 22 Foo St). =20 A few streets don't have a street suffix and annoyingly often have a street suffix in the name (The Terrace is the most well known). =20 A few streets have a street direction at the end of the street name (e.g. North, Upper, Extension). Fortunately, street suffix and street direction are disjoint. So, if I was using a hypothetical perfect parser generator, I could write it like (skipping setResultsName): UNIT_IDENTIFER =3D Word(alphanums) STREET_NUMBER =3D Word(nums) STREET_ALPHA =3D alphas STREET_NAME =3D OneOrMore(Word(alphas)) LONG_SUFFIX =3D "STREET" | "ROAD" | "DRIVE" | ... SHORT_SUFFIX =3D "ST" | "RD" | "DR" | ... STREET_SUFFIX =3D LONG_SUFFIX | SHORT_SUFFIX STREET_DIRECTION =3D "NORTH" | "N" | "EAST" | "E" | "EXTENSION" | "EXT" = | "WEST" | "W"=20 STREET_LEFTPART =3D Optional(UNIT_IDENTIFIER + "/") + STREET_NUMBER + Optional(STREET_ALPHA) STREET_NORMAL =3D STREET_LEFTPART + STREET_NAME + = Optional(STREET_SUFFIX)=20 HIGHWAY_NO =3D Word(alphanums) STREET_SH =3D STREET_LEFTPART + ("SH"|"STATE HIGHWAY") + HIGHWAY_NO + Optional("SH"|"STATE HIGHWAY") STREET =3D STREET_NORMAL | STREET_SH Apart from the crazy cases of "THE TERRACE" which I handle by a whole separate rule, the interesting part here is that ambiguity is best resolved right to left. Looking leftmost an address could start with a number but it means either a street number or a unit number - we don't know until we look for the slash. For the street name we don't know for sure the street name has ended until we see the end of field. At that point we can consider if the thing before the end of field as a street direction, or a street suffix, or part of the street name. The right to left thing also applies at a global scale. Seeing "SUITE" at the start of an address could be the beginning of the unit or of a building for a rural address, or of the unit or a building for an urban address! It isn't until we get to the suburb that we know if we're processing an urban or a rural address (as rural addresses have a suburb of RD <number>). However looking rightmost we can only see a postcode or a country. I considered reversing the entire input and parsing the whole lot backwards, but that felt inelegant. However, your prefix tree suggestion has given me an idea, I'm going to calculate the frequency of each different street suffix and direction, and add that information to the list of each suffix, using 'order by' in the select statement. -----Original Message----- From: Ralph Corderoy [mailto:ra...@in...]=20 Sent: Wednesday, March 21, 2007 2:23 AM To: Corrin Lakeland Cc: pyp...@li... Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)=20 Have you a link to the NZ address format? Cheers, Ralph. |