Corrin -
Address parsing is a tricky topic, and many mailing list companies spend a
lot of money developing proprietary solutions. It is helpful that New
Zealand has specified a standard format, let's see if we can get pyparsing
to suss it out.
For your first question, here is a slightly cleaned-up version of your
street suffix generator (using the results from the db select to give us the
various possible street suffixes):
cursor.execute(r'select distinct * from (select short_suffix from
suffix_to_long UNION select long_suffix from suffix_to_long) as f')
STREET_SUFFIX = MatchFirst(
[ Keyword(x[0]) for x in cursor.fetchall() ]
).setResultsName("Street Suffix")
What's happening here is that, instead of using the '|' operators, we are
directly constructing a MatchFirst expression. Realize that expr1 | expr2
is just a short-cut for MatchFirst( [ expr1, expr2 ] ), so all we need to do
is build a list of all the Keyword expressions, and make a MatchFirst out of
them.
This cleans up the eval and "|".join ugliness, but I don't think this will
help your speed issue very much. Instead, here is an approach that mimics
some of the internals of oneOf, by generating a Regex for us. It's actually
similar to your eval approach, but will generate a Regex string instead.
In this case, we want all of your alternatives in a Regex, as A|B|C|D|...,
so this will look fairly familiar to you:
"|".join( x[0] for x in cursor.fetchall() )
We need the Regex to treat these as keywords, so we will surround the
alternatives with the re "word break" indicator "\b". We don't want this to
be used just for the first and last alternatives, so we'll enclose the
alternatives in non-grouping parens, (?:...). This gives us a re string of:
r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() )
Now pass this as the initializer argument to create a pyparsing Regex
expression, and you should get the benefits of oneOf speed and Keyword
matching. That is:
STREET_SUFFIX = Regex(
r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() )
)
For your second question, how to get street names to not read past the end
of the street name and consume the street suffix too? Again, this is really
a common issue in pyparsing grammars - there is a canned solution, although
this may cost us some parse-time performance.
The problem is that pyparsing does not do overall pattern matching and
backtracking the way a regular expression does - instead it marches through
the input string left-to-right, successively matching sequential
expressions, testing alternatives and repetition, throwing exceptions when
mismatches occur, etc. In the following example address:
1234 FLOWER COVERED BRIDGE LANE
you want an expression for the street name that takes "FLOWER COVERED
BRIDGE", and leaves "LANE" to be the street suffix. The logic in doing this
left-to-right is "take each alphabetic word, as long as it is not a valid
suffix, and accumulate it into the street name". In pyparsing, this will
look like:
STREET_NAME = OneOrMore(~STREET_SUFFIX +
Word(alphas)).setResultsName("Street Name")
OneOrMore takes care of the repetition, but we want it to stop when it
reaches a STREET_SUFFIX. I'm not really sure how to make this any more
efficient.
One other note: this construct will return the example as a list: [
'FLOWER', 'COVERED', 'BRIDGE' ]. You can merge these for yourself by adding
a parse action:
STREET_NAME.setParseAction( lambda toks : " ".join(toks) )
or use a Combine wrapper:
STREET_NAME = Combine( OneOrMore(~STREET_SUFFIX + Word(alphas)),
joinString=' ',
adjacent=False ).setResultsName("Street Name")
whichever suits your eye better - they are essentially equivalent. (I'd
probably take the parse action...)
Another note: this will break down with any pathologically named streets,
such as LANE LANE or STREET STREET. This sounds ridiculous, but here is a
true story: my freshman year in college, I lived in a dormitory donated by
an alumnus named Hall - yep, it was named "Hall Hall".
Yet another note: it appears that the NZ Post requires addresses to be all
uppercase, you might change usage of alphas to your own variable uppers =
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. This will speed up slightly some of the
internal regex's.
Lastly your question regarding building names. I'm not exactly clear from
your description how this needs to work, but since you are only testing for
success/failure, and you want to accept things that are NOT matches of unit
or floor, it seems that you might have some luck with something like:
BUILDING_NAME = ~( VALID_UNIT | VALID_FLOOR )
Some time in the past, I worked on a similar address parser, I think it was
in response to a c.l.py posting. I'll add it to the examples page on the
pyparsing wiki so you can compare it with your own efforts. There are some
odd cases, such as street numbers with 1/2 in them, that might be
interesting for you to incorporate into your project.
HTH,
-- Paul
|