Re: [Pyparsing] Efficency of Keyword (and a couple other bits)
Brought to you by:
ptmcg
From: Paul M. <pa...@al...> - 2007-03-20 04:14:20
|
Corrin - Address parsing is a tricky topic, and many mailing list companies spend a lot of money developing proprietary solutions. It is helpful that New Zealand has specified a standard format, let's see if we can get pyparsing to suss it out. For your first question, here is a slightly cleaned-up version of your street suffix generator (using the results from the db select to give us the various possible street suffixes): cursor.execute(r'select distinct * from (select short_suffix from suffix_to_long UNION select long_suffix from suffix_to_long) as f') STREET_SUFFIX = MatchFirst( [ Keyword(x[0]) for x in cursor.fetchall() ] ).setResultsName("Street Suffix") What's happening here is that, instead of using the '|' operators, we are directly constructing a MatchFirst expression. Realize that expr1 | expr2 is just a short-cut for MatchFirst( [ expr1, expr2 ] ), so all we need to do is build a list of all the Keyword expressions, and make a MatchFirst out of them. This cleans up the eval and "|".join ugliness, but I don't think this will help your speed issue very much. Instead, here is an approach that mimics some of the internals of oneOf, by generating a Regex for us. It's actually similar to your eval approach, but will generate a Regex string instead. In this case, we want all of your alternatives in a Regex, as A|B|C|D|..., so this will look fairly familiar to you: "|".join( x[0] for x in cursor.fetchall() ) We need the Regex to treat these as keywords, so we will surround the alternatives with the re "word break" indicator "\b". We don't want this to be used just for the first and last alternatives, so we'll enclose the alternatives in non-grouping parens, (?:...). This gives us a re string of: r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() ) Now pass this as the initializer argument to create a pyparsing Regex expression, and you should get the benefits of oneOf speed and Keyword matching. That is: STREET_SUFFIX = Regex( r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() ) ) For your second question, how to get street names to not read past the end of the street name and consume the street suffix too? Again, this is really a common issue in pyparsing grammars - there is a canned solution, although this may cost us some parse-time performance. The problem is that pyparsing does not do overall pattern matching and backtracking the way a regular expression does - instead it marches through the input string left-to-right, successively matching sequential expressions, testing alternatives and repetition, throwing exceptions when mismatches occur, etc. In the following example address: 1234 FLOWER COVERED BRIDGE LANE you want an expression for the street name that takes "FLOWER COVERED BRIDGE", and leaves "LANE" to be the street suffix. The logic in doing this left-to-right is "take each alphabetic word, as long as it is not a valid suffix, and accumulate it into the street name". In pyparsing, this will look like: STREET_NAME = OneOrMore(~STREET_SUFFIX + Word(alphas)).setResultsName("Street Name") OneOrMore takes care of the repetition, but we want it to stop when it reaches a STREET_SUFFIX. I'm not really sure how to make this any more efficient. One other note: this construct will return the example as a list: [ 'FLOWER', 'COVERED', 'BRIDGE' ]. You can merge these for yourself by adding a parse action: STREET_NAME.setParseAction( lambda toks : " ".join(toks) ) or use a Combine wrapper: STREET_NAME = Combine( OneOrMore(~STREET_SUFFIX + Word(alphas)), joinString=' ', adjacent=False ).setResultsName("Street Name") whichever suits your eye better - they are essentially equivalent. (I'd probably take the parse action...) Another note: this will break down with any pathologically named streets, such as LANE LANE or STREET STREET. This sounds ridiculous, but here is a true story: my freshman year in college, I lived in a dormitory donated by an alumnus named Hall - yep, it was named "Hall Hall". Yet another note: it appears that the NZ Post requires addresses to be all uppercase, you might change usage of alphas to your own variable uppers = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. This will speed up slightly some of the internal regex's. Lastly your question regarding building names. I'm not exactly clear from your description how this needs to work, but since you are only testing for success/failure, and you want to accept things that are NOT matches of unit or floor, it seems that you might have some luck with something like: BUILDING_NAME = ~( VALID_UNIT | VALID_FLOOR ) Some time in the past, I worked on a similar address parser, I think it was in response to a c.l.py posting. I'll add it to the examples page on the pyparsing wiki so you can compare it with your own efforts. There are some odd cases, such as street numbers with 1/2 in them, that might be interesting for you to incorporate into your project. HTH, -- Paul |