Thread: [Pyparsing] Efficency of Keyword (and a couple other bits)
Brought to you by:
ptmcg
From: Corrin L. <Cor...@da...> - 2007-03-19 20:07:10
|
Hi, =20 I have pyparsing working fairly well, but it is going extremely slowly so I'd like to know how to make it faster. (Current performance is roughly one line per second). =20 Problem: I'm testing addresses to ensure they are correctly formatted. New Zealand Post has recently moved to computerised delivery. That means all business mail must be perfectly formed according to a grammar or else they don't get delivered. I implemented the grammer in pyparsing initially using rules like: =20 STREET_DIRECTION =3D OneOf("NORTH","EAST","WEST","SOUTH","N","E","W","S").setResultsName("Str eet Direction") STREET_SUFFIX =3D OneOf("STREET","ROAD","LANE",...,"ST","RD","LN",...).setResultsName("Str eet Suffix") STREET_NUMBER =3D Word(nums).setResultsName("Street Number") STREET_ALPHA =3D Optional(Word(alphas,exact=3D1)).setResultsName("Street Alpha") STREET_NAME =3D OneOrMore(Word(alphas)).setResultsName("Street Name") STREET =3D STREET_NUMBER + STREET_ALPHA + STREET_NAME + STREET_SUFFIX = + STREET_DIRECTION + FollowedBy(FieldBreak) =20 However, I ran into a few problems. Firstly, the use of OneOf(Literal) meant that ST or RD appearing anywhere in an address matched a suffix even if it is part of a larger word. I solved that problem by replacing OneOf(Literal) with Keywords seperated by bars, as in: =20 cursor.execute(r'select distinct * from (select short_suffix from suffix_to_long UNION select long_suffix from suffix_to_long) as f') valid_suffix_str =3D "|".join([ "Keyword(\"" + x[0] + "\")" for = x in cursor.fetchall() ]) STREET_SUFFIX =3D eval(valid_suffix_str).setResultsName("Street Suffix") =20 Disgusting huh? But I couldn't find anything else that worked accurately :-(. So I guess my first question is, is there any better way of doing this? Or of speeding this up (because in comparison to OneOf, it is _really_ slow, even with enablePackrat.) =20 The second problem I ran into was the parser was too greedy. STREET_NAME did its best to suck up the STREET_SUFFIX without passing it over. I got around that by replacing the definition of STREET_NAME by a SkipTo(STREET_SUFFIX) but it is still more greedy than necessary. I have these nice clear FieldBreaks that split up the address, but my pyparsing grammer does not take advantage of them for increasing efficency. E.g. there is no point looking for a street name that spans them. I just couldn't find any efficient way of forcing this STREET line to be locked just to one field. =20 I also ran into an intermittant problem in pyparsing's backtracking, where the city would be parsed as a suburb successfully but the whole address would be rejected as the 'suburb' was not followed by a postcode. Pyparsing would correctly backtrack and find the correct parse, but the setResultsName("Suburb Name") resulted in both the suburb and the city being set to the same thing! (Un)fortunately I have changed the code since and the current version does not exhibit this behaviour. =20 =20 The last problem I ran into is with building names. A building name is defined as any string that is not a valid unit or a valid floor. E.g. "HARBOUR APARTMENTS" is a valid building name, as is "23 THE TERRACE", but "FLOOR 2" or "SUITE A" isn't. The only way I found to implement that was to create two instances of the parser and calling the second instance inside setParseAction but that's really slow too. I guess it's having to create a whole parse results when all I'm interested in is success for failure. So, any ideas or suggestions welcomed, especially with respect to the Keyword issue. Corrin Lakeland |
From: Eike W. <eik...@gm...> - 2007-03-19 22:44:53
|
Hello Corrin! On Monday 19 March 2007 21:06, Corrin Lakeland wrote: > So, any ideas or suggestions welcomed, especially with respect to > the Keyword issue. There is the 'Keyword' parser, it does probably what you want. Usage: mathFuncs = Keyword('sin') | Keyword('cos') | Keyword('tan') I use code similar to this in my toy language. Regards Eike. |