Thread: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)
Brought to you by:
ptmcg
From: Corrin L. <Cor...@da...> - 2007-03-19 22:55:18
|
Hi Eike, Yes, I'm using Keyword("STREET")|Keyword("ROAD"). The SQL pulls a list of street types from the database, the Keyword(" + x + ") turns the list into keywords, and the eval creates a pyparsing rule for it. Not the easiest code to read by a long way :( The problem is that five hundred strings seperated by | is much less efficient than Word(alphas) -----Original Message----- From: pyp...@li... [mailto:pyp...@li...] On Behalf Of Eike Welk Sent: Tuesday, March 20, 2007 11:45 AM To: pyp...@li... Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits) Hello Corrin! On Monday 19 March 2007 21:06, Corrin Lakeland wrote: > So, any ideas or suggestions welcomed, especially with respect to the=20 > Keyword issue. There is the 'Keyword' parser, it does probably what you want. Usage: mathFuncs =3D Keyword('sin') | Keyword('cos') | Keyword('tan') I use code similar to this in my toy language. Regards Eike. ------------------------------------------------------------------------ - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDE V _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |
From: Paul M. <pa...@al...> - 2007-03-20 04:14:20
|
Corrin - Address parsing is a tricky topic, and many mailing list companies spend a lot of money developing proprietary solutions. It is helpful that New Zealand has specified a standard format, let's see if we can get pyparsing to suss it out. For your first question, here is a slightly cleaned-up version of your street suffix generator (using the results from the db select to give us the various possible street suffixes): cursor.execute(r'select distinct * from (select short_suffix from suffix_to_long UNION select long_suffix from suffix_to_long) as f') STREET_SUFFIX = MatchFirst( [ Keyword(x[0]) for x in cursor.fetchall() ] ).setResultsName("Street Suffix") What's happening here is that, instead of using the '|' operators, we are directly constructing a MatchFirst expression. Realize that expr1 | expr2 is just a short-cut for MatchFirst( [ expr1, expr2 ] ), so all we need to do is build a list of all the Keyword expressions, and make a MatchFirst out of them. This cleans up the eval and "|".join ugliness, but I don't think this will help your speed issue very much. Instead, here is an approach that mimics some of the internals of oneOf, by generating a Regex for us. It's actually similar to your eval approach, but will generate a Regex string instead. In this case, we want all of your alternatives in a Regex, as A|B|C|D|..., so this will look fairly familiar to you: "|".join( x[0] for x in cursor.fetchall() ) We need the Regex to treat these as keywords, so we will surround the alternatives with the re "word break" indicator "\b". We don't want this to be used just for the first and last alternatives, so we'll enclose the alternatives in non-grouping parens, (?:...). This gives us a re string of: r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() ) Now pass this as the initializer argument to create a pyparsing Regex expression, and you should get the benefits of oneOf speed and Keyword matching. That is: STREET_SUFFIX = Regex( r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() ) ) For your second question, how to get street names to not read past the end of the street name and consume the street suffix too? Again, this is really a common issue in pyparsing grammars - there is a canned solution, although this may cost us some parse-time performance. The problem is that pyparsing does not do overall pattern matching and backtracking the way a regular expression does - instead it marches through the input string left-to-right, successively matching sequential expressions, testing alternatives and repetition, throwing exceptions when mismatches occur, etc. In the following example address: 1234 FLOWER COVERED BRIDGE LANE you want an expression for the street name that takes "FLOWER COVERED BRIDGE", and leaves "LANE" to be the street suffix. The logic in doing this left-to-right is "take each alphabetic word, as long as it is not a valid suffix, and accumulate it into the street name". In pyparsing, this will look like: STREET_NAME = OneOrMore(~STREET_SUFFIX + Word(alphas)).setResultsName("Street Name") OneOrMore takes care of the repetition, but we want it to stop when it reaches a STREET_SUFFIX. I'm not really sure how to make this any more efficient. One other note: this construct will return the example as a list: [ 'FLOWER', 'COVERED', 'BRIDGE' ]. You can merge these for yourself by adding a parse action: STREET_NAME.setParseAction( lambda toks : " ".join(toks) ) or use a Combine wrapper: STREET_NAME = Combine( OneOrMore(~STREET_SUFFIX + Word(alphas)), joinString=' ', adjacent=False ).setResultsName("Street Name") whichever suits your eye better - they are essentially equivalent. (I'd probably take the parse action...) Another note: this will break down with any pathologically named streets, such as LANE LANE or STREET STREET. This sounds ridiculous, but here is a true story: my freshman year in college, I lived in a dormitory donated by an alumnus named Hall - yep, it was named "Hall Hall". Yet another note: it appears that the NZ Post requires addresses to be all uppercase, you might change usage of alphas to your own variable uppers = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. This will speed up slightly some of the internal regex's. Lastly your question regarding building names. I'm not exactly clear from your description how this needs to work, but since you are only testing for success/failure, and you want to accept things that are NOT matches of unit or floor, it seems that you might have some luck with something like: BUILDING_NAME = ~( VALID_UNIT | VALID_FLOOR ) Some time in the past, I worked on a similar address parser, I think it was in response to a c.l.py posting. I'll add it to the examples page on the pyparsing wiki so you can compare it with your own efforts. There are some odd cases, such as street numbers with 1/2 in them, that might be interesting for you to incorporate into your project. HTH, -- Paul |
From: Ralph C. <ra...@in...> - 2007-03-20 13:23:08
|
Hi Corrine, Paul wrote: > We need the Regex to treat these as keywords, so we will surround the > alternatives with the re "word break" indicator "\b". We don't want > this to be used just for the first and last alternatives, so we'll > enclose the alternatives in non-grouping parens, (?:...). This gives > us a re string of: > > r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() ) I don't know if the re module optimises r'\b(?:apple|banana|cherry)\b' where each of the alternatives is a literal. If not, then there's gains to be made by concocting a better regexp by noting that the matching tries each alternative in turn, failing when on the first character that doesn't match the current alternative. If you've 500 alternatives this is on average going to test 250 alternatives assuming they occur with equal frequency. However, say the alternatives are balloon beech apply chair apple about boat beach big We sort these to see their pattern better. about apple apply balloon beach beech big boat chair We can make the re engine see if the first character is an r'a' and if not it doesn't need to attempt any of the three words starting with r'a'. Similarly with r'b' and r'c'. pat = re.compile(r'''\b(?: a(?:bout|pple|pply)| b(?:alloon|each|eech|ig|oat)| c(?:hair) )\b''', re.VERBOSE) For simplicity in constructing the regexp you may want to use a positive lookahead instead to avoid choppimng the first character from every word. pat = re.compile(r'''\b(?: (?=a)(?:about|apple|apply)| (?=b)(?:balloon|beach|beech|big|boat)| (?=c)(?:chair) )\b''', re.VERBOSE) If all the words occur with equal frequency then odds are that the word will start with r'b' so we should place that branch first. Then the r'a' branch. Then r'c'. pat = re.compile(r'''\b(?: b(?:alloon|each|eech|ig|oat)| a(?:bout|pple|pply)| c(?:hair) )\b''', re.VERBOSE) It could be that you've real data to sample and know that r'chair' occurs 70% of the time so you'd place it first, despite it being a one-word branch. If you've 50,000 words instead of 500 then you can take this approach for the first two characters, e.g. pat = re.compile(r'''\b(?: a(?:b(?:out)|p(?:ple|ply))| b(?:a(?:lloon)|e(?:ach|ech)|i(?:g)|o(?:at))| c(?:h(?:air)) )\b''', re.VERBOSE) What we're doing is giving the re engine a prefix tree. http://en.wikipedia.org/wiki/Prefix_tree Be careful to cope with one and two letter words properly and don't optimise prematurely, but I'm assuming you're going to run this on hoards of data. Even so, it's worth using the timeit module to check you assumptions on real data; who knows, perhaps the re module does optimise this case these days! Have you a link to the NZ address format? Cheers, Ralph. |
From: Eike W. <eik...@gm...> - 2007-03-20 16:39:57
|
Hello Paul! On Tuesday 20 March 2007 05:14, Paul McGuire wrote: > For your second question, how to get street names to not read past > the end of the street name and consume the street suffix too? > =A0Again, this is really a common issue in pyparsing grammars - there > is a canned solution, although this may cost us some parse-time > performance. You should turn that part into a FAQ: 'My Parser is Too Greedy'.=20 The first part of your very nicely worked out email could maybe get=20 into the 'Tips' section of Pyparsing's web presence: 'How to Match=20 5000 Keywords'. Slightly OT: The very helpfull 'operatorGrammar' is not mentioned in the 'usage=20 notes' (HowToUsePyparsing) page. I think you should copy the=20 informative passage from the 'news' page to the 'usage notes'. Kind regards, Eike. |
From: Corrin L. <Cor...@da...> - 2007-03-20 20:26:08
|
Here you go: www.nzpost.co.nz/NZPost/Images/addressing.nzpost/pdfs/AddressStandards.p df It is very long and tiresome, though you can probably get away with just Chapter 4 and Appendix A. Since I'm only doing validation I don't have to worry about any imperfect input which helps simplify things a lot. 1) The unit line is a mess: UNIT =3D Optional((UNIT_TYPE + UNIT_IDENTIFIER)) + Optional(FLOOR) + Optional(BUILDING_NAME) That's made trickier since if none of the elements is present then the whole line is skipped and BUILDING_NAME is defined pretty much as .* Also, building name may go on a separate line, as in ADDRESS =3D REST | (UNIT + SEPERATOR + REST) |=20 | ((Optional(UNIT_TYPE + UNIT_IDENTIFIER) + Optional(FLOOR)) + SEPERATOR + BUILDING_NAME + SEPERATOR + REST 2) The street name really complicates procesing a street line: It starts with an optional UNIT_IDENTIFIER followed by a slash (2/22 Foo St means FLAT 2, 22 Foo St). =20 A few streets don't have a street suffix and annoyingly often have a street suffix in the name (The Terrace is the most well known). =20 A few streets have a street direction at the end of the street name (e.g. North, Upper, Extension). Fortunately, street suffix and street direction are disjoint. So, if I was using a hypothetical perfect parser generator, I could write it like (skipping setResultsName): UNIT_IDENTIFER =3D Word(alphanums) STREET_NUMBER =3D Word(nums) STREET_ALPHA =3D alphas STREET_NAME =3D OneOrMore(Word(alphas)) LONG_SUFFIX =3D "STREET" | "ROAD" | "DRIVE" | ... SHORT_SUFFIX =3D "ST" | "RD" | "DR" | ... STREET_SUFFIX =3D LONG_SUFFIX | SHORT_SUFFIX STREET_DIRECTION =3D "NORTH" | "N" | "EAST" | "E" | "EXTENSION" | "EXT" = | "WEST" | "W"=20 STREET_LEFTPART =3D Optional(UNIT_IDENTIFIER + "/") + STREET_NUMBER + Optional(STREET_ALPHA) STREET_NORMAL =3D STREET_LEFTPART + STREET_NAME + = Optional(STREET_SUFFIX)=20 HIGHWAY_NO =3D Word(alphanums) STREET_SH =3D STREET_LEFTPART + ("SH"|"STATE HIGHWAY") + HIGHWAY_NO + Optional("SH"|"STATE HIGHWAY") STREET =3D STREET_NORMAL | STREET_SH Apart from the crazy cases of "THE TERRACE" which I handle by a whole separate rule, the interesting part here is that ambiguity is best resolved right to left. Looking leftmost an address could start with a number but it means either a street number or a unit number - we don't know until we look for the slash. For the street name we don't know for sure the street name has ended until we see the end of field. At that point we can consider if the thing before the end of field as a street direction, or a street suffix, or part of the street name. The right to left thing also applies at a global scale. Seeing "SUITE" at the start of an address could be the beginning of the unit or of a building for a rural address, or of the unit or a building for an urban address! It isn't until we get to the suburb that we know if we're processing an urban or a rural address (as rural addresses have a suburb of RD <number>). However looking rightmost we can only see a postcode or a country. I considered reversing the entire input and parsing the whole lot backwards, but that felt inelegant. However, your prefix tree suggestion has given me an idea, I'm going to calculate the frequency of each different street suffix and direction, and add that information to the list of each suffix, using 'order by' in the select statement. -----Original Message----- From: Ralph Corderoy [mailto:ra...@in...]=20 Sent: Wednesday, March 21, 2007 2:23 AM To: Corrin Lakeland Cc: pyp...@li... Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)=20 Have you a link to the NZ address format? Cheers, Ralph. |
From: Corrin L. <Cor...@da...> - 2007-03-20 22:53:21
|
Just a brief *wow*... I tried with the old method but with the keywords sorted by frequency so RD came first, then ST, etc. It made about a 10% improvement in speed. Then I tried using the Regex method and it made a 200% improvement in speed! Pretty amazing to see :) Corrin -----Original Message----- From: Ralph Corderoy [mailto:ra...@in...]=20 Sent: Wednesday, March 21, 2007 1:23 AM To: Corrin Lakeland Cc: pyp...@li... Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)=20 Paul wrote: > We need the Regex to treat these as keywords, so we will surround the=20 > alternatives with the re "word break" indicator "\b". We don't want=20 > this to be used just for the first and last alternatives, so we'll=20 > enclose the alternatives in non-grouping parens, (?:...). This gives=20 > us a re string of: >=20 > r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() ) I don't know if the re module optimises r'\b(?:apple|banana|cherry)\b' where each of the alternatives is a literal. If not, then there's gains to be made by concocting a better regexp by noting that the matching tries each alternative in turn, failing when on the first character that doesn't match the current alternative. If you've 500 alternatives this is on average going to test 250 alternatives assuming they occur with equal frequency. |
From: Corrin L. <Cor...@da...> - 2007-03-22 19:58:01
|
Slightly non-scientific since I didn't adjust for varying loads on the machine or disk caches, but running with and without sort both gave 85.6 processing time. Looks like the key was choosing to use the re module, and sorting didn't help. Interesting :) Corrin PS: matching the building name using ~(UNIT_TYPE) didn't work... Since that didn't give me the option to go .setResultsName. What I've got at the moment is: BUILDING =3D OneOrMore(Word(nameletters)).setParseAction(rejectBuildingName).setResul tsName("BuildingName") And the parse action is: def rejectBuildingName(string,loc,tokens):=20 """ Prevent building name of LWR GROUND and similar """ building_name =3D "" for token in tokens: if token =3D=3D self.SEPCHAR: if debug: print "Rejected the building name " + building_name raise ParseException(string,loc,"found a field seperator in the building") if building_name <> "": building_name +=3D " " building_name +=3D token if self.debug: print "Trying to reject the building name %s"%(building_name) if spare_parser <> None: r =3D None try: r =3D spare_parser.NOT_A_BUILDING.parseString(building_name) except ParseException, pe: r =3D r if r =3D=3D None: if debug: print "Looks like this building is not a floor or a unit" else: if debug: print "Rejected as this building looks like a floor or a unit" raise ParseException(string,loc,"Rejected %s as a building name - looks like a floor or a unit" % string) if debug: print "Looks like this building is okay" NOT_A_BUILDING =3D (UNIT_TYPE | FLOOR | BOX_LINE | BAG_LINE) It feels like a very roundabout way of doing it to me, though it seems to work well enough. -----Original Message----- From: Ralph Corderoy [mailto:ra...@in...]=20 Sent: Thursday, March 22, 2007 12:11 AM To: Corrin Lakeland Cc: pyp...@li... Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)=20 Hi Corrin, I'm glad you've got the speed-up you were after. Out of interest, how does the Regexp with the alternatives sorted by frequency compare with the Regexp with the alternatives sorted by reverse frequency? This would show if the re module is optimising without you needing to sort. Cheers, Ralph. |