Re: [Pyparsing] Efficency of Keyword (and a couple other bits)
Brought to you by:
ptmcg
From: Ralph C. <ra...@in...> - 2007-03-20 13:23:08
|
Hi Corrine, Paul wrote: > We need the Regex to treat these as keywords, so we will surround the > alternatives with the re "word break" indicator "\b". We don't want > this to be used just for the first and last alternatives, so we'll > enclose the alternatives in non-grouping parens, (?:...). This gives > us a re string of: > > r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() ) I don't know if the re module optimises r'\b(?:apple|banana|cherry)\b' where each of the alternatives is a literal. If not, then there's gains to be made by concocting a better regexp by noting that the matching tries each alternative in turn, failing when on the first character that doesn't match the current alternative. If you've 500 alternatives this is on average going to test 250 alternatives assuming they occur with equal frequency. However, say the alternatives are balloon beech apply chair apple about boat beach big We sort these to see their pattern better. about apple apply balloon beach beech big boat chair We can make the re engine see if the first character is an r'a' and if not it doesn't need to attempt any of the three words starting with r'a'. Similarly with r'b' and r'c'. pat = re.compile(r'''\b(?: a(?:bout|pple|pply)| b(?:alloon|each|eech|ig|oat)| c(?:hair) )\b''', re.VERBOSE) For simplicity in constructing the regexp you may want to use a positive lookahead instead to avoid choppimng the first character from every word. pat = re.compile(r'''\b(?: (?=a)(?:about|apple|apply)| (?=b)(?:balloon|beach|beech|big|boat)| (?=c)(?:chair) )\b''', re.VERBOSE) If all the words occur with equal frequency then odds are that the word will start with r'b' so we should place that branch first. Then the r'a' branch. Then r'c'. pat = re.compile(r'''\b(?: b(?:alloon|each|eech|ig|oat)| a(?:bout|pple|pply)| c(?:hair) )\b''', re.VERBOSE) It could be that you've real data to sample and know that r'chair' occurs 70% of the time so you'd place it first, despite it being a one-word branch. If you've 50,000 words instead of 500 then you can take this approach for the first two characters, e.g. pat = re.compile(r'''\b(?: a(?:b(?:out)|p(?:ple|ply))| b(?:a(?:lloon)|e(?:ach|ech)|i(?:g)|o(?:at))| c(?:h(?:air)) )\b''', re.VERBOSE) What we're doing is giving the re engine a prefix tree. http://en.wikipedia.org/wiki/Prefix_tree Be careful to cope with one and two letter words properly and don't optimise prematurely, but I'm assuming you're going to run this on hoards of data. Even so, it's worth using the timeit module to check you assumptions on real data; who knows, perhaps the re module does optimise this case these days! Have you a link to the NZ address format? Cheers, Ralph. |