Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Corrine,

Paul wrote:
> We need the Regex to treat these as keywords, so we will surround the
> alternatives with the re "word break" indicator "\b".  We don't want
> this to be used just for the first and last alternatives, so we'll
> enclose the alternatives in non-grouping parens, (?:...).  This gives
> us a re string of:
> 
>     r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() )

I don't know if the re module optimises r'\b(?:apple|banana|cherry)\b'
where each of the alternatives is a literal.  If not, then there's gains
to be made by concocting a better regexp by noting that the matching
tries each alternative in turn, failing when on the first character that
doesn't match the current alternative.  If you've 500 alternatives this
is on average going to test 250 alternatives assuming they occur with
equal frequency.

However, say the alternatives are

    balloon beech apply chair apple about boat beach big

We sort these to see their pattern better.

    about apple apply
    balloon beach beech big boat
    chair

We can make the re engine see if the first character is an r'a' and if
not it doesn't need to attempt any of the three words starting with
r'a'.  Similarly with r'b' and r'c'.

    pat = re.compile(r'''\b(?:
        a(?:bout|pple|pply)|
        b(?:alloon|each|eech|ig|oat)|
        c(?:hair)
        )\b''', re.VERBOSE)

For simplicity in constructing the regexp you may want to use a positive
lookahead instead to avoid choppimng the first character from every
word.

    pat = re.compile(r'''\b(?:
        (?=a)(?:about|apple|apply)|
        (?=b)(?:balloon|beach|beech|big|boat)|
        (?=c)(?:chair)
        )\b''', re.VERBOSE)

If all the words occur with equal frequency then odds are that the word
will start with r'b' so we should place that branch first.  Then the
r'a' branch.  Then r'c'.

    pat = re.compile(r'''\b(?:
        b(?:alloon|each|eech|ig|oat)|
        a(?:bout|pple|pply)|
        c(?:hair)
        )\b''', re.VERBOSE)

It could be that you've real data to sample and know that r'chair'
occurs 70% of the time so you'd place it first, despite it being a
one-word branch.

If you've 50,000 words instead of 500 then you can take this approach
for the first two characters, e.g.

    pat = re.compile(r'''\b(?:
        a(?:b(?:out)|p(?:ple|ply))|
        b(?:a(?:lloon)|e(?:ach|ech)|i(?:g)|o(?:at))|
        c(?:h(?:air))
        )\b''', re.VERBOSE)

What we're doing is giving the re engine a prefix tree.

    http://en.wikipedia.org/wiki/Prefix_tree

Be careful to cope with one and two letter words properly and don't
optimise prematurely, but I'm assuming you're going to run this on
hoards of data.  Even so, it's worth using the timeit module to check
you assumptions on real data;  who knows, perhaps the re module does
optimise this case these days!

Have you a link to the NZ address format?

Cheers,

Ralph.