Thread: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

Brought to you by: ptmcg

pyparsing-users

Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

From: Corrin L. <Cor...@da...> - 2007-03-19 22:55:18

Hi Eike,

Yes, I'm using Keyword("STREET")|Keyword("ROAD").  The SQL pulls a list
of street types from the database, the Keyword(" + x + ") turns the list
into keywords, and the eval creates a pyparsing rule for it.  Not the
easiest code to read by a long way :(

The problem is that five hundred strings seperated by | is much less
efficient than Word(alphas)

-----Original Message-----
From: pyp...@li...
[mailto:pyp...@li...] On Behalf Of Eike
Welk
Sent: Tuesday, March 20, 2007 11:45 AM
To: pyp...@li...
Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

Hello Corrin!

On Monday 19 March 2007 21:06, Corrin Lakeland wrote:
> So, any ideas or suggestions welcomed, especially with respect to the=20
> Keyword issue.

There is the 'Keyword' parser, it does probably what you want.
Usage:

mathFuncs =3D Keyword('sin') | Keyword('cos') | Keyword('tan')

I use code similar to this in my toy language.


Regards Eike.


------------------------------------------------------------------------
-
Take Surveys. Earn Cash. Influence the Future of IT Join
SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDE
V
_______________________________________________
Pyparsing-users mailing list
Pyp...@li...
https://lists.sourceforge.net/lists/listinfo/pyparsing-users

Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

From: Paul M. <pa...@al...> - 2007-03-20 04:14:20

Corrin -

Address parsing is a tricky topic, and many mailing list companies spend a
lot of money developing proprietary solutions.  It is helpful that New
Zealand has specified a standard format, let's see if we can get pyparsing
to suss it out.

For your first question, here is a slightly cleaned-up version of your
street suffix generator (using the results from the db select to give us the
various possible street suffixes):

    cursor.execute(r'select distinct * from (select short_suffix from
suffix_to_long UNION select long_suffix from suffix_to_long) as f')
    
    STREET_SUFFIX = MatchFirst( 
        [ Keyword(x[0]) for x in cursor.fetchall() ] 
        ).setResultsName("Street Suffix")

What's happening here is that, instead of using the '|' operators, we are
directly constructing a MatchFirst expression.  Realize that expr1 | expr2
is just a short-cut for MatchFirst( [ expr1, expr2 ] ), so all we need to do
is build a list of all the Keyword expressions, and make a MatchFirst out of
them.

This cleans up the eval and "|".join ugliness, but I don't think this will
help your speed issue very much.  Instead, here is an approach that mimics
some of the internals of oneOf, by generating a Regex for us.  It's actually
similar to your eval approach, but will generate a Regex string instead.

In this case, we want all of your alternatives in a Regex, as A|B|C|D|...,
so this will look fairly familiar to you:

    "|".join( x[0] for x in cursor.fetchall() )

We need the Regex to treat these as keywords, so we will surround the
alternatives with the re "word break" indicator "\b".  We don't want this to
be used just for the first and last alternatives, so we'll enclose the
alternatives in non-grouping parens, (?:...).  This gives us a re string of:

    r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() )

Now pass this as the initializer argument to create a pyparsing Regex
expression, and you should get the benefits of oneOf speed and Keyword
matching.  That is:

    STREET_SUFFIX = Regex(
        r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() )
        )


For your second question, how to get street names to not read past the end
of the street name and consume the street suffix too?  Again, this is really
a common issue in pyparsing grammars - there is a canned solution, although
this may cost us some parse-time performance.

The problem is that pyparsing does not do overall pattern matching and
backtracking the way a regular expression does - instead it marches through
the input string left-to-right, successively matching sequential
expressions, testing alternatives and repetition, throwing exceptions when
mismatches occur, etc.  In the following example address:

    1234 FLOWER COVERED BRIDGE LANE

you want an expression for the street name that takes "FLOWER COVERED
BRIDGE", and leaves "LANE" to be the street suffix.  The logic in doing this
left-to-right is "take each alphabetic word, as long as it is not a valid
suffix, and accumulate it into the street name".  In pyparsing, this will
look like:

    STREET_NAME = OneOrMore(~STREET_SUFFIX +
Word(alphas)).setResultsName("Street Name")

OneOrMore takes care of the repetition, but we want it to stop when it
reaches a STREET_SUFFIX.  I'm not really sure how to make this any more
efficient.

One other note: this construct will return the example as a list: [
'FLOWER', 'COVERED', 'BRIDGE' ].  You can merge these for yourself by adding
a parse action:

    STREET_NAME.setParseAction( lambda toks : " ".join(toks) )

or use a Combine wrapper:

    STREET_NAME = Combine( OneOrMore(~STREET_SUFFIX + Word(alphas)),
                            joinString=' ',
                            adjacent=False ).setResultsName("Street Name")


whichever suits your eye better - they are essentially equivalent.  (I'd
probably take the parse action...)

Another note: this will break down with any pathologically named streets,
such as LANE LANE or STREET STREET.  This sounds ridiculous, but here is a
true story: my freshman year in college, I lived in a dormitory donated by
an alumnus named Hall - yep, it was named "Hall Hall".

Yet another note: it appears that the NZ Post requires addresses to be all
uppercase, you might change usage of alphas to your own variable uppers =
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.  This will speed up slightly some of the
internal regex's.


Lastly your question regarding building names.  I'm not exactly clear from
your description how this needs to work, but since you are only testing for
success/failure, and you want to accept things that are NOT matches of unit
or floor, it seems that you might have some luck with something like:

    BUILDING_NAME = ~( VALID_UNIT | VALID_FLOOR )
    

Some time in the past, I worked on a similar address parser, I think it was
in response to a c.l.py posting.  I'll add it to the examples page on the
pyparsing wiki so you can compare it with your own efforts.  There are some
odd cases, such as street numbers with 1/2 in them, that might be
interesting for you to incorporate into your project.

HTH,
-- Paul

Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

From: Ralph C. <ra...@in...> - 2007-03-20 13:23:08

Hi Corrine,

Paul wrote:
> We need the Regex to treat these as keywords, so we will surround the
> alternatives with the re "word break" indicator "\b".  We don't want
> this to be used just for the first and last alternatives, so we'll
> enclose the alternatives in non-grouping parens, (?:...).  This gives
> us a re string of:
> 
>     r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() )

I don't know if the re module optimises r'\b(?:apple|banana|cherry)\b'
where each of the alternatives is a literal.  If not, then there's gains
to be made by concocting a better regexp by noting that the matching
tries each alternative in turn, failing when on the first character that
doesn't match the current alternative.  If you've 500 alternatives this
is on average going to test 250 alternatives assuming they occur with
equal frequency.

However, say the alternatives are

    balloon beech apply chair apple about boat beach big

We sort these to see their pattern better.

    about apple apply
    balloon beach beech big boat
    chair

We can make the re engine see if the first character is an r'a' and if
not it doesn't need to attempt any of the three words starting with
r'a'.  Similarly with r'b' and r'c'.

    pat = re.compile(r'''\b(?:
        a(?:bout|pple|pply)|
        b(?:alloon|each|eech|ig|oat)|
        c(?:hair)
        )\b''', re.VERBOSE)

For simplicity in constructing the regexp you may want to use a positive
lookahead instead to avoid choppimng the first character from every
word.

    pat = re.compile(r'''\b(?:
        (?=a)(?:about|apple|apply)|
        (?=b)(?:balloon|beach|beech|big|boat)|
        (?=c)(?:chair)
        )\b''', re.VERBOSE)

If all the words occur with equal frequency then odds are that the word
will start with r'b' so we should place that branch first.  Then the
r'a' branch.  Then r'c'.

    pat = re.compile(r'''\b(?:
        b(?:alloon|each|eech|ig|oat)|
        a(?:bout|pple|pply)|
        c(?:hair)
        )\b''', re.VERBOSE)

It could be that you've real data to sample and know that r'chair'
occurs 70% of the time so you'd place it first, despite it being a
one-word branch.

If you've 50,000 words instead of 500 then you can take this approach
for the first two characters, e.g.

    pat = re.compile(r'''\b(?:
        a(?:b(?:out)|p(?:ple|ply))|
        b(?:a(?:lloon)|e(?:ach|ech)|i(?:g)|o(?:at))|
        c(?:h(?:air))
        )\b''', re.VERBOSE)

What we're doing is giving the re engine a prefix tree.

    http://en.wikipedia.org/wiki/Prefix_tree

Be careful to cope with one and two letter words properly and don't
optimise prematurely, but I'm assuming you're going to run this on
hoards of data.  Even so, it's worth using the timeit module to check
you assumptions on real data;  who knows, perhaps the re module does
optimise this case these days!

Have you a link to the NZ address format?

Cheers,

Ralph.

Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

From: Eike W. <eik...@gm...> - 2007-03-20 16:39:57

Hello Paul!

On Tuesday 20 March 2007 05:14, Paul McGuire wrote:
> For your second question, how to get street names to not read past
> the end of the street name and consume the street suffix too?
> =A0Again, this is really a common issue in pyparsing grammars - there
> is a canned solution, although this may cost us some parse-time
> performance.

You should turn that part into a FAQ: 'My Parser is Too Greedy'.=20

The first part of your very nicely worked out email could maybe get=20
into the 'Tips' section of Pyparsing's web presence: 'How to Match=20
5000 Keywords'.

Slightly OT:
The very helpfull 'operatorGrammar' is not mentioned in the 'usage=20
notes' (HowToUsePyparsing) page. I think you should copy the=20
informative passage from the 'news' page to the 'usage notes'.

Kind regards, Eike.

Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

From: Corrin L. <Cor...@da...> - 2007-03-20 20:26:08

Here you go:

www.nzpost.co.nz/NZPost/Images/addressing.nzpost/pdfs/AddressStandards.p
df

It is very long and tiresome, though you can probably get away with just
Chapter 4 and Appendix A.  Since I'm only doing validation I don't have
to worry about any imperfect input which helps simplify things a lot.

1) The unit line is a mess:

UNIT =3D Optional((UNIT_TYPE + UNIT_IDENTIFIER)) + Optional(FLOOR) +
Optional(BUILDING_NAME)
That's made trickier since if none of the elements is present then the
whole line is skipped and BUILDING_NAME is defined pretty much as .*
Also, building name may go on a separate line, as in
ADDRESS =3D REST
        | (UNIT + SEPERATOR + REST) |=20
        | ((Optional(UNIT_TYPE + UNIT_IDENTIFIER) + Optional(FLOOR)) +
SEPERATOR + BUILDING_NAME + SEPERATOR + REST

2) The street name really complicates procesing a street line:

It starts with an optional UNIT_IDENTIFIER followed by a slash (2/22 Foo
St means FLAT 2, 22 Foo St). =20
A few streets don't have a street suffix and annoyingly often have a
street suffix in the name (The Terrace is the most well known). =20
A few streets have a street direction at the end of the street name
(e.g. North, Upper, Extension). Fortunately, street suffix and street
direction are disjoint.

So, if I was using a hypothetical perfect parser generator, I could
write it like (skipping setResultsName):

UNIT_IDENTIFER =3D Word(alphanums)
STREET_NUMBER =3D Word(nums)
STREET_ALPHA =3D alphas
STREET_NAME =3D OneOrMore(Word(alphas))
LONG_SUFFIX =3D "STREET" | "ROAD" | "DRIVE" | ...
SHORT_SUFFIX =3D "ST" | "RD" | "DR" | ...
STREET_SUFFIX =3D LONG_SUFFIX | SHORT_SUFFIX
STREET_DIRECTION =3D "NORTH" | "N" | "EAST" | "E" | "EXTENSION" | "EXT" =
|
"WEST" | "W"=20
STREET_LEFTPART =3D Optional(UNIT_IDENTIFIER + "/") + STREET_NUMBER +
Optional(STREET_ALPHA)
STREET_NORMAL =3D STREET_LEFTPART + STREET_NAME + =
Optional(STREET_SUFFIX)=20
HIGHWAY_NO =3D Word(alphanums)
STREET_SH =3D STREET_LEFTPART + ("SH"|"STATE HIGHWAY") + HIGHWAY_NO +
Optional("SH"|"STATE HIGHWAY")
STREET =3D STREET_NORMAL | STREET_SH

Apart from the crazy cases of "THE TERRACE" which I handle by a whole
separate rule, the interesting part here is that ambiguity is best
resolved right to left.  Looking leftmost an address could start with a
number but it means either a street number or a unit number - we don't
know until we look for the slash.  For the street name we don't know for
sure the street name has ended until we see the end of field.  At that
point we can consider if the thing before the end of field as a street
direction, or a street suffix, or part of the street name.

The right to left thing also applies at a global scale.  Seeing "SUITE"
at the start of an address could be the beginning of the unit or of a
building for a rural address, or of the unit or a building for an urban
address!  It isn't until we get to the suburb that we know if we're
processing an urban or a rural address (as rural addresses have a suburb
of RD <number>).  However looking rightmost we can only see a postcode
or a country.

I considered reversing the entire input and parsing the whole lot
backwards, but that felt inelegant.

However, your prefix tree suggestion has given me an idea, I'm going to
calculate the frequency of each different street suffix and direction,
and add that information to the list of each suffix, using 'order by' in
the select statement.

-----Original Message-----
From: Ralph Corderoy [mailto:ra...@in...]=20
Sent: Wednesday, March 21, 2007 2:23 AM
To: Corrin Lakeland
Cc: pyp...@li...
Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)=20


Have you a link to the NZ address format?

Cheers,


Ralph.

Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

From: Corrin L. <Cor...@da...> - 2007-03-20 22:53:21

Just a brief *wow*...

I tried with the old method but with the keywords sorted by frequency so
RD came first, then ST, etc.  It made about a 10% improvement in speed.
Then I tried using the Regex method and it made a 200% improvement in
speed!   Pretty amazing to see :)

Corrin
-----Original Message-----
From: Ralph Corderoy [mailto:ra...@in...]=20
Sent: Wednesday, March 21, 2007 1:23 AM
To: Corrin Lakeland
Cc: pyp...@li...
Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)=20

Paul wrote:
> We need the Regex to treat these as keywords, so we will surround the=20
> alternatives with the re "word break" indicator "\b".  We don't want=20
> this to be used just for the first and last alternatives, so we'll=20
> enclose the alternatives in non-grouping parens, (?:...).  This gives=20
> us a re string of:
>=20
>     r"\b(?:%s)\b" % "|".join( x[0] for x in cursor.fetchall() )

I don't know if the re module optimises r'\b(?:apple|banana|cherry)\b'
where each of the alternatives is a literal.  If not, then there's gains
to be made by concocting a better regexp by noting that the matching
tries each alternative in turn, failing when on the first character that
doesn't match the current alternative.  If you've 500 alternatives this
is on average going to test 250 alternatives assuming they occur with
equal frequency.

Re: [Pyparsing] Efficency of Keyword (and a couple other bits)

From: Corrin L. <Cor...@da...> - 2007-03-22 19:58:01

Slightly non-scientific since I didn't adjust for varying loads on the
machine or disk caches, but running with and without sort both gave 85.6
processing time.  Looks like the key was choosing to use the re module,
and sorting didn't help.  Interesting :)

Corrin

PS: matching the building name using ~(UNIT_TYPE) didn't work... Since
that didn't give me the option to go .setResultsName.  What I've got at
the moment is:
	BUILDING =3D
OneOrMore(Word(nameletters)).setParseAction(rejectBuildingName).setResul
tsName("BuildingName")

And the parse action is:
    def rejectBuildingName(string,loc,tokens):=20
	""" Prevent building name of LWR GROUND and similar """
        building_name =3D ""
        for token in tokens:
            if token =3D=3D self.SEPCHAR:
                if debug:
                    print "Rejected the building name " + building_name
                raise ParseException(string,loc,"found a field seperator
in the building")
            if building_name <> "":
                building_name +=3D " "
            building_name +=3D token
        if self.debug:
            print "Trying to reject the building name
%s"%(building_name)
        if spare_parser <> None:
            r =3D None
            try:
                r =3D
spare_parser.NOT_A_BUILDING.parseString(building_name)
            except ParseException, pe:
                r =3D r
            if r =3D=3D None:
                if debug:
                    print "Looks like this building is not a floor or a
unit"
            else:
                if debug:
                    print "Rejected as this building looks like a floor
or a unit"
                raise ParseException(string,loc,"Rejected %s as a
building name - looks like a floor or a unit" % string)
        if debug:
            print "Looks like this building is okay"

        NOT_A_BUILDING =3D (UNIT_TYPE | FLOOR | BOX_LINE | BAG_LINE)

It feels like a very roundabout way of doing it to me, though it seems
to work well enough.

-----Original Message-----
From: Ralph Corderoy [mailto:ra...@in...]=20
Sent: Thursday, March 22, 2007 12:11 AM
To: Corrin Lakeland
Cc: pyp...@li...
Subject: Re: [Pyparsing] Efficency of Keyword (and a couple other bits)=20


Hi Corrin,

I'm glad you've got the speed-up you were after.  Out of interest, how
does the Regexp with the alternatives sorted by frequency compare with
the Regexp with the alternatives sorted by reverse frequency?  This
would show if the re module is optimising without you needing to sort.

Cheers,


Ralph.