Re: [Pyparsing] parsing a simple Language

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Duncan, my friend, so good to hear from you again!  I'm glad pyparsing
continues to be of some use to you.  I must admit, you are the first I have
heard of to be parsing Tibetan with pyparsing.  I think I can propose a few
alternative solutions for you.

First of all, your immediate problem has to do with your use of 'max'.  'max
= 1' means just that, 1 AND NO MORE!  In your failing case, "sSmi", the
leading 's' is followed by another 'S', which by definition of your init
word is not allowed; you exceeded the maximum -> parser fail!  Fortunately,
the simplest remedy is to use the 'exact' argument instead of 'max':

init = Word('sSbB', exact=1).setName("initial")
med = Word('mMpP').setName("medial")
vow = Word('aeiou', exact=1).setName("vowel")

'exact' does not impose the same lookahead restriction that 'max' does.

If your test case is close enough to your Tibetan application, you might try
one of these other options.  You can merge your initial and medial
expressions into a single word, since what you describe is exactly the same
as the 2-argument constructor for word.  Breaking out the definition of
syllable as:

syllable = Combine(
    init + ZeroOrMore(med) + Optional(vow)
    )
syllables = Group(OneOrMore(syllable)).setResultsName("syllables")

The first two bits of your syllable can be merged into a single Word
expression:

syllable = Combine(
    Word('sSbB', 'mMpP') + Optional(vow)
    )
syllables = Group(OneOrMore(syllable)).setResultsName("syllables")

Or if you can tolerate an even more liberal expression (which would match if
vowels were mixed in with medials, and not just added to the end):

syllable = Word('sSbB', 'mMpPaeiouAEIOU')

This will parse fairly quickly as well, since it is able to internally
convert this entire thing into the single regex "[sSbB][mMpPaeiouAEIOU]*".

If you still need the more rigor of your original case (only a single
potential vowel at the end of the syllable, not mixed in with medials), you
might still try rolling your own Regex:

syllable = Regex(r"[sSbB][mMpP]*[aeiou]?")

I've found that for low-level tokens like words and numbers, using a Regex
really outperforms "Combine(startWithThis + (somethingElse|anotherThing) +
Optional(stillAnotherThing))"; while keeping the re's localized to just a
simple building block pretty much keeps them from getting too out-of-hand.
For instance, I've modified the fourFn.py example that ships with pyparsing
to show the old style commented out, and a still-fairly-easy-to-follow-regex
replacement:

#~ fnumber = Combine( Word( "+-"+nums, nums ) + 
                   #~ Optional( point + Optional( Word( nums ) ) ) +
                   #~ Optional( e + Word( "+-"+nums, nums ) ) )
fnumber = Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
ident = Word(alphas, alphas+nums+"_$")

If these syllabic constructs in Tibetan can be built up from single Unicode
characters, then I think all of these suggestions are still valid, even down
to the Regex idea.

I'd be very interested to see more of your Tibetan parser, as things
progress - good luck!

-- Paul