Re: [Pyparsing] parsing a simple Language

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Sun, Oct 10, 2010 at 10:56 AM, Paul McGuire <pt...@au...> wrote:
> Duncan, my friend, so good to hear from you again!  I'm glad pyparsing
> continues to be of some use to you.  I must admit, you are the first I have
> heard of to be parsing Tibetan with pyparsing.  I think I can propose a few
> alternative solutions for you.
>
> First of all, your immediate problem has to do with your use of 'max'.  'max
> = 1' means just that, 1 AND NO MORE!

Ah, I see. I had incorrectly interpreted that as "match only one
initial, and if another initial is found, starting parsing that as a
new syllable."

> In your failing case, "sSmi", the
> leading 's' is followed by another 'S', which by definition of your init
> word is not allowed; you exceeded the maximum -> parser fail!  Fortunately,
> the simplest remedy is to use the 'exact' argument instead of 'max':
>
> init = Word('sSbB', exact=1).setName("initial")
> med = Word('mMpP').setName("medial")
> vow = Word('aeiou', exact=1).setName("vowel")
>
> 'exact' does not impose the same lookahead restriction that 'max' does.
>
> If your test case is close enough to your Tibetan application, you might try
> one of these other options.  You can merge your initial and medial
> expressions into a single word, since what you describe is exactly the same
> as the 2-argument constructor for word.  Breaking out the definition of
> syllable as:
>
> syllable = Combine(
>    init + ZeroOrMore(med) + Optional(vow)
>    )
> syllables = Group(OneOrMore(syllable)).setResultsName("syllables")
>
> The first two bits of your syllable can be merged into a single Word
> expression:
>
> syllable = Combine(
>    Word('sSbB', 'mMpP') + Optional(vow)
>    )
> syllables = Group(OneOrMore(syllable)).setResultsName("syllables")

Hrm, I tried that, but wan't able to figure out to get at the parsed
data for the medials. I need to be able to introspect the parsed data
in order to perform various conversion operations (at a later time). I
didn't complicate my minimal example with it, but I've got results
names set for initials, medials, and vowels.

> Or if you can tolerate an even more liberal expression (which would match if
> vowels were mixed in with medials, and not just added to the end):
>
> syllable = Word('sSbB', 'mMpPaeiouAEIOU')
>
> This will parse fairly quickly as well, since it is able to internally
> convert this entire thing into the single regex "[sSbB][mMpPaeiouAEIOU]*".

Ah, this is a great example -- thanks! Sadly, I can't use it, since
the rules for vowels in Tibetam unicode are strict about being at the
end.

> If you still need the more rigor of your original case (only a single
> potential vowel at the end of the syllable, not mixed in with medials), you
> might still try rolling your own Regex:
>
> syllable = Regex(r"[sSbB][mMpP]*[aeiou]?")

Oh, this is very nice. I'm going to play with this some more. Thanks!

> I've found that for low-level tokens like words and numbers, using a Regex
> really outperforms "Combine(startWithThis + (somethingElse|anotherThing) +
> Optional(stillAnotherThing))"; while keeping the re's localized to just a
> simple building block pretty much keeps them from getting too out-of-hand.
> For instance, I've modified the fourFn.py example that ships with pyparsing
> to show the old style commented out, and a still-fairly-easy-to-follow-regex
> replacement:
>
> #~ fnumber = Combine( Word( "+-"+nums, nums ) +
>                   #~ Optional( point + Optional( Word( nums ) ) ) +
>                   #~ Optional( e + Word( "+-"+nums, nums ) ) )
> fnumber = Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
> ident = Word(alphas, alphas+nums+"_$")
>
> If these syllabic constructs in Tibetan can be built up from single Unicode
> characters, then I think all of these suggestions are still valid, even down
> to the Regex idea.
>
> I'd be very interested to see more of your Tibetan parser, as things
> progress - good luck!

Once I get it hammered out, I'll reply with a single-file example :-)
It's part of a library I'm creating to support advanced features in
Tibetan software, but the grammar itself should lend itself nicely to
an example.

Thanks again for your help and insights, Paul -- once again, pyparsing
shines in all of its glory :-)

d