Re: [Pyparsing] parsing a simple Language
Brought to you by:
ptmcg
From: Paul M. <pt...@au...> - 2010-10-10 16:57:00
|
Duncan, my friend, so good to hear from you again! I'm glad pyparsing continues to be of some use to you. I must admit, you are the first I have heard of to be parsing Tibetan with pyparsing. I think I can propose a few alternative solutions for you. First of all, your immediate problem has to do with your use of 'max'. 'max = 1' means just that, 1 AND NO MORE! In your failing case, "sSmi", the leading 's' is followed by another 'S', which by definition of your init word is not allowed; you exceeded the maximum -> parser fail! Fortunately, the simplest remedy is to use the 'exact' argument instead of 'max': init = Word('sSbB', exact=1).setName("initial") med = Word('mMpP').setName("medial") vow = Word('aeiou', exact=1).setName("vowel") 'exact' does not impose the same lookahead restriction that 'max' does. If your test case is close enough to your Tibetan application, you might try one of these other options. You can merge your initial and medial expressions into a single word, since what you describe is exactly the same as the 2-argument constructor for word. Breaking out the definition of syllable as: syllable = Combine( init + ZeroOrMore(med) + Optional(vow) ) syllables = Group(OneOrMore(syllable)).setResultsName("syllables") The first two bits of your syllable can be merged into a single Word expression: syllable = Combine( Word('sSbB', 'mMpP') + Optional(vow) ) syllables = Group(OneOrMore(syllable)).setResultsName("syllables") Or if you can tolerate an even more liberal expression (which would match if vowels were mixed in with medials, and not just added to the end): syllable = Word('sSbB', 'mMpPaeiouAEIOU') This will parse fairly quickly as well, since it is able to internally convert this entire thing into the single regex "[sSbB][mMpPaeiouAEIOU]*". If you still need the more rigor of your original case (only a single potential vowel at the end of the syllable, not mixed in with medials), you might still try rolling your own Regex: syllable = Regex(r"[sSbB][mMpP]*[aeiou]?") I've found that for low-level tokens like words and numbers, using a Regex really outperforms "Combine(startWithThis + (somethingElse|anotherThing) + Optional(stillAnotherThing))"; while keeping the re's localized to just a simple building block pretty much keeps them from getting too out-of-hand. For instance, I've modified the fourFn.py example that ships with pyparsing to show the old style commented out, and a still-fairly-easy-to-follow-regex replacement: #~ fnumber = Combine( Word( "+-"+nums, nums ) + #~ Optional( point + Optional( Word( nums ) ) ) + #~ Optional( e + Word( "+-"+nums, nums ) ) ) fnumber = Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?") ident = Word(alphas, alphas+nums+"_$") If these syllabic constructs in Tibetan can be built up from single Unicode characters, then I think all of these suggestions are still valid, even down to the Regex idea. I'd be very interested to see more of your Tibetan parser, as things progress - good luck! -- Paul |