Re: [Pyparsing] parsing a simple Language
Brought to you by:
ptmcg
From: Duncan M. <dun...@gm...> - 2010-10-10 21:50:06
|
On Sun, Oct 10, 2010 at 10:56 AM, Paul McGuire <pt...@au...> wrote: > Duncan, my friend, so good to hear from you again! I'm glad pyparsing > continues to be of some use to you. I must admit, you are the first I have > heard of to be parsing Tibetan with pyparsing. I think I can propose a few > alternative solutions for you. > > First of all, your immediate problem has to do with your use of 'max'. 'max > = 1' means just that, 1 AND NO MORE! Ah, I see. I had incorrectly interpreted that as "match only one initial, and if another initial is found, starting parsing that as a new syllable." > In your failing case, "sSmi", the > leading 's' is followed by another 'S', which by definition of your init > word is not allowed; you exceeded the maximum -> parser fail! Fortunately, > the simplest remedy is to use the 'exact' argument instead of 'max': > > init = Word('sSbB', exact=1).setName("initial") > med = Word('mMpP').setName("medial") > vow = Word('aeiou', exact=1).setName("vowel") > > 'exact' does not impose the same lookahead restriction that 'max' does. > > If your test case is close enough to your Tibetan application, you might try > one of these other options. You can merge your initial and medial > expressions into a single word, since what you describe is exactly the same > as the 2-argument constructor for word. Breaking out the definition of > syllable as: > > syllable = Combine( > init + ZeroOrMore(med) + Optional(vow) > ) > syllables = Group(OneOrMore(syllable)).setResultsName("syllables") > > The first two bits of your syllable can be merged into a single Word > expression: > > syllable = Combine( > Word('sSbB', 'mMpP') + Optional(vow) > ) > syllables = Group(OneOrMore(syllable)).setResultsName("syllables") Hrm, I tried that, but wan't able to figure out to get at the parsed data for the medials. I need to be able to introspect the parsed data in order to perform various conversion operations (at a later time). I didn't complicate my minimal example with it, but I've got results names set for initials, medials, and vowels. > Or if you can tolerate an even more liberal expression (which would match if > vowels were mixed in with medials, and not just added to the end): > > syllable = Word('sSbB', 'mMpPaeiouAEIOU') > > This will parse fairly quickly as well, since it is able to internally > convert this entire thing into the single regex "[sSbB][mMpPaeiouAEIOU]*". Ah, this is a great example -- thanks! Sadly, I can't use it, since the rules for vowels in Tibetam unicode are strict about being at the end. > If you still need the more rigor of your original case (only a single > potential vowel at the end of the syllable, not mixed in with medials), you > might still try rolling your own Regex: > > syllable = Regex(r"[sSbB][mMpP]*[aeiou]?") Oh, this is very nice. I'm going to play with this some more. Thanks! > I've found that for low-level tokens like words and numbers, using a Regex > really outperforms "Combine(startWithThis + (somethingElse|anotherThing) + > Optional(stillAnotherThing))"; while keeping the re's localized to just a > simple building block pretty much keeps them from getting too out-of-hand. > For instance, I've modified the fourFn.py example that ships with pyparsing > to show the old style commented out, and a still-fairly-easy-to-follow-regex > replacement: > > #~ fnumber = Combine( Word( "+-"+nums, nums ) + > #~ Optional( point + Optional( Word( nums ) ) ) + > #~ Optional( e + Word( "+-"+nums, nums ) ) ) > fnumber = Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?") > ident = Word(alphas, alphas+nums+"_$") > > If these syllabic constructs in Tibetan can be built up from single Unicode > characters, then I think all of these suggestions are still valid, even down > to the Regex idea. > > I'd be very interested to see more of your Tibetan parser, as things > progress - good luck! Once I get it hammered out, I'll reply with a single-file example :-) It's part of a library I'm creating to support advanced features in Tibetan software, but the grammar itself should lend itself nicely to an example. Thanks again for your help and insights, Paul -- once again, pyparsing shines in all of its glory :-) d |