LaTeX parsing

Tim Arnold
  • Tim Arnold

    Tim Arnold - 2005-12-06

    Hi, I'm starting a project to translate one set of LaTeX tags to a different set. I don't need a fully defined parser, so I'm using scanString to get the bits I need.

    What I don't understand is how to handle the recursion. For example,<pre>
    \subidx{stuff that may be \textbf{bold} or $math$}
    I want the complete argument of \subidx and I don't care what's in it, but I'm not getting anywhere. I'm very new to parsing (mainly started this morning), but from viewing the docs and this forum, I'm pretty sure pyparsing is exactly what I need.

    I've tried the Forward specification and I kind of understand what's going on with it, but if there's any advice you can give me I'd sure appreciate it.


    • Tim Arnold

      Tim Arnold - 2005-12-07

      following up my own post after playing a little.
      Here's my code, and it *seems* ok, but if it looks weird or wrong somehow, please let me know.
      from pyparsing import *

      sbji   = Keyword('\\sbji')
      bgroup = Literal('{')
      egroup = Literal('}')
      arg    = Forward()
      arg    << bgroup + OneOrMore(Word(alphas+'\\')) + egroup
      subidx = (sbji +
              bgroup.suppress() +
                  OneOrMore(Word(alphas+'\\')|arg) +
      print [(t,s,e) for t,s,e in subidx.scanString(r'\sbji{hey {\bf you}\textit{me} there}')]

      • Paul McGuire

        Paul McGuire - 2005-12-07

        Tim -

        (Sorry for the delayed response - I'm traveling on business at the moment.)

        Thanks for posting your first attempts.  In general, LaTeX parsing is tricky - see John Hunter's work with mathplotlib, the mathtext module parses LaTeX using pyparsing.  Also, here is a link to an archived thread on comp.compilers USENET group discussion this problem (

        I have inserted some modifications and associated comments into your first cut program - hope they clear up some of pyparsing's idiosyncriacies!

        And thanks for using pyparsing!
        -- Paul

        from pyparsing import *

        sbji   = Keyword(r'\sbji')
        bgroup = Literal('{')
        egroup = Literal('}')
        #1. Create an expression for a LaTeX element - it appears in two places, and is likely to
        # expand as you find more types of  text in your input data
        elem = Word(alphas+'\\')

        #1a. For example, what about punctuation? numbers?
        elem = Word(alphas+'\\') | oneOf('. , ? - ( )') | Word(nums)

        #1b. I would also suggest handling those pesky LaTeX '\' commands with their own
        # expression - and I think they also permit numbers.  The Word constructor has a
        # two-argument form: when 2 args are supplied, the first arg gives the set of allowed
        # initial characters for the word, and the second arg gives the set of allowed body
        # characters for the word.
        cmd = Word('\\',alphanums)
        punc = oneOf('. , ? ! - ( ) : ;* ^ % $ @ ~ = + [ ] / < > " \' ')
        elem = cmd | Word(alphanums+'_') | punc

        arg    = Forward()

        #2. Right idea with arg being a Forward, but there is one step missing.  When
        # using a Forward, the purpose is to "forward declare" the expression so that you
        # can use it in other expressions before defining its own contents, usually using
        # one of the expressions it is defined *in*, thereby giving the recursion support.
        # In this case, we will define an elem to be made up of an arg, and then use
        # elem to define the expression for an arg.  This will support unlimited recursion
        # of {}'s.
        elem = cmd | Word(alphanums+'_') | punc | arg

        arg    << bgroup + OneOrMore(elem) + egroup
        subidx = (sbji +
                bgroup.suppress() +
                    OneOrMore(elem | arg) +
        #2a. Only one more thing to resolve, and I don't know the answer.  Is the \sbji
        # command capable of recursion?  That is, could you have a subidx nested in another
        # subidx?  As it turns out, I think this grammar will "work", but only the outermost
        # subidx will be recognized as such - any embedded subidx will just get included in the
        # arg expression of the outer subidx.

        # slightly more challenging test string, with deeper nesting of {}'s
        teststring = r'\sbji{hey {\bf you \it{and I mean you!}}\textit{me} there}'
        print [(t,s,e) for t,s,e in subidx.scanString(teststring)]

        for t,s,e in subidx.scanString(teststring):
            print t,s,e
        #3. Optional - I don't know if you care about any {} nesting any deeper within the arg,
        # other than the grouping that the outermost arg does for you.  If you *do* want to
        # interpret the sub- and sub-sub-, and sub-sub-sub-etc. groups, here is a slightly modified
        # definition of arg.  By suppressing the {}'s and Group'ing the expression (just like you
        # did above with subidx), the {}'s disappear altogether, and we get a nicely nested
        # ParseResults to work with (or just treat it like a list if you prefer).
        arg    << Group( bgroup.suppress() + OneOrMore(elem) + egroup.suppress() )

        for t,s,e in subidx.scanString(teststring):
            print t,s,e

        • Tim Arnold

          Tim Arnold - 2005-12-09

          hey Paul,
          well, you rock! Thanks for your answer. I played around with it all day yesterday and I understand things much better.
          I'm now moving to the next step using transformString, so I'll probably pop back up here later. 

          thanks again for such a great response.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks