Thread: [Pyparsing] Parsing LaTeX and regression
Brought to you by:
ptmcg
From: <cmi...@kd...> - 2007-06-23 23:20:08
|
Hello, I'm a beginner in Python and pyparsing and I'm trying to write a script to parse LaTeX code to search&replace and suppress specific tags. I've got this code : # coding=latin1 from pyparsing import * tag = Literal ("\\") tagname= Word( alphas ) openingbracket = Literal("{") text = Word( alphas + "éèà" + " ") closingbracket = Literal("}") paragraph = Forward() paragraphitem = Optional(text) + Optional(paragraph) + Optional(text) paragraph << tag +tagname+ openingbracket + Group(paragraphitem) + closingbracket test = " Starting text \\emph{This sentence is in \\textit{italics} in Bembo} \\emph{This sentence is in \\textit{italics} in bembo and in \\emph{Italian}} Middle filling \\emph{This second sentence is in Emphasis} End" for foundparagraph in paragraph.scanString(test) : print test, "-->", foundparagraph I would like pyparsing to return : 1) \emph{This sentence is in \textit{italics} in Bembo} 2) \emph{This sentence is in \textit{italics} in bembo and in \emph{Italian}} 3) \emph{This second sentence is in Emphasis} My script does not parse correctly 2) and I'm puzzled. Cheers, Charles |
From: Paul M. <pa...@al...> - 2007-06-24 00:01:27
|
Charles - Nice first attempt at a LaTeX parser, and you really are quite close. = Well done in using Forward in defining a recursive grammar. =20 Here is your source code, with some comments - your original code is commented out with #~ characters. The final grammar version produces = the text as you expect. -- Paul # coding=3Dlatin1 from pyparsing import * tag =3D Literal ("\\") tagname=3D Word( alphas ) openingbracket =3D Literal("{") # Avoid including spaces as part of any Word definitions. # With spaces allowed as part of a Word, it is easy for the # parser to read farther than you want. Your grammar for=20 # this application works without this change, but I encourage=20 # you to avoid this practice. #~ text =3D Word( alphas + "=E9=E8=E0" + " ") =20 text =3D Word( alphas + "=E9=E8=E0") closingbracket =3D Literal("}") paragraph =3D Forward() # On the right track, but making paragraphitem this text/paragraph/text=20 # structure is too rigid, and doesn't allow for enough repetition #~ paragraphitem =3D Optional(text) + Optional(paragraph) + = Optional(text) #~ paragraph << tag +tagname+ openingbracket + Group(paragraphitem) + closingbracket # simplify paragraphitem to a list of alternatives of what you would = find # in a paragraph paragraphitem =3D text | paragraph # move the repetition of paragraphitems into the definition of paragraph # using ZeroOrMore; now the mixing of text and paragraphs is unlimited paragraph << tag +tagname+ openingbracket + ZeroOrMore(paragraphitem) + closingbracket test =3D r""" Starting text \emph{This sentence is in \textit{italics} = in=20 Bembo} \emph{This sentence is in \textit{italics} in bembo and in=20 \emph{Italian}} Middle filling \emph{This second sentence is in = Emphasis} End""" print test print # First cut at the parser, using searchString # scanString returns a tuple of (tokens, startLocation, endLocation) # If all you want are the tokens, use searchString instead. #~ for foundparagraph in paragraph.scanString(test) : for foundparagraph in paragraph.searchString(test) : print "-->", foundparagraph.asList() print # same parser, but just add a parse action to paragraph that retains # the whitespace of the original text - keepOriginalText is included # with pyparsing paragraph.setParseAction( keepOriginalText ) for foundparagraph in paragraph.searchString(test) : print "-->", foundparagraph.asList() print # same parser again, but with another parse action to strip out newlines paragraph.addParseAction( lambda toks: toks[0].replace("\n","") ) for foundparagraph in paragraph.searchString(test) : print "-->", foundparagraph.asList() Print Prints out: Starting text \emph{This sentence is in \textit{italics} in=20 Bembo} \emph{This sentence is in \textit{italics} in bembo and in=20 \emph{Italian}} Middle filling \emph{This second sentence is in = Emphasis} End --> ['\\', 'emph', '{', 'This sentence is in ', '\\', 'textit', '{', 'italics', '}', 'in ', 'Bembo', '}'] --> ['\\', 'emph', '{', 'This sentence is in ', '\\', 'textit', '{', 'italics', '}', 'in bembo and in ', '\\', 'emph', '{', 'Italian', '}', = '}'] --> ['\\', 'emph', '{', 'This second sentence is in Emphasis', '}'] --> ['\\emph{This sentence is in \\textit{italics} in \nBembo}'] --> ['\\emph{This sentence is in \\textit{italics} in bembo and in \n\\emph{Italian}}'] --> ['\\emph{This second sentence is in Emphasis}'] --> ['\\emph{This sentence is in \\textit{italics} in Bembo}'] --> ['\\emph{This sentence is in \\textit{italics} in bembo and in \\emph{Italian}}'] --> ['\\emph{This second sentence is in Emphasis}']=20 > -----Original Message----- > From: pyp...@li...=20 > [mailto:pyp...@li...] On=20 > Behalf Of cmi...@kd... > Sent: Saturday, June 23, 2007 6:16 PM > To: pyp...@li... > Subject: [Pyparsing] Parsing LaTeX and regression >=20 > Hello, >=20 > I'm a beginner in Python and pyparsing and I'm trying to=20 > write a script to > parse LaTeX code to search&replace and suppress specific tags. >=20 > I've got this code : >=20 > # coding=3Dlatin1 > from pyparsing import * >=20 > tag =3D Literal ("\\") > tagname=3D Word( alphas ) > openingbracket =3D Literal("{") > text =3D Word( alphas + "=E9=E8=E0" + " ") > closingbracket =3D Literal("}") >=20 > paragraph =3D Forward() > paragraphitem =3D Optional(text) + Optional(paragraph) + = Optional(text) > paragraph << tag +tagname+ openingbracket + Group(paragraphitem) + > closingbracket >=20 > test =3D " Starting text \\emph{This sentence is in \\textit{italics} = in > Bembo} \\emph{This sentence is in \\textit{italics} in bembo and in > \\emph{Italian}} Middle filling \\emph{This second sentence=20 > is in Emphasis} > End" >=20 >=20 >=20 > for foundparagraph in paragraph.scanString(test) : > print test, "-->", foundparagraph >=20 > I would like pyparsing to return : > 1) \emph{This sentence is in \textit{italics} in Bembo} > 2) \emph{This sentence is in \textit{italics} in bembo and in > \emph{Italian}} > 3) \emph{This second sentence is in Emphasis} >=20 > My script does not parse correctly 2) and I'm puzzled. >=20 > Cheers, > Charles=20 >=20 >=20 > -------------------------------------------------------------- > ----------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Pyparsing-users mailing list > Pyp...@li... > https://lists.sourceforge.net/lists/listinfo/pyparsing-users >=20 >=20 |
From: <cmi...@kd...> - 2007-06-24 22:00:14
|
Thanks very much Paul for your quick and comprehensive answer. It works indeed very well and it is impressive how much you can do with pyparsing in so few lines of code. There is one line that is more or less black magic (but I'm a Python newbie). I don't understand the [0]. Toks is a list of strings and .replace works on strings. Why it is not necessary to iterate the replace method on every item of the toks list ? You apply the method only on the first item. I don't understand the logic. > # same parser again, but with another parse action to strip out newlines > paragraph.addParseAction( lambda toks: toks[0].replace("\n","") ) Cheers, Charles |
From: Paul M. <pa...@al...> - 2007-06-25 00:38:31
|
> Thanks very much Paul for your quick and comprehensive > answer. It works > indeed very well and it is impressive how much you can do > with pyparsing in > so few lines of code. > De rien - :) > There is one line that is more or less black magic (but I'm a Python > newbie). I don't understand the [0]. Toks is a list of > strings and .replace > works on strings. Why it is not necessary to iterate the > replace method on > every item of the toks list ? You apply the method only on > the first item. > > I don't understand the logic. > > > # same parser again, but with another parse action to strip > out newlines > > paragraph.addParseAction( lambda toks: toks[0].replace("\n","") ) > First, let me apologize for using a coding style for these parse actions that is unfriendly to Python newbies. This lambda is the same as this function: def newlineRemover( toks ): return toks[0].replace("\n","") and would be set as the parse action with this statement: paragraph.addParseAction( newlineRemover ) A parse action is called when a given element within your grammar is matched within the input string. Parse actions can have one of the following signatures: parseAction(inputString,locn,parsedTokens) parseAction(locn,parsedTokens) parseAction(parsedTokens) parseAction() Where: inputString is the complete string being parsed locn is the location within the input string where the parse element was found parsedTokens is a ParseResults object containing the matched tokens In this particular case, I know that a parse action attached to paragraph will be sent a ParseResults object containing a single string, so I don't bother iterating over the whole sequence of strings. This will vary by expression - some expressions will send multiple strings. Look at these two definitions for a floating point number: real1 = Word(nums) + "." + Word(nums) real2 = Combine( Word(nums) + "." + Word(nums) ) If you want to parse "3.14159", real1 will return ["3",".","14159"], and real2 will return ["3.14159"]. Parse expressions with implicit repetition, such as OneOrMore, may send a sequence of unforeseeable length, and these would likely require a parse action that processes the sequence using some form of iteration or map. So the short answer is "it depends." For this particular parse element, I *know* that the parse action will *always* get a ParseResults containing a single element, so I don't bother iterating over it, I just use element 0. While you are developing a grammar, you can use a decorator that is provided with pyparsing, traceParseAction, like this: @traceParseAction def newlineRemover( toks ): return toks[0].replace("\n","") This will print out the arguments being passed to the parse action, and the value returned from it. -- Paul |
From: Paul M. <pa...@al...> - 2007-06-25 00:58:38
|
Oh! I just reread your original e-mail more closely, in that you are trying to "search&replace and suppress specific tags". Please look into the transformString method as an alternative to scanString or searchString. With transformString, you compose parse actions that perform the desired text modification, and return the modified string. transformString takes these mods and inserts them back into the original string in place of the parsed tokens. Suppressing is even easier, just wrap the expression in a pyparsing Suppress object. There are some examples in the scanExamples.py file in the pyparsing examples directory, and also look at the transformString usage in htmlStripper.py. Cheers, -- Paul |