Thread: [Pyparsing] Strategies for use with ParseFile
Brought to you by:
ptmcg
From: <dav...@l-...> - 2008-01-22 17:43:43
|
All, Been using pyparsing for a long time, and I feel like I'm using it in a poor fashion, as it seems to be quite cumbersome to use. Some background: We need to parse text files that are routinely hundreds of thousands of lines long. The grammar is rather complicated (guesstimate of 300 rules). The grammer is stored in a class, with each rule a static class variable. I have another class (a parser) that subscribes to rule subsets through the usage of "setParseAction" for the interesting rules. When an interesting rule is encountered, my parser class is called. It then pulls out the interesting tokens, constructs a python object, and then it fires a callback function, where an interested user of this data can act upon it. Now, and "interesting" rule may be composed of say, 10 subrules. I don't need their info individually, but I can get it though the composite object. So, two questions: 1.) Any easy way to retrieve original text for an entire EDT below 2.) Any suggestions for better organization of the data. I've thought about some inheritence usage because the file has header data, and oneOrMore() of 6 different "things" (one of which is a EDT illustrated below), but seems like a bit of a shoehorn. Thanks ------------------------------------------------------------------------ --------------------------------------------- class Grammar: <snip> EnumeratedDataType =3D \ Keyword("(EnumeratedDataType") + \ EDT_Name + \ Optional(EDT_Description) + \ Optional(EDT_MomEnumeratedDataType) + \ EDT_AutoSequence + \ Optional(EDT_Description) + \ EDT_StartValue + \ OneOrMore(EDT_Enumeration) + \ ")";=20 ------------------------------------------------------------------------ --------------------------------------------- class Parser: <snip> def __EDT_setParseActions__(self): """Set the parse actions for the EDT elements""" Grammar.EnumeratedDataType.setParseAction(self.__EDT__); # These can all be handled identically. One of each only. Grammar.EDT_Name.setParseAction(self.__EDT_Element__); =20 Grammar.EDT_MomEnumeratedDataType.setParseAction(self.__EDT_Element__); Grammar.EDT_AutoSequence.setParseAction(self.__EDT_Element__); Grammar.EDT_Description.setParseAction(self.__EDT_Element__); Grammar.EDT_StartValue.setParseAction(self.__EDT_Element__); =20 # You can have one or more of these =20 Grammar.EDT_Enumeration.setParseAction(self.__EDT_Enumeration__); =20 Grammar.EDT_Enumerator.setParseAction(self.__EDT_Enum_Element__); =20 Grammar.EDT_Representation.setParseAction(self.__EDT_Enum_Element__); =20 def __EDT__(self, s, l, toks): # Fire the EDT callback and reset the parent. We've already stored the # data we care about self.__fireCallback__(OMDParser.EDT_TOKEN, self.__ParentElement__); self.__ResetParent__(); def __EDT_Element__(self, s, l, toks): """ This method is called whenever we encounter an EDT element. We add the element to the __ParentElement__ dictionary =20 """ # Init the parent, and add the parsed item self.__InitParent__(self.EDT_TOKEN); =20 self.__ParentElement__.addKey(toks[0], toks[1]); def __EDT_Enumeration__(self, s, l, toks): """=20 This method is called whenever an enumeration is fully parsed. We must now add it to the parent element and reset the child """ self.__ParentElement__.appendKey("Enumerations", self.__ChildElement__); self.__ResetChild__(); def __EDT_Enum_Element__(self, s, l, toks): """ This method is called whenever we encounter an Enumeration element We add the element to the CurrEnumeration dictionary=20 """ # Initialize the child element, and set the current element self.__InitChild__("Enumeration"); =20 self.__ChildElement__.addKey(toks[0], toks[1]);=20 ------------------------------------------------------------------------ --------------------------------------------- USAGE!!!: def gotEDT(EDT): print EDT; # Start of "Main" function =20 if __name__ =3D=3D "__main__": op =3D Parser(<fileName>); op.registerCallback(OMDParser.EDT_TOKEN, gotEDT); |
From: Paul M. <pt...@au...> - 2008-01-23 04:51:47
|
David - This does seem fairly complicated, but I think your approach in using parse actions as parse-time callbacks to build a data structure is actually pretty typical. To answer your specific questions: 1. There is a parse action keepOriginalText which may do the trick for you. Maybe this example would help: from pyparsing import * a_s = Word("a") b_s = Word("b") c_s = Word("c") allwords = a_s + b_s + c_s def showTokens(tokens): print "Showing tokens:", tokens.asList() allwords.setParseAction(showTokens, keepOriginalText, showTokens) allwords.parseString("aaaaa bbbb cccc") Prints: Showing tokens: ['aaaaa', 'bbbb', 'cccc'] Showing tokens: ['aaaaa bbbb cccc'] When allwords is parsed, the 3 parse actions are called in turn. First showTokens is called with the individual tokens returned from matching a_s, b_s, and c_s. Then keepOriginalText is called that changes the matched tokens back to the original text. Then showTokens is called again to show the effect of calling keepOriginalText. Does this help? 2. I don't really have much to go on to answer your second question. It *is* possible that you don't need multiple callbacks to create Python objects and return them. Instead, you can just have the related class define __init__ to accept the tokens that are passed to a parse action, and just name the class as the parse action. This will cause the __init__ method to be called with the matched tokens, and the constructed object will be returned to the parser. There are examples of this in the Pycon presentation that ships with pyparsing, describing the interactive adventure game; there is an example in the pyparsing O'Reilly short cut, in which a query string getc converted to a sequence of classes. For example: class XClass(object): def __init__(self,tokens): self.matchedText = tokens[0] def __repr__(self): return "%s:(%s)" % (self.__class__.__name__,self.matchedText) class AClass(XClass): pass class BClass(XClass): pass class CClass(XClass): pass a_s.setParseAction(AClass) b_s.setParseAction(BClass) c_s.setParseAction(CClass) allwords = a_s + b_s + c_s print allwords.parseString("aaaaa bbbb cccc").asList() Prints: [AClass:(aaaaa), BClass:(bbbb), CClass:(cccc)] Also, your naming convention is a little distracting, leading and trailing double-underscores are usually reserved for "magic" functions, such as __str__, __call__, etc. So when you use them on your own class and method names, it looks confusing to me. Also, I don't know if you are gaining anything by burying different pyparsing expressions/rules inside class variables. This sounds vaguely Java-esque to me. In Python, things *can* exist outside of a class... I don't feel that I've really addressed all of your question/concern, can you distill this architecture down to some small examples, and repost? Otherwise, I'd say this is pretty much in line with how you would parse this data and use it to construct an overall data structure with it. -- Paul |
From: <dav...@l-...> - 2008-01-23 14:55:00
|
I think you answered my main concerns with using the framework. Some of the more "strategic" questions don't seem to be answered well in the documentation, and getting a third party perspective is certainly useful. I've heard of several things with the __somefunc__ naming. One main thing that I've heard, is it's more like the equivalent to "private" functions than "magic" functions. It's a way to make it glaringly obvious what a class user should keep their sticky little hands out of :). While not shown in my example, the Parser class does have "public" functions for subscription to the resultant objects. The whole framework comes about from the fact that this is really meant to be a generic parser. There can be quite a few different utility "things" that can be done w/ the parsed data, and it is really quite useful to have generic callbacks that different classes can use. Like I mentioned previously, the Grammar and file-size make the whole parsing of the file quite an ordeal, but pyparsing is much easier to use than the corresponding (ugly as hell) perl framework that we used to use. =20 P.S. I would like to personally thank you for one of the most well structured, thoughtful (as in you put a lot of thought into it :-P ), and useful responses to anything I've ever posted on *any* mailing list Thanks! > -----Original Message----- > From: Paul McGuire [mailto:pt...@au...]=20 > Sent: Tuesday, January 22, 2008 10:52 PM > To: Weber, David C @ Link; pyp...@li... > Subject: RE: [Pyparsing] Strategies for use with ParseFile >=20 > David - >=20 > This does seem fairly complicated, but I think your approach=20 > in using parse actions as parse-time callbacks to build a=20 > data structure is actually pretty typical. >=20 > To answer your specific questions: > 1. There is a parse action keepOriginalText which may do the=20 > trick for you. > Maybe this example would help: >=20 > from pyparsing import * >=20 > a_s =3D Word("a") > b_s =3D Word("b") > c_s =3D Word("c") >=20 > allwords =3D a_s + b_s + c_s > def showTokens(tokens): > print "Showing tokens:", tokens.asList() > =20 > allwords.setParseAction(showTokens, keepOriginalText, showTokens) > allwords.parseString("aaaaa bbbb cccc") >=20 >=20 > Prints: > Showing tokens: ['aaaaa', 'bbbb', 'cccc'] > Showing tokens: ['aaaaa bbbb cccc'] > =20 > When allwords is parsed, the 3 parse actions are called in=20 > turn. First showTokens is called with the individual tokens=20 > returned from matching a_s, b_s, and c_s. Then=20 > keepOriginalText is called that changes the matched tokens=20 > back to the original text. Then showTokens is called again=20 > to show the effect of calling keepOriginalText. Does this help? >=20 > 2. I don't really have much to go on to answer your second=20 > question. It > *is* possible that you don't need multiple callbacks to=20 > create Python objects and return them. Instead, you can just=20 > have the related class define __init__ to accept the tokens=20 > that are passed to a parse action, and just name the class as=20 > the parse action. This will cause the __init__ method to be=20 > called with the matched tokens, and the constructed object=20 > will be returned to the parser. There are examples of this=20 > in the Pycon presentation that ships with pyparsing,=20 > describing the interactive adventure game; there is an=20 > example in the pyparsing O'Reilly short cut, in which a query=20 > string getc converted to a sequence of classes. For example: >=20 > class XClass(object): > def __init__(self,tokens): > self.matchedText =3D tokens[0] > def __repr__(self): > return "%s:(%s)" % (self.__class__.__name__,self.matchedText) > class AClass(XClass): pass > class BClass(XClass): pass > class CClass(XClass): pass > a_s.setParseAction(AClass) > b_s.setParseAction(BClass) > c_s.setParseAction(CClass) >=20 > allwords =3D a_s + b_s + c_s >=20 > print allwords.parseString("aaaaa bbbb cccc").asList() >=20 > Prints: > [AClass:(aaaaa), BClass:(bbbb), CClass:(cccc)] >=20 >=20 > Also, your naming convention is a little distracting, leading=20 > and trailing double-underscores are usually reserved for=20 > "magic" functions, such as __str__, __call__, etc. So when=20 > you use them on your own class and method names, it looks=20 > confusing to me. >=20 > Also, I don't know if you are gaining anything by burying=20 > different pyparsing expressions/rules inside class variables.=20 > This sounds vaguely Java-esque to me. In Python, things=20 > *can* exist outside of a class... >=20 > I don't feel that I've really addressed all of your=20 > question/concern, can you distill this architecture down to=20 > some small examples, and repost? > Otherwise, I'd say this is pretty much in line with how you=20 > would parse this data and use it to construct an overall data=20 > structure with it. >=20 > -- Paul >=20 >=20 |
From: Paul M. <pt...@au...> - 2008-01-23 15:41:28
|
David - The naming schemes go like this (cf. http://www.python.org/dev/peps/pep-0008/, under "Naming Conventions"): __xxx__ : "magic" methods useful by convention by Python internals. Examples include __init__, __call__, __add__, __del__, __dict__ (pyparsing uses methods like these __add__, __or__, __xor__, etc. to do the operator overloading) _xxx : quasi-private names, these do not get imported when using "from module import *" xxx_ : convention for naming variables that conflict with Python keywords (class_, for_, etc.) __xxx : class attributes with leading double-underscore are name-mangled by the Python interpreter to "hide" them externally, a form of private but can be worked around if you really, really, really need to (reference __xxx in class Y as _Y__xxx, but to my mind, this is even a worse red flag than using an attribute with a leading underscore). As you've probably noticed, pyparsing doesn't fully comply with PEP8, mostly with respect to using camel case names instead of names_with_underscores. I think it is just my own personal history - I used to use names with underscores back in my C and PL/I days, and then "graduated" to mixed case when I moved to Smalltalk, C++ and Java. And I'm glad you were able to make some sense of my ramblings. :) -- Paul |