Thread: [Pyparsing] A newbie w/nested structures
Brought to you by:
ptmcg
From: Tim C. <tim...@gm...> - 2007-09-09 07:07:53
|
Hi All, Okay, this problem has driven me to the point of asking for help! :-) I have spent several days reading all I can find on Pyparsing and tring out the examples and stepping through them with a debugger but for some reason I just can't make my parser work. I have tried so many different approaches to defining the grammar that it is pitiful. ;-) My results varied but were never anywhere close. Usually I just just got an empty list back. I know that my main problem is that I don't know how to approach defining the nested structures. So, somehow I just don't understand something very fundamental about Pyparsing. I intuitively *know* it will be a great help to my project, "once I get it". I really do not want to hand code a complete parser when it seems that Pyparsing has all these great features. So if someone could point me in some direction other than the one I'm going in now I would appreciate it. :-) The scripts I need to parse (see below for an example) are nested structures that represent an object hierarchy based on a reference model. The archetype definition language (ADL) was designed specifically for this purpose in order to maintain the semantic relationship of the objects. If anyone has time to discuss the basic starting strategy for this project maybe that will trigger whatever I'm missing. Thanks. Tim -- Timothy Cook, MSc Health Informatics Research & Development Services http://timothywayne.cook.googlepages.com/home 01-904-322-8582 ================================================================================================== archetype (adl_version=1.4) openEHR-EHR-COMPOSITION.report.v1 concept [at0000] -- Report language original_language = <[ISO_639-1::en]> description original_author = < ["name"] = <"XXXXXXXXX"> ["organisation"] = <"YYYYYYYYYYYYYYY"> ["date"] = <"28/06/2006"> ["email"] = <"xx...@co..."> > details = < ["en"] = < language = <[ISO_639-1::en]> purpose = <"A generic report which provides the features to allow recording of who requested the report and to whom it has been sent. Participation classes allow further information to be recorded"> use = <""> misuse = <""> > > lifecycle_state = <"AuthorDraft"> other_contributors = <> definition COMPOSITION[at0000] matches { -- Report category matches { DV_CODED_TEXT matches { defining_code matches {[openehr::433]} } } context matches { EVENT_CONTEXT matches { other_context matches { ITEM_TREE[at0001] matches { -- Tree items cardinality matches {0..1; ordered} matches { CLUSTER[at0002] occurrences matches {0..1} matches { -- Request details items cardinality matches {0..1; ordered} matches { ELEMENT[at0003] occurrences matches {0..*} matches { -- Request identifier value matches { DV_TEXT matches {*} } } ELEMENT[at0004] occurrences matches {0..1} matches { -- Requesting clinician value matches { DV_TEXT matches {*} } } ELEMENT[at0012] occurrences matches {0..1} matches { -- Contact details of requesting clinician value matches { DV_TEXT matches {*} } } ELEMENT[at0005] occurrences matches {0..1} matches { -- Date of request value matches { DV_DATE_TIME matches { value matches {yyyy-??-??T??:??:??} } } } } } CLUSTER[at0006] occurrences matches {0..1} matches { -- Report details items cardinality matches {0..1; ordered} matches { ELEMENT[at0007] occurrences matches {0..*} matches { -- Report identifier value matches { DV_TEXT matches {*} } } ELEMENT[at0014] occurrences matches {0..1} matches { -- Status value matches { DV_CODED_TEXT matches { defining_code matches { [local:: at0015, -- Final at0016, -- Interim at0017, -- Supplementary at0018] -- Corrected/amended } } } } ELEMENT[at0013] occurrences matches {0..1} matches { -- Date/time report issued value matches { DV_DATE_TIME matches { value matches {yyyy-??-??T??:??:??} } } } CLUSTER[at0008] occurrences matches {0..1} matches { -- Copies to items cardinality matches {0..*; unordered} matches { ELEMENT[at0009] occurrences matches {0..*} matches { -- Copied party details value matches { DV_TEXT matches {*} } } } } CLUSTER[at0010] occurrences matches {0..1} matches { -- Referrals items cardinality matches {0..*; unordered} matches { ELEMENT[at0011] occurrences matches {0..*} matches { -- Referred party details value matches { DV_TEXT matches {*} } } } } } } } } } } } } ontology term_definitions = < ["en"] = < items = < ["at0000"] = < description = <"Generic reporting composition in response to a request for information or testing"> text = <"Report"> > ["at0001"] = < description = <"@ internal @"> text = <"Tree"> > ["at0002"] = < description = <"Information about the request"> text = <"Request details"> > ["at0003"] = < description = <"Identification of the request"> text = <"Request identifier"> > ["at0004"] = < description = <"Information about the requesting clinician"> text = <"Requesting clinician"> > ["at0005"] = < description = <"The date of the request"> text = <"Date of request"> > ["at0006"] = < description = <"Details of the report"> text = <"Report details"> > ["at0007"] = < description = <"Identification information about the report"> text = <"Report identifier"> > ["at0008"] = < description = <"Collection of parties who have been copied the report"> text = <"Copies to"> > ["at0009"] = < description = <"Details of the parties to whom the copies have been copied"> text = <"Copied party details"> > ["at0010"] = < description = <"Collection of parties who have been referred to generate the report"> text = <"Referrals"> > ["at0011"] = < description = <"Details of the parties to whom the specimen or findings have been referred for special testing or elaboration"> text = <"Referred party details"> > ["at0012"] = < description = <"Details for contacting requesting clinician"> text = <"Contact details of requesting clinician"> > ["at0013"] = < description = <"The date and time the report was officially issued"> text = <"Date/time report issued"> > ["at0014"] = < description = <"The status of the report"> text = <"Status"> > ["at0015"] = < description = <"This report is the final report"> text = <"Final"> > ["at0016"] = < description = <"This report is an interim report and a final or further interim report is to be expected"> text = <"Interim"> > ["at0017"] = < description = <"This report is supplementary to a previous report"> text = <"Supplementary"> > ["at0018"] = < description = <"This report is a correction or amendment of a previous report"> text = <"Corrected/amended"> > > > > |
From: Paul M. <pa...@al...> - 2007-09-09 09:30:58
|
Tim - First of all, this is a very ambitious parser to start with, so don't be discouraged. It is a recursive grammar, which is also a more advanced parser to start with. Here are some suggestions on getting started: - pick a part of the sample ADL (I would suggest working section by section) - develop a simple BNF for this grammar Here is a sample parser for the ontology section. It is a recursive example, defining a valueDef that is defined in terms of component valueDefs. It also shows the comment format, and the mechanism for skipping comments. I hope this sample gives you a jump start on a more complete ADL parser. -- Paul from pyparsing import * LT,GT,EQ,LPAR,RPAR,LBRK,RBRK,BAR,QUOT,SEMI = map(Suppress,"<>=()[]|';") upper = srange("[A-Z]") lower = upper.lower() attrName = Word(lower,alphanums+"_") key = attrName | (LBRK+quotedString+RBRK) quotedString.setParseAction(removeQuotes) valueDef = Forward() valueDef << ( key + EQ + LT + ZeroOrMore( Group(valueDef | quotedString )) + GT ) ontologySection = "ontology" + valueDef comment = "--" + restOfLine ontologySection.ignore(comment) sample = """ ontology term_definitions = < ["en"] = < items = < ["at0000"] = < description = <"Generic reporting composition in response to a request for information or testing"> text = <"Report"> > ["at0001"] = < description = <"@ internal @"> text = <"Tree"> > ["at0002"] = < description = <"Information about the request"> text = <"Request details"> > ["at0003"] = < description = <"Identification of the request"> text = <"Request identifier"> > ["at0004"] = < description = <"Information about the requesting clinician"> text = <"Requesting clinician"> > ["at0005"] = < description = <"The date of the request"> text = <"Date of request"> > ["at0006"] = < description = <"Details of the report"> text = <"Report details"> > ["at0007"] = < description = <"Identification information about the report"> text = <"Report identifier"> > ["at0008"] = < description = <"Collection of parties who have been copied the report"> text = <"Copies to"> > ["at0009"] = < description = <"Details of the parties to whom the copies have been copied"> text = <"Copied party details"> > ["at0010"] = < description = <"Collection of parties who have been referred to generate the report"> text = <"Referrals"> > ["at0011"] = < description = <"Details of the parties to whom the specimen or findings have been referred for special testing or elaboration"> text = <"Referred party details"> > ["at0012"] = < description = <"Details for contacting requesting clinician"> text = <"Contact details of requesting clinician"> > ["at0013"] = < description = <"The date and time the report was officially issued"> text = <"Date/time report issued"> > ["at0014"] = < description = <"The status of the report"> text = <"Status"> > ["at0015"] = < description = <"This report is the final report"> text = <"Final"> > ["at0016"] = < description = <"This report is an interim report and a final or further interim report is to be expected"> text = <"Interim"> > ["at0017"] = < description = <"This report is supplementary to a previous report"> text = <"Supplementary"> > ["at0018"] = < description = <"This report is a correction or amendment of a previous report"> text = <"Corrected/amended"> > > > > """ res = ontologySection.parseString(sample) from pprint import pprint pprint( res.asList() ) |
From: Tim C. <tim...@gm...> - 2007-09-09 10:48:01
|
Hi Paul, On Sun, 2007-09-09 at 04:30 -0500, Paul McGuire wrote: > Here are some suggestions on getting started: > - pick a part of the sample ADL (I would suggest working section by section) > - develop a simple BNF for this grammar > > Here is a sample parser for the ontology section. This is exactly what I started with. Well, it's what I dropped back to when I realized how difficult it would be. :-) > It is a recursive > example, defining a valueDef that is defined in terms of component > valueDefs. It also shows the comment format, and the mechanism for skipping > comments. I hope this sample gives you a jump start on a more complete ADL > parser. Thank you so very much for your prompt and informative reply. > valueDef = Forward() > valueDef << ( key + EQ + LT + ZeroOrMore( Group(valueDef | quotedString )) + > GT ) I am certain that my problem is that I still do not have a good grasp of Forward(). Is this the meaning of the valueDef assignment? "valueDefs are composed of a key followed by an = followed by < and then zero or more embedded valueDefs or quoted strings. The key is composed of an attribute name or a bracketed and quoted string" BTW: The example you sent raises ParseException; pyparsing.ParseException: Expected ">" (at char 55), (line:4, col:17) Thanks. Tim -- Timothy Cook, MSc Health Informatics Research & Development Services http://timothywayne.cook.googlepages.com/home 01-904-322-8582 |
From: Tim C. <tim...@gm...> - 2007-09-09 13:24:06
|
Paul, As I started making a mess out of your code trying find out why I was getting the exception, I discovered (**I think**) that this is not recursive. As I defined the grammar on a more atomic level I came up with a different solution. It seems to me to work and I even simulated an additional language section and it parsed just fine as well. Do you see any pitfalls that I may have missed? I have far to go and this was one of the simplest examples. But it is encouraging to have a bit of success. Thanks for a great tool. Hand coding this would (as you know) taken *a lot* more lines of error prone code. Cheers, Tim =========================code=========================== from pyparsing import * LT,GT,EQ,LPAR,RPAR,LBRK,RBRK,BAR,QUOT,SEMI = map(Suppress,"<>=()[]|';") upper = srange("[A-Z]") lower = upper.lower() attrName = Word(lower,alphanums+"_") quotedString.setParseAction(removeQuotes) text = QuotedString('"',multiline=True) code_descr = ("description"+EQ+LT+text+GT).setResultsName("code_descr", listAllMatches=True) code_text = ("text"+EQ+LT+text+GT).setResultsName("code_text", listAllMatches=True) o_atcode = (LBRK+quotedString+RBRK+EQ+LT+code_descr+code_text +GT).setResultsName("o_atCode", listAllMatches=True) o_items = ("items"+EQ+LT +OneOrMore(o_atcode)+GT).setResultsName("o_items", listAllMatches=True) o_langsection = (LBRK+quotedString+RBRK+EQ+LT+o_items +GT).setResultsName("0_langsection", listAllMatches=True) termDefs = ("term_definitions"+EQ+LT +OneOrMore(o_langsection)+GT).setResultsName("termDefs", listAllMatches=True) ontologySection = ("ontology" + termDefs).setResultsName("ontologySection", listAllMatches=True) comment = "--" + restOfLine ontologySection.ignore(comment) sample = """ <string was deleted for brevity in email> """ res = ontologySection.parseString(sample) from pprint import pprint pprint( res.asDict() ) ======================================================== |
From: Paul M. <pa...@al...> - 2007-09-09 18:29:30
|
Tim - Sorry, I must have mis-pasted the sample, here it is again. It parses okay on my system (and I tested versions back to 1.4.2). from pyparsing import * LT,GT,EQ,LPAR,RPAR,LBRK,RBRK,BAR,QUOT,SEMI = map(Suppress,"<>=()[]|';") upper = srange("[A-Z]") lower = upper.lower() attrName = Word(lower,alphanums+"_") key = attrName | (LBRK+quotedString+RBRK) quotedString.setParseAction(removeQuotes) valueDef = Forward() valueDef << ( key + EQ + LT + ZeroOrMore( Group(valueDef | quotedString )) + GT ) ontologySection = "ontology" + valueDef comment = "--" + restOfLine ontologySection.ignore(comment) sample = """ ontology term_definitions = < ["en"] = < items = < ["at0000"] = < description = <"Generic reporting composition in response to a request for information or testing"> text = <"Report"> > ["at0001"] = < description = <"@ internal @"> text = <"Tree"> > ["at0002"] = < description = <"Information about the request"> text = <"Request details"> > ["at0003"] = < description = <"Identification of the request"> text = <"Request identifier"> > > > > """ res = ontologySection.parseString(sample) from pprint import pprint pprint( res.asList() ) This prints: ['ontology', 'term_definitions', ['en', ['items', ['at0000', ['description', ['Generic reporting composition in response to a request for information or testing']], ['text', ['Report']]], ['at0001', ['description', ['@ internal @']], ['text', ['Tree']]], ['at0002', ['description', ['Information about the request']], ['text', ['Request details']]], ['at0003', ['description', ['Identification of the request']], ['text', ['Request identifier']]]]]] Your description of key and valueDef are completely correct. If you want another example of Forward, I just posted this on comp.lang.python - http://groups.google.com/group/comp.lang.python/browse_frm/thread/4f128e9df5 4d6962/# - it is a basic nested example with words enclosed in nested braces. -- Paul -----Original Message----- From: pyp...@li... [mailto:pyp...@li...] On Behalf Of Tim Cook Sent: Sunday, September 09, 2007 5:41 AM To: PyParsing List Subject: Re: [Pyparsing] A newbie w/nested structures Hi Paul, On Sun, 2007-09-09 at 04:30 -0500, Paul McGuire wrote: > Here are some suggestions on getting started: > - pick a part of the sample ADL (I would suggest working section by > section) > - develop a simple BNF for this grammar > > Here is a sample parser for the ontology section. This is exactly what I started with. Well, it's what I dropped back to when I realized how difficult it would be. :-) > It is a recursive > example, defining a valueDef that is defined in terms of component > valueDefs. It also shows the comment format, and the mechanism for > skipping comments. I hope this sample gives you a jump start on a > more complete ADL parser. Thank you so very much for your prompt and informative reply. > valueDef = Forward() > valueDef << ( key + EQ + LT + ZeroOrMore( Group(valueDef | > quotedString )) + GT ) I am certain that my problem is that I still do not have a good grasp of Forward(). Is this the meaning of the valueDef assignment? "valueDefs are composed of a key followed by an = followed by < and then zero or more embedded valueDefs or quoted strings. The key is composed of an attribute name or a bracketed and quoted string" BTW: The example you sent raises ParseException; pyparsing.ParseException: Expected ">" (at char 55), (line:4, col:17) Thanks. Tim -- Timothy Cook, MSc Health Informatics Research & Development Services http://timothywayne.cook.googlepages.com/home 01-904-322-8582 |
From: Tim C. <tim...@gm...> - 2007-09-10 05:50:34
|
Paul, On Sun, 2007-09-09 at 13:29 -0500, Paul McGuire wrote: > Tim - > > Sorry, I must have mis-pasted the sample, here it is again. It parses okay > on my system (and I tested versions back to 1.4.2). I'm using 1.4.7 on Linux. The problem was actually that there are newlines embedded in the sample text and quotedString doesn't allow that. I defined quotedText = QuotedString('"',multiline=True) and replaced the quotedString in the valueDef definition with quotedText. It works now. > Your description of key and valueDef are completely correct. Great! Maybe I'm getting somewhere. :-) > If you want > another example of Forward, I just posted this on comp.lang.python - > http://groups.google.com/group/comp.lang.python/browse_frm/thread/4f128e9df5 > 4d6962/# - it is a basic nested example with words enclosed in nested > braces. Thanks for another great example. Cheers, Tim |
From: Tim C. <tim...@gm...> - 2007-09-10 05:30:11
|
Hi Paul, On Sun, 2007-09-09 at 17:46 -0500, Paul McGuire wrote: > Tim - > > By all means, if you have a simplified set of data, then skip the recursion > part. But if you find yourself starting to hand code several "optional" > levels of nesting, then you might revisit the recursive example. When I got > your first e-mail, I just assumed that a recursive approach would be > required. I also downloaded the latest ADL spec, and it has quite a bit > more data types to it than just quoted strings, and these appear to support > nested structure too. > Thanks for taking the time to investigate this further. I will certainly need a more complicated parser then. I have to support the full ADL specs. both 1.4 and 2.0. http://svn.openehr.org/specification/TAGS/Release-1.0.1/publishing/roadmap.html This is an open source project so if anyone wants to help out with a health care application framework, the help would be very welcome. :-) http://sourceforge.net/projects/oship If anyone would like further info about the project then please email me directly or join the project list at SF. > But if your sample data files that you are working with are not nested, or > not nested too deeply, then I agree with your "keep it simple" approach. There is a growing list of archetypes. http://www.openehr.org/wsvn/knowledge/archetypes/dev/adl/?rev=0&sc=0 Therefore my simple example is just that, simple. I have already run into coding several Optional sections. I must figure out today how to properly exploit Forward(). :-> Cheers, Tim -- Timothy Cook, MSc Health Informatics Research & Development Services http://timothywayne.cook.googlepages.com/home 01-904-322-8582 |
From: Ralph C. <ra...@in...> - 2007-09-10 10:04:20
|
Hi Tim, > I must figure out today how to properly exploit Forward(). :-> I haven't seen anyone try and explain this, so I'll have a go. I read up on PyParsing ages ago, but I think this is right... I assume you're happy with normal recursion, e.g. def add(a, b): 'Add two non-negative numbers.' if b == 0: return a return add(a + 1, b - 1) Notice how we can write a call to add() inside add() despite the definition of add() not being finished yet. That's something Python allows. Now consider a data structure where each node has a `next' pointer to another node of the same type. If we want a circular definition where we've just a single node whose `next' point points to itself, we can't do >>> a = { 'next': a } Traceback (most recent call last): File "<stdin>", line 1, in ? NameError: name 'a' is not defined >>> This is because `a' only exists once the definition of `a' is finished; notice how this is different to add() above. A solution is a placeholder. I'll use None. >>> a = { 'next': None } >>> a['next'] = a >>> import pprint >>> pprint.pprint(a) {'next': <Recursion on dict with id=-1210075780>} >>> PyParsing has the same issue when trying to set up the data structures representing the grammar. Instead of using `None' it has Forward(). The idea is the same; it's a placeholder that is later replaced by something else. Something that exists later that didn't exist at the time the Forward() was required. Once this concept clicks, you'll see it's really quite simple. :-) Cheers, Ralph. |
From: Tim C. <tim...@gm...> - 2007-09-11 06:31:41
|
On Mon, 2007-09-10 at 11:04 +0100, Ralph Corderoy wrote: > PyParsing has the same issue when trying to set up the data structures > representing the grammar. Instead of using `None' it has Forward(). > The idea is the same; it's a placeholder that is later replaced by > something else. Something that exists later that didn't exist at the > time the Forward() was required. > > Once this concept clicks, you'll see it's really quite simple. :-) Thanks Ralph. Tim <........patiently waiting for the click> :-) |