Thread: [Pyparsing] RE: python indentation grammar
Brought to you by:
ptmcg
From: Michel P. <mi...@di...> - 2005-08-18 17:30:42
|
On Thu, 2005-08-18 at 06:35 -0500, Paul McGuire wrote: > Michel - > > Glad the sparql parsing is proceeding well. > > I'm not sure pyparsing is going to go much better than your current parser, > given the warts that you cite: > - pyparsing is not very good in multi-thread code, for the same reasons you > mention, mostly use of globals. I don't see any vars declared global in pyparsing unless you've added them recently. I don't see any of the other usual thread-killing warts either, like mutable default arguments or module level vars. I've no experience with any other kind of global state in python, by my eyes pyparsing should be pretty threadsafe, but hey, you're the author. ;) I'd be more than willing to try and fix and/or verify pyparsing with multiple threads. Really the thread issue is just a minor concern, I don't think they're be much concurrent parsing as much as generation, so I can get a way with locks for now if it's totally necessary. > - pyparsing's asXML() output for parsed results is somewhat hit-or-miss. I > really should remove that code for now, or at least label it as "shaky". I'm just using it for visual verification for now, so it's shakyness is ok for me. > > To do indentation-based parsing, you will need a parse action to do the > indentation work, and a stack to keep track of the current indentation > levels, so that you can unwind to previous indent levels. Here's one > suggestion if you haven't thought of it already: use pyparsing's > col(loc,strg) built-in inside the parse action, to compute the column of the > starting text. Great, that's what I imagined, but the col() trick will be usefull, thanks Paul! -Michel |
From: Michel P. <mi...@di...> - 2005-08-18 20:58:24
|
On Thu, 2005-08-18 at 13:57 -0500, Paul McGuire wrote: > Michel - > > Not so much global data, as it is parsing state preserved inside the > pyparsing class instances (namely the cacheing of exception instances). I > am fairly certain that calling parseString is not thread-safe, and you > should interlock calls to it if you have multiple threads calling it. Oh I'm sorry, what I meant to say was different threads will be calling different instances, not the same instance. IE, every thread will have its own SPARQLGrammar.Query instance. sliplib used module vars and declared global vars and thus the _whole module_, and all of its features, cannot be used from different threads, but different instances of pyparsing classes should be fine. I think. ;) > I have made a few attempts at indentation-based parsing in the past, but I > looked at them last night, and they are really not so good. I think the key > will be in a) using a parse action with col() to detect the indentation > level of the current line, and b) keeping a global stack of indentations > levels seen thus far, so that you can tell if your current line is part of > the current indent level, a deeper level or a higher level. Sounds good. Something to think about would be encapsulating the indentation level in something other than a global var so that it is thread safe. Maybe the parse action can be a callable instance that keeps this level internal? class IndentationAction(object): level = 0 def __call__(self, *args): # ... indentation tracking logic indent = White().parseAction(IndentationAction()) or something like that. > When creating your test cases, be sure to add unfriendly tests, such as > nested levels that unwind to a higher nesting than just the immediate > parent. That is: > > A > A1 > A2 > A2a > A2aa > A2ab > A2b > A2ba > A3 > > Since there is no A2c entry (to be a peer of A2a and A2b), your parser will > end up doing a double pop from the indentation stack. > > Also, what would this data signify? > > A > A1 > A2 > A2a > A2aa > A2ab > A2b > A2ba > A2.5 > A3 > > Note that A2.5 is more indented than A2 and A3, but less indented than A2a > and A2b. I'm guessing this case should probably be an error (and if you > detect it in a parse action, you should raise ParseFatalException instead of > simple ParseException, to halt parsing immediately). Right, obviously we went good structured representation but not necessarily the exact semantics of Python, unless desired. I'll work some more on this over the weekend and let you know what my results are. -Michel |
From: Paul M. <pa...@al...> - 2005-08-18 18:57:18
|
Michel - Not so much global data, as it is parsing state preserved inside the pyparsing class instances (namely the cacheing of exception instances). I am fairly certain that calling parseString is not thread-safe, and you should interlock calls to it if you have multiple threads calling it. I have made a few attempts at indentation-based parsing in the past, but I looked at them last night, and they are really not so good. I think the key will be in a) using a parse action with col() to detect the indentation level of the current line, and b) keeping a global stack of indentations levels seen thus far, so that you can tell if your current line is part of the current indent level, a deeper level or a higher level. When creating your test cases, be sure to add unfriendly tests, such as nested levels that unwind to a higher nesting than just the immediate parent. That is: A A1 A2 A2a A2aa A2ab A2b A2ba A3 Since there is no A2c entry (to be a peer of A2a and A2b), your parser will end up doing a double pop from the indentation stack. Also, what would this data signify? A A1 A2 A2a A2aa A2ab A2b A2ba A2.5 A3 Note that A2.5 is more indented than A2 and A3, but less indented than A2a and A2b. I'm guessing this case should probably be an error (and if you detect it in a parse action, you should raise ParseFatalException instead of simple ParseException, to halt parsing immediately). -- Paul |
From: Michel P. <mi...@di...> - 2005-08-31 00:38:17
|
On Thu, 2005-08-18 at 13:57 -0500, Paul McGuire wrote: > I have made a few attempts at indentation-based parsing in the past, but I > looked at them last night, and they are really not so good. I think the key > will be in a) using a parse action with col() to detect the indentation > level of the current line, and b) keeping a global stack of indentations > levels seen thus far, so that you can tell if your current line is part of > the current indent level, a deeper level or a higher level. Well I have made a bit more progress on this, as well as some great progress on the sparql parser with pyparsing. On the indentation problem, I have the following module. The relavent pyparsing code is down near the end: https://svn.cignex.com/public/slipr/slipr/slipr.py It's pretty self contained. When this module is run, it tries to parse the test file: https://svn.cignex.com/public/slipr/data/pyinrdf.slpr and I've got everything matching fine, except the whitespace. ;) For some reason I can't get the whitespace action to work right, it only matches about every other whitespace in the doc. Here's some of the output. Notice at the end how some of the whitespace is not matched between tags: <tag> <name> <identifier>RDF</identifier> </name> <attrs> <name> <identifier>python</identifier> </name> <string>"http://namespaces.zemantic.org/python#"</string> </attrs> </tag> [' '] [' '] <tag> <name> <identifier>Ontology</identifier> </name> <attrs> <name> <identifier>python</identifier> </name> <value> <identifier>bob</identifier> </value> </attrs> </tag> [' '] [' '] <tag> <name> <identifier>Class</identifier> </name> <attrs> <name> <identifier>Object</identifier> </name> </attrs> </tag> [' '] <tag> <name> <identifier>issubclass</identifier> </name> <attrs> <name> <identifier>Object</identifier> </name> </attrs> </tag> <tag> <name> <identifier>isinstance</identifier> </name> <attrs> <name> <identifier>Object</identifier> </name> </attrs> </tag> I'm not sure what's wrong, can anyone spot a simple error or suggest another way to handle the indentation issue? Thanks, -Michel |
From: Paul M. <pa...@al...> - 2005-08-31 01:54:53
|
Michel - This part of pyparsing is not well-documented at all, since I typically discourage people from writing whitespace-sensitive parsers. Very often, people come from writing regexp's and try to figure out how to explicitly handle whitespace between tokens, and I have to explain that pyparsing doesn't require explicit whitespace handling, that whitespace is assumed to be a token delimiter, but that the whitespace itself is skipped/ignored by default. However, your grammar is *by its nature* whitespace-sensitive. So you probably need to call the leaveWhitespace() method on your root parse object, self.block, as in: self.block.leaveWhitespace() Do this right at the end of your initGrammar method, after assigning self.block: self.block = ZeroOrMore( ...etc, etc. ) self.block.leaveWhitespace() This will recursively set whitespace handling through the whole bnf, not just for the root node, and your whitespace handling should be more predictable. After adding this line, I reran your test, and I think I got all the between-tag whitespace you were looking for. I'm also glad asXML() seems to be working adequately for you. As I mentioned before, this method is a bit iffy, so it is fortunate that you are getting such good results. Congratulations on such a sophisticated parsing application! -- Paul |