Thread: [Pyparsing] Grouping when using asList()
Brought to you by:
ptmcg
From: Joshua J. K. <jk...@sk...> - 2008-03-05 21:30:34
|
Warning: long First: thank for the great package! It greatly simplifies my life. :) I'm working on parsing some SQL create statements from MS SQL (sigh, I know, but it's what we're stuck with at the moment). Here is the create: CREATE TABLE [dbo].[auth_login] ( [login_id] [int] IDENTITY (1, 1) NOT NULL , [login] [varchar] (255) NULL , [password_hash] [varchar] (255) NULL , [login_name] [varchar] (255) NULL , PRIMARY KEY CLUSTERED ( [login_id] ) ON [PRIMARY] , CONSTRAINT [IX_auth_login] UNIQUE NONCLUSTERED ( [login] ) ON [PRIMARY] ) ON [PRIMARY] You'll notice that the column definitions as well as the primary key and constraints are all within one enclosing pair of parentheses. My code is as such: create_table = (p.CaselessKeyword('create').suppress() + p.CaselessKeyword('table').suppress() + bracket_quoted.setResultsName('schema') + "." + bracket_quoted.setResultsName('table_name') + p.nestedExpr(content=p.delimitedList(p.Or( [p.Group(column_def.setResultsName('columns')), p.Group(primary_key), p.Group(constraint)]))) + p.CaselessKeyword('on').suppress() + bracket_quoted.suppress() ) p denotes the pyparsing module (import pyparsing as p) bracket_quoted is QuotedString using '[' and ']' column_def, primary_key, and constraint are all pyparsing expressions. So, at any rate, running that code and outputting via asList() gives me: ['dbo', '.', 'auth_login', [['login_id', 'int', 'IDENTITY', '1, 1', 'not', 'null'], ['login', 'varchar', '255', 'null'], ['password_hash', 'varchar', '255', 'null'], ['login_name', 'varchar', '255', 'null'], ['paid_us_good', 'money', 'null'], ['asdfsdafsadf', 'nchar', '10', 'null'], ['primary', 'key', 'clustered', ['login_id']], ['constraint', 'IX_auth_login', 'unique', 'nonclustered', ['login']]]] The column defs are in the same list element as the primary key and constraint. Apparently I'm not understanding the Group class. What can I do to put each of those three things (column defs, primary key defs, and constraint defs) in their own list elements? I want to know that the elements of l[3] are columns, l[4] are primary keys, or similar. Thanks! j -- Joshua Kugler VOC/SigNet Provider (aka Web App Programmer) S&K Aerospace Alaska |
From: Paul M. <pt...@au...> - 2008-03-06 02:05:17
|
Hah! You call that a long e-mail? You should see some of my responses on the pyparsing wiki (see the Discussion tab on the Home page)! :) I tried to run this code, but you left out some important elements, so I couldn't recreate your problem. On the face of it, your Group calls look ok. Could you paste your code to http://pyparsing.pastebin.com, and I'll give it a look? Welcome to pyparsing! -- Paul -----Original Message----- From: pyp...@li... [mailto:pyp...@li...] On Behalf Of Joshua J. Kugler Sent: Wednesday, March 05, 2008 3:31 PM To: pyp...@li... Subject: [Pyparsing] Grouping when using asList() Warning: long First: thank for the great package! It greatly simplifies my life. :) I'm working on parsing some SQL create statements from MS SQL (sigh, I know, but it's what we're stuck with at the moment). Here is the create: CREATE TABLE [dbo].[auth_login] ( [login_id] [int] IDENTITY (1, 1) NOT NULL , [login] [varchar] (255) NULL , [password_hash] [varchar] (255) NULL , [login_name] [varchar] (255) NULL , PRIMARY KEY CLUSTERED ( [login_id] ) ON [PRIMARY] , CONSTRAINT [IX_auth_login] UNIQUE NONCLUSTERED ( [login] ) ON [PRIMARY] ) ON [PRIMARY] You'll notice that the column definitions as well as the primary key and constraints are all within one enclosing pair of parentheses. My code is as such: create_table = (p.CaselessKeyword('create').suppress() + p.CaselessKeyword('table').suppress() + bracket_quoted.setResultsName('schema') + "." + bracket_quoted.setResultsName('table_name') + p.nestedExpr(content=p.delimitedList(p.Or( [p.Group(column_def.setResultsName('columns')), p.Group(primary_key), p.Group(constraint)]))) + p.CaselessKeyword('on').suppress() + bracket_quoted.suppress() ) p denotes the pyparsing module (import pyparsing as p) bracket_quoted is QuotedString using '[' and ']' column_def, primary_key, and constraint are all pyparsing expressions. So, at any rate, running that code and outputting via asList() gives me: ['dbo', '.', 'auth_login', [['login_id', 'int', 'IDENTITY', '1, 1', 'not', 'null'], ['login', 'varchar', '255', 'null'], ['password_hash', 'varchar', '255', 'null'], ['login_name', 'varchar', '255', 'null'], ['paid_us_good', 'money', 'null'], ['asdfsdafsadf', 'nchar', '10', 'null'], ['primary', 'key', 'clustered', ['login_id']], ['constraint', 'IX_auth_login', 'unique', 'nonclustered', ['login']]]] The column defs are in the same list element as the primary key and constraint. Apparently I'm not understanding the Group class. What can I do to put each of those three things (column defs, primary key defs, and constraint defs) in their own list elements? I want to know that the elements of l[3] are columns, l[4] are primary keys, or similar. Thanks! j -- Joshua Kugler VOC/SigNet Provider (aka Web App Programmer) S&K Aerospace Alaska ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |
From: Joshua J. K. <jk...@sk...> - 2008-03-06 02:13:19
|
On Wed, 2008-03-05 at 20:05 -0600, Paul McGuire wrote: > Hah! You call that a long e-mail? You should see some of my responses on > the pyparsing wiki (see the Discussion tab on the Home page)! :) I'll take a look at those. > I tried to run this code, but you left out some important elements, so I > couldn't recreate your problem. On the face of it, your Group calls look > ok. Could you paste your code to http://pyparsing.pastebin.com, and I'll > give it a look? Sure...I didn't think about pastebin, and didn't want to flood the group with my code. See my code at: http://pyparsing.pastebin.com/m47c772c9 You'll notice that column defs, primary keys, and constraints are all in element [3] of the nested list returned from the asList() call. > Welcome to pyparsing! Thanks! j -- Joshua Kugler VOC/SigNet Provider (aka Web App Programmer) S&K Aerospace Alaska |
From: Paul M. <pt...@au...> - 2008-03-06 03:26:46
|
Joshua - <lightbulb>Ah, now I see why you are confused!</lightbulb> The Group class does *not* do grouping like you see in SQL GROUP BY or the itertools groupby. Group is a way for a grammar developer to impart structure to the parsed tokens. The default behavior is for all the tokens to be returned in one flat list. Here is an example: testData = "123 abc 234 def 789 456 xyz" entry = Word(nums) + Optional(Word(alphas)) grammar = OneOrMore(entry) print grammar.parseString(testData).dump() Prints: ['123', 'abc', '234', 'def', '789', '456', 'xyz'] Now we will use Group to "group" the tokens for each entry: entry = Group( Word(nums) + Optional(Word(alphas)) ) Prints: [['123', 'abc'], ['234', 'def'], ['789'], ['456', 'xyz']] You are looking for something a little different. Pyparsing has a feature to implement it, but I've not seen it get much use. It is a qualifier on setResultsName. Before I describe the qualifier itself, let me again show you the default behavior. entry = Word(nums).setResultsName("int") + Optional(Word(alphas).setResultsName("word")) prints: ['123', 'abc', '234', 'def', '789', '456', 'xyz'] - int: 456 - word: xyz The last matching string is the one that is saved for the given results name. But sometimes, you want to keep *all* the matching strings. This is done using the listAllMatches qualifier: entry = Word(nums).setResultsName("int",listAllMatches=True) + Optional(Word(alphas).setResultsName("word",listAllMatches=True)) prints: ['123', 'abc', '234', 'def', '789', '456', 'xyz'] - int: ['123', '234', '789', '456'] - word: ['abc', 'def', 'xyz'] That's it for now, *this* is a long e-mail! :) I'll post the mods to your code in the next e-mail. -- Paul |
From: Paul M. <pa...@al...> - 2008-03-06 03:48:20
|
Ok, here are the changes I made to your code: I changed column_def to: col_width_def = (p.Suppress('(') + p.delimitedList(p.Word(p.nums)).setParseAction( lambda toks: [int(tok) for tok in toks]) + p.Suppress(')') ) column_def = (bracket_quoted.setResultsName('column_name') + bracket_quoted.setResultsName('column_type') + p.Optional(p.CaselessKeyword('IDENTITY')).setResultsName('identity') + #~ p.Optional(paren_quoted(p.nums + p.Optional(',' + p.nums))).setResultsName('column_width') + p.Optional(col_width_def).setResultsName('column_width') + nullable ) paren_quoted was just not the expression, and the "(p.nums + p.Optional(',' + p.nums)" argument *actually* created a results name for the column_width field. I changed create_table to: create_table = (p.CaselessKeyword('create').suppress() + p.CaselessKeyword('table').suppress() + bracket_quoted.setResultsName('schema') + "." + bracket_quoted.setResultsName('table_name') #~ + p.nestedExpr(content=p.delimitedList(p.Or( #~ [p.Group(column_def.setResultsName('columns')), #~ p.Group(primary_key), #~ p.Group(constraint)]))) + p.Group(p.Suppress('(') + p.delimitedList(p.Or( [p.Group(column_def).setResultsName('columns', listAllMatches=True), p.Group(primary_key).setResultsName('pkeys', listAllMatches=True), p.Group(constraint).setResultsName('constraints', listAllMatches=True)] )) + p.Suppress(')') )("defs") + p.CaselessKeyword('on').suppress() + bracket_quoted.suppress() ) There was no need to use nestedExpr for the list of column/key/constraint items. Note the use of listAllMatches in the setResultsName calls. Here is the code to print out the parsed results: dbData = create_table.parseString(sql) pprint.pprint(dbData.asList()) print print dbData.dump() print print "Columns" for c in dbData.defs.columns: print " Column: " + c.column_name print c.dump(indent=" ") print print "Primary Keys" for pk in dbData.defs.pkeys: print pk.asList() print print "Constraints" for c in dbData.defs.constraints: print "Constraint: " + c.constraint_name print c.asList() Note the use of dump() for quick listing of tokens, and any named fields. The output from this code is: ['dbo', '.', 'auth_login', [['login_id', 'int', 'IDENTITY', 1, 1, 'not null'], ['login', 'varchar', 255, 'null'], ['password_hash', 'varchar', 255, 'null'], ['login_name', 'varchar', 255, 'null'], ['paid_us_good', 'money', 'null'], ['asdfsdafsadf', 'nchar', 10, 'null'], ['primary key', 'clustered', ['login_id']], ['constraint', 'IX_auth_login', 'unique', 'nonclustered', ['login']]]] ['dbo', '.', 'auth_login', [['login_id', 'int', 'IDENTITY', 1, 1, 'not null'], ['login', 'varchar', 255, 'null'], ['password_hash', 'varchar', 255, 'null'], ['login_name', 'varchar', 255, 'null'], ['paid_us_good', 'money', 'null'], ['asdfsdafsadf', 'nchar', 10, 'null'], ['primary key', 'clustered', ['login_id']], ['constraint', 'IX_auth_login', 'unique', 'nonclustered', ['login']]]] - defs: [['login_id', 'int', 'IDENTITY', 1, 1, 'not null'], ['login', 'varchar', 255, 'null'], ['password_hash', 'varchar', 255, 'null'], ['login_name', 'varchar', 255, 'null'], ['paid_us_good', 'money', 'null'], ['asdfsdafsadf', 'nchar', 10, 'null'], ['primary key', 'clustered', ['login_id']], ['constraint', 'IX_auth_login', 'unique', 'nonclustered', ['login']]] - columns: [['login_id', 'int', 'IDENTITY', 1, 1, 'not null'], ['login', 'varchar', 255, 'null'], ['password_hash', 'varchar', 255, 'null'], ['login_name', 'varchar', 255, 'null'], ['paid_us_good', 'money', 'null'], ['asdfsdafsadf', 'nchar', 10, 'null']] - constraints: [['constraint', 'IX_auth_login', 'unique', 'nonclustered', ['login']]] - pkeys: [['primary key', 'clustered', ['login_id']]] - schema: dbo - table_name: auth_login Columns Column: login_id ['login_id', 'int', 'IDENTITY', 1, 1, 'not null'] - column_name: login_id - column_type: int - column_width: [1, 1] - identity: IDENTITY Column: login ['login', 'varchar', 255, 'null'] - column_name: login - column_type: varchar - column_width: [255] Column: password_hash ['password_hash', 'varchar', 255, 'null'] - column_name: password_hash - column_type: varchar - column_width: [255] Column: login_name ['login_name', 'varchar', 255, 'null'] - column_name: login_name - column_type: varchar - column_width: [255] Column: paid_us_good ['paid_us_good', 'money', 'null'] - column_name: paid_us_good - column_type: money Column: asdfsdafsadf ['asdfsdafsadf', 'nchar', 10, 'null'] - column_name: asdfsdafsadf - column_type: nchar - column_width: [10] Primary Keys ['primary key', 'clustered', ['login_id']] Constraints Constraint: IX_auth_login ['constraint', 'IX_auth_login', 'unique', 'nonclustered', ['login']] -- Paul |
From: Joshua J. K. <jk...@sk...> - 2008-03-06 20:47:20
|
On Wed, 2008-03-05 at 21:48 -0600, Paul McGuire wrote: > Ok, here are the changes I made to your code: Wow...just...wow. Really, I wasn't asking anyone to write my parser for me. I just wanted to know how to do it. Wow. Thanks you!! > I changed column_def to: > > col_width_def = (p.Suppress('(') + > p.delimitedList(p.Word(p.nums)).setParseAction( lambda toks: > [int(tok) for tok in toks]) + > p.Suppress(')') > ) That casting to int didn't matter much since we'll be just doing string output, but it does validate the input. :) > paren_quoted was just not the expression, and the "(p.nums + p.Optional(',' > + p.nums)" argument *actually* created a results name for the column_width > field. Gotcha. > I changed create_table to: > > create_table = (p.CaselessKeyword('create').suppress() > + p.CaselessKeyword('table').suppress() > + bracket_quoted.setResultsName('schema') > + "." + bracket_quoted.setResultsName('table_name') > #~ + p.nestedExpr(content=p.delimitedList(p.Or( > #~ [p.Group(column_def.setResultsName('columns')), > #~ p.Group(primary_key), > #~ p.Group(constraint)]))) > + p.Group(p.Suppress('(') + > p.delimitedList(p.Or( > [p.Group(column_def).setResultsName('columns', > listAllMatches=True), > p.Group(primary_key).setResultsName('pkeys', > listAllMatches=True), > p.Group(constraint).setResultsName('constraints', > listAllMatches=True)] > )) > + p.Suppress(')') > )("defs") > + p.CaselessKeyword('on').suppress() > + bracket_quoted.suppress() > ) > > There was no need to use nestedExpr for the list of column/key/constraint > items. Note the use of listAllMatches in the setResultsName calls. Saw that. That makes sense. So, it doesn't change the asList() output any, but I have attributes I can access by name now. Cool. One question: ("defs") is a call to what Groups() returns, but I'm not following. Can you point me to something in the docs that explains what is done when you call a pyparsing expression? Again, thanks so much! I'm much further along to where I need to be. j -- Joshua Kugler VOC/SigNet Provider (aka Web App Programmer) S&K Aerospace Alaska |
From: Paul M. <pa...@al...> - 2008-03-06 22:08:02
|
Joshua - You already had most of the pieces there, just a couple of gaps around a little-used feature. >>>One question: ("defs") is a call to what Groups() returns, but I'm not following. Can you point me to something in the docs that explains what is done when you call a pyparsing expression?<<< Let's say you have this form for a date string: twoDigitInt = Word(nums,exact=2) date = twoDigitInt + '/' + twoDigitInt + '/' + twoDigitInt The old way to assign results names was (I'm using US date format mm/dd/yy convention): date = twoDigitInt.setResultsName("month") + '/' + twoDigitInt.setResultsName("day") + '/' + twoDigitInt.setResultsName("year") I *really* wanted to encourage people to use results names, so I made ParserElements "callable", and having the __call__ function call setResultsName, thus supporting the following shortened syntax: date = twoDigitInt("month") + '/' + twoDigitInt("day") + '/' + twoDigitInt("year") Try to look past the "it's a function call" to "this is a lot easier to specify a results name than that old way!". (However, if you need to add the listAllMatches attribute, this still requires the explicit call to setResultsName.) I hope that's not too confusing. -- Paul |
From: Joshua J. K. <jk...@sk...> - 2008-03-06 22:19:04
|
On Thu, 2008-03-06 at 16:08 -0600, Paul McGuire wrote: > Try to look past the "it's a function call" to "this is a lot easier to > specify a results name than that old way!". (However, if you need to add > the listAllMatches attribute, this still requires the explicit call to > setResultsName.) > > I hope that's not too confusing. Makes perfect sense. Thanks for the explanation. j -- Joshua Kugler VOC/SigNet Provider (aka Web App Programmer) S&K Aerospace Alaska |
From: Joshua J. K. <jk...@sk...> - 2008-03-07 01:50:36
|
On Thu, 2008-03-06 at 16:08 -0600, Paul McGuire wrote: > Try to look past the "it's a function call" to "this is a lot easier to > specify a results name than that old way!". (However, if you need to add > the listAllMatches attribute, this still requires the explicit call to > setResultsName.) Just had a thought: why not allow the listAllMatches=True to be added to the calls to pyparsing elements? Such as: p.delimitedList(bracket_quoted('key_columns', listAllMatches=True)) In fact, here you go. Starting on line 1170 of pyparsing.py (doc strings deleted for brevity) def __call__(self, name, **kw): return self.setResultsName(name, **kw) And just advertise pp_element() as a shortcut for pp_element.setResultsName(). j -- Joshua Kugler VOC/SigNet Provider (aka Web App Programmer) S&K Aerospace Alaska |
From: Ralph C. <ra...@in...> - 2008-03-07 10:06:39
|
Hi Paul, > date = twoDigitInt("month") + '/' + twoDigitInt("day") + '/' + > twoDigitInt("year") > > Try to look past the "it's a function call" to "this is a lot easier > to specify a results name than that old way!". (However, if you need > to add the listAllMatches attribute, this still requires the explicit > call to setResultsName.) How about signifying in the string that listAllMatches should be true, e.g. 'month*'? Cheers, Ralph. |
From: Ralph C. <ra...@in...> - 2008-03-08 13:14:10
|
Hi Paul, > Oooh, I'm *tempted*, so tempted. > > What do you think of Joshua's proposal of passing through the **kw dict? > It's a little more explicit without requiring the full ".setResultsName" > syntax. Side-by-side, the options are: > > A) date = twoDigitInt("month*") + '/' + twoDigitInt("day") > B) date = twoDigitInt("month", listAllMatches=True) + '/' + > twoDigitInt("day") I think that 'listAllMatches = True' is just noisy and interrupts the flow. > Mostly I'm concerned about accidental listAllMatches behavior when a > results name is set from within the parsing of a Dict, which is a bit > out of the developer's control - the results names are taken from the > input text stream, and could reasonably contain any of these options. OK. What about a second optional argument that's not a long keyword? Or guarantee that listAllMatches is it. Any truth value should do. dd('month', 1) + '/' + dd('day') Cheers, Ralph. |
From: Paul M. <pt...@au...> - 2008-03-07 14:06:50
|
Oooh, I'm *tempted*, so tempted. What do you think of Joshua's proposal of passing through the **kw dict? It's a little more explicit without requiring the full ".setResultsName" syntax. Side-by-side, the options are: A) date = twoDigitInt("month*") + '/' + twoDigitInt("day") B) date = twoDigitInt("month", listAllMatches=True) + '/' + twoDigitInt("day") I could go either way. Just for the sake of alternatives, I tried some other string-based indicators: C) date = twoDigitInt("month!") + '/' + twoDigitInt("day") D) date = twoDigitInt("month+") + '/' + twoDigitInt("day") E) date = twoDigitInt("[month]") + '/' + twoDigitInt("day") F) date = twoDigitInt("{month}") + '/' + twoDigitInt("day") but I'm not compelled to use any of these over '*'. Mostly I'm concerned about accidental listAllMatches behavior when a results name is set from within the parsing of a Dict, which is a bit out of the developer's control - the results names are taken from the input text stream, and could reasonably contain any of these options. Any other comments from others on the list? -- Paul -----Original Message----- From: pyp...@li... [mailto:pyp...@li...] On Behalf Of Ralph Corderoy Sent: Friday, March 07, 2008 4:07 AM To: pyp...@li... Subject: Re: [Pyparsing] Grouping when using asList() Hi Paul, > date = twoDigitInt("month") + '/' + twoDigitInt("day") + '/' + > twoDigitInt("year") > > Try to look past the "it's a function call" to "this is a lot easier > to specify a results name than that old way!". (However, if you need > to add the listAllMatches attribute, this still requires the explicit > call to setResultsName.) How about signifying in the string that listAllMatches should be true, e.g. 'month*'? Cheers, Ralph. ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |
From: Joshua J. K. <jk...@sk...> - 2008-03-07 17:53:08
|
On Fri, 2008-03-07 at 08:06 -0600, Paul McGuire wrote: > Oooh, I'm *tempted*, so tempted. Syntactical sugar is *always* tempting, I've found, even when it's something that could come back to bite you later. :) > Mostly I'm concerned about accidental listAllMatches behavior when a results > name is set from within the parsing of a Dict, which is a bit out of the > developer's control - the results names are taken from the input text > stream, and could reasonably contain any of these options. I would (personally) prefer the explicit listAllMatches=True simply because you climb on to very shaky ground when you start introducing special characters...they are kind of like "magic numbers." And then there is the concept of "surprising behavior." If I was reading over someone's code (and was unfamiliar with pyparsing), and I saw 'month*', I might be expecting it to do something like a wildcard match, or some such. The other alternatives seem a bit ambiguous, even if [] does imply putting all the results in an array. Actually, listAllMatches is a bit ambiguous too...keepAllMatches might be a little bit clearer as far as that goes, but that is neither here nor there. So, my vote is to keep the explicitness, both for practical reasons and readability/understanding-the-code reasons. j -- Joshua Kugler VOC/SigNet Provider (aka Web App Programmer) S&K Aerospace Alaska |