Thread: [Pyparsing] How to distinguish a variable from a integer
Brought to you by:
ptmcg
From: Gustavo N. <me...@gu...> - 2009-05-14 17:02:03
|
Hello, everybody. First of all, I wanted to thank you for this awesome package. I'm having fun with it. :) I've read O'reilly's shortcut on Pyparsing, but still I can't find an answer to this: One of the components of the grammar I'm defining is an operand. Operands can be a number or a variable. A variable is a string made up of word characters (in any language), numbers (in any language/culture) and/or a spacing character (underscores by default). I'm using the following: """ import re from pyparsing import * # Defining the numbers: decimal_sep = Literal(".") decimals = Optional(decimal_sep + OneOrMore(Word(nums))) number = Combine(Word(nums) + decimals) # Defining the variables: variable = Regex("[\w\d_]+", re.UNICODE) # Finally, let's define the operand: operand = number | variable """ The operand above works perfectly with the following expressions: hello -> variable 23 -> number hello_world -> variable But it doesn't support variables which begin with a number (e.g., "1st_variable"). I get the following exception all the time: >>>> from varnums import * >>>> operand.parseString("1st_variable") >(['1'], {}) >>>> operand.parseString("1st_variable", parseAll=True) >Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/home/gustavo/System/pyenvs/booleano/lib/python2.6/site-packages/pyparsing >-1.5.2-py2.6.egg/pyparsing.py", line 1076, in parseString raise exc >pyparsing.ParseException: Expected end of text (at char 1), (line:1, col:2) I know I can invert the definition of the operand (i.e., "operand = variable | number"), but then strings like "22" will be matched as variables (not numbers). How can I fix this? Thanks in advance. -- Gustavo Narea <xri://=Gustavo>. | Tech blog: =Gustavo/(+blog)/tech ~ About me: =Gustavo/about | |
From: spir <den...@fr...> - 2009-05-14 17:22:08
|
Le Thu, 14 May 2009 19:01:42 +0200, Gustavo Narea <me...@gu...> s'exprima ainsi: > Hello, everybody. > > First of all, I wanted to thank you for this awesome package. I'm having > fun with it. :) > > I've read O'reilly's shortcut on Pyparsing, but still I can't find an > answer to this: > > One of the components of the grammar I'm defining is an operand. Operands > can be a number or a variable. A variable is a string made up of word > characters (in any language), numbers (in any language/culture) and/or a > spacing character (underscores by default). > > I'm using the following: > """ > import re > from pyparsing import * > > # Defining the numbers: > decimal_sep = Literal(".") > decimals = Optional(decimal_sep + OneOrMore(Word(nums))) > number = Combine(Word(nums) + decimals) > > # Defining the variables: > variable = Regex("[\w\d_]+", re.UNICODE) > > # Finally, let's define the operand: > operand = number | variable > """ > > The operand above works perfectly with the following expressions: > hello -> variable > 23 -> number > hello_world -> variable > > But it doesn't support variables which begin with a number (e.g., > "1st_variable"). I get the following exception all the time: > >>>> from varnums import * > >>>> operand.parseString("1st_variable") > >(['1'], {}) > >>>> operand.parseString("1st_variable", parseAll=True) > >Traceback (most recent call last): > > File "<stdin>", line 1, in <module> > > File > > "/home/gustavo/System/pyenvs/booleano/lib/python2.6/site-packages/pyparsing > >-1.5.2-py2.6.egg/pyparsing.py", line 1076, in parseString raise exc > >pyparsing.ParseException: Expected end of text (at char 1), (line:1, col:2) > > > I know I can invert the definition of the operand (i.e., "operand = > variable | number"), but then strings like "22" will be matched as > variables (not numbers). > > How can I fix this? > > Thanks in advance. The issue is that your variables can start like a number (the reason why in most PLs var names cannot start with a digit). So that: * using (number | variable) number masks variable * using the opposite number is eaten by variable You should use the common pattern for a variable, requiring letter or '_' at start: variable = Regex("[a-zA-Z_]\w*", re.UNICODE) (untested with unicode) (also beware that \w includes digits and '_') so that number and variable are mutually exclusive. You could also add a lookahead for !(letter | '_') trailing after the definition of number, but then your definition of variable is unclear. It should be required that a variable has at least one non-digit char. Which is uneasy ;-) Denis ------ la vita e estrany |
From: Gustavo N. <me...@gu...> - 2009-05-14 18:04:16
|
Bonjour, Denis. spir said: > The issue is that your variables can start like a number (the reason why in > most PLs var names cannot start with a digit). So that: * using (number | > variable) number masks variable > * using the opposite number is eaten by variable > > You should use the common pattern for a variable, requiring letter or '_' > at start: variable = Regex("[a-zA-Z_]\w*", re.UNICODE) (untested with > unicode) Yes, I was aware of that limitation with regular expressions, but I thought there was a way to work around that with Pyparsing. >(also beware that \w includes digits and '_') Oops, you're right. Thanks! > so that number and variable are mutually exclusive. > > You could also add a lookahead for !(letter | '_') trailing after the > definition of number, but then your definition of variable is unclear. It > should be required that a variable has at least one non-digit char. Which > is uneasy ;-) Yep, that's not a good solution either. Well, I'll have to make sure variables don't start with a number. :/ Merci beaucoup! -- Gustavo Narea <xri://=Gustavo>. | Tech blog: =Gustavo/(+blog)/tech ~ About me: =Gustavo/about | |
From: Paul M. <pt...@au...> - 2009-05-14 17:47:33
|
> -----Original Message----- > From: Gustavo Narea [mailto:me...@gu...] > Sent: Thursday, May 14, 2009 12:02 PM > To: pyp...@li... > Subject: [Pyparsing] How to distinguish a variable from a integer > > Hello, everybody. > > First of all, I wanted to thank you for this awesome package. I'm having > fun with it. :) Well, well, my friend, so we meet again! I'm pleased to see you have been bitten by the pyparsing bug. :) > How can I fix this? > In general, I think this is why variable names in most computing languages I know do *not* permit the name to begin with a number. But you are the language designer, so I will show you how to do this in pyparsing. Two suggestions, not sure if I have a preference: 1. use "operand = number ^ variable" instead of "operand = number | variable". '|' returns MatchFirst expressions, which return, well, the first matching expression. '^' returns Or expressions, which return *longest* match of all the alternative expressions. Think of the '^' as a little set of dividers, measuring the returned values of all the expressions, and picking the longest. '^' is not a cure-all, though, and can cause infinite run-time recursion in self-referencing grammars (those that include operatorPrecedence or Forward expressions). 2. As you say, invert operand to "operand = variable | number", and then attach a parse action to variable that first tries to evaluate the result as a number. In your current parser, you may eventually attach a parse action to number, something like this: number.setParseAction(lambda tokens: float(tokens[0])) so that at post-parse time, the returned string has already been converted to a float. So instead, attaching something like this to variable (untested): def numOrVar(tokens): try: return float(tokens[0]) except ValueError: pass variable.setParseAction(numOrVar) Now you don't even need the alternation, since as you observed, variable will also match "22", so just define "operand = variable". You could also try this for defining variable: variable = Word(unicode(alphanums+'_')) or variable = Word(unicode(alphanums+alphas8bit+'_')) or to absolutely cover all bases (for 2-byte Unicode, anyway): allUnicodeAlphas = u''.join(c for c in map(unichr,range(65536)) if c.isalpha()) allUnicodeNums = = u''.join(c for c in map(unichr,range(65536)) if c.isdigit()) variable = Word(allUnicodeAlphas + allUnicodeNums + u'_') (It's surprising how many Unicode digits there are besides '0'-'9'.) BTW, this definition of decimals: decimals = Optional(decimal_sep + OneOrMore(Word(nums))) includes some unnecessary repetition. It should be sufficient to write: decimals = Optional(decimal_sep + Word(nums)) Unless I misunderstood your intent here. So, will we see some pyparsing sneak into a repoze package one of these days, perhaps some sort of authorization rights syntax, hmmm? Buena suerte, y mucho gusto! -- Paul |
From: Paul M. <pt...@au...> - 2009-05-14 17:57:16
|
> -----Original Message----- > From: spir [mailto:den...@fr...] > You could also add a lookahead for !(letter | '_') trailing after the > definition of number, but then your definition of variable is unclear. It > should be required that a variable has at least one non-digit char. Which > is uneasy ;-) > Yes, Denis has another approach. You can use parse actions as a way to add validation logic like Denis suggests. Here is a validating parse action that you could attach to variable to ensure that it contains at least one non-digit. def mustHaveAtLeastOneNonDigit(tokens): if all(c.isdigit() for c in tokens[0]): raise ParseException("variable must have at least one non-digit") variable.setParseAction(mustHaveAtLeastOneNonDigit) Parse actions can serve a number of uses. They can also be chained so that multiple actions or validations can be invoked: ip_part = Word(nums) convertToInt = lambda tokens: int(tokens[0]) def validateRange(tokens): if not 0 <= tokens[0] < 256: raise ParseException("value must be in range 0-255") ip_part.setParseAction(convertToInt, validateRange) # or: # ip_part.setParseAction(convertToInt) # ip_part.addParseAction(validateRange) ip_addr = Combine(ip_part + ('.'+ip_part)*3 ) print ip_addr.parseString("192.168.0.255") print ip_addr.parseString("123.456.789.000") -- Paul |
From: Gustavo N. <me...@gu...> - 2009-05-14 23:02:32
|
Yes, it worked that way! Thank you very much once again, Denis and Paul! :) - Gustavo. Paul said: > Yes, Denis has another approach. You can use parse actions as a way to add > validation logic like Denis suggests. Here is a validating parse action > that you could attach to variable to ensure that it contains at least one > non-digit. > > def mustHaveAtLeastOneNonDigit(tokens): > if all(c.isdigit() for c in tokens[0]): > raise ParseException("variable must have at least one non-digit") > variable.setParseAction(mustHaveAtLeastOneNonDigit) > > > Parse actions can serve a number of uses. They can also be chained so that > multiple actions or validations can be invoked: > > > ip_part = Word(nums) > convertToInt = lambda tokens: int(tokens[0]) > def validateRange(tokens): > if not 0 <= tokens[0] < 256: > raise ParseException("value must be in range 0-255") > ip_part.setParseAction(convertToInt, validateRange) > # or: > # ip_part.setParseAction(convertToInt) > # ip_part.addParseAction(validateRange) > ip_addr = Combine(ip_part + ('.'+ip_part)*3 ) > > print ip_addr.parseString("192.168.0.255") > print ip_addr.parseString("123.456.789.000") -- Gustavo Narea <xri://=Gustavo>. | Tech blog: =Gustavo/(+blog)/tech ~ About me: =Gustavo/about | |
From: Gustavo N. <me...@gu...> - 2009-05-14 19:54:00
|
Paul said: > > Hello, everybody. > > > > First of all, I wanted to thank you for this awesome package. I'm having > > fun with it. :) > > Well, well, my friend, so we meet again! I'm pleased to see you have been > bitten by the pyparsing bug. :) Hello, Paul! Good to see you here :) > > How can I fix this? > > In general, I think this is why variable names in most computing languages > I know do *not* permit the name to begin with a number. But you are the > language designer, so I will show you how to do this in pyparsing. > > Two suggestions, not sure if I have a preference: > 1. use "operand = number ^ variable" instead of "operand = number | > variable". '|' returns MatchFirst expressions, which return, well, the > first matching expression. '^' returns Or expressions, which return > *longest* match of all the alternative expressions. Think of the '^' as a > little set of dividers, measuring the returned values of all the > expressions, and picking the longest. '^' is not a cure-all, though, and > can cause infinite run-time recursion in self-referencing grammars (those > that include operatorPrecedence or Forward expressions). I use both operatorPrecedence and Forward :/ > 2. As you say, invert operand to "operand = variable | number", and then > attach a parse action to variable that first tries to evaluate the result > as a number. In your current parser, you may eventually attach a parse > action to number, something like this: > number.setParseAction(lambda tokens: float(tokens[0])) > so that at post-parse time, the returned string has already been converted > to a float. So instead, attaching something like this to variable > (untested): > > def numOrVar(tokens): > try: > return float(tokens[0]) > except ValueError: > pass > variable.setParseAction(numOrVar) > > Now you don't even need the alternation, since as you observed, variable > will also match "22", so just define "operand = variable". > You could also try this for defining variable: > > variable = Word(unicode(alphanums+'_')) > > or > > variable = Word(unicode(alphanums+alphas8bit+'_')) > > or to absolutely cover all bases (for 2-byte Unicode, anyway): > > allUnicodeAlphas = u''.join(c for c in map(unichr,range(65536)) if > c.isalpha()) > allUnicodeNums = = u''.join(c for c in map(unichr,range(65536)) if > c.isdigit()) > variable = Word(allUnicodeAlphas + allUnicodeNums + u'_') > > (It's surprising how many Unicode digits there are besides '0'-'9'.) I said that an operand could be a variable or a number to simplify things given that the problem was between numbers and variables. But it's actually more complex than that: It could be a quoted string or a set (in the form "{element1, element2, ...}" where each element can be a number, variable, quoted string or even another set) too: operand = number | string | variable | set Therefore setting a parse action for the whole operand wouldn't be desirable, I'd rather set it in the types individually -- specially to be able to test them separately too. Sorry for not pointing this out. > > > BTW, this definition of decimals: > > decimals = Optional(decimal_sep + OneOrMore(Word(nums))) > > includes some unnecessary repetition. It should be sufficient to write: > > decimals = Optional(decimal_sep + Word(nums)) > > Unless I misunderstood your intent here. Thank you so much! I thought I had to set the quantifier explicitly. > So, will we see some pyparsing sneak into a repoze package one of these > days, perhaps some sort of authorization rights syntax, hmmm? You guessed right! :) I'm working on a package called PyACL, which as the name implies, implements Access Control Lists in Python (and repoze.what 2 will use it a lot). But one of the things that I was missing was the way to allow system administrators to filter the access rules easily, so I started working on this generic Pyparsing-based library which I'll announce here as soon as it's usable: https://launchpad.net/booleano > Buena suerte, y mucho gusto! ¡Lo mismo digo! ;-) Thank you! =) -- Gustavo Narea <xri://=Gustavo>. | Tech blog: =Gustavo/(+blog)/tech ~ About me: =Gustavo/about | |
From: Paul M. <pt...@au...> - 2009-05-14 20:18:11
|
> I said that an operand could be a variable or a number to simplify things > given that the problem was between numbers and variables. But it's > actually > more complex than that: It could be a quoted string or a set (in the form > "{element1, element2, ...}" where each element can be a number, variable, > quoted string or even another set) too: > > operand = number | string | variable | set > Ah, with that said, let me then suggest this as a starting point for implementing operand (assuming that variables can *not* start with a digit): LBRACE = Suppress('{') RBRACE = Suppress('}') operand = Forward() number = #...(as you have defined it in your original code) string_ = quotedString.setParseAction(removeQuotes) variable = #...(use your Unicode definition of choice) set_ = Group(LBRACE + delimitedList(operand) + RBRACE) operand << (number | string_ | variable | set_) delimitedList takes care of the repetition with intervening comma delimited. Group packages the result in its own list, so that recursive set definitions will maintain their nesting properly. The set-enclosing braces are suppressed from the output - they are useful during parsing, but unnecessary once the tokens have been grouped. This is the canonical form for defining a recursive expression like your operand, mas o menos. Now you are free to include operand in other expressions, or even operatorPrecedence. -- Paul |
From: Gustavo N. <me...@gu...> - 2009-05-14 22:30:50
|
Paul said: > Ah, with that said, let me then suggest this as a starting point for > implementing operand (assuming that variables can not start with a digit): > > LBRACE = Suppress('{') > RBRACE = Suppress('}') > operand = Forward() > number = #...(as you have defined it in your original code) > string_ = quotedString.setParseAction(removeQuotes) > variable = #...(use your Unicode definition of choice) > set_ = Group(LBRACE + delimitedList(operand) + RBRACE) > operand << (number | string_ | variable | set_) > > delimitedList takes care of the repetition with intervening comma > delimited. Group packages the result in its own list, so that recursive set > definitions will maintain their nesting properly. The set-enclosing braces > are suppressed from the output - they are useful during parsing, but > unnecessary once the tokens have been grouped. > > This is the canonical form for defining a recursive expression like your > operand, mas o menos. Now you are free to include operand in other > expressions, or even operatorPrecedence. Thank you very much for that! I was going to talk about that, because the way I implemented it was more complex and operand.validate() raised an exception. But this fixes the problem. Thanks once again! -- Gustavo Narea <xri://=Gustavo>. | Tech blog: =Gustavo/(+blog)/tech ~ About me: =Gustavo/about | |
From: spir <den...@fr...> - 2009-05-15 12:49:41
|
Le Thu, 14 May 2009 21:53:40 +0200, Gustavo Narea <me...@gu...> s'exprima ainsi: > I'm working on a package called PyACL, which as the name implies, > implements Access Control Lists in Python (and repoze.what 2 will use it a > lot). But one of the things that I was missing was the way to allow system > administrators to filter the access rules easily, so I started working on > this generic Pyparsing-based library which I'll announce here as soon as > it's usable: https://launchpad.net/booleano Had a look and find it really interesting (booleano). Reminds me of a project about customizing computer languages (PL, wiki, etc), including allowing various natural languages. This should be a kind of layer (possibly implemented in an editor) between the user and the standard computer language. The main issue was that key words (not necessarily reserved words) may well be free words for another user/natural language. E.g. in your example - Castilian: autor == "David TMX" y álbum.año >= 2008 - English: author == "David TMX" and album.year >= 2008 - French: auteur == «David TMX» et album.année >= 2008 what happens if logical ('and' '==' '>=', probably 'not' 'or'), or maybe even key ('author' 'album' 'year'), tokens are used with another sense or context in another user's dialect? Do you need to protect all possible variants of words having a special meaning in your language? (Even if it was possible, then user-level choices are impossible). > > Buena suerte, y mucho gusto! > > ¡Lo mismo digo! ;-) > > Thank you! =) Bona sort, i molt plaer! (català) Denis ------ la vita e estrany |
From: Gustavo N. <me...@gu...> - 2009-05-15 14:37:45
|
Bonjour, Denis ! spir said: > Had a look and find it really interesting (booleano). > Reminds me of a project about customizing computer languages (PL, wiki, > etc), including allowing various natural languages. This should be a kind > of layer (possibly implemented in an editor) between the user and the > standard computer language. The main issue was that key words (not > necessarily reserved words) may well be free words for another user/natural > language. > > E.g. in your example > - Castilian: autor == "David TMX" y álbum.año >= 2008 > - English: author == "David TMX" and album.year >= 2008 > - French: auteur == «David TMX» et album.année >= 2008 > what happens if logical ('and' '==' '>=', probably 'not' 'or'), or maybe > even key ('author' 'album' 'year'), tokens are used with another sense or > context in another user's dialect? Do you need to protect all possible > variants of words having a special meaning in your language? (Even if it > was possible, then user-level choices are impossible). Developers using Booleano should define the variable and function names beforehand; users cannot define variables or functions, just re-use those provided by the application. So, when the developer passes the variables and functions valid in the expressions, Booleano checks that their names aren't reserved words in the grammar (and variable, function and operator names are all case-insensitive). Also, there won't be just one grammar to parse all the expressions. There will be one grammar per localization, so you could only use the French grammar to parse French expressions (not English or Spanish expressions). This way, name collisions are avoided. > > > Buena suerte, y mucho gusto! > > > > ¡Lo mismo digo! ;-) > > > > Thank you! =) > > Bona sort, i molt plaer! (català) OK, now I ran out of ideas 'cause I don't know how to say so in another language ;-) Salut ! -- Gustavo Narea <xri://=Gustavo>. | Tech blog: =Gustavo/(+blog)/tech ~ About me: =Gustavo/about | |