Thread: [Pyparsing] How to distinguish a variable from a integer

Brought to you by: ptmcg

pyparsing-users

[Pyparsing] How to distinguish a variable from a integer

From: Gustavo N. <me...@gu...> - 2009-05-14 17:02:03

Hello, everybody.

First of all, I wanted to thank you for this awesome package. I'm having fun 
with it. :)

I've read O'reilly's shortcut on Pyparsing, but still I can't find an answer 
to this:

One of the components of the grammar I'm defining is an operand. Operands can 
be a number or a variable. A variable is a string made up of word characters 
(in any language), numbers (in any language/culture) and/or a spacing 
character (underscores by default).

I'm using the following:
"""
import re
from pyparsing import *

# Defining the numbers:
decimal_sep = Literal(".")
decimals = Optional(decimal_sep + OneOrMore(Word(nums)))
number = Combine(Word(nums) + decimals)

# Defining the variables:
variable = Regex("[\w\d_]+", re.UNICODE)

# Finally, let's define the operand:
operand = number | variable
"""

The operand above works perfectly with the following expressions:
hello -> variable
23 -> number
hello_world -> variable

But it doesn't support variables which begin with a number (e.g., 
"1st_variable"). I get the following exception all the time:
>>>> from varnums import *
>>>> operand.parseString("1st_variable")
>(['1'], {})
>>>> operand.parseString("1st_variable", parseAll=True)
>Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File
> "/home/gustavo/System/pyenvs/booleano/lib/python2.6/site-packages/pyparsing
>-1.5.2-py2.6.egg/pyparsing.py", line 1076, in parseString raise exc
>pyparsing.ParseException: Expected end of text (at char 1), (line:1, col:2)


I know I can invert the definition of the operand (i.e., "operand = variable | 
number"), but then strings like "22" will be matched as variables (not 
numbers).

How can I fix this?

Thanks in advance.
-- 
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |

Re: [Pyparsing] How to distinguish a variable from a integer

From: spir <den...@fr...> - 2009-05-14 17:22:08

Le Thu, 14 May 2009 19:01:42 +0200,
Gustavo Narea <me...@gu...> s'exprima ainsi:

> Hello, everybody.
> 
> First of all, I wanted to thank you for this awesome package. I'm having
> fun with it. :)
> 
> I've read O'reilly's shortcut on Pyparsing, but still I can't find an
> answer to this:
> 
> One of the components of the grammar I'm defining is an operand. Operands
> can be a number or a variable. A variable is a string made up of word
> characters (in any language), numbers (in any language/culture) and/or a
> spacing character (underscores by default).
> 
> I'm using the following:
> """
> import re
> from pyparsing import *
> 
> # Defining the numbers:
> decimal_sep = Literal(".")
> decimals = Optional(decimal_sep + OneOrMore(Word(nums)))
> number = Combine(Word(nums) + decimals)
> 
> # Defining the variables:
> variable = Regex("[\w\d_]+", re.UNICODE)
> 
> # Finally, let's define the operand:
> operand = number | variable
> """
> 
> The operand above works perfectly with the following expressions:
> hello -> variable
> 23 -> number
> hello_world -> variable
> 
> But it doesn't support variables which begin with a number (e.g., 
> "1st_variable"). I get the following exception all the time:
> >>>> from varnums import *
> >>>> operand.parseString("1st_variable")
> >(['1'], {})
> >>>> operand.parseString("1st_variable", parseAll=True)
> >Traceback (most recent call last):
> >  File "<stdin>", line 1, in <module>
> >  File
> > "/home/gustavo/System/pyenvs/booleano/lib/python2.6/site-packages/pyparsing
> >-1.5.2-py2.6.egg/pyparsing.py", line 1076, in parseString raise exc
> >pyparsing.ParseException: Expected end of text (at char 1), (line:1, col:2)
> 
> 
> I know I can invert the definition of the operand (i.e., "operand =
> variable | number"), but then strings like "22" will be matched as
> variables (not numbers).
> 
> How can I fix this?
> 
> Thanks in advance.

The issue is that your variables can start like a number (the reason why in most PLs var names cannot start with a digit). So that:
* using (number | variable) number masks variable
* using the opposite number is eaten by variable

You should use the common pattern for a variable, requiring letter or '_' at start:
  variable = Regex("[a-zA-Z_]\w*", re.UNICODE)       (untested with unicode)
(also beware that \w includes digits and '_')
so that number and variable are mutually exclusive.

You could also add a lookahead for !(letter | '_') trailing after the definition of number, but then your definition of variable is unclear. It should be required that a variable has at least one non-digit char. Which is uneasy ;-)

Denis
------
la vita e estrany

Re: [Pyparsing] How to distinguish a variable from a integer

From: Gustavo N. <me...@gu...> - 2009-05-14 18:04:16

Bonjour, Denis.

spir said:
> The issue is that your variables can start like a number (the reason why in
> most PLs var names cannot start with a digit). So that: * using (number |
> variable) number masks variable
> * using the opposite number is eaten by variable
>
> You should use the common pattern for a variable, requiring letter or '_'
> at start: variable = Regex("[a-zA-Z_]\w*", re.UNICODE)       (untested with
> unicode) 

Yes, I was aware of that limitation with regular expressions, but I thought 
there was a way to work around that with Pyparsing.

>(also beware that \w includes digits and '_')

Oops, you're right. Thanks!

> so that number and variable are mutually exclusive.
>
> You could also add a lookahead for !(letter | '_') trailing after the
> definition of number, but then your definition of variable is unclear. It
> should be required that a variable has at least one non-digit char. Which
> is uneasy ;-)

Yep, that's not a good solution either.

Well, I'll have to make sure variables don't start with a number. :/

Merci beaucoup!
-- 
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |

Re: [Pyparsing] How to distinguish a variable from a integer

From: Paul M. <pt...@au...> - 2009-05-14 17:47:33

> -----Original Message-----
> From: Gustavo Narea [mailto:me...@gu...]
> Sent: Thursday, May 14, 2009 12:02 PM
> To: pyp...@li...
> Subject: [Pyparsing] How to distinguish a variable from a integer
> 
> Hello, everybody.
> 
> First of all, I wanted to thank you for this awesome package. I'm having
> fun with it. :)

Well, well, my friend, so we meet again!  I'm pleased to see you have been
bitten by the pyparsing bug. :)

> How can I fix this?
> 
In general, I think this is why variable names in most computing languages I
know do *not* permit the name to begin with a number.  But you are the
language designer, so I will show you how to do this in pyparsing.

Two suggestions, not sure if I have a preference:
1. use "operand = number ^ variable" instead of "operand = number |
variable".  '|' returns MatchFirst expressions, which return, well, the
first matching expression.  '^' returns Or expressions, which return
*longest* match of all the alternative expressions.  Think of the '^' as a
little set of dividers, measuring the returned values of all the
expressions, and picking the longest.  '^' is not a cure-all, though, and
can cause infinite run-time recursion in self-referencing grammars (those
that include operatorPrecedence or Forward expressions).

2. As you say, invert operand to "operand = variable | number", and then
attach a parse action to variable that first tries to evaluate the result as
a number.  In your current parser, you may eventually attach a parse action
to number, something like this:
number.setParseAction(lambda tokens: float(tokens[0]))
so that at post-parse time, the returned string has already been converted
to a float.  So instead, attaching something like this to variable
(untested):

def numOrVar(tokens):
    try:
        return float(tokens[0])
    except ValueError:
        pass
variable.setParseAction(numOrVar)

Now you don't even need the alternation, since as you observed, variable
will also match "22", so just define "operand = variable".


You could also try this for defining variable:

variable = Word(unicode(alphanums+'_'))

or 

variable = Word(unicode(alphanums+alphas8bit+'_'))

or to absolutely cover all bases (for 2-byte Unicode, anyway):

allUnicodeAlphas = u''.join(c for c in map(unichr,range(65536)) if
c.isalpha())
allUnicodeNums = = u''.join(c for c in map(unichr,range(65536)) if
c.isdigit())
variable = Word(allUnicodeAlphas + allUnicodeNums + u'_')

(It's surprising how many Unicode digits there are besides '0'-'9'.)


BTW, this definition of decimals:

decimals = Optional(decimal_sep + OneOrMore(Word(nums)))

includes some unnecessary repetition.  It should be sufficient to write:

decimals = Optional(decimal_sep + Word(nums))

Unless I misunderstood your intent here.


So, will we see some pyparsing sneak into a repoze package one of these
days, perhaps some sort of authorization rights syntax, hmmm?

Buena suerte, y mucho gusto!
-- Paul

Re: [Pyparsing] How to distinguish a variable from a integer

From: Paul M. <pt...@au...> - 2009-05-14 17:57:16

> -----Original Message-----
> From: spir [mailto:den...@fr...]
> You could also add a lookahead for !(letter | '_') trailing after the
> definition of number, but then your definition of variable is unclear. It
> should be required that a variable has at least one non-digit char. Which
> is uneasy ;-)
> 

Yes, Denis has another approach.  You can use parse actions as a way to add
validation logic like Denis suggests.  Here is a validating parse action
that you could attach to variable to ensure that it contains at least one
non-digit.

def mustHaveAtLeastOneNonDigit(tokens):
    if all(c.isdigit() for c in tokens[0]):
        raise ParseException("variable must have at least one non-digit")
variable.setParseAction(mustHaveAtLeastOneNonDigit)


Parse actions can serve a number of uses.  They can also be chained so that
multiple actions or validations can be invoked:


ip_part = Word(nums)
convertToInt = lambda tokens: int(tokens[0])
def validateRange(tokens):
    if not 0 <= tokens[0] < 256:
        raise ParseException("value must be in range 0-255")
ip_part.setParseAction(convertToInt, validateRange)
# or:
# ip_part.setParseAction(convertToInt)
# ip_part.addParseAction(validateRange)
ip_addr = Combine(ip_part + ('.'+ip_part)*3 )

print ip_addr.parseString("192.168.0.255")
print ip_addr.parseString("123.456.789.000")


-- Paul

Re: [Pyparsing] How to distinguish a variable from a integer

From: Gustavo N. <me...@gu...> - 2009-05-14 23:02:32

Yes, it worked that way!

Thank you very much once again, Denis and Paul! :)

  - Gustavo.

Paul said:
> Yes, Denis has another approach.  You can use parse actions as a way to add
> validation logic like Denis suggests.  Here is a validating parse action
> that you could attach to variable to ensure that it contains at least one
> non-digit.
>
> def mustHaveAtLeastOneNonDigit(tokens):
>     if all(c.isdigit() for c in tokens[0]):
>         raise ParseException("variable must have at least one non-digit")
> variable.setParseAction(mustHaveAtLeastOneNonDigit)
>
>
> Parse actions can serve a number of uses.  They can also be chained so that
> multiple actions or validations can be invoked:
>
>
> ip_part = Word(nums)
> convertToInt = lambda tokens: int(tokens[0])
> def validateRange(tokens):
>     if not 0 <= tokens[0] < 256:
>         raise ParseException("value must be in range 0-255")
> ip_part.setParseAction(convertToInt, validateRange)
> # or:
> # ip_part.setParseAction(convertToInt)
> # ip_part.addParseAction(validateRange)
> ip_addr = Combine(ip_part + ('.'+ip_part)*3 )
>
> print ip_addr.parseString("192.168.0.255")
> print ip_addr.parseString("123.456.789.000")
-- 
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |

Re: [Pyparsing] How to distinguish a variable from a integer

From: Gustavo N. <me...@gu...> - 2009-05-14 19:54:00

Paul said:
> > Hello, everybody.
> >
> > First of all, I wanted to thank you for this awesome package. I'm having
> > fun with it. :)
>
> Well, well, my friend, so we meet again!  I'm pleased to see you have been
> bitten by the pyparsing bug. :)

Hello, Paul!

Good to see you here :)

> > How can I fix this?
>
> In general, I think this is why variable names in most computing languages
> I know do *not* permit the name to begin with a number.  But you are the
> language designer, so I will show you how to do this in pyparsing.
>
> Two suggestions, not sure if I have a preference:
> 1. use "operand = number ^ variable" instead of "operand = number |
> variable".  '|' returns MatchFirst expressions, which return, well, the
> first matching expression.  '^' returns Or expressions, which return
> *longest* match of all the alternative expressions.  Think of the '^' as a
> little set of dividers, measuring the returned values of all the
> expressions, and picking the longest.  '^' is not a cure-all, though, and
> can cause infinite run-time recursion in self-referencing grammars (those
> that include operatorPrecedence or Forward expressions).

I use both operatorPrecedence and Forward :/


> 2. As you say, invert operand to "operand = variable | number", and then
> attach a parse action to variable that first tries to evaluate the result
> as a number.  In your current parser, you may eventually attach a parse
> action to number, something like this:
> number.setParseAction(lambda tokens: float(tokens[0]))
> so that at post-parse time, the returned string has already been converted
> to a float.  So instead, attaching something like this to variable
> (untested):
>
> def numOrVar(tokens):
>     try:
>         return float(tokens[0])
>     except ValueError:
>         pass
> variable.setParseAction(numOrVar)
>
> Now you don't even need the alternation, since as you observed, variable
> will also match "22", so just define "operand = variable".
> You could also try this for defining variable:
>
> variable = Word(unicode(alphanums+'_'))
>
> or
>
> variable = Word(unicode(alphanums+alphas8bit+'_'))
>
> or to absolutely cover all bases (for 2-byte Unicode, anyway):
>
> allUnicodeAlphas = u''.join(c for c in map(unichr,range(65536)) if
> c.isalpha())
> allUnicodeNums = = u''.join(c for c in map(unichr,range(65536)) if
> c.isdigit())
> variable = Word(allUnicodeAlphas + allUnicodeNums + u'_')
>
> (It's surprising how many Unicode digits there are besides '0'-'9'.)

I said that an operand could be a variable or a number to simplify things 
given that the problem was between numbers and variables. But it's actually 
more complex than that: It could be a quoted string or a set (in the form 
"{element1, element2, ...}" where each element can be a number, variable, 
quoted string or even another set) too:

operand = number | string | variable | set

Therefore setting a parse action for the whole operand wouldn't be desirable, 
I'd rather set it in the types individually -- specially to be able to test 
them separately too.

Sorry for not pointing this out.


>
>
> BTW, this definition of decimals:
>
> decimals = Optional(decimal_sep + OneOrMore(Word(nums)))
>
> includes some unnecessary repetition.  It should be sufficient to write:
>
> decimals = Optional(decimal_sep + Word(nums))
>
> Unless I misunderstood your intent here.

Thank you so much! I thought I had to set the quantifier explicitly.


> So, will we see some pyparsing sneak into a repoze package one of these
> days, perhaps some sort of authorization rights syntax, hmmm?

You guessed right! :)

I'm working on a package called PyACL, which as the name implies, implements 
Access Control Lists in Python (and repoze.what 2 will use it a lot). But one 
of the things that I was missing was the way to allow system administrators to 
filter the access rules easily, so I started working on this generic 
Pyparsing-based library which I'll announce here as soon as it's usable:
https://launchpad.net/booleano


> Buena suerte, y mucho gusto!

¡Lo mismo digo! ;-)

Thank you! =)
-- 
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |

Re: [Pyparsing] How to distinguish a variable from a integer

From: Paul M. <pt...@au...> - 2009-05-14 20:18:11

> I said that an operand could be a variable or a number to simplify things
> given that the problem was between numbers and variables. But it's
> actually
> more complex than that: It could be a quoted string or a set (in the form
> "{element1, element2, ...}" where each element can be a number, variable,
> quoted string or even another set) too:
> 
> operand = number | string | variable | set
> 
Ah, with that said, let me then suggest this as a starting point for
implementing operand (assuming that variables can *not* start with a digit):

LBRACE = Suppress('{')
RBRACE = Suppress('}')
operand = Forward()
number = #...(as you have defined it in your original code)
string_ = quotedString.setParseAction(removeQuotes)
variable = #...(use your Unicode definition of choice)
set_ = Group(LBRACE + delimitedList(operand) + RBRACE)
operand << (number | string_ | variable | set_)

delimitedList takes care of the repetition with intervening comma delimited.
Group packages the result in its own list, so that recursive set definitions
will maintain their nesting properly.  The set-enclosing braces are
suppressed from the output - they are useful during parsing, but unnecessary
once the tokens have been grouped.

This is the canonical form for defining a recursive expression like your
operand, mas o menos.  Now you are free to include operand in other
expressions, or even operatorPrecedence.
    
-- Paul

Re: [Pyparsing] How to distinguish a variable from a integer

From: Gustavo N. <me...@gu...> - 2009-05-14 22:30:50

Paul said:
> Ah, with that said, let me then suggest this as a starting point for
> implementing operand (assuming that variables can not start with a digit):
>
> LBRACE = Suppress('{')
> RBRACE = Suppress('}')
> operand = Forward()
> number = #...(as you have defined it in your original code)
> string_ = quotedString.setParseAction(removeQuotes)
> variable = #...(use your Unicode definition of choice)
> set_ = Group(LBRACE + delimitedList(operand) + RBRACE)
> operand << (number | string_ | variable | set_)
>
> delimitedList takes care of the repetition with intervening comma
> delimited. Group packages the result in its own list, so that recursive set
> definitions will maintain their nesting properly.  The set-enclosing braces
> are suppressed from the output - they are useful during parsing, but
> unnecessary once the tokens have been grouped.
>
> This is the canonical form for defining a recursive expression like your
> operand, mas o menos.  Now you are free to include operand in other
> expressions, or even operatorPrecedence.

Thank you very much for that! I was going to talk about that, because the way 
I implemented it was more complex and operand.validate() raised an exception.

But this fixes the problem. Thanks once again!
-- 
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |

Re: [Pyparsing] How to distinguish a variable from a integer

From: spir <den...@fr...> - 2009-05-15 12:49:41

Le Thu, 14 May 2009 21:53:40 +0200,
Gustavo Narea <me...@gu...> s'exprima ainsi:

> I'm working on a package called PyACL, which as the name implies,
> implements Access Control Lists in Python (and repoze.what 2 will use it a
> lot). But one of the things that I was missing was the way to allow system
> administrators to filter the access rules easily, so I started working on
> this generic Pyparsing-based library which I'll announce here as soon as
> it's usable: https://launchpad.net/booleano

Had a look and find it really interesting (booleano).
Reminds me of a project about customizing computer languages (PL, wiki, etc), including allowing various natural languages. This should be a kind of layer (possibly implemented in an editor) between the user and the standard computer language.
The main issue was that key words (not necessarily reserved words) may well be free words for another user/natural language.

E.g. in your example
 - Castilian: autor == "David TMX" y álbum.año >= 2008
 - English: author == "David TMX" and album.year >= 2008
 - French: auteur == «David TMX» et album.année >= 2008
what happens if logical ('and' '==' '>=', probably 'not' 'or'), or maybe even key ('author' 'album' 'year'), tokens are used with another sense or context in another user's dialect?
Do you need to protect all possible variants of words having a special meaning in your language? (Even if it was possible, then user-level choices are impossible).

> > Buena suerte, y mucho gusto!  
> 
> ¡Lo mismo digo! ;-)
> 
> Thank you! =)

Bona sort, i molt plaer! (català)

Denis
------
la vita e estrany

Re: [Pyparsing] How to distinguish a variable from a integer

From: Gustavo N. <me...@gu...> - 2009-05-15 14:37:45

Bonjour, Denis !

spir said:
> Had a look and find it really interesting (booleano).
> Reminds me of a project about customizing computer languages (PL, wiki,
> etc), including allowing various natural languages. This should be a kind
> of layer (possibly implemented in an editor) between the user and the
> standard computer language. The main issue was that key words (not
> necessarily reserved words) may well be free words for another user/natural
> language.
>
> E.g. in your example
>  - Castilian: autor == "David TMX" y álbum.año >= 2008
>  - English: author == "David TMX" and album.year >= 2008
>  - French: auteur == «David TMX» et album.année >= 2008
> what happens if logical ('and' '==' '>=', probably 'not' 'or'), or maybe
> even key ('author' 'album' 'year'), tokens are used with another sense or
> context in another user's dialect? Do you need to protect all possible
> variants of words having a special meaning in your language? (Even if it
> was possible, then user-level choices are impossible).

Developers using Booleano should define the variable and function names 
beforehand; users cannot define variables or functions, just re-use those 
provided by the application. 

So, when the developer passes the variables and functions valid in the 
expressions, Booleano checks that their names aren't reserved words in the 
grammar (and variable, function and operator names are all case-insensitive).

Also, there won't be just one grammar to parse all the expressions. There will 
be one grammar per localization, so you could only use the French grammar to 
parse French expressions (not English or Spanish expressions).

This way, name collisions are avoided.


> > > Buena suerte, y mucho gusto!  
> >
> > ¡Lo mismo digo! ;-)
> >
> > Thank you! =)
>
> Bona sort, i molt plaer! (català)

OK, now I ran out of ideas 'cause I don't know how to say so in another 
language ;-)

Salut !
-- 
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |