Re: [Pyparsing] PyParsing and unicode

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> -----Original Message-----
> From: pyp...@li... 
> [mailto:pyp...@li...] On 
> Behalf Of Jean-Paul Calderone
> Sent: Sunday, June 25, 2006 8:04 PM
> To: pyp...@li...
> Subject: [Pyparsing] PyParsing and unicode
> 
> Hey,
> 
> I'm wondering how to match any sequence of 
> whitespace-separated characters,
> including non-ascii.  For ASCII, I've just been using
> pyparsing.Word(alphanums) but this approach doesn't seem to work for
> unicode.
> 
Well, there *are* quite a few other printable characters besides just
letters and numbers.  Pyparsing defines the constant pyparsing.printables as
all non-whitespace 7-bit ASCII characters, that is
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,
-./:
;<=>?@[\\]^_`{|}~'  (Contrast this with string.printable, which includes
whitespace characters... just different interpretations of what "printable"
means, I guess.  Note also that the string module defines printable, but the
str class does not.)

> Also, while trying to figure this out, I tried this:
> 
>   
> pyparsing.OneOrMore(pyparsing.NotAny(pyparsing.White())).parse
> String("hello")
> 
> Running this goes into an infinite loop consuming all CPU 
> resources.  Not
> sure if this is a bug worth fixing in PyParsing but I thought 
> I'd point it
> out.
> 
> Jean-Paul
> 
NotAny is merely a negative lookahead, the opposite of FollowedBy.  It does
*not* advance the parse position, so OneOrMore(NotAny(whatever)) will just
loop forever.  I think what you are looking for is the opposite of Word, the
pyparsing class CharsNotIn.  Here's your example, entered at the Python
prompt:

>>> print pyparsing.OneOrMore(pyparsing.CharsNotIn("
\t\n\r\f")).parseString("hello")
['hello']

I've not tested this with unicode characters though.

-- Paul