> -----Original Message-----
> From: pyp...@li...
> [mailto:pyp...@li...] On
> Behalf Of Jean-Paul Calderone
> Sent: Sunday, June 25, 2006 8:04 PM
> To: pyp...@li...
> Subject: [Pyparsing] PyParsing and unicode
>
> Hey,
>
> I'm wondering how to match any sequence of
> whitespace-separated characters,
> including non-ascii. For ASCII, I've just been using
> pyparsing.Word(alphanums) but this approach doesn't seem to work for
> unicode.
>
Well, there *are* quite a few other printable characters besides just
letters and numbers. Pyparsing defines the constant pyparsing.printables as
all non-whitespace 7-bit ASCII characters, that is
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,
-./:
;<=>?@[\\]^_`{|}~' (Contrast this with string.printable, which includes
whitespace characters... just different interpretations of what "printable"
means, I guess. Note also that the string module defines printable, but the
str class does not.)
> Also, while trying to figure this out, I tried this:
>
>
> pyparsing.OneOrMore(pyparsing.NotAny(pyparsing.White())).parse
> String("hello")
>
> Running this goes into an infinite loop consuming all CPU
> resources. Not
> sure if this is a bug worth fixing in PyParsing but I thought
> I'd point it
> out.
>
> Jean-Paul
>
NotAny is merely a negative lookahead, the opposite of FollowedBy. It does
*not* advance the parse position, so OneOrMore(NotAny(whatever)) will just
loop forever. I think what you are looking for is the opposite of Word, the
pyparsing class CharsNotIn. Here's your example, entered at the Python
prompt:
>>> print pyparsing.OneOrMore(pyparsing.CharsNotIn("
\t\n\r\f")).parseString("hello")
['hello']
I've not tested this with unicode characters though.
-- Paul
|