Re: [Pyparsing] PyParsing and unicode
Brought to you by:
ptmcg
From: Paul M. <pa...@al...> - 2006-06-26 15:06:28
|
> -----Original Message----- > From: pyp...@li... > [mailto:pyp...@li...] On > Behalf Of Jean-Paul Calderone > Sent: Sunday, June 25, 2006 8:04 PM > To: pyp...@li... > Subject: [Pyparsing] PyParsing and unicode > > Hey, > > I'm wondering how to match any sequence of > whitespace-separated characters, > including non-ascii. For ASCII, I've just been using > pyparsing.Word(alphanums) but this approach doesn't seem to work for > unicode. > Well, there *are* quite a few other printable characters besides just letters and numbers. Pyparsing defines the constant pyparsing.printables as all non-whitespace 7-bit ASCII characters, that is '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+, -./: ;<=>?@[\\]^_`{|}~' (Contrast this with string.printable, which includes whitespace characters... just different interpretations of what "printable" means, I guess. Note also that the string module defines printable, but the str class does not.) > Also, while trying to figure this out, I tried this: > > > pyparsing.OneOrMore(pyparsing.NotAny(pyparsing.White())).parse > String("hello") > > Running this goes into an infinite loop consuming all CPU > resources. Not > sure if this is a bug worth fixing in PyParsing but I thought > I'd point it > out. > > Jean-Paul > NotAny is merely a negative lookahead, the opposite of FollowedBy. It does *not* advance the parse position, so OneOrMore(NotAny(whatever)) will just loop forever. I think what you are looking for is the opposite of Word, the pyparsing class CharsNotIn. Here's your example, entered at the Python prompt: >>> print pyparsing.OneOrMore(pyparsing.CharsNotIn(" \t\n\r\f")).parseString("hello") ['hello'] I've not tested this with unicode characters though. -- Paul |