Thread: [Pyparsing] PyParsing and unicode
Brought to you by:
ptmcg
From: Jean-Paul C. <ex...@di...> - 2006-05-02 14:26:47
|
Hey All, I've been using PyParsing to handle commands in Imaginary (formerly Pottery). So far it's done most of the things I've asked of it, and I think I have some ideas to work around the rest, but the behavior with respect to unicode is a bit confusing. In 1.2 (Ubuntu Breezy packaged version), I could parse a unicode string and get back a unicode string: exarkun@boson:~$ python Python 2.4.2 (#2, Sep 30 2005, 21:19:01) [GCC 4.0.2 20050808 (prerelease) (Ubuntu 4.0.1-4ubuntu8)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import pyparsing >>> pyparsing.__version__ '1.2' >>> pyparsing.quotedString.parseString(u"'foo'") ([u"'foo'"], {}) >>> exarkun@boson:~$ However, on upgrading to 1.3 (Ubuntu Dapper packaged version), this no longer appears to be the case: exarkun@kunai:~$ python Python 2.4.3 (#2, Apr 27 2006, 14:43:58) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import pyparsing >>> pyparsing.__version__ '1.3.3' >>> pyparsing.quotedString.parseString(u"'foo'") (["'foo'"], {}) >>> exarkun@kunai:~$ More confusing, this behavior seems to depend on the exact expression you use to parse a string: sometimes the result will come out as unicode, sometimes not. The exact expression I am using (created by the targetString function here <http://divmod.org/trac/browser/trunk/Imaginary/imaginary/commands.py#L19>) allows either quoted or unquoted strings and, frustratingly, if the quotes are supplied the result is a str, but if they are omitted the result is unicode. I have considered wrapping my usage of PyParsing in an extra layer that does type-checking and decodes when appropriate, but this seems like a hackish work-around for a mis-feature of PyParsing, rather than the correct solution. Is this a bug, am I mis-using PyParsing, or does PyParsing really just not differentiate between these two types? Thanks in advance, Jean-Paul |
From: Jean-Paul C. <ex...@di...> - 2006-06-26 01:03:40
|
Hey, I'm wondering how to match any sequence of whitespace-separated characters, including non-ascii. For ASCII, I've just been using pyparsing.Word(alphanums) but this approach doesn't seem to work for unicode. Also, while trying to figure this out, I tried this: pyparsing.OneOrMore(pyparsing.NotAny(pyparsing.White())).parseString("hello") Running this goes into an infinite loop consuming all CPU resources. Not sure if this is a bug worth fixing in PyParsing but I thought I'd point it out. Jean-Paul |
From: Paul M. <pa...@al...> - 2006-05-03 05:23:29
|
Jean-Paul - My first thought is that this is a bug in pyparsing. I'll look into what changed around the 1.3 time frame to see what may have caused this. There is also a more recent version of pyparsing than 1.3.3, you might download from SF and give it a try. I don't expect it to be different in this respect tho. -- Paul |
From: Paul M. <pa...@al...> - 2006-05-04 01:51:22
|
Jean-Paul, Here are my tests with the latest version: from pyparsing import quotedString,__version__ print __version__ def stripper(s, loc, toks): toks = toks.asList() toks[0] = toks[0][1:-1] return toks input = u"'foo'" print quotedString.parseString(input) quotedString.setParseAction(stripper) print quotedString.parseString(input) Prints: 1.4.3 [u"'foo'"] [u'foo'] Can you send me an example of how the behavior changes depending on usage? So far, it looks like the current release does the right thing. -- Paul |
From: Paul M. <pa...@al...> - 2006-05-04 02:01:17
|
Oh, here are the results under 1.3.3: 1.3.3 ["'foo'"] ['foo'] And here are the results under 1.4.2 (the current released version): 1.4.2 [u"'foo'"] [u'foo'] So I definitely see this problem exists under 1.3.3, but I'd rather not go through a release on this old version track if I can help it. Is it a problem for you to upgrade to 1.4.2? -- Paul |