Menu

#2 Unicode Tokens in Python 2.7

2.0
closed
None
2016-10-09
2016-10-05
No

I am trying to use a Unicode Token under Python 2.7, but it's not working out for me. basically, i can only match chars up to \x7f because of a lambda named key() which uses Python3 str(). This breaks 2.7 which wants unicode() instead.

here's a simple example to demonstrate the problem:

#!/usr/bin/env python

from lrparsing import Grammar, Token


patched = True


class G(Grammar):
    if patched:
        START = Token(re=ur'[\x00-\U0010ffff]+')
    else:
        START = Token(re=ur'[\x00-\U0000007f]+')


def test(n):
    def code_units(cp):
        return u''.join([unichr(x) for x in range(cp, cp + n)])
    for cp in range(0, 0x110000, n):
        print G().parse(code_units(cp))


test(8)

with the following patch (and changing the unit tests to run python rather than python3) everything except the check for 100% code coverage appears to work okay, i think...

--- lrparsing.py    2016-10-04 15:44:59.000000000 +0000
+++ lrparsing.py.unicode27  2016-10-03 22:05:03.000000000 +0000
@@ -2495,7 +2495,7 @@
             token
             for token in self.registry.values()
             if isinstance(token, Token))

-        key = lambda t: (str(t.literal), str(t.re))
+        key = lambda t: (unicode(t.literal), unicode(t.re))
         all_tokens = sorted(all_tokens, key=key, reverse=True)
         patterns = [token for token in all_tokens if token.re is not None]
         self.literals = {}

Any chance of getting something like this applied to keep 2.7 working with Unicode?

Discussion

  • Russell Stuart

    Russell Stuart - 2016-10-09
    • status: open --> closed
     
  • Russell Stuart

    Russell Stuart - 2016-10-09

    Done in 1.0.13. Thank you for your suggestion.

     

Log in to post a comment.