lrparsing / Tickets / #2 Unicode Tokens in Python 2.7

#2 Unicode Tokens in Python 2.7

Milestone: 2.0

Status: closed

Owner: Russell Stuart

Labels: None

Updated: 2016-10-09

Created: 2016-10-05

Creator: Ray Lehtiniemi

Private: No

I am trying to use a Unicode Token under Python 2.7, but it's not working out for me. basically, i can only match chars up to \x7f because of a lambda named key() which uses Python3 str(). This breaks 2.7 which wants unicode() instead.

here's a simple example to demonstrate the problem:

#!/usr/bin/env python

from lrparsing import Grammar, Token


patched = True


class G(Grammar):
    if patched:
        START = Token(re=ur'[\x00-\U0010ffff]+')
    else:
        START = Token(re=ur'[\x00-\U0000007f]+')


def test(n):
    def code_units(cp):
        return u''.join([unichr(x) for x in range(cp, cp + n)])
    for cp in range(0, 0x110000, n):
        print G().parse(code_units(cp))


test(8)

with the following patch (and changing the unit tests to run python rather than python3) everything except the check for 100% code coverage appears to work okay, i think...

--- lrparsing.py    2016-10-04 15:44:59.000000000 +0000
+++ lrparsing.py.unicode27  2016-10-03 22:05:03.000000000 +0000
@@ -2495,7 +2495,7 @@
             token
             for token in self.registry.values()
             if isinstance(token, Token))

-        key = lambda t: (str(t.literal), str(t.re))
+        key = lambda t: (unicode(t.literal), unicode(t.re))
         all_tokens = sorted(all_tokens, key=key, reverse=True)
         patterns = [token for token in all_tokens if token.re is not None]
         self.literals = {}

Any chance of getting something like this applied to keep 2.7 working with Unicode?

Discussion

Russell Stuart - 2016-10-09

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell Stuart - 2016-10-09

Done in 1.0.13. Thank you for your suggestion.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unicode Tokens in Python 2.7

An LR(1) parser with a pythonic interface

Milestone

Searches

Help

#2 Unicode Tokens in Python 2.7

Discussion