I am trying to use a Unicode Token under Python 2.7, but it's not working out for me. basically, i can only match chars up to \x7f because of a lambda named key() which uses Python3 str(). This breaks 2.7 which wants unicode() instead.
here's a simple example to demonstrate the problem:
#!/usr/bin/env python
from lrparsing import Grammar, Token
patched = True
class G(Grammar):
if patched:
START = Token(re=ur'[\x00-\U0010ffff]+')
else:
START = Token(re=ur'[\x00-\U0000007f]+')
def test(n):
def code_units(cp):
return u''.join([unichr(x) for x in range(cp, cp + n)])
for cp in range(0, 0x110000, n):
print G().parse(code_units(cp))
test(8)
with the following patch (and changing the unit tests to run python rather than python3) everything except the check for 100% code coverage appears to work okay, i think...
--- lrparsing.py 2016-10-04 15:44:59.000000000 +0000
+++ lrparsing.py.unicode27 2016-10-03 22:05:03.000000000 +0000
@@ -2495,7 +2495,7 @@
token
for token in self.registry.values()
if isinstance(token, Token))
- key = lambda t: (str(t.literal), str(t.re))
+ key = lambda t: (unicode(t.literal), unicode(t.re))
all_tokens = sorted(all_tokens, key=key, reverse=True)
patterns = [token for token in all_tokens if token.re is not None]
self.literals = {}
Any chance of getting something like this applied to keep 2.7 working with Unicode?
Done in 1.0.13. Thank you for your suggestion.