Re: [Pyparsing] Word and Regex matching more than they should
Brought to you by:
ptmcg
From: Ralph C. <ra...@in...> - 2018-01-22 12:19:19
|
Hi Stuart, > > unicode_printables = ''.join(filterfalse(str.isspace, \ > > (chr(i) for i in range(33, sys.maxunicode)))) > > Now that's a handy little generator snippet… It's buggy; it should be `sys.maxunicode + 1'. :-) Running it on Arch Linux with python 3.6.4-1, from 0 rather than 33, and condensing the list to inclusive ranges, I get 0000 0008 000e 001b 0021 0084 0086 009f 00a1 167f 1681 1fff 200b 2027 202a 202e 2030 205e 2060 2fff 3001 10ffff That looks like more than I'd expect. If the language you're parsing doesn't specify what's valid then you might want to look at https://en.wikipedia.org/wiki/Unicode_character_properties#General_Category and pick the value's you're interested in, and then filter for those, e.g. using Python's unicodedata module. -- Cheers, Ralph. https://plus.google.com/+RalphCorderoy |