Re: [Pyparsing] Word and Regex matching more than they should
Brought to you by:
ptmcg
|
From: Ralph C. <ra...@in...> - 2018-01-22 12:19:19
|
Hi Stuart,
> > unicode_printables = ''.join(filterfalse(str.isspace, \
> > (chr(i) for i in range(33, sys.maxunicode))))
>
> Now that's a handy little generator snippet…
It's buggy; it should be `sys.maxunicode + 1'. :-)
Running it on Arch Linux with python 3.6.4-1, from 0 rather than 33, and
condensing the list to inclusive ranges, I get
0000 0008
000e 001b
0021 0084
0086 009f
00a1 167f
1681 1fff
200b 2027
202a 202e
2030 205e
2060 2fff
3001 10ffff
That looks like more than I'd expect. If the language you're parsing
doesn't specify what's valid then you might want to look at
https://en.wikipedia.org/wiki/Unicode_character_properties#General_Category
and pick the value's you're interested in, and then filter for those,
e.g. using Python's unicodedata module.
--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy
|