Re: [Pyparsing] Word and Regex matching more than they should

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Stuart,

> >     unicode_printables = ''.join(filterfalse(str.isspace, \
> >         (chr(i) for i in range(33, sys.maxunicode))))
>
> Now that's a handy little generator snippet…

It's buggy; it should be `sys.maxunicode + 1'.  :-)

Running it on Arch Linux with python 3.6.4-1, from 0 rather than 33, and
condensing the list to inclusive ranges, I get

    0000  0008
    000e  001b
    0021  0084
    0086  009f
    00a1  167f
    1681  1fff
    200b  2027
    202a  202e
    2030  205e
    2060  2fff
    3001  10ffff

That looks like more than I'd expect.  If the language you're parsing
doesn't specify what's valid then you might want to look at
https://en.wikipedia.org/wiki/Unicode_character_properties#General_Category
and pick the value's you're interested in, and then filter for those,
e.g. using Python's unicodedata module.

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy