Thread: [Pyparsing] Word and Regex matching more than they should

Brought to you by: ptmcg

pyparsing-users

[Pyparsing] Word and Regex matching more than they should

From: Stuart L. <st...@vr...> - 2018-01-19 06:36:15

Hi all,

I've got a funny issue with trying to get pyparsing to parse a grammar
for Project Haystack data.

The data format I'm trying to parse is described here:
https://www.project-haystack.org/doc/Zinc

I'm slowly working my way up the parsing tree, but I'm finding pyparsing
is tripping up on my grammar definitions.  For the purpose of
discussion, I've posted my grammar here:

https://github.com/vrtsystems/hszinc/blob/feature/WC-1173-add-list-support/hszinc/grammar.py

I'll admit up front I am new to pyparsing.  Previously I used
Parsimonious, but couldn't quite get to handle the recursive nature of
Project Haystack data types, in particular, I had trouble making it
parse a filter string.

A proof of concept for pyparsing worked, so I'm trying to get a more
complete grammar working so that I can parse the data coming back from
Project Haystack.  I'm finding though that some of my patterns are
capturing more than I anticipated.

If, for instance, I try to parse a quantity… a quantity is defined as a
decimal number, followed by a unit string.  The unit string may consist
of letters, the symbols %, _, $ and /, or Unicode points 128 or above.
Crucially, it may not match a space.

I'm finding if I pass one in, it does:
> stuartl@vk4msl-ws ~/vrt/projects/widesky/sdk/hszinc $ ipython2 
> Python 2.7.14 (default, Jan 17 2018, 17:36:45) 
> Type "copyright", "credits" or "license" for more information.
> 
> IPython 5.4.1 -- An enhanced Interactive Python.
> ?         -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help      -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
> 
> In [1]: from hszinc import grammar
> 
> In [2]: grammar.hs_quantity.parseString('123.45 notpartofquantity')
> Out[2]: ([BasicQuantity(123.45, 'notpartofquantity')], {})

That has taken ' notpartofquantity', and included it in the raw data for
the Quantity.  It should ignore that because of the space separation.

This breaks hs_meta; which is supposed to parse metadata pairs and
markers, e.g.

	aString:"testing" aNumber:123.45 aMarker

Any ideas where I might be going wrong?
Thanks in advance.
Regards,
-- 
     _ ___             Stuart Longland - Systems Engineer
\  /|_) |                           T: +61 7 3535 9619
 \/ | \ |     38b Douglas Street    F: +61 7 3535 9699
   SYSTEMS    Milton QLD 4064       http://www.vrt.com.au

Re: [Pyparsing] Word and Regex matching more than they should

From: Stuart L. <st...@vr...> - 2018-01-22 00:06:38

On 19/01/18 16:16, Stuart Longland wrote:
> I'm finding if I pass one in, it does:
>> stuartl@vk4msl-ws ~/vrt/projects/widesky/sdk/hszinc $ ipython2 
>> Python 2.7.14 (default, Jan 17 2018, 17:36:45) 
>> Type "copyright", "credits" or "license" for more information.
>>
>> IPython 5.4.1 -- An enhanced Interactive Python.
>> ?         -> Introduction and overview of IPython's features.
>> %quickref -> Quick reference.
>> help      -> Python's own help system.
>> object?   -> Details about 'object', use 'object??' for extra details.
>>
>> In [1]: from hszinc import grammar
>>
>> In [2]: grammar.hs_quantity.parseString('123.45 notpartofquantity')
>> Out[2]: ([BasicQuantity(123.45, 'notpartofquantity')], {})

Okay, something is *definitely* buggy:
> stuartl@vk4msl-ws ~/vrt/projects/widesky/sdk/hszinc $ ipython2
> Python 2.7.14 (default, Jan 17 2018, 17:36:45) 
> Type "copyright", "credits" or "license" for more information.
> 
> IPython 5.4.1 -- An enhanced Interactive Python.
> ?         -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help      -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
> 
> In [1]: import pyparsing as pp
> 
> In [2]: class Quantity(object):
>    ...:     def __init__(self, value, unit):
>    ...:         self.value = value
>    ...:         self.unit = unit
>    ...:     def __repr__(self):
>    ...:         return 'Q(%r, %r)' % (self.value, self.unit)
>    ...: 
> 
> In [3]: hs_unit         = pp.Regex(ur"[a-zA-Z%_/$\x80-\xffffffff]+")
>    ...: hs_decimal      = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?").setParseAction(
>    ...:                 lambda toks : [float(toks[0].replace('_',''))])
>    ...: hs_quantity     = (hs_decimal + hs_unit).setParseAction(
>    ...:         lambda toks : [Quantity(toks[0], unit=toks[1])])
>    ...: 
> 
> In [4]: hs_quantity.parseString('123.123 abc')
> Out[4]: ([Q(123.123, 'abc')], {})
> 
> In [5]: hs_quantity.parseString('123.123 abc', parseAll=True)
> Out[5]: ([Q(123.123, 'abc')], {})

*Nowhere*, in those patterns, is a space allowed.  Yet, it passes it
through.
-- 
     _ ___             Stuart Longland - Systems Engineer
\  /|_) |                           T: +61 7 3535 9619
 \/ | \ |     38b Douglas Street    F: +61 7 3535 9699
   SYSTEMS    Milton QLD 4064       http://www.vrt.com.au

Re: [Pyparsing] Word and Regex matching more than they should

From: Stuart L. <st...@vr...> - 2018-01-22 05:28:47

On 22/01/18 10:06, Stuart Longland wrote:
> Okay, something is *definitely* buggy:
>> stuartl@vk4msl-ws ~/vrt/projects/widesky/sdk/hszinc $ ipython2
>> Python 2.7.14 (default, Jan 17 2018, 17:36:45) 
>> Type "copyright", "credits" or "license" for more information.
>>
>> IPython 5.4.1 -- An enhanced Interactive Python.
>> ?         -> Introduction and overview of IPython's features.
>> %quickref -> Quick reference.
>> help      -> Python's own help system.
>> object?   -> Details about 'object', use 'object??' for extra details.
>>
>> In [1]: import pyparsing as pp
>>
>> In [2]: class Quantity(object):
>>    ...:     def __init__(self, value, unit):
>>    ...:         self.value = value
>>    ...:         self.unit = unit
>>    ...:     def __repr__(self):
>>    ...:         return 'Q(%r, %r)' % (self.value, self.unit)
>>    ...: 
>>
>> In [3]: hs_unit         = pp.Regex(ur"[a-zA-Z%_/$\x80-\xffffffff]+")
>>    ...: hs_decimal      = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?").setParseAction(
>>    ...:                 lambda toks : [float(toks[0].replace('_',''))])
>>    ...: hs_quantity     = (hs_decimal + hs_unit).setParseAction(
>>    ...:         lambda toks : [Quantity(toks[0], unit=toks[1])])
>>    ...: 
>>
>> In [4]: hs_quantity.parseString('123.123 abc')
>> Out[4]: ([Q(123.123, 'abc')], {})
>>
>> In [5]: hs_quantity.parseString('123.123 abc', parseAll=True)
>> Out[5]: ([Q(123.123, 'abc')], {})
> *Nowhere*, in those patterns, is a space allowed.  Yet, it passes it
> through.

Okay, so the magic was `leaveWhitespace`… without that, it'll silently
discard whitespace in around tokens in the parser.  Working around it is
a tad ugly, but doable:

https://github.com/vrtsystems/hszinc/commit/4b517d679dc40766340eba87660a7bdf858a68fc

Regards,
-- 
     _ ___             Stuart Longland - Systems Engineer
\  /|_) |                           T: +61 7 3535 9619
 \/ | \ |     38b Douglas Street    F: +61 7 3535 9699
   SYSTEMS    Milton QLD 4064       http://www.vrt.com.au

Re: [Pyparsing] Word and Regex matching more than they should

From: Paul M. <pt...@au...> - 2018-01-22 09:31:02

Stuart -

Yes, leaveWhitespace is what you need to use to suppress pyparsing's default behavior of skipping whitespace between expressions in your parser. IIRC, units was to be a trailing set of characters, with no intervening whitespace:

    # -*- coding: latin-1 -*-

    import pyparsing as pp
    import sys
    from itertools import filterfalse

    unicode_printables = ''.join(filterfalse(str.isspace, (chr(i) for i in range(33, sys.maxunicode))))
    unit_chars = unicode_printables
    units = pp.Word(unit_chars)
    numeric_value = pp.pyparsing_common.number("value") + pp.Optional(units.leaveWhitespace()("units"))

    numeric_value.runTests("""\
       12345.6
       12345.6mph
       12345.6ft²
       12345.7 mph
    """)

Prints:

    12345.6
    [12345.6]
    - value: 12345.6


    12345.6mph
    [12345.6, 'mph']
    - units: 'mph'
    - value: 12345.6


    12345.6ft²
    [12345.6, 'ft²']
    - units: 'ft²'
    - value: 12345.6


    12345.7 mph
            ^
    FAIL: Expected end of text (at char 8), (line:1, col:9)

Sorry to not have gotten back to you sooner, but it looks like you have worked this out for yourself. I had a look at your first efforts at a pyparsing parser for ZINC when you first sent this out, but when I went to look for it again, it was no longer on Github. If you can repost a working link I may be able to help you tune up your parser a bit.

-- Paul McGuire



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Re: [Pyparsing] Word and Regex matching more than they should

From: Stuart L. <st...@vr...> - 2018-01-22 09:34:44

Hi Paul,

On 22/01/18 19:17, Paul McGuire wrote:
> Stuart -
> 
> Yes, leaveWhitespace is what you need to use to suppress pyparsing's default behavior of skipping whitespace between expressions in your parser. IIRC, units was to be a trailing set of characters, with no intervening whitespace:
> 
>     # -*- coding: latin-1 -*-
> 
>     import pyparsing as pp
>     import sys
>     from itertools import filterfalse
> 
>     unicode_printables = ''.join(filterfalse(str.isspace, (chr(i) for i in range(33, sys.maxunicode))))
>     unit_chars = unicode_printables

Now that's a handy little generator snippet… I've been doing various
ugly kludges to try and generate all the code points but that is nice
and simple.

>     units = pp.Word(unit_chars)
>     numeric_value = pp.pyparsing_common.number("value") + pp.Optional(units.leaveWhitespace()("units"))
> 
>     numeric_value.runTests("""\
>        12345.6
>        12345.6mph
>        12345.6ft²
>        12345.7 mph
>     """)
> 
> Prints:
> 
>     12345.6
>     [12345.6]
>     - value: 12345.6
> 
> 
>     12345.6mph
>     [12345.6, 'mph']
>     - units: 'mph'
>     - value: 12345.6
> 
> 
>     12345.6ft²
>     [12345.6, 'ft²']
>     - units: 'ft²'
>     - value: 12345.6
> 
> 
>     12345.7 mph
>             ^
>     FAIL: Expected end of text (at char 8), (line:1, col:9)
> 
> Sorry to not have gotten back to you sooner, but it looks like you have worked this out for yourself. I had a look at your first efforts at a pyparsing parser for ZINC when you first sent this out, but when I went to look for it again, it was no longer on Github. If you can repost a working link I may be able to help you tune up your parser a bit.

No problems… while I'm on a deadline, I can understand that on this
forum, we're all more or less volunteers, hence I just kept working at
the problem.  Either someone would reply or I'd figure it out; either
way no harm is done. :-)

Prior to using `pyparsing`, that file just stored the grammar
definitions.  `pyparsing`, with the `.setParseAction` method, more or
less does nearly all of the parsing as well, so it no longer made sense
to call it "grammar", as it was more than that.  The file got renamed to
"zincparser.py".

https://github.com/vrtsystems/hszinc/blob/feature/WC-1173-add-list-support/hszinc/zincparser.py

Hopefully things are a little cleaner than my first attempt, but there's
still lots to be learned.  `pyparsing` is quite a powerful little
library, wished I had stumbled on it sooner.

I've managed to get tests to pass once again, so that's a plus.  Test
coverage fell, but that's because a lot of code was able to be thrown
out thanks to pyparsing.

https://travis-ci.org/vrtsystems/hszinc/builds/331703708

Regards,
-- 
     _ ___             Stuart Longland - Systems Engineer
\  /|_) |                           T: +61 7 3535 9619
 \/ | \ |     38b Douglas Street    F: +61 7 3535 9699
   SYSTEMS    Milton QLD 4064       http://www.vrt.com.au

Re: [Pyparsing] Word and Regex matching more than they should

From: Paul M. <pt...@au...> - 2018-01-22 09:42:13

Your sample code was right in front of me!

import pyparsing as pp
class Quantity(object):
     def __init__(self, value, unit):
         self.value = value
         self.unit = unit
     def __repr__(self):
         return 'Q(%r, %r)' % (self.value, self.unit)

# hs_unit         = pp.Regex(r"[a-zA-Z%_/$\x80-\x{:x}]+".format(sys.maxunicode))
hs_unit         = pp.Regex(r"[a-zA-Z%_/$\x80-\xffffff]+").setName("unit-string")
hs_decimal      = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?").setParseAction(
                lambda toks : [float(toks[0].replace('_',''))]).setName("decimal-numeric")
hs_quantity     = (hs_decimal("value") + hs_unit.leaveWhitespace()("unit")).setParseAction(
                lambda toks: Quantity(**toks))

hs_quantity.runTests("""\
123.123abc
123.123 abc
""")

Oddly enough, I could not specify the unicode range that you did, nor does sys.maxunicode work. This actually looks like a Python bug. I also see that your units is not quite as liberal as the unicode_printables one that I wrote, accepting only '%_/$' punctuation characters. I also see that your decimal expression accepts '_' spacers - the pyparsing_common.number expression that I used in the previous reply does not do this.

I made a few other tweaks to your parser:
- added setName() calls, so that exceptions are a bit clearer looking ("expected unit-string" instead of "expected Re:('[a-zA-Z%_/$\\x80-\\xffffff]+')")
- used results names in hs_quantity so that the name-to-expression mapping was clearer (note that setName() sets the name of the expression itself, while setting results names sets the name to be used for the respective parsed results)

Out of curiosity, why Python2? I would only use Py2 for legacy work at this point, not for new projects.

-- Paul



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Re: [Pyparsing] Word and Regex matching more than they should

From: Stuart L. <st...@vr...> - 2018-01-22 09:51:03

Hi Paul,
On 22/01/18 19:42, Paul McGuire wrote:
> Oddly enough, I could not specify the unicode range that you did, nor does sys.maxunicode work. This actually looks like a Python bug. I also see that your units is not quite as liberal as the unicode_printables one that I wrote, accepting only '%_/$' punctuation characters. I also see that your decimal expression accepts '_' spacers - the pyparsing_common.number expression that I used in the previous reply does not do this.
> 
> I made a few other tweaks to your parser:
> - added setName() calls, so that exceptions are a bit clearer looking ("expected unit-string" instead of "expected Re:('[a-zA-Z%_/$\\x80-\\xffffff]+')")
> - used results names in hs_quantity so that the name-to-expression mapping was clearer (note that setName() sets the name of the expression itself, while setting results names sets the name to be used for the respective parsed results)

Yeah, I've slowly been figuring those things out, latest code actually
does make use of .setName quite a bit.

> Out of curiosity, why Python2? I would only use Py2 for legacy work at this point, not for new projects.

At the moment, we still have a legacy code base that uses Python 2.7… it
is hoped (maybe this year, but who knows) that I can make the jump to 3.4+.

We recently (late last year) dropped support for Debian Wheezy, which
was the primary road block to adopting Python 3.x.  Naturally though, I
have to try and justify to the powers-at-be why we need to address the
remaining technical debt. :-)

For what it's worth, this particular library is written for both.  While
we use it in production on Python 2.7, others use it regularly on 3.4
and up.  The unit tests cover 2.7, 3.4 and 3.5.  I should add 3.6 in
there too.
-- 
     _ ___             Stuart Longland - Systems Engineer
\  /|_) |                           T: +61 7 3535 9619
 \/ | \ |     38b Douglas Street    F: +61 7 3535 9699
   SYSTEMS    Milton QLD 4064       http://www.vrt.com.au

Re: [Pyparsing] Word and Regex matching more than they should

From: Ralph C. <ra...@in...> - 2018-01-22 12:19:19

Hi Stuart,

> >     unicode_printables = ''.join(filterfalse(str.isspace, \
> >         (chr(i) for i in range(33, sys.maxunicode))))
>
> Now that's a handy little generator snippet…

It's buggy; it should be `sys.maxunicode + 1'.  :-)

Running it on Arch Linux with python 3.6.4-1, from 0 rather than 33, and
condensing the list to inclusive ranges, I get

    0000  0008
    000e  001b
    0021  0084
    0086  009f
    00a1  167f
    1681  1fff
    200b  2027
    202a  202e
    2030  205e
    2060  2fff
    3001  10ffff

That looks like more than I'd expect.  If the language you're parsing
doesn't specify what's valid then you might want to look at
https://en.wikipedia.org/wiki/Unicode_character_properties#General_Category
and pick the value's you're interested in, and then filter for those,
e.g. using Python's unicodedata module.

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

Re: [Pyparsing] Word and Regex matching more than they should

From: Ralph C. <ra...@in...> - 2018-01-22 12:24:19

Hi,

> hs_decimal      = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?")

I think this matches

    _
    ___
    -_
    _._
    _E_

and so on.

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy