Thread: [Pyparsing] Word and Regex matching more than they should
Brought to you by:
ptmcg
From: Stuart L. <st...@vr...> - 2018-01-19 06:36:15
|
Hi all, I've got a funny issue with trying to get pyparsing to parse a grammar for Project Haystack data. The data format I'm trying to parse is described here: https://www.project-haystack.org/doc/Zinc I'm slowly working my way up the parsing tree, but I'm finding pyparsing is tripping up on my grammar definitions. For the purpose of discussion, I've posted my grammar here: https://github.com/vrtsystems/hszinc/blob/feature/WC-1173-add-list-support/hszinc/grammar.py I'll admit up front I am new to pyparsing. Previously I used Parsimonious, but couldn't quite get to handle the recursive nature of Project Haystack data types, in particular, I had trouble making it parse a filter string. A proof of concept for pyparsing worked, so I'm trying to get a more complete grammar working so that I can parse the data coming back from Project Haystack. I'm finding though that some of my patterns are capturing more than I anticipated. If, for instance, I try to parse a quantity… a quantity is defined as a decimal number, followed by a unit string. The unit string may consist of letters, the symbols %, _, $ and /, or Unicode points 128 or above. Crucially, it may not match a space. I'm finding if I pass one in, it does: > stuartl@vk4msl-ws ~/vrt/projects/widesky/sdk/hszinc $ ipython2 > Python 2.7.14 (default, Jan 17 2018, 17:36:45) > Type "copyright", "credits" or "license" for more information. > > IPython 5.4.1 -- An enhanced Interactive Python. > ? -> Introduction and overview of IPython's features. > %quickref -> Quick reference. > help -> Python's own help system. > object? -> Details about 'object', use 'object??' for extra details. > > In [1]: from hszinc import grammar > > In [2]: grammar.hs_quantity.parseString('123.45 notpartofquantity') > Out[2]: ([BasicQuantity(123.45, 'notpartofquantity')], {}) That has taken ' notpartofquantity', and included it in the raw data for the Quantity. It should ignore that because of the space separation. This breaks hs_meta; which is supposed to parse metadata pairs and markers, e.g. aString:"testing" aNumber:123.45 aMarker Any ideas where I might be going wrong? Thanks in advance. Regards, -- _ ___ Stuart Longland - Systems Engineer \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas Street F: +61 7 3535 9699 SYSTEMS Milton QLD 4064 http://www.vrt.com.au |
From: Stuart L. <st...@vr...> - 2018-01-22 00:06:38
|
On 19/01/18 16:16, Stuart Longland wrote: > I'm finding if I pass one in, it does: >> stuartl@vk4msl-ws ~/vrt/projects/widesky/sdk/hszinc $ ipython2 >> Python 2.7.14 (default, Jan 17 2018, 17:36:45) >> Type "copyright", "credits" or "license" for more information. >> >> IPython 5.4.1 -- An enhanced Interactive Python. >> ? -> Introduction and overview of IPython's features. >> %quickref -> Quick reference. >> help -> Python's own help system. >> object? -> Details about 'object', use 'object??' for extra details. >> >> In [1]: from hszinc import grammar >> >> In [2]: grammar.hs_quantity.parseString('123.45 notpartofquantity') >> Out[2]: ([BasicQuantity(123.45, 'notpartofquantity')], {}) Okay, something is *definitely* buggy: > stuartl@vk4msl-ws ~/vrt/projects/widesky/sdk/hszinc $ ipython2 > Python 2.7.14 (default, Jan 17 2018, 17:36:45) > Type "copyright", "credits" or "license" for more information. > > IPython 5.4.1 -- An enhanced Interactive Python. > ? -> Introduction and overview of IPython's features. > %quickref -> Quick reference. > help -> Python's own help system. > object? -> Details about 'object', use 'object??' for extra details. > > In [1]: import pyparsing as pp > > In [2]: class Quantity(object): > ...: def __init__(self, value, unit): > ...: self.value = value > ...: self.unit = unit > ...: def __repr__(self): > ...: return 'Q(%r, %r)' % (self.value, self.unit) > ...: > > In [3]: hs_unit = pp.Regex(ur"[a-zA-Z%_/$\x80-\xffffffff]+") > ...: hs_decimal = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?").setParseAction( > ...: lambda toks : [float(toks[0].replace('_',''))]) > ...: hs_quantity = (hs_decimal + hs_unit).setParseAction( > ...: lambda toks : [Quantity(toks[0], unit=toks[1])]) > ...: > > In [4]: hs_quantity.parseString('123.123 abc') > Out[4]: ([Q(123.123, 'abc')], {}) > > In [5]: hs_quantity.parseString('123.123 abc', parseAll=True) > Out[5]: ([Q(123.123, 'abc')], {}) *Nowhere*, in those patterns, is a space allowed. Yet, it passes it through. -- _ ___ Stuart Longland - Systems Engineer \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas Street F: +61 7 3535 9699 SYSTEMS Milton QLD 4064 http://www.vrt.com.au |
From: Stuart L. <st...@vr...> - 2018-01-22 05:28:47
|
On 22/01/18 10:06, Stuart Longland wrote: > Okay, something is *definitely* buggy: >> stuartl@vk4msl-ws ~/vrt/projects/widesky/sdk/hszinc $ ipython2 >> Python 2.7.14 (default, Jan 17 2018, 17:36:45) >> Type "copyright", "credits" or "license" for more information. >> >> IPython 5.4.1 -- An enhanced Interactive Python. >> ? -> Introduction and overview of IPython's features. >> %quickref -> Quick reference. >> help -> Python's own help system. >> object? -> Details about 'object', use 'object??' for extra details. >> >> In [1]: import pyparsing as pp >> >> In [2]: class Quantity(object): >> ...: def __init__(self, value, unit): >> ...: self.value = value >> ...: self.unit = unit >> ...: def __repr__(self): >> ...: return 'Q(%r, %r)' % (self.value, self.unit) >> ...: >> >> In [3]: hs_unit = pp.Regex(ur"[a-zA-Z%_/$\x80-\xffffffff]+") >> ...: hs_decimal = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?").setParseAction( >> ...: lambda toks : [float(toks[0].replace('_',''))]) >> ...: hs_quantity = (hs_decimal + hs_unit).setParseAction( >> ...: lambda toks : [Quantity(toks[0], unit=toks[1])]) >> ...: >> >> In [4]: hs_quantity.parseString('123.123 abc') >> Out[4]: ([Q(123.123, 'abc')], {}) >> >> In [5]: hs_quantity.parseString('123.123 abc', parseAll=True) >> Out[5]: ([Q(123.123, 'abc')], {}) > *Nowhere*, in those patterns, is a space allowed. Yet, it passes it > through. Okay, so the magic was `leaveWhitespace`… without that, it'll silently discard whitespace in around tokens in the parser. Working around it is a tad ugly, but doable: https://github.com/vrtsystems/hszinc/commit/4b517d679dc40766340eba87660a7bdf858a68fc Regards, -- _ ___ Stuart Longland - Systems Engineer \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas Street F: +61 7 3535 9699 SYSTEMS Milton QLD 4064 http://www.vrt.com.au |
From: Paul M. <pt...@au...> - 2018-01-22 09:31:02
|
Stuart - Yes, leaveWhitespace is what you need to use to suppress pyparsing's default behavior of skipping whitespace between expressions in your parser. IIRC, units was to be a trailing set of characters, with no intervening whitespace: # -*- coding: latin-1 -*- import pyparsing as pp import sys from itertools import filterfalse unicode_printables = ''.join(filterfalse(str.isspace, (chr(i) for i in range(33, sys.maxunicode)))) unit_chars = unicode_printables units = pp.Word(unit_chars) numeric_value = pp.pyparsing_common.number("value") + pp.Optional(units.leaveWhitespace()("units")) numeric_value.runTests("""\ 12345.6 12345.6mph 12345.6ft² 12345.7 mph """) Prints: 12345.6 [12345.6] - value: 12345.6 12345.6mph [12345.6, 'mph'] - units: 'mph' - value: 12345.6 12345.6ft² [12345.6, 'ft²'] - units: 'ft²' - value: 12345.6 12345.7 mph ^ FAIL: Expected end of text (at char 8), (line:1, col:9) Sorry to not have gotten back to you sooner, but it looks like you have worked this out for yourself. I had a look at your first efforts at a pyparsing parser for ZINC when you first sent this out, but when I went to look for it again, it was no longer on Github. If you can repost a working link I may be able to help you tune up your parser a bit. -- Paul McGuire --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
From: Stuart L. <st...@vr...> - 2018-01-22 09:34:44
|
Hi Paul, On 22/01/18 19:17, Paul McGuire wrote: > Stuart - > > Yes, leaveWhitespace is what you need to use to suppress pyparsing's default behavior of skipping whitespace between expressions in your parser. IIRC, units was to be a trailing set of characters, with no intervening whitespace: > > # -*- coding: latin-1 -*- > > import pyparsing as pp > import sys > from itertools import filterfalse > > unicode_printables = ''.join(filterfalse(str.isspace, (chr(i) for i in range(33, sys.maxunicode)))) > unit_chars = unicode_printables Now that's a handy little generator snippet… I've been doing various ugly kludges to try and generate all the code points but that is nice and simple. > units = pp.Word(unit_chars) > numeric_value = pp.pyparsing_common.number("value") + pp.Optional(units.leaveWhitespace()("units")) > > numeric_value.runTests("""\ > 12345.6 > 12345.6mph > 12345.6ft² > 12345.7 mph > """) > > Prints: > > 12345.6 > [12345.6] > - value: 12345.6 > > > 12345.6mph > [12345.6, 'mph'] > - units: 'mph' > - value: 12345.6 > > > 12345.6ft² > [12345.6, 'ft²'] > - units: 'ft²' > - value: 12345.6 > > > 12345.7 mph > ^ > FAIL: Expected end of text (at char 8), (line:1, col:9) > > Sorry to not have gotten back to you sooner, but it looks like you have worked this out for yourself. I had a look at your first efforts at a pyparsing parser for ZINC when you first sent this out, but when I went to look for it again, it was no longer on Github. If you can repost a working link I may be able to help you tune up your parser a bit. No problems… while I'm on a deadline, I can understand that on this forum, we're all more or less volunteers, hence I just kept working at the problem. Either someone would reply or I'd figure it out; either way no harm is done. :-) Prior to using `pyparsing`, that file just stored the grammar definitions. `pyparsing`, with the `.setParseAction` method, more or less does nearly all of the parsing as well, so it no longer made sense to call it "grammar", as it was more than that. The file got renamed to "zincparser.py". https://github.com/vrtsystems/hszinc/blob/feature/WC-1173-add-list-support/hszinc/zincparser.py Hopefully things are a little cleaner than my first attempt, but there's still lots to be learned. `pyparsing` is quite a powerful little library, wished I had stumbled on it sooner. I've managed to get tests to pass once again, so that's a plus. Test coverage fell, but that's because a lot of code was able to be thrown out thanks to pyparsing. https://travis-ci.org/vrtsystems/hszinc/builds/331703708 Regards, -- _ ___ Stuart Longland - Systems Engineer \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas Street F: +61 7 3535 9699 SYSTEMS Milton QLD 4064 http://www.vrt.com.au |
From: Paul M. <pt...@au...> - 2018-01-22 09:42:13
|
Your sample code was right in front of me! import pyparsing as pp class Quantity(object): def __init__(self, value, unit): self.value = value self.unit = unit def __repr__(self): return 'Q(%r, %r)' % (self.value, self.unit) # hs_unit = pp.Regex(r"[a-zA-Z%_/$\x80-\x{:x}]+".format(sys.maxunicode)) hs_unit = pp.Regex(r"[a-zA-Z%_/$\x80-\xffffff]+").setName("unit-string") hs_decimal = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?").setParseAction( lambda toks : [float(toks[0].replace('_',''))]).setName("decimal-numeric") hs_quantity = (hs_decimal("value") + hs_unit.leaveWhitespace()("unit")).setParseAction( lambda toks: Quantity(**toks)) hs_quantity.runTests("""\ 123.123abc 123.123 abc """) Oddly enough, I could not specify the unicode range that you did, nor does sys.maxunicode work. This actually looks like a Python bug. I also see that your units is not quite as liberal as the unicode_printables one that I wrote, accepting only '%_/$' punctuation characters. I also see that your decimal expression accepts '_' spacers - the pyparsing_common.number expression that I used in the previous reply does not do this. I made a few other tweaks to your parser: - added setName() calls, so that exceptions are a bit clearer looking ("expected unit-string" instead of "expected Re:('[a-zA-Z%_/$\\x80-\\xffffff]+')") - used results names in hs_quantity so that the name-to-expression mapping was clearer (note that setName() sets the name of the expression itself, while setting results names sets the name to be used for the respective parsed results) Out of curiosity, why Python2? I would only use Py2 for legacy work at this point, not for new projects. -- Paul --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus |
From: Stuart L. <st...@vr...> - 2018-01-22 09:51:03
|
Hi Paul, On 22/01/18 19:42, Paul McGuire wrote: > Oddly enough, I could not specify the unicode range that you did, nor does sys.maxunicode work. This actually looks like a Python bug. I also see that your units is not quite as liberal as the unicode_printables one that I wrote, accepting only '%_/$' punctuation characters. I also see that your decimal expression accepts '_' spacers - the pyparsing_common.number expression that I used in the previous reply does not do this. > > I made a few other tweaks to your parser: > - added setName() calls, so that exceptions are a bit clearer looking ("expected unit-string" instead of "expected Re:('[a-zA-Z%_/$\\x80-\\xffffff]+')") > - used results names in hs_quantity so that the name-to-expression mapping was clearer (note that setName() sets the name of the expression itself, while setting results names sets the name to be used for the respective parsed results) Yeah, I've slowly been figuring those things out, latest code actually does make use of .setName quite a bit. > Out of curiosity, why Python2? I would only use Py2 for legacy work at this point, not for new projects. At the moment, we still have a legacy code base that uses Python 2.7… it is hoped (maybe this year, but who knows) that I can make the jump to 3.4+. We recently (late last year) dropped support for Debian Wheezy, which was the primary road block to adopting Python 3.x. Naturally though, I have to try and justify to the powers-at-be why we need to address the remaining technical debt. :-) For what it's worth, this particular library is written for both. While we use it in production on Python 2.7, others use it regularly on 3.4 and up. The unit tests cover 2.7, 3.4 and 3.5. I should add 3.6 in there too. -- _ ___ Stuart Longland - Systems Engineer \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas Street F: +61 7 3535 9699 SYSTEMS Milton QLD 4064 http://www.vrt.com.au |
From: Ralph C. <ra...@in...> - 2018-01-22 12:19:19
|
Hi Stuart, > > unicode_printables = ''.join(filterfalse(str.isspace, \ > > (chr(i) for i in range(33, sys.maxunicode)))) > > Now that's a handy little generator snippet… It's buggy; it should be `sys.maxunicode + 1'. :-) Running it on Arch Linux with python 3.6.4-1, from 0 rather than 33, and condensing the list to inclusive ranges, I get 0000 0008 000e 001b 0021 0084 0086 009f 00a1 167f 1681 1fff 200b 2027 202a 202e 2030 205e 2060 2fff 3001 10ffff That looks like more than I'd expect. If the language you're parsing doesn't specify what's valid then you might want to look at https://en.wikipedia.org/wiki/Unicode_character_properties#General_Category and pick the value's you're interested in, and then filter for those, e.g. using Python's unicodedata module. -- Cheers, Ralph. https://plus.google.com/+RalphCorderoy |
From: Ralph C. <ra...@in...> - 2018-01-22 12:24:19
|
Hi, > hs_decimal = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?") I think this matches _ ___ -_ _._ _E_ and so on. -- Cheers, Ralph. https://plus.google.com/+RalphCorderoy |