Menu

#105 + matches invalid token

v1.0 (example)
closed
nobody
None
5
2018-01-22
2018-01-22
No

Hi,

I'm trying to get a parser working for parsing Project Haystack ZINC format grids, but am having a problem with the parser incorrectly matching invalid strings.

In the example below, I'm trying to match a quantity (a numeric value with a unit). Below is a basic test case with pyparsing 2.2.0, and Python 2.7. I can also reproduce it in pyparsing 2.0.3 as shipped in Debian Jessie.

import pyparsing as pp
class Quantity(object):
     def __init__(self, value, unit):
         self.value = value
         self.unit = unit
     def __repr__(self):
         return 'Q(%r, %r)' % (self.value, self.unit)

hs_unit         = pp.Regex(ur"[a-zA-Z%_/$\x80-\xffffffff]+")
hs_decimal      = pp.Regex(r"-?[\d_]+(\.[\d_]+)?([eE][+\-]?[\d_]+)?").setParseAction(
                lambda toks : [float(toks[0].replace('_',''))])
hs_quantity     = (hs_decimal + hs_unit).setParseAction(
        lambda toks : [Quantity(toks[0], unit=toks[1])])

hs_quantity.parseString('123.123 abc')
  # -> ([Q(123.123, 'abc')], {})

hs_quantity.parseString('123.123 abc', parseAll=True)
  # -> ([Q(123.123, 'abc')], {})

Note that, nowhere does that regex permit spaces. hs_decimal allows a leading hyphen, digits, underscores, decimal points and the letter e (either case), but not spaces. Likewise hs_unit allows alphanumerics, some punctuation, and Unicode code points above 0x80, but not spaces (0x20).
hs_quantity is literally a hs_decimal followed immediately (no space) by a hs_unit… there should be no space. 123.123 abc should not match this pattern, ever.

Discussion

  • Stuart Longland

    Stuart Longland - 2018-01-22

    Okay, I've now determined this is down to how pyparsing behaves… it defaults to matching (and discarding) proceeding whitespace.

    The fix to this was somewhat ugly, but in essence, I had to wrap each pyparsing object with a wrapper that would call leaveWhitespace before returning it. Since it's nearly impossible to know where this is being applied, I've resorted to doing it on all pyparsing objects I use so I don't get caught out. The result is visually messy, but works.

    I think this "feature" could be better described, having a switch that can turn this on or off on an instance-wide basis would be useful too.

     
  • Paul McGuire

    Paul McGuire - 2018-01-22

    If you want to completely disable whitespace skipping, use ParserElement.setDefaultWhitespaceChars("") right after importing pyparsing. But whitespace skipping is a basic feature (no air quotes necessary) of pyparsing, and helps avoid having to sprinkle \s* all over your regexes.

    Closing this as "works as designed".

     
  • Paul McGuire

    Paul McGuire - 2018-01-22
    • status: open --> closed
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.