confounded by '+' operator

2005-09-18
2013-05-14
  • Frithiof Andreas Jensen

    I am having real trouble parsing some syslog entries.

    Basically, I have an expression 'A' for one part and a second expression for the second part 'B'. Each expression on it's own work fine - but when combined with '+' means  'A' *immediately followed by* 'B' ; what I want is 'A' *skip the crap inbetween* 'B'.

    The getNTPServers example does that, why cant I?

    Example:

    from pyparsing import *

    log="""Aug 18 11:38:31 localhost snmpd[8569]: NET-SNMP version 5.2.1.2
    Aug 18 11:39:01 localhost /USR/SBIN/CRON[8579]: (root) CMD (  [ -d /var/lib/php4 ] && find /var/lib/php4/ -type f -cmin +$(/usr/lib/php4/maxlifetime) -print0 | xargs -r -0 rm)
    Aug 18 11:39:02 localhost snmpd[8569]: Connection from UDP/IPv6: [::1]:32932
    Aug 18 11:39:42 localhost last message repeated 2 times
    Aug 18 11:40:01 localhost /USR/SBIN/CRON[8644]: (www-data) CMD (/usr/share/cacti/site/poller.php >/dev/null 2>&1)
    Aug 18 11:40:01 localhost snmpd[8569]: Connection from UDP/IPv6: [::1]:32932
    Aug 18 11:40:25 localhost last message repeated 6 times
    Aug 18 11:42:15 localhost snmptrapd[8571]: 2005-08-18 11:42:15 UDP/IPv6: [3ffe:100:3:282::3]:32772 [UDP/IPv6: [3ffe:100:3:282::3]:32772]: RFC1213-MIB::sysUpTime.0 = Timeticks: (53512) 0:08:55.12^ISNMPv2-MIB::snmpTrapOID.0 = OID: ADHOC-MIB::adhocTopologyEvent
    Aug 18 11:42:20 localhost snmptrapd[8571]: 2005-08-18 11:42:20 UDP/IPv6: [3ffe:100:3:282::3]:32772 [UDP/IPv6: [3ffe:100:3:282::3]:32772]: RFC1213-MIB::sysUpTime.0 = Timeticks: (54012) 0:09:00.12^ISNMPv2-MIB::snmpTrapOID.0 = OID: ADHOC-MIB::adhocTopologyEvent
    Aug 18 11:42:30 localhost snmptrapd[8571]: 2005-08-18 11:42:30 UDP/IPv6: [3ffe:100:3:282::3]:32772 [UDP/IPv6: [3ffe:100:3:282::3]:32772]: RFC1213-MIB::sysUpTime.0 = Timeticks: (55012) 0:09:10.12^ISNMPv2-MIB::snmpTrapOID.0 = OID: ADHOC-MIB::adhocTopologyEvent
    Aug 18 11:42:35 localhost snmptrapd[8571]: 2005-08-18 11:42:35 UDP/IPv6: [3ffe:100:3:282::3]:32772 [UDP/IPv6: [3ffe:100:3:282::3]:32772]: RFC1213-MIB::sysUpTime.0 = Timeticks: (55512) 0:09:15.12^ISNMPv2-MIB::snmpTrapOID.0 = OID: ADHOC-MIB::adhocTopologyEvent
    Aug 18 11:42:57 localhost snmpd[8569]: Connection from UDP/IPv6: [::1]:32932
    Aug 18 11:43:22 localhost last message repeated 5 times
    Aug 18 11:43:30 localhost snmptrapd[8571]: 2005-08-18 11:43:30 UDP/IPv6: [3ffe:100:3:282::3]:32772 [UDP/IPv6: [3ffe:100:3:282::3]:32772]: RFC1213-MIB::sysUpTime.0 = Timeticks: (61012) 0:10:10.12^ISNMPv2-MIB::snmpTrapOID.0 = OID: ADHOC-MIB::adhoctraps.7^IADHOC-MIB::adhocSecurityReceiveUnverfied = INTEGER: 6^IADHOC-MIB::adhocSecurityReceiveAddress = STRING:
    """

    integer = Word(nums)
    logday  = Word(alphas, max=3)

    hexno   = Word(hexnums)
    ip6sep  = oneOf(': ::')

    logtime = integer + ZeroOrMore(':' + integer)
    logdate = Combine(logday +' '+ integer +' '+ logtime)

    ip6addr = Combine(hexno + ip6sep + hexno + ip6sep + hexno + ip6sep + hexno + ip6sep + hexno)

    trapinfo = logdate.setResultsName('logdate') + ip6addr.setResultsName('hostaddr')

    print trapinfo

    for a in logdate.scanString(log):
        print a

    for b in ip6addr.scanString(log):
        print b
       
    for stuff in trapinfo.scanString(log):
        print stuff
       

       

     
    • Paul McGuire

      Paul McGuire - 2005-09-19

      Dear fajensen -

      You're really not very far off at all!  Just a couple of misconceptions, and a missing class.

      1. The Combine class does not quite work as you are thinking.  It serves two roles:
      .  a. its default behavior is to force that all embedded expressions are adjacent (although this can be disabled)
      .  b. it returns the results as a single string instead of as a list of matched tokens

      You can see an appropriate use of Combine in fourFn.py, where we define a floating point number as:
      .   fnumber = Combine( Word( "+-"+nums, nums ) +
      .                      Optional( point + Optional( Word( nums ) ) ) +
      .                      Optional( e + Word( "+-"+nums, nums ) ) )

      We certainly don't want "6. 02 E 23" to be matched as a valid floating point number.  On the other hand we also don't want "6.02E23" to be returned to us as ['6', '.', '02', 'E', '23'].  So by wrapping this fnumber with a Combine, we ensure that only adjacent characters are used for matching, and we get a nice clean '6.02E23' string, all ready to convert to a real number.

      In your program, you correctly used Combine for ip6addr, but it tripped you up on logdate.  In this case, I think the class you are looking for is Group, but you could also just leave the expressions without any enclosing structure.  And *definitely* leave out any explicit matching for whitespace - pyparsing will recognize whitespace as an input token delimiter, but by default it throws it away.  (Again, if you are really determined, whitespace can be included in your expressions, but in this particular application, I'd go without.)  So logdate changes from
      .   logdate = Combine(logday +' '+ integer +' '+ logtime)
      to simply:
      .   logdate = logday + integer + logtime

      Also, the expression for logtime is an excellent candidate for Combine - the characters are expected to be adjacent, and you really just want the final concatenated string, not the individual pieces.

      2. If you look a little closer at the getNTPservers.py example, the expression used to call scanString does not use '+' to join its two subexpressions, but '|'.  If you define trapinfo using '|', you get these scanString return values:

      ((['Aug', '18', '11:38:31'], {'logdate': [((['Aug', '18', '11:38:31'], {}), -1)]}), 0, 15)
      ((['Aug', '18', '11:39:01'], {'logdate': [((['Aug', '18', '11:39:01'], {}), -1)]}), 64, 79)
      ((['Aug', '18', '11:39:02'], {'logdate': [((['Aug', '18', '11:39:02'], {}), -1)]}), 240, 255)
      ((['Aug', '18', '11:39:42'], {'logdate': [((['Aug', '18', '11:39:42'], {}), -1)]}), 317, 332)
      ((['Aug', '18', '11:40:01'], {'logdate': [((['Aug', '18', '11:40:01'], {}), -1)]}), 373, 388)
      ((['Aug', '18', '11:40:01'], {'logdate': [((['Aug', '18', '11:40:01'], {}), -1)]}), 487, 502)
      ((['Aug', '18', '11:40:25'], {'logdate': [((['Aug', '18', '11:40:25'], {}), -1)]}), 564, 579)
      ((['Aug', '18', '11:42:15'], {'logdate': [((['Aug', '18', '11:42:15'], {}), -1)]}), 620, 635)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 694, 711)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 731, 748)
      ((['Aug', '18', '11:42:20'], {'logdate': [((['Aug', '18', '11:42:20'], {}), -1)]}), 879, 894)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 953, 970)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 990, 1007)
      ((['Aug', '18', '11:42:30'], {'logdate': [((['Aug', '18', '11:42:30'], {}), -1)]}), 1138, 1153)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 1212, 1229)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 1249, 1266)
      ((['Aug', '18', '11:42:35'], {'logdate': [((['Aug', '18', '11:42:35'], {}), -1)]}), 1397, 1412)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 1471, 1488)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 1508, 1525)
      ((['Aug', '18', '11:42:57'], {'logdate': [((['Aug', '18', '11:42:57'], {}), -1)]}), 1656, 1671)
      ((['Aug', '18', '11:43:22'], {'logdate': [((['Aug', '18', '11:43:22'], {}), -1)]}), 1733, 1748)
      ((['Aug', '18', '11:43:30'], {'logdate': [((['Aug', '18', '11:43:30'], {}), -1)]}), 1789, 1804)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 1863, 1880)
      ((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {'hostaddr': [((['3ffe', ':', '100', ':', '3', ':', '282', '::', '3'], {}), -1)]}), 1900, 1917)

      What we are essentially saying here is "scan for any occurrence of A or B", and so scanString returns us the various and separate matches, if *either* of the expressions is found.

      Unfortunately, there is no connection made between the returned values, unless you do some work on your own in preserving values across loops through the scanString generator loop, something like this:
      .  curdate = ""
      .  for stuff in trapinfo.scanString(log):
      .      toks,start,end = stuff
      .      if toks.getName() == "logdate":
      .          curdate = " ".join(toks)
      .      elif toks.getName() == "hostaddr":
      .          print curdate, toks.hostaddr
         
      Since you were so good about setting up the results names, I used them to distinguish which subexpression actually matched any given scanString result.

      3. Also, note that scanString returns a tuple, not just a single value.  For each match, scanString returns (toks,start,end), where:
      .  toks = the matching tokens (returned as a ParseResults)
      .  start = the starting location of the match within the source string
      .  end = the ending location of the match within the source string
      You can unpack them as I did in the example above, or simply write:
      .  for toks,start,end in trapinfo.scanString(log):

      4. Lastly, if you really want to "just skip over some stuff", there is a pyparsing class called SkipTo.  I originally modified your trapinfo expression to use SkipTo, giving:
      .  trapinfo = logdate.setResultsName('logdate') + SkipTo(ip6addr)  + ip6addr.setResultsName('hostaddr')
      Unfortunately, SkipTo is a fairly dumb matcher.  In your example, not every logdate has a corresponding  ip6addr, as in
      .  time1
      .  time2
      .  time3  ipaddr1
      For data in this order, SkipTo will erroneously match ipaddr1 with time1, not time3.  Even marking the trailing address as Optional, as in:
      .  trapinfo = logdate.setResultsName('logdate') + Optional(SkipTo(ip6addr)  + ip6addr.setResultsName('hostaddr'))
      wont work - the pyparsing scanner will still read ahead from time1 and still find ipaddr1.  In fact, as a general rule of thumb, if you grammar has SkipTo embedded inside Optional, this is a warning sign that your program may be skipping over too much.

      What is needed is a way to inspect the text that is matched by SkipTo, and to abort the match if any logdates are found inside.  This can be done by adding a parse action.
      def dontSkipOverLogdates(s,l,t):
      .   try:
      .       logdate.scanString(t[0]).next()
      .   except StopIteration:
      .       return  # success, keep on parsing!
      .   else:
      .       raise ParseException(s,l,"skipped over a logdate")

      trapinfo = logdate.setResultsName('logdate') + \ .    SkipTo(ip6addr).setParseAction(dontSkipOverLogdates)  + \ .    ip6addr.setResultsName('hostaddr')
      for toks,start,end in trapinfo.scanString(log):
      .   print toks.logdate, toks.hostaddr

      What happens in this parse action is that the matched text is provided as the 0'th element of the ParseResults t.  We call logdate.scanString using t[0], which sets up a generator for any matching logdates.  If no logdate exists in this text, the call to next() will raise StopIteration - indicating to us that our original SkipTo did not skip over any intervening logdates, and this is a valid case.  If any logdates do exist in t[0], then the call to next() succeeds, we branch to the try-block's else clause, and raise a custom ParseException, signalling to pyparsing that this seeming match was not a desirable match after all.  (You can use this technique for various semantic qualifiers or filters, when a syntax-only match may accept too many input values.)

      Well, I guess this may be more than you bargained for when you asked for help.  Just correcting your Combine call on logdate will get you most of the way there, and you can ignore the rest of this stuff for now.  Good luck with using pyparsing!

      -- Paul

       
      • Frithiof Andreas Jensen

        THANKS A LOT - that was a very comprehensive answer. I think I finally got it to work, needs more testing though.

        It is always bad to parse syslog because so much stuff is dumped there and one can never be quite certain that the parser will not choke on some very rare event - OTOH syslog is the only intelligent place to dreg out management events.

         

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks