What exactly does lineno represent?

2006-04-28
2013-05-14
  • Tim Edwards

    Tim Edwards - 2006-04-28

    I am parsing a named.conf and the repsective zone files.  Two of the zone files are particularly large: file1=1627 lines and file2=12850 lines.  The largest of the two takes about 30 minutes to parse, which is far too long.  So I am trying to optimize my grammar.

    In the parseAction that handles each resource record of a zone file, I started printing the lineno argument everytime it increase by a factor of 10000. 

    For the small zone files, this never prints.  But for the forementioned larger files it occurs.  File1's lineno exceeds 90000, and file2's lineno exceeds 770000!  How do these values relate to my grammar?  Is it correct to assume vague grammar, that is grammar that requires  deeper parse trees, produce higher lineno values ?

    Below are excerpts from my code concerning the resource records, which constitute the majority of the parsing time.

    --code--
    rrClass=Optional(Literal('IN')).setResultsName('class').setName('recordClass')
    rrType=Literal('A')|Literal('PTR')|Literal('CNAME')|Literal('HWINFO')|Literal('TXT')|Literal('NS')|Literal('MX')
    rrType=rrType.setResultsName('type').setName('recordType')
    rrName=Group(fqDomainName|domain|ipAddr|Literal('@')).setResultsName('name').setName('nodeName')
    rrTTL=Optional(Word(nums).setResultsName('ttl')).setName('recordTTL')
    rrData=Optional(Word(nums))+(fqDomainName|ipAddr|quotedString)
    rrData.setResultsName('data').setName('recordData')
    rrMain=Group(OneOrMore(Group(rrTTL+rrClass+rrType+rrData))).setResultsName('rr').setName('rrMain')
    rr=Group(rrName+rrMain).setResultsName('node').setName('resultsRecord')

    --end code--

    Thanks for any answers or optimization suggestions,
    Tim

     
    • Paul McGuire

      Paul McGuire - 2006-04-28

      Tim -

      lineno is supposed to report the line in the input string, by counting newlines from the start of the string (the first line is line 1).  If you are just reporting lineno for debugging purposes, try turning it off, since calculating it may be expensive, especially for a large file, and *especially* if you are calling it often!

      Here is a short example, showing the use of line, col, and lineno:

      =====================
      from pyparsing import *

      data = """Now is the time
      for all good men
      to come to the aid
      of their country."""

      def reportLongWords(s,l,t):
          word = t[0]
          if len(word) > 3:
              print "Found '%s' on line %d at column %d" % (word, lineno(l,s), col(l,s))
              print "The full line of text was:"
              print "'%s'" % line(l,s)
              print (" "*col(l,s))+"^"
              print
             
      wd = Word(alphas).setParseAction( reportLongWords )

      OneOrMore(wd).parseString(data)

      This prints:
      Found 'time' on line 1 at column 12
      The full line of text was:
      'Now is the time'
                  ^

      Found 'good' on line 2 at column 9
      The full line of text was:
      'for all good men'
               ^

      Found 'come' on line 3 at column 4
      The full line of text was:
      'to come to the aid'
          ^

      Found 'their' on line 4 at column 4
      The full line of text was:
      'of their country.'
          ^

      Found 'country' on line 4 at column 10
      The full line of text was:
      'of their country.'
                ^

      Nothing in your grammar looks terrible, (I'm in transit at the moment, so I can't look at your problem in much depth just now).  You *might* get a minor speedup from changing

      rrType=Literal('A')|Literal('PTR')|Literal('CNAME')|Literal('HWINFO')|Literal('TXT')|Literal('NS')|Literal('MX')

      to:

      rrType=oneOf('A PTR CNAME HWINFO TXT NS MX')

      Also, try importing psyco - this can give you a 30-50% speedup.  I'm afraid I'd need to see your full grammar and a sample file to see just where the performance hitch is (which I am *very* interested in finding, by the way - performance is the one thing I continually get slammed on, and each performance question usually highlights some opportunity for improvement).  Also, try calling enablePackrat(), which was included in the last release.

      I think there is a named.conf parser in the examples directory, or try googling for "pyparsing named.conf" - I know that at least one other person has tackled this file previously.

      -- Paul

       
    • Tim Edwards

      Tim Edwards - 2006-04-28

      Paul,

      I am emailing you my code, it is a bit much to post here in the forum.

      I looked at the other named parsers, but they don't seem to parse the zone files.  I know my grammar isn't as eloquent as the examples on the web, but I felt I need more specific definitions for my project.

      I made the change you suggested, thanks.

      I guess I used a bad variable name, and it has caused some confusion. =/  I used 'lineNo' for the variable you referenced in your example as 'l'.  I am not calculating the line number, I am only printing the value of 'l'.

      I have an optimization suggestion that you may or may not be open to.  If so I would be willing to help implement it.
      I noticed that  Regex  is very efficient.  What do you think of generating python regular expression for the parseExpressions?  If the regex's are created properly, then each token could still be named, and popped out of the string into a list of tokens. 

      --
      Tim

       
    • Paul McGuire

      Paul McGuire - 2006-04-29

      Ah!  The variable I reference as 'l' is the location into the input string.

      I'm sorry, I've gotten fairly lazy in specing out the args to my parse actions.  Parse actions are passed 3 args:
      1. the original, entire string being parsed
      2. the location in the string at which the current expression (to which the current parse action is attached) was found
      3. a parse results representing the tokens matched in the string

      So I often shortcut these args as:
      def blahBlahAction(s,l,t):
          print "BlahBlah matched at posn",l," :",t[0]

      But in more readable form, it would be:
      def blahBlahAction(strg,locn,tokens):
          print "BlahBlah matched at posn",locn," :",tokens[0]

      So this is why your value, which you called 'lineNo', is getting up into the 100's of thousands; it's because you have that many characters in your input file.

      I got your e-mail, and am back home from my business trip, so I'll be able to look at this in a little more detail.

      -- Paul

       
    • Paul McGuire

      Paul McGuire - 2006-04-30

      Tim -

      I looked over your code, and nothing really leaps out at me that is a significant issue.  Here are some minor tweaks, but except for using enablePackrat, I don't expect any significant performance speedup.
         
      1. min=1 is pretty much assumed on Word definitions, you can leave it off.
      2. Add call to ParserElement.enablePackrat() after importing from pyparsing.  This may not work out, since you add parse actions in the calling modules, but the payoff could be signficant.
      3. Some of your variables collide with standard Python built-ins, including type, range, and file.  This probably doesn't affect performance, but could mess things up with strange error messages.
      4. (very minor!) You have a misspelling of sQuote as sQoute (similar for sDblQuote):
      sQoute=quote.suppress()
      sDblQoute=dblQuote.suppress()

      As for your optimization suggestion, I think I'm not only open to it, but I hope I'm already doing it.  Several of the classes have been reimplemented using internal regexps, look at Word, QuotedString, even the oneOf helper, and many of the comment built-ins.

      -- Paul

       

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks