Menu

Another Newbie question: parsing across lines

Julian
2005-02-05
2013-05-14
  • Julian

    Julian - 2005-02-05

    Hi there-
    I need to parse a log file, and am wondering if PyParse could be used.

    The log looks something like:

    Date Time Title: <Title>
    Result Message

    Repeated many, many times :-)

    Is there a way to make a grammer that accepts (i.e. performs an action/returns the tokens and values) both lines as one unit?

    I.e. sometimes things get corrupted; I don't want it to accept just the first or the second line independantly, I need it to only accept things if both lines are present.

    Thanks in advance for your help!

    Regards,

    Julian

     
    • Tom Lynch

      Tom Lynch - 2005-02-05

      Sure.  But first some disclaimers:
      - I'm new at this too
      - I don't know 90% of what pyparsing can do, so I am sure there is a much better way
      - I wrote this off the top of my head so it probably won't run
      - I don't know the format of your tokens so I am guessing

      # this script is called logParser.py

      # first load the appropriate modules(?)
      from pyparsing import alphas, alphanums, Word, Literal, nums

      # a message is a bunch of letters and numbers
      message = Word(alphas, alphanums)

      # a title is a bunch of letters and numbers too
      # so this is a redundant but more clear
      title = Word(alphas, alphanums)

      # a date looks like 2004/12/1
      date = integer + "/" + integer + "/" + integer

      # a time looks like 21:01:01
      time = integer + ":" + integer + ":" + integer

      # put it all together
      correctMessage = date + time + Literal("Title:") + title + message

      # now that i think about it, this doesn't handle the
      #new line..hmmmm, you'll have to figure that out!

      #now use it
      # read the file
      args = sys.argv[1:]
      if len(args) != 1:
          print 'usage: python logParser.py <datafile.txt>'
          sys.exit(-1)
      infilename = sys.argv[1]
      infile = file(infilename, 'r')

      # parse the file
      for line in infile:
          anEntry = correctMessage.parseString(line)
          print anEntry
        
      # end of this questionable code

      Note to Paul:

      Paul, it would be great if you could correct this code or post the approved solutions AND THEN post it as the first in a collection of useful pyparsing 'recipes'.  Well, great for Julian and me anyway.

         

       
      • Julian

        Julian - 2005-02-08

        Paul, Tom-
        Wow!  Tom, you did a great job in giving a starting point (and well explained too!), and Paul, you rounded it off with an example I can basically cut 'n paste.

        Thanks very much for both of your time and effort.

        Sorry I was so vague about the logging format; it's the output of an internal test tool used where I work.  There are a couple more tokens in the second line (e.g. result code), but that's fine- good practice for me to ensure I understand what's going on :-)

        Again, many thanks- this is a far more intelligible way to deal with this problem than lots of regular expressions.

        Regards

        Julian

         
    • Paul McGuire

      Paul McGuire - 2005-02-06

      Tom -

      A very good first stab at this problem.   Here are some comments:
      - The main Achille's Heel of your program is that the grammar works across lines pretty well already, but the loop at the bottom ("for line in infile") only parses single lines individually.  You should instead use file.read() to read the entire file contents into a single input string, or define your grammar as something like:
      messages = ZeroOrMore(Group(correctMessage))
      and then use parseFile():
      listOfMessages = messages.parseFile( infilename )

      - Date and time are given as a sequence of expressions, but they will be returned as individual tokens (i.e. "2004/01/01" will be returned as the list ['2004', '/', '01', '/', '01']).  Use Combine to return the concatenated list as a single token.

      - Your definition of title is ok, as long as the title consists of only a single word.  You'll have to ask Julian if that is sufficient.  If title is really a filename, then you could expand it to something like:
      Word(alphas,alphanums+"$_") + "." + Word(alphas)
      Or if it truly a title consisting of one or more words, or numbers or other unpredictable things, then you can read up to the end of the line using restOfLine.  The only problem with restOfLine is that you may have to explicitly read past the end of line using a LineEnd() expression.

      So the first line of each log message becomes:
      line1 = date + time + "Title:" + restOfLine + LineEnd().suppress()

      Julian doesn't give us much to go on for line2, so we'll have to assume that this line contains pretty much anything.  He does tell us that sometimes the second line is omitted, but that he doesn't really want to accept those.  (We'll come back to that in a minute.)  So let's assume that if line2 starts with a date, then it really is another line1, and that there is a missing line2.  We can then use restOfLine, and the NotAny (created using the '~' operator) to prevent matches if a line starts with a date.

      line2 = ~date + restOfLine

      Now we have:

      line1 = date + time + "Title:" + restOfLine + LineEnd()
      line2 = ~date + restOfLine
      correctMessage = line1 + line2

      To make it easier to access the different fields of the log message, let's give them results names:
      line1 = date.setResultsName("date") + time.setResultsName("time") + "Title:" + restOfLine.setResultsName("title") + LineEnd()

      line2 = ~date + restOfLine.setResultsName("results")

      correctMessage = line1 + line2

      Since it is possible that there will be some text in our input file that doesn't match the correctMessage grammar, then it may be better to use scanString than parseString, since scanString will scan through the input string and extract matches (parseString requires that the entire string fit within the defined grammar).

      scanString is a generator that returns the start position of the matched text, the end position of the matched text, and the parsed tokens as a ParseResults object (which can be treated as a string, dictionary, or object).  Here's how it would look:

      infile = file( infilename )
      for (start,end,tokens) in correctMessage.scanString( infile.read() ):
          print tokens.date, tokens.time, '"' + tokens.title + '"', "Results:", tokens.results
         
      So here is your whole program, with some test data (including one or two bad entries), too.  I'll add it to the examples directory in the next release (version 1.3, which should be ready soon).

      -- Paul

      ==========================================
      from pyparsing import alphas, alphanums, Word, Literal, nums, restOfLine, LineEnd, Combine

      integer = Word(nums)

      # a date looks like 2004/12/1
      date = Combine( integer + "/" + integer + "/" + integer )

      # a time looks like 21:01:01
      time = Combine( integer + ":" + integer + ":" + integer )

      line1 = date.setResultsName("date") + time.setResultsName("time") + "Title:" + restOfLine.setResultsName("title") + LineEnd()

      line2 = ~date + restOfLine.setResultsName("results")

      correctMessage = line1 + line2

      testdata = """
      2004/01/01 00:00:01 Title: Rise and Fall of the Roman Empire
      A terrific book, but a trifle long
      2004/01/02 12:34:56 Title: Raiders of the Lost Ark
      A great movie to go see.
      2005/02/12 13:57:09 Title: Star Wars Episode I - The Phantom Menace
      2003/08/23 10:11:23 Title: How to Succeed in Business Without Really Trying
      A goofy play made into a goofy movie.
      """

      for (tokens,start,end) in correctMessage.scanString( testdata ):
          print tokens.date, tokens.time, '"' + tokens.title.strip() + '"', "Results:", tokens.results
         

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.