Python parsing module / Discussion / Help/Open Discussion: Another Newbie question: parsing across lines

Julian - 2005-02-05

Hi there-
I need to parse a log file, and am wondering if PyParse could be used.

The log looks something like:

Date Time Title: <Title>
Result Message

Repeated many, many times :-)

Is there a way to make a grammer that accepts (i.e. performs an action/returns the tokens and values) both lines as one unit?

I.e. sometimes things get corrupted; I don't want it to accept just the first or the second line independantly, I need it to only accept things if both lines are present.

Thanks in advance for your help!

Regards,

Julian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tom Lynch - 2005-02-05
  
  Sure. But first some disclaimers:
  - I'm new at this too
  - I don't know 90% of what pyparsing can do, so I am sure there is a much better way
  - I wrote this off the top of my head so it probably won't run
  - I don't know the format of your tokens so I am guessing
  
  # this script is called logParser.py
  
  # first load the appropriate modules(?)
  from pyparsing import alphas, alphanums, Word, Literal, nums
  
  # a message is a bunch of letters and numbers
  message = Word(alphas, alphanums)
  
  # a title is a bunch of letters and numbers too
  # so this is a redundant but more clear
  title = Word(alphas, alphanums)
  
  # a date looks like 2004/12/1
  date = integer + "/" + integer + "/" + integer
  
  # a time looks like 21:01:01
  time = integer + ":" + integer + ":" + integer
  
  # put it all together
  correctMessage = date + time + Literal("Title:") + title + message
  
  # now that i think about it, this doesn't handle the
  #new line..hmmmm, you'll have to figure that out!
  
  #now use it
  # read the file
  args = sys.argv[1:]
  if len(args) != 1:
      print 'usage: python logParser.py <datafile.txt>'
      sys.exit(-1)
  infilename = sys.argv[1]
  infile = file(infilename, 'r')
  
  # parse the file
  for line in infile:
      anEntry = correctMessage.parseString(line)
      print anEntry
  
  # end of this questionable code
  
  Note to Paul:
  
  Paul, it would be great if you could correct this code or post the approved solutions AND THEN post it as the first in a collection of useful pyparsing 'recipes'. Well, great for Julian and me anyway.
  
  
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Julian - 2005-02-08
    
    Paul, Tom-
    Wow! Tom, you did a great job in giving a starting point (and well explained too!), and Paul, you rounded it off with an example I can basically cut 'n paste.
    
    Thanks very much for both of your time and effort.
    
    Sorry I was so vague about the logging format; it's the output of an internal test tool used where I work. There are a couple more tokens in the second line (e.g. result code), but that's fine- good practice for me to ensure I understand what's going on :-)
    
    Again, many thanks- this is a far more intelligible way to deal with this problem than lots of regular expressions.
    
    Regards
    
    Julian
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul McGuire - 2005-02-06
  
  Tom -
  
  A very good first stab at this problem.   Here are some comments:
  - The main Achille's Heel of your program is that the grammar works across lines pretty well already, but the loop at the bottom ("for line in infile") only parses single lines individually. You should instead use file.read() to read the entire file contents into a single input string, or define your grammar as something like:
  messages = ZeroOrMore(Group(correctMessage))
  and then use parseFile():
  listOfMessages = messages.parseFile( infilename )
  
  - Date and time are given as a sequence of expressions, but they will be returned as individual tokens (i.e. "2004/01/01" will be returned as the list ['2004', '/', '01', '/', '01']). Use Combine to return the concatenated list as a single token.
  
  - Your definition of title is ok, as long as the title consists of only a single word. You'll have to ask Julian if that is sufficient. If title is really a filename, then you could expand it to something like:
  Word(alphas,alphanums+"$_") + "." + Word(alphas)
  Or if it truly a title consisting of one or more words, or numbers or other unpredictable things, then you can read up to the end of the line using restOfLine. The only problem with restOfLine is that you may have to explicitly read past the end of line using a LineEnd() expression.
  
  So the first line of each log message becomes:
  line1 = date + time + "Title:" + restOfLine + LineEnd().suppress()
  
  Julian doesn't give us much to go on for line2, so we'll have to assume that this line contains pretty much anything. He does tell us that sometimes the second line is omitted, but that he doesn't really want to accept those. (We'll come back to that in a minute.) So let's assume that if line2 starts with a date, then it really is another line1, and that there is a missing line2. We can then use restOfLine, and the NotAny (created using the '~' operator) to prevent matches if a line starts with a date.
  
  line2 = ~date + restOfLine
  
  Now we have:
  
  line1 = date + time + "Title:" + restOfLine + LineEnd()
  line2 = ~date + restOfLine
  correctMessage = line1 + line2
  
  To make it easier to access the different fields of the log message, let's give them results names:
  line1 = date.setResultsName("date") + time.setResultsName("time") + "Title:" + restOfLine.setResultsName("title") + LineEnd()
  
  line2 = ~date + restOfLine.setResultsName("results")
  
  correctMessage = line1 + line2
  
  Since it is possible that there will be some text in our input file that doesn't match the correctMessage grammar, then it may be better to use scanString than parseString, since scanString will scan through the input string and extract matches (parseString requires that the entire string fit within the defined grammar).
  
  scanString is a generator that returns the start position of the matched text, the end position of the matched text, and the parsed tokens as a ParseResults object (which can be treated as a string, dictionary, or object). Here's how it would look:
  
  infile = file( infilename )
  for (start,end,tokens) in correctMessage.scanString( infile.read() ):
      print tokens.date, tokens.time, '"' + tokens.title + '"', "Results:", tokens.results
  
  So here is your whole program, with some test data (including one or two bad entries), too. I'll add it to the examples directory in the next release (version 1.3, which should be ready soon).
  
  -- Paul
  
  ==========================================
  from pyparsing import alphas, alphanums, Word, Literal, nums, restOfLine, LineEnd, Combine
  
  integer = Word(nums)
  
  # a date looks like 2004/12/1
  date = Combine( integer + "/" + integer + "/" + integer )
  
  # a time looks like 21:01:01
  time = Combine( integer + ":" + integer + ":" + integer )
  
  line1 = date.setResultsName("date") + time.setResultsName("time") + "Title:" + restOfLine.setResultsName("title") + LineEnd()
  
  line2 = ~date + restOfLine.setResultsName("results")
  
  correctMessage = line1 + line2
  
  testdata = """
  2004/01/01 00:00:01 Title: Rise and Fall of the Roman Empire
  A terrific book, but a trifle long
  2004/01/02 12:34:56 Title: Raiders of the Lost Ark
  A great movie to go see.
  2005/02/12 13:57:09 Title: Star Wars Episode I - The Phantom Menace
  2003/08/23 10:11:23 Title: How to Succeed in Business Without Really Trying
  A goofy play made into a goofy movie.
  """
  
  for (tokens,start,end) in correctMessage.scanString( testdata ):
      print tokens.date, tokens.time, '"' + tokens.title.strip() + '"', "Results:", tokens.results
  
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Another Newbie question: parsing across lines

Forums

Help

Another Newbie question: parsing across lines document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Another Newbie question: parsing across lines