Menu

Bug or bad regex?

Help
2004-01-26
2004-01-27
  • Nobody/Anonymous

    I want to parse log files.  The log files have a nasty problem in that the end of a log entry may have text in any format, and that text may be over multiple lines.  The start of the next log line will always begin with one of (debug, info... etc), except for the last line.  Consider the sample input:

    // begin sample --

    DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
    that may be
    over multiple lines
    DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
    that may be over
    multiple lines

    // end sample --

    So I tried this regex...

    (\w*) (.*)(?=DEBUG)

    This worked ok, but could not match the last line because it is the end of the file and does not have a "DEBUG" entry.  So then I tried this:

    (?(?=.*DEBUG) ( ((\w*) (.*)(?=DEBUG)) ) | ((\w*) (.*)) )

    This was closer, but still didn't work.  The output was:

    //-- begin output --

    Groups:     0: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
    that may be
    over multiple lines
    >     1: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
    that may be
    over multiple lines
    >      2: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
    that may be
    over multiple lines
    >     3: <DEBUG>     4: < 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
    that may be
    over multiple lines
    >      5:  -     6:  -     7:  - Groups:     0: <>      1: <>     2: <>     3: <>      4: <>     5:  -     6:  -      7:  -Groups:      0: <EBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
    that may be over
    multiple lines
    >     1:  -     2:  -      3:  -     4:  -     5: <EBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
    that may be over
    multiple lines
    >      6: <EBUG>     7: < 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
    that may be over
    multiple lines
    > Groups:     0: <>     1:  -      2:  -     3:  -     4:  -      5: <>     6: <>     7: <>

    // end output --

    This was obtained after hitting "apply" four times.  What I expected was after hitting apply the first time to get all of log entry 1 parsed, and after hitting apply a second time getting all of log entry 2 parsed.  As you can see, that didn't happen.  Also, when the "apply" button is hit for the 3rd time I don't get "debug" as expected but instead get "ebug". 

    Is my regex bad (probably) or is this a bug in jregex.  Once I have this working I will build upon it to parse out important information in each log entry.

    Thanks for the help
    Later
    Rob

     
    • Sergey A. Samokhodkin

      Hello, nobody! I'll be pretty short in details.

      1. the 'EBUG..' was matched by the second part of the expression. After the empty match (which was just before the last DEBUG) jregex deliberately moves its pointer one char ahead (guess why), so the 'D' was skipped. Seems to be no bug here.

      2. the expression seem to be bad. For example, try it against 3+ lines.

      3. My suggestion:
      (?m)^DEBUG.*(?:\s+^(?!DEBUG).*)* , no extra flags. It reads as "line starting with DEBUG followed by lines not starting with DEBUG". The DEBUG can be replaced by (?:DEBUG|INFO|ETC). Seem to work in the demo applet.

      Hope this helps. Questions are welcome.

      Regards,
      Sergey

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.