I want to parse log files. The log files have a nasty problem in that the end of a log entry may have text in any format, and that text may be over multiple lines. The start of the next log line will always begin with one of (debug, info... etc), except for the last line. Consider the sample input:
// begin sample --
DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
that may be over
multiple lines
// end sample --
So I tried this regex...
(\w*) (.*)(?=DEBUG)
This worked ok, but could not match the last line because it is the end of the file and does not have a "DEBUG" entry. So then I tried this:
This was closer, but still didn't work. The output was:
//-- begin output --
Groups: 0: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
> 1: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
> 2: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
> 3: <DEBUG> 4: < 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
> 5: - 6: - 7: - Groups: 0: <> 1: <> 2: <> 3: <> 4: <> 5: - 6: - 7: -Groups: 0: <EBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
that may be over
multiple lines
> 1: - 2: - 3: - 4: - 5: <EBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
that may be over
multiple lines
> 6: <EBUG> 7: < 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
that may be over
multiple lines
> Groups: 0: <> 1: - 2: - 3: - 4: - 5: <> 6: <> 7: <>
// end output --
This was obtained after hitting "apply" four times. What I expected was after hitting apply the first time to get all of log entry 1 parsed, and after hitting apply a second time getting all of log entry 2 parsed. As you can see, that didn't happen. Also, when the "apply" button is hit for the 3rd time I don't get "debug" as expected but instead get "ebug".
Is my regex bad (probably) or is this a bug in jregex. Once I have this working I will build upon it to parse out important information in each log entry.
Thanks for the help
Later
Rob
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1. the 'EBUG..' was matched by the second part of the expression. After the empty match (which was just before the last DEBUG) jregex deliberately moves its pointer one char ahead (guess why), so the 'D' was skipped. Seems to be no bug here.
2. the expression seem to be bad. For example, try it against 3+ lines.
3. My suggestion:
(?m)^DEBUG.*(?:\s+^(?!DEBUG).*)* , no extra flags. It reads as "line starting with DEBUG followed by lines not starting with DEBUG". The DEBUG can be replaced by (?:DEBUG|INFO|ETC). Seem to work in the demo applet.
Hope this helps. Questions are welcome.
Regards,
Sergey
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I want to parse log files. The log files have a nasty problem in that the end of a log entry may have text in any format, and that text may be over multiple lines. The start of the next log line will always begin with one of (debug, info... etc), except for the last line. Consider the sample input:
// begin sample --
DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
that may be over
multiple lines
// end sample --
So I tried this regex...
(\w*) (.*)(?=DEBUG)
This worked ok, but could not match the last line because it is the end of the file and does not have a "DEBUG" entry. So then I tried this:
(?(?=.*DEBUG) ( ((\w*) (.*)(?=DEBUG)) ) | ((\w*) (.*)) )
This was closer, but still didn't work. The output was:
//-- begin output --
Groups: 0: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
> 1: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
> 2: <DEBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
> 3: <DEBUG> 4: < 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] lots of text
that may be
over multiple lines
> 5: - 6: - 7: - Groups: 0: <> 1: <> 2: <> 3: <> 4: <> 5: - 6: - 7: -Groups: 0: <EBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
that may be over
multiple lines
> 1: - 2: - 3: - 4: - 5: <EBUG 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
that may be over
multiple lines
> 6: <EBUG> 7: < 2004-01-23 11:29:58,818 {anonymous} [Servlet.Engine.Transports:9] more random text
that may be over
multiple lines
> Groups: 0: <> 1: - 2: - 3: - 4: - 5: <> 6: <> 7: <>
// end output --
This was obtained after hitting "apply" four times. What I expected was after hitting apply the first time to get all of log entry 1 parsed, and after hitting apply a second time getting all of log entry 2 parsed. As you can see, that didn't happen. Also, when the "apply" button is hit for the 3rd time I don't get "debug" as expected but instead get "ebug".
Is my regex bad (probably) or is this a bug in jregex. Once I have this working I will build upon it to parse out important information in each log entry.
Thanks for the help
Later
Rob
Hello, nobody! I'll be pretty short in details.
1. the 'EBUG..' was matched by the second part of the expression. After the empty match (which was just before the last DEBUG) jregex deliberately moves its pointer one char ahead (guess why), so the 'D' was skipped. Seems to be no bug here.
2. the expression seem to be bad. For example, try it against 3+ lines.
3. My suggestion:
(?m)^DEBUG.*(?:\s+^(?!DEBUG).*)* , no extra flags. It reads as "line starting with DEBUG followed by lines not starting with DEBUG". The DEBUG can be replaced by (?:DEBUG|INFO|ETC). Seem to work in the demo applet.
Hope this helps. Questions are welcome.
Regards,
Sergey