Request Regex help here

2. Help
Loreia2
2013-11-22
2013-11-24
  • Loreia2
    Loreia2
    2013-11-22

    Lately I see lots of questions where users try to get help regarding regular expressions (regex).

    Perhaps it would be smart to open just one thread dedicated to this purpose. With a dedicated thread more people will read your question and hopefully speedy answer would follow.

    Let's start with the first question :-)

    I need to split log file into smaller chunks (per test case).
    General format of the log file is:

    TEST SUITE SETUP PART
    Test case start string
    Test case 1 body
    Test case end string
    .
    .
    many test cases follow one after another
    .
    .
    .
    Test case start string
    Test case n body
    Test case end string
    TEST SUITE TEAR DOWN PART
    

    Each line start with a date and time string of variable length and format.

    In first version of my Python script, it was easy to extract each test case into a separate file because I had a well defined START and END string. So, I was able to define a simple one-liner in Python.

    re.findall("START.*?END", file_content)
    
    # where START = "[^\n]*Test case start string"
    # END  = "Test case end string"
    # [^\n]* is needed so I could read entire line, including date prefix
    # re.findall() returns a list of non overlapping strings
    

    That one was easy, but now my requirements changed, and I have a more complex job this time.

    General format of the log file is now:

    TEST SUITE SETUP PART
    line before start string (always different, no fixed elements in this line)
    Test case start string
    Test case 1 body
    .
    .
    many test cases follow one after another
    .
    .
    .
    line before start string (always different, no fixed elements in this line)
    Test case start string
    Test case n body
    TEST SUITE TEAR DOWN PART
    

    START string is now two lines long. First line is completely random (date and file path), and the seconds one is random part (date) plus fixed part ("Test case start string").
    END string is not defined any longer. Test cases start at START string and end when another test case starts. And I need to add TEST SUITE SETUP PART to first test case, and TEST SUITE TEAR DOWN PART to last test case

    This is what I did:

    test_cases = re.split(START, file_content)
    split_strings = re.findall(START, file_content)
    
    # where START = "\n.*?\n.*?Test case start string"
    # so, I get two lines where the second one ends with "Test case start string"
    # re.split() returns a list of strings that do not include START string
    # for example: re.split("x", "AxBxCxD") returns: ["A", "B", "C", "D"]
    
    # now I am forced to merge TEST SUITE SETUP PART with first test case 
    # and to merge all split_strings with test cases.
    # this is the actual code from my script:
    
    testcases = []
    testcases.append(test_cases[0] + split_strings[0] + test_cases[1])
    for tag,tc in zip(split_strings[1:], test_cases[2:]):
        testcases.append(tag + tc)
    
    # first I merge TEST SUITE SETUP PART with first split_string and first test case
    # and the I merge all the other split_strings with corresponding test cases
    

    While this code is relatively simple, easy to explain and understand; I still have a feeling that my colleagues that have zero experience with Python syntax might find this code complicated. Instead I would like to write a single regex that does the same thing. Any ideas are appreciated. requirements are:

    General structure of the file:

    (date) TEST SUITE SETUP PART
    (date) line before start string (always different, no fixed elements in this line)
    (date) Test case start string
    (date) Test case 1 body
    (date) .
    (date) .
    (date) many test cases follow one after another
    (date) .
    (date) .
    (date) .
    (date) line before start string (always different, no fixed elements in this line)
    (date) Test case start string
    (date) Test case n body
    (date) TEST SUITE TEAR DOWN PART
    

    The only known part is: test cases start with a line:
    (date) Test case start string
    (date) part is configurable, and should be treated as completely random.
    Line before start line belongs to the test case too. Its content is also completely random.
    Test case ends when next test case starts.

    (date) TEST SUITE SETUP PART belongs into the first test case
    (date) TEST SUITE TEAR DOWN PART belongs into the last test case

    So, end results should be something like:

    (date) TEST SUITE SETUP PART
    (date) line before start string (always different, no fixed elements in this line)
    (date) Test case start string
    (date) Test case 1 body
    

    this part is repeated n times:

    (date) line before start string (always different, no fixed elements in this line)
    (date) Test case start string
    (date) Test case n body
    

    .

    (date) line before start string (always different, no fixed elements in this line)
    (date) Test case start string
    (date) Test case n body
    (date) TEST SUITE TEAR DOWN PART
    

    In my opinion, logic of the regex should be:
    start: FROM THE START OF FILE or FROM THE TEST CASE START LINE
    end: AT THE NEXT TEST CASE START LINE or AT THE END OF FILE

    Is it possible to write such a regex?

    Thanks to anyone willing to invest his time and knowledge into this.

    BR,
    Loreia

     
  • THEVENOT Guy
    THEVENOT Guy
    2013-11-23

    Hello Loreia2,

    Here's my first attempt to match your needs ( Regexly speaking :-) ) because I think that I didn't understand everything you said :-(

    Moreover, I suppose that the "Test case n body" may lie on several lines and contains blank lines too !

    So, in a new tab of Notepad++, copy the example text below :

    NOTE : In that example text, I inserted dates in various format and length and I wrote random text inside the different body parts.


    (23/11/2013) TEST SUITE SETUP PART
    (23/11/13) line before start string (always different, no fixed elements in this line)
    (13-11-23) Test case start string
    (2013.11.23) Test case 1 body
    bla bla
    bla
    (11/23/2013) line before start string (always different, no fixed elements in this line)
    (11/23/13) Test case start string
    (11-23/2013) Test case 2 body foo foo
    Test
    foo
    23/11/2013 line before start string (always different, no fixed elements in this line)
    23/11/13 Test case start string
    13-11-23 Test case 3 body

    Wow

    Wow

    Wow
    2013.11.23 line before start string (always different, no fixed elements in this line)
    11/23/2013 Test case start string
    11/23/13 Test case 4 body
    Dummy Text
    for a text

    11-23/2013 TEST SUITE TEAR DOWN PART


    Now, just copy ( by CTRL-C / CTRL-V ) this long regex, below, in the search dialog of N++ :

    (?-is).*\R.*Test case start string\R\K(?s).*?\R(?-s)(?=(.*\R.*Test case start string|.*TEST SUITE TEAR DOWN PART))

    Then, each time you hit the Find Next button, it will select ALL the lines, with their EOL, from UNDER the "(date) Test case start string" line, till just ABOVE the line, before the next line "(date) Test case start string" ( Not very easy to explain ! )

    Are the smaller strings, found with that regex, the ones that you would like to strictly identify ?


    This regex can be spilt in four principal parts :

    (?-is), at the beginning, is a modifier, that forces the regex engine to consider that the search is case sensitive ( No Ignore case ) and that the dot represents a standard character ( No Single line).

    .*\R.*Test case start string\R\K is equivalent to a lookbehind, which match two lines : the (date) Test case start string line + its EOL, AND the line ABOVE it, with its EOL, too. Remember that a strict lookbehind is forbidden, in our case, because of NON FIXED length of that under-set !

    (?s).*?\R is the regex to search, that is to say it matches absolutely any character ( including EOLs ) till a final EOL character. It's because of the role of the (?s) modifier, that considers, this time, any text as a single line !

    (?-s)(?=(.*\R.*Test case start string|.*TEST SUITE TEAR DOWN PART)) is a traditional positive lookhead, build with an alternative between two strings :

    • .*\R.*Test case start string : the two lines : the (date) Test case start string line + its EOL, AND the line ABOVE it, with its EOL, too.

    • .*TEST SUITE TEAR DOWN PART : the last line TEST SUITE TEAR DOWN PART, eventually preceded by some characters.

    At the beginning of that last part, again, the (?-s) modifier impose the regex engine to consider that DOT as a standard character !

    Moreover, remember that the \R form represents ANY kind of EOL ( \r\n or \n or \r ), in the SEARCH part of a regex.

    See you soon, for your comments !

    Best Regards,

    guy038

     
    Last edit: THEVENOT Guy 2013-11-23
  • Loreia2
    Loreia2
    2013-11-24

    Hi Guy,

    thanks for this great answer, it gave an idea how to solve this thing. I will test and report here early next week after I play a bit with few RE combinations.

    I also see I wasn't clear about few details. (date) part is really a mess, this log file is processed few times, and each line may contain zero, one or two dates in row, and format can be pretty much anything. This happens because several loggers are processing each line, and each logger can add stuff of its own. (date) should really be treated as a random string of variable length.

    Here are several examples from the actual file:

    14:33:44 2013-11-24 14:33:44,146 Configuration
    14:38:43 Total tests run:
    14:41:26 [sonar:sonar] 14:41:26.175 DEBUG
    14:41:26    at org.
    14:22:47 [WARNING] Assembly file
    

    Also,
    TEST SUITE SETUP PART and TEST SUITE TEAR DOWN PART are also multiline strings, in fact they represent megabytes of random text. The only thing fixed in my log file is this:

    14:24:43 2013-11-24 14:24:38,671 New file: /root/very/long/path/to/file_03_01_01.log.html main se.xxx.ng.logging.writers.HtmlLogWriterNG.createAppender(HtmlLogWriterNG.java:506)
    14:24:43 2013-11-24 14:24:38,681 Default XXX testcase setUp main se.xxx.ng.logging.writers.HtmlLogWriterNG.setTestInfo(HtmlLogWriterNG.java:81)
    

    And from this only string Default XXX testcase setUp is known, everything else from the quote above is random, just Default XXX testcase setUp is fixed.

    So, these two lines represent a start of test case. In my original question I represented these two lines as:

    (date) line before start string (always different, no fixed elements in this line)
    (date) Test case start string
    

    Also, test case body is also very very longs, many lines of text.

    But don't worry about it for now, you gave me few ideas to play with next week. Lets see if that is enough to solve the puzzle.

    BR,
    Loreia