How do I strip all but needed text in documents

Faze
2013-11-08
2013-11-12
  • Faze

    Faze - 2013-11-08

    Documents each having thousands of lines where for many lines is included a (md5 +32 characters)

    Need able to strip all text characters except the (md5 +32 characters) while leaving each (md5 +32 characters) on it's own line

    Where after I can delete all empty lines and save the documents. to use with a text compare tool or if you can also suggest a text compare tool that can compare these edited saved documents.

    Other notes:

    I did try compare the documents as is with compare plugin but it hung never completes maybe as only used one thread or number of lines to compare.

    Also tried using Find extended mode as there are no examples in Notepad++ offline help it was doomed to fail and so it did I tried Find input
    md5 \u################################
    Some examples in the help files of Find extended how to use would be useful to have for use

     
  • THEVENOT Guy

    THEVENOT Guy - 2013-11-09

    Hello Faze,

    From reading your post, I deduce that in your file, you have three types of lines, in your file :

    • Lines that contain the string md5 followed by a space followed with 32 digits, with possible characters either before or after the block md5 ####....#####

    • Non empty lines that doesn't contain, at all, the string md5 ####.......####

    • Blank lines

    And that you only want, after the search/replacement, the different strings md5 ####...###, one per line, like this :

    md5 #####......######
    md5 #####......######
    md5 #####......######
    ...

    Then :

    • Open the search/replacement dialog, with the CTRL + H shortcut

    • Set, of course, the Regular expression Search mode

    • Unset, if necessary, the option . matchtes newline ( IMPORTANT )

    • Fill in the search and replacement dialogs as below :

    SEARCH : .*(md5 \d{32})|.*\R?

    REPLACE : (?1\1\r\n)

    • Click on the Replace All button

    Et voilà !

    Generally speaking, if you wish to keep a specific string XXX, ONLY, that can be found inside some lines of a file, use the S/R below :

    SEARCH : .*(XXX)|.*\R?

    REPLACE : (?1\1\r\n)

    Of course, the string XXX may, also, be a regular expression itself !


    Explanations :

    At any position of the file :

    • The search string try to match a string XXX, preceded by any characters, even none

    • If no match, then the search string try to match All the remainder characters of a line, followed by the End of Line character(s), if any

    ( Remember that the regex \R? is equivalent to the regex (\r\n|\n|\r){0,1} )

    The replacement regex uses a conditional form, whose the general syntax is :

    (?n....:....). It means :

    • If the group n exists, the replacement string is all the characters between ?n and the colon (:)

    • If the group n doesn't exist, the replacement string is all the characters between the colon and the ending round braket

    So, in our case, it means that :

    • If the group 1 is found ( the string md5 ####....##### ), just rewrite it, followed with an EOL

    • Otherwise, doesn't write anything ( because the else part of the conditional replacement is not present ! )

    I hope I well understood your needs ! If not, feel free to give us, further information

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2013-11-09
  • Faze

    Faze - 2013-11-09

    Hi I have tried what you have said and yes Regex (posix) goes above my head way to many variables.

    I tried what you have written but doesn't work with replace, find, or mark. The best it can do is clear everything. Normally I need to do this with windows .txt .rtf and occasionally some other type documents if this makes a difference.

    Here is something you can test with save to a simple .txt document using with latest windows notepad++ v6.5.1

    dsafdsfdasf
    asdfaslfpoagewrtg
    awgeokgpoakgpoawerg

    {

    [

    (
    greasogarekger md5 f8a464f460cbac47cff6a2e68ad65fa4 fadsjflkjjawiefiouewhfuiawhefiupa(jfdskja )efajfoijoiew
    )

    ]
    }

    faspfkopkpowkfai09239jkf0j09awfmu90823ruprj rvr9[}]

     
  • THEVENOT Guy

    THEVENOT Guy - 2013-11-09

    Hello Faze,

    OK, I understood the problem :-) We just forgot that the MD5 string is an Hexadecimal list of characters !

    So it's obvious that the regex must match any digit from 0 to 9 and, also, upper case letters from A to F and lower case letters from a to f !

    Then, the Search/Replacement becomes :

    SEARCH : .*(md5 [0-9A-Fa-f]{32})|.*\R? OR .*(md5 [\dA-Fa-f]{32})|.*\R?

    REPLACE : (?1\1\r\n)

    You can also use, instead the POSIX class [:xdigit:], which must be enclosed in square brackets

    SEARCH : .*(md5 [[:xdigit:]]{32})|.*\R?

    REPLACE : (?1\1\r\n)

    Best regards,

    guy038

     
    Last edit: THEVENOT Guy 2013-11-09
  • Fool4UAnyway

    Fool4UAnyway - 2013-11-10

    Smiling.

    You could probably perform this MD5 check rightaway with ExamDiff Pro.

    If all MD5's are on the end of the line, you might simply ignore anything before them.
    If your regular expression is specific enough, it may even skip the lines without MD5 as well.

    Perhaps the most simple way for you to do this is:
    - add a nonsense MD5 to each line, say 32 A characters
    - search for double MD5's and remove the nonsense MD5
    - compare all lines by ignoring anything in front of the last 32 characters (1-[33]).

     
  • Faze

    Faze - 2013-11-12

    Thanks THEVENOT Guy this has saved me many days|weeks if had to do the same manually. Now also I have the few examples you gave can do the same for other documents. Many times previously have I needed similar for different text documents. Now I'm able to do those in future also with the solutions you gave, don't give up notepad++ needs you on their forums :)

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks