Extract all links, all URLs of a text file?

Dirk
2014-02-26
2014-03-07
  • Dirk

    Dirk - 2014-02-26

    How can you extract all links, all URLs of a text file?

     
  • cchris

    cchris - 2014-03-01

    Use linefilter2 plugin with a proper regular expression to maych URLs.
    The one N++ uses for clickable links is

    "[A-Za-z]+://[A-Za-z0-9_\-\+~.:?&@=/%#,;\{\}\(\)\[\]\|\*\!\\]+"

    CChris

     
  • THEVENOT Guy

    THEVENOT Guy - 2014-03-01

    Hi Dirk, CChris and All,

    After the two slashes of the regex, the list of characters, between the two square braquets can be simplified !

    Indeed, if we just follow the natural order of the ASCII characters, from code-point 032 to code-point 127, the form [....]+ can be re-written [!#%&(-;=?-\]_a-~]+

    So, the regex, to use with the plugin LineFilter2, becomes :

    [A-Za-z]+://[!#%&(-;=?-\]_a-~]+

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2014-03-01
  • biffons

    biffons - 2014-03-03

    Thank you very much guy038 and CChris.

    I am not quite sure how to do it, I open this window - http://i.imgur.com/wNIdwxL.png - and add the regex
    [A-Za-z]+://[!#%&(-;=?-]_a-~]+
    or
    "[A-Za-z]+://[A-Za-z0-9_-+~.:?&@=/%#,;{}()[]\|*!\]+"
    and click search, then the context is marked red, but no link, URL is extracted, the surrounding text is still there.

    What am I missing?

    Many thanks again.

     
  • THEVENOT Guy

    THEVENOT Guy - 2014-03-03

    Hello Biffons, Dirk and All,

    I understood, from your link, why the regexes didn't work at all. Indeed, you just need to select the search mode "Regular expression", as the search contains special regex characters

    You may keep the show info header square box checked, if you prefer to but I advice you to uncheck the ignore case option, especially when using ranges of characters.

    For example, if you're searching the regex [Y-b], in N++ :

    • with the option Match case checked , this regex does match one of the characters Y, Z, [, \, ], ^, _, `, a, or b ( from Unicode code-point 0089 to code-point 0098 )

    • with the option Match case unchecked , this regex doesn't match anything, because the regex [Y-b] then represents the regex [Y-B] which is an invalid regular expression !

    Happily, Cchris regex, and my regex too, don't care about that option ! So, when clicking on the Search button, all the links and URL of the current file are re-written in a new tab :-) Nice !

    Just see the attached picture Biffons.png , below, which sum up the steps to perform !

    Cheers,

    Guy038

     
    Last edit: THEVENOT Guy 2014-03-03
  • biffons

    biffons - 2014-03-03

    Hello Guy038,

    Thank you very much, very easily understandable (I do not have any idea of regex).

    Ah, sorry, Biffons and Dirk are the same person, I never get notification e-mails, so I tried to change that by registering newly a few times, but without success.

    When I use this
    [A-Za-z]+://[!#%&(-;=?-]_a-~]+
    withe the settings - http://i.imgur.com/yCO5vVw.png - I get (completely unchanged from the original)this:
    http://i.imgur.com/AO5SSrb.png

    Thanks again.

     
  • THEVENOT Guy

    THEVENOT Guy - 2014-03-05

    Hi Dirk and All,

    OK, I got the trick about your two user-names ! And I also understood while you get unchanged text after the search with the plugin LineFilter2. It's just because the plugin LineFilter2 re-copy any entire line which contains the matching occurrence of the regex search !

    Thus, if you, strictly want to extract all internet addresses, from a file, one per line, without any text before and/or after the address, in Notepad++, without the help of any plugin, follow the method, as described below :

    • Re-copy your file, to scan for internet addresses, in a new tab

    • Replace the cursor at the very beginning of the file with the shortcut CTRL + Org

    • Perform the Notepad++ Search/Replacement below :

    SEARCH : (?s).*?([A-Za-z]+://[!#%&(-;=?-\]_a-~]+)|.*\z

    REPLACE : (?1\1\r\n)

    Notes :

    As showed in an example of HTML file, on the attached pictures Dirk_Before.png and Dirk_After.png, below :

    • The options Regular expression and Match case must be checked

    • The options Wrap around and . matches newline must be unchecked

    Explanations :

    • The Search part is a simple alternative ( | symbol ) between two regexes. It tries to match the regex .*?([A-Za-z]+://[!#%&(-;=?-\]_a-~]+) OR the regex .*\z, if the previous one isn't found.

    • The Replacement part is a conditional replacement, of the general form is :

    (?nTHEN part:ELSE part) which means : If the group n exists, then re-write the THEN part else re-write the ELSE part.

    • Due to the modifier (?s), the string .*? followed by [A-Za-z]+://[!#%&(-;=?-\]_a-~]+ ( the definition of an Internet address ), is looking for ANY character, included EOL character(s) till the FIRST internet address found, in the file. As that Internet address, matched, is included into round brackets, it represents the group 1 and it is re-copied \1, followed with \r\n ( the Windows EOL characters ).

    • After the LAST Internet address found, the remaining text, till the very end of the file, is matched by the regex .*\z, as the modifier (?s) is still active.

    • This time, as the group 1, the Intranet address is NOT found, the ELSE part of the conditional replacement is executed. But, as NO colon exists in the replacement regex, the remaining text found is replaced with nothing. So, it is deleted.

    Hope that results of this S/R, on your file, will be OK, this time !

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2014-03-05
  • biffons

    biffons - 2014-03-07

    Ah, whatever I do, I do not get any e-mail notificaton.

    Hello guy038.

    Thank you very much for the new regexe(s) and for the explanations.

    Great, it works perfectly. I tested some files and all links of each file were extracted perfectly, such links also being in the same line, so two different links / URLs written in the same line are extracted to a single line each.

    Thank you a lot for your great work, I appreciate it very much.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks