#172 Make all URIs clickable - not just http://

Next_release
closed
Don HO
None
6
2010-02-26
2010-01-27
kirsche40
No

Hi,

I have a (half-backed) workaround for the https://-issue. The solution is half-backed because SciTE does not use POSIX ERE (with alternation) but only POSIX BRE (without alternation). Notepad++ unfortunately does use SciTEs RegEx parser (https://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Unsupported_Regex_Operators). So I could not use alternation to separate the different URI-Schemas. Now here is a description of what I have done:

Notepad_plus.cpp;5.6.6

35c35
< const char *urlHttpRegExpr = "http://[a-z0-9_\\-\\+.:?&@=/%#]*";
---
> const char *urlHttpRegExpr = "[A-Za-z]+://[A-Za-z0-9_\\-\\+~.:?&@=/%#]+";

For URI-Schema the RegEx (\w+) and (\D+) is not usable because "http1://" or "htt p://" would be matched. So I used a set of characters: "[A-Za-z]+". I do not know if there are URI-Schemas (http://tools.ietf.org/html/rfc3986) with big capitals. If they do not exist remove them from the set! If the URI-Schema length is limited in some way or has a minimum size we can attach this limits to the RegEx-string.

I also changed the alphabet range for the authority and path. Now big capitals are matched. With the old implemantation the match of URL
http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Unsupported_Regex_Operators
will stop at
http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=

At least I changed the quantifier for the authority and path from '*' into '+'. Without this change any single URI-Schema without authority and path would be also underlined and clickable in Notepad++ to mark a link (IMHO a wrong behaviour).

For a fast check I used the Notepad++ integrated RegEx-Search on the following lines:
abc htt p://abc
abc ://abc
abc_://abc
abc https://abc
abc http://abc
abc ftp://abc
abc httpabc
abc ABC://abc
abc AbC://abc

Another solution could be using a new RegEx-Parser like the small http://www.pcre.org/. This can be attached to SciTE. Look at
http://www.scintilla.org/ScintillaDoc.html and search for "A different regular expression library can be". Maybe this could be used as plugin?

At the end a warning:
With my solution the called application has to deal with the URI-Schema because oops:// is also matched as well known protocolls like http://, ftp:// and so on. The behaviour "let the called application decide what to do with the data" was a discussed security issue called the "URI Handling Vulnerability" and was not limited to MS IE and MS Windows (see:
http://www.microsoft.com/technet/security/advisory/943521.mspx
http://tools.cisco.com/security/center/viewAlert.x?alertId=13688 and
http://www.gentoo.org/security/en/glsa/glsa-200405-19.xml\).
But if you look into the discussions spreed around the internet everybody wants to blame the usual suspects. IMHO this problem is still unsolved because only Microsofts Windows SHELL32.dll and some browsers are fixed. The current URL-handling in Notepad++ could be also affected by this issue. Long story short: My solution does not implement a new security issue but implements an extended detection of URIs which was requested by some Notepad++ users.

Best regards,
Andreas Kirsch.

Discussion

  • kirsche40
    kirsche40
    2010-01-27

    • assigned_to: nobody --> donho
     
  • THANK YOU, Andreas, for making a move to solve this defect which has been around forever!

     
  • kirsche40
    kirsche40
    2010-01-28

    I checked again my solution and found an error which exists also in current implementation. If a comma is part of the authority or path the URI is not completely underlined. For example:
    http://www.spiegel.de/wirtschaft/unternehmen/0,1518,672480,00.html
    will stopp at
    http://www.spiegel.de/wirtschaft/unternehmen/0
    So the source change has to be

    Notepad_plus.cpp;5.6.6
    35c35
    < const char *urlHttpRegExpr = "http://[a-z0-9_\\-\\+.:?&@=/%#]*";
    ---
    > const char *urlHttpRegExpr =
    "[A-Za-z]+://[A-Za-z0-9_\\-\\+~.:?&@=/%#,]+";

    Best regards,
    Andreas Kirsch.

     
  • kirsche40
    kirsche40
    2010-01-28

    Again a change. There are a lot of crazy people in the wild. I found some URI-paths with unescaped characters: semicolons, exclamation marks, curly brackets, square brackets, normal brackets and at least pipe symbols and backslashes. o_O The given RFC in my start post gives in chapter "3. Syntax Components" a detailed description of what a URI can consist. The inside notepad used RegEx-URI-syntax
    <schema>://<authority>/<path>
    will match only a subset of possible URIs.

    --> We have to discuss if all possible URIs should be marked as link. <--

    For URI-path the following RegEx would match the requirements for the symbols mentioned at start of my comment (add escape character '\' before each character if mandatory):
    [A-Za-z]+://[A-Za-z0-9_\-\+~.:?&@=/%#,;\{\}\(\)\[\]\|\*\!\\]+

    Best regards,
    Andreas Kirsch.

     
  • Don HO
    Don HO
    2010-02-20

    • priority: 5 --> 6
     
  • Don HO
    Don HO
    2010-02-26

    • status: open --> closed