This is a mail that I started to write around the 15th December, but never sent to the list.

I’ve checked in the code changes described below.

The basic summary of which is that although we’ve tried several different methods for parsing urls from text strings, the original method in 0.18.2 seems to be the most reliable (and fastest)

 

Paul

 

 

 

Since version 0.18.2, we’ve varied how we’ve parsed url’s to the extent that we’ve realised versions of mantis with a number of obvious issues in the URL parsing code.

 

I’ve spent a while myself comparing the various parsing routines, and have gradually concluded that the original regular expression is probably the best and most reliable version.

 

The original version I’m referring to is:

 

         $p_string = preg_replace( '/(([[:alpha:]][-+.[:alnum:]]*):\/\/(%[[:digit:]A-Fa-f]{2}|[-_.!~*\';\/?:@&=+$#\(\),\[\][:alnum:]])+)/s',  

                                                  '<a href="\1">\1</a> [<a href="\1" target="blank">^</a>]',  

                                                  $p_string);

 

 

That then moved to a mantislink class for parsing urls. This had issues e.g.  with parsing a body of test where there we’re multiple url’s from the same domain e.g. http://localhost http://localhost/foo/bar.php http://localhost/moo.php. The problem here was that you’d often end up with http://localhost being parsed first, so the 2nd or 3rd query would become <a href=http://localhost>http://localhost<a/>/moo.php.

 

After which, I commited a different url parser which went out in 0.19.2 and has an (obvious) bug in it stripping off query parameters.

 

Bpfennigschmidt commited 5 days ago, a new url parser, which fixes the issue in 0.19.2 but is still not as good as the original 0.18.2 code.

 

About 2 weeks ago, I started taking a serious look at the url parsing code in terms of trying to determine the ‘best’ current url parser to use, and have pretty much concluded that I would like to go back to the 0.18.2 code with a couple of modifications.

1. Add the use of rtrim() to trim trailing fullstops (and comma’s?) from a url. Note: this requires magic_quotes_sybase to be set to false. I discovered that issue when I hit the problem of the latest url parsing code failing with a fatal parser error. If we don’t go back to 0.18.2 we should add a check to string_insert_hrefs to check for the state of magic_quotes_sybase.

2. The addition of ^ as a character that can appear in a url query string.

3. The addition of \ as a character that can appear in a url query string.

4. The addition of % as a character that can appear in a url query string.

5. The addition of { as a character that can appear in a url query string.

6. The addition of | as a character that can appear in a url query string.

7. The addition of } as a character that can appear in a url query string.

 

This basically gives the current patch of to be applied to string_api.php (as attached): Note: rtrim is 4.1.0+ so we need to work on compatibility before applying.

 

diff -u -r1.63 string_api.php

 

--- string_api.php      15 Dec 2004 23:07:34 -0000    1.63

 

+++ string_api.php      20 Dec 2004 17:05:44 -0000

 

@@ -284,38 +284,27 @@

 

      # Tag Processing

 

      #===================================

 

 

 

-     function parseurl($proto='', $url='', $xtra='')

 

-     {

 

-           # find wired characters at the end of the string

 

-           preg_match("/(.+)(|[^\w]+|&quot;|&amp;|&#039;|&lt;|&gt;)$/Uis", $url, $match);

 

-           # Add protocol if needed

 

-           $url = ($proto == '') ? 'http://'. $match[1] : $match[1];

 

-           # add wired characters at the end

 

-           $rest = ( isset($match[2]) ) ? $match[2] : '';

 

-           # return

 

-           return($xtra .'<a href="'. $url .'" target="_blank">'. $match[1] .'</a>'. $rest);

 

-     }

 

-

 

      # --------------------

 

      # Detect URLs and email addresses in the string and replace them with href anchors

 

      function string_insert_hrefs( $p_string ) {

 

            if ( !config_get( 'html_make_links' ) ) {

 

                  return $p_string;

 

            }

 

-           # Find any URL in a string and replace it by a clickable link          

 

-          

 

-           $expression = array(

 

-                 "/([^]\w\-=\"\'\/])?((https?|ftps?|edk|gopher|news|telnet|ssh):\/\/[^\s\W]|www\.[^\s\W])([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/esi",

 

-                 "/^((https?|ftps?|edk|gopher|news|telnet|ssh):\/\/[^\s\W]|www\.[^\s\W])([^ \r\n\(\)\^\$!`\"'\|\[\]\{\}<>]*)/esi"

 

-           );

 

-

 

-           $replacement = array(

 

-                 "parseurl('\\3','\\2\\4','\\1')",

 

-                 "parseurl('\\2','\\1\\3')"

 

-           );

 

+

 

+           $t_change_quotes = false;

 

+           if( ini_get_bool( 'magic_quotes_sybase' ) ) {

 

+                 $t_change_quotes = true;

 

+                 ini_set( 'magic_quotes_sybase', false );

 

+           }

 

           

 

-           $p_string = preg_replace($expression, $replacement, $p_string);

 

-    

 

+           # Find any URL in a string and replace it by a clickable link

 

+           $p_string = preg_replace( '/(([[:alpha:]][-+.[:alnum:]]*):\/\/(%[[:digit:]A-Fa-f]{2}|[-_.!~*\';\/?%^\\\\:@&={\|}+$#\(\),\[\][:alnum:]])+)/se',  

 

+                                                                 "'<a href=\"'.rtrim('\\1','.').'\">\\1</a> [<a href=\"'.rtrim('\\1','.').'\" target=\"blank\">^</a>]'",  

 

+                                                                 $p_string);

 

+           if( $t_change_quotes ) {

 

+                 ini_set( 'magic_quotes_sybase', true );

 

+           }

 

+

 

            # Set up a simple subset of RFC 822 email address parsing

 

            #  We don't allow domain literals or quoted strings

 

            #  We also don't allow the & character in domains even though the RFC

 

 

 

***** CVS exited normally with code 1 *****

 

 

 

A comparision of the above code to the current CVS code on a large list of urls (taken from my proxy server logs), results in the following match errors/stats

 

 

 

Time Taken: 2,494.2329

 

Correct Matches: 22044776

 

Failed Matches: 99382

 

 

 

The current CVS code run on the same list of URL’s presents the following output:

 

 

 

Time Taken: 3,493.4399

 

Correct Matches: 20541333

 

Failed Matches: 1602825