Re: [Tcllib-devel] Asking for comments on a ncgi patch (regexp -> string last/...)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Jeff is entirely right that the performance of regexp is somewhat
unpredictable.  One of these years, I'm going to say, "very well, to
perdition with my sanity!" and plunge into Henry's code.

For CGI parsing, I believe it's the first = that's supposed to govern.
The field name can easily be restricted to have no = in it, since
it's in the form presented to the client.  The field value, on the other
hand, is not controllable.

Paradoxically, the regexp to recognize the leftmost = is much faster:

% set x f[string repeat o 100]=ba[string repeat r 99]
% time {regexp {^([^=]*)=(.*)$} $x -> name value} 10000
27.8222 microseconds per iteration
% time {regexp {^(.*)=([^=]*)} $x -> name value} 10000
171.1232 microseconds per iteration

The time in the slower regexp appears to be spent, by the way, in dealing 
with the consequences of the first pair of capturing brackets:

% time {regexp {(?:.*)=(.*)} $x -> value} 10000
20.6424 microseconds per iteration
% time {regexp {(.*)=(?:.*)} $x -> name} 10000
162.0929 microseconds per iteration

My guess is that this common usage, in Henry's algorithm, leads to a
state space explosion.  Much of the art of designing recognizers is
the art of controlling such things in the common cases; regexp matching
in general is PSPACE-complete, and our extended regexps are even worse.
(If I recall correctly, the time for recognizing arbitrary regexps when
counted repetition is allowed is not bounded by any tower of exponentials
2**(2**(2**...(2**n)...)))

Kevin

--
73 de ke9tv/2, Kevin KENNY   GE Corporate Research & Development
ke...@cr...           P. O. Box 8, Bldg. K-1, Rm. 5B36A
                             Schenectady, New York 12301-0008 USA