From: <ke...@cr...> - 2006-08-09 14:45:29
|
Jeff is entirely right that the performance of regexp is somewhat unpredictable. One of these years, I'm going to say, "very well, to perdition with my sanity!" and plunge into Henry's code. For CGI parsing, I believe it's the first = that's supposed to govern. The field name can easily be restricted to have no = in it, since it's in the form presented to the client. The field value, on the other hand, is not controllable. Paradoxically, the regexp to recognize the leftmost = is much faster: % set x f[string repeat o 100]=ba[string repeat r 99] % time {regexp {^([^=]*)=(.*)$} $x -> name value} 10000 27.8222 microseconds per iteration % time {regexp {^(.*)=([^=]*)} $x -> name value} 10000 171.1232 microseconds per iteration The time in the slower regexp appears to be spent, by the way, in dealing with the consequences of the first pair of capturing brackets: % time {regexp {(?:.*)=(.*)} $x -> value} 10000 20.6424 microseconds per iteration % time {regexp {(.*)=(?:.*)} $x -> name} 10000 162.0929 microseconds per iteration My guess is that this common usage, in Henry's algorithm, leads to a state space explosion. Much of the art of designing recognizers is the art of controlling such things in the common cases; regexp matching in general is PSPACE-complete, and our extended regexps are even worse. (If I recall correctly, the time for recognizing arbitrary regexps when counted repetition is allowed is not bounded by any tower of exponentials 2**(2**(2**...(2**n)...))) Kevin -- 73 de ke9tv/2, Kevin KENNY GE Corporate Research & Development ke...@cr... P. O. Box 8, Bldg. K-1, Rm. 5B36A Schenectady, New York 12301-0008 USA |