Re: [Sqlgrey-users] A smarter smart

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Michel Bouissou wrote the following on 15.02.2005 16:48 :

>Le Mardi 15 F=E9vrier 2005 16:31, Lionel Bouton a =E9crit :
> =20
>
>>Thanks, I'm worried about the size of the regexp though. There are two
>>things on my mind :
>>- is it maintainable ?
>>   =20
>>
>
>I don't think it will need much maintenance. It's based on a (yet more=20
>complex ;-)
>

Even more !

> regexp I have built over years, and that very seldom needs=20
>changes -- and the changes are improvements that are not strictly speaki=
ng=20
>necessary nor urgent.
>
>Maintaining such a regexp is not that complex if you are careful ;-)=20
>especially about line breaks if you split it into several lines (it seem=
s=20
>that an escaped line break should NOT be put after a ) or } or ? or the=20
>regexp won't work. I limit myself to splitting after "regular characters=
" and=20
>before a "|".
> =20
>

I see.

> =20
>
>>- how much processing time is needed for these regexp ?
>>   =20
>>
>
>Given that we just process a short hostname and not a long file, and giv=
en=20
>that Perl will compile the regexp only once except for the one that cont=
ains=20
>part of the IP as a variable, I believe the processing time should be=20
>negligible (compared to the database accesses etc.)
> =20
>

Regexp can be both really quick and slow. I've not yet enough experience=20
with perl regexps to know only with a quick look at a regexp if perl=20
would handle hundreds of thousands of match/second or just hundreds/secon=
d.

> =20
>
>>I'd like to add this as a separate algorithm and put the regexp in
>>external files that can be reloaded
>>   =20
>>
>
>I would hardcode this. I expect very little changes to this, if any. Loa=
ding=20
>the regexps from external files would make this still more complex and=20
>subject to errors...
> =20
>

I'd prefer to have

if ($fqdn =3D~ $known_server_patter) ...

and so on.
than the full regexp in the code ! The accidental keypress in the middle=20
of the regexp could have unforseen consequences and would be hard to=20
spot without a cvs diff, but the keypress in the middle of a var name is=20
an instant blocker with an obvious error message leading to a painless=20
resolution.

Editing the regexp file would be less error-prone in my opinion.

Loading regexps from file isn't really so complex.

> [...]
>
>>I'll probably start the 1.5.x branch for this new algorithm.
>>   =20
>>
>
>Meanwhile, you can test it on your own system, I don't think you'll noti=
ce any=20
>performance impact, but it will probably be more accurate that the basic=
 IP=20
>address test (see my last post with some examples...)
> =20
>

I won't notice any perf difference. Installations handling more than a=20
million mail per day are worrying me though.

I'll bench the code to see how many lines per second these regexp can=20
handle on my systems, hard numbers are usually more convincing to me=20
with things as complex as regexpes.

Lionel.