Re: [Sqlgrey-users] Re: sqlgrey easier data mining

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Michel Bouissou wrote the following on 02/07/2005 02:31 PM :

>Le Dimanche 06 F=E9vrier 2005 16:13, Lionel Bouton a =E9crit :
> =20
>
>>I'm not inclined to add stuff just because it isn't a big deal, especia=
lly
>>in the database schema which is the kind of thing I learned to change w=
ith
>>caution.=20
>>   =20
>>
>
>Sure, but the database schema will have to change anyway (to include=20
>first_seen and rename an IP address field). So it would be the good mome=
nt to=20
>add one more field that costs little. Changing important fields in a dat=
abase=20
>schema must be done with caution, I agree, but adding a purely informati=
ve=20
>field (that won't be used as a key or calculation base or whatever) has =
no=20
>consequences...
>
> =20
>

This makes sense to me. But there are so many purely informative fields.=20
For example it just occured to me that you *may* want to have a=20
"previously_seen" field in order to do queries like that :
SELECT sender_domain, host_ip, last_seen - previously_seen FROM=20
domain_awl ORDER BY (last_seen - previously_seen) LIMIT 50;
I'd even argue that this will be more useful than a counter field... but=20
still less useful than a log parsing tool.

>>Look at the TODO, there are already several things with a clear need...
>>   =20
>>
>
>Yes. About the todo, a couple of remarks :
>
>1/ I object against integrating SPF in any way in SQLgrey. SPF and greyl=
isting=20
>are completely different systems, with different goals and approaches. S=
PF is=20
>implemented in separate patches (I use a Postfix patch) or policy server=
s. I=20
>don't see the interest of integrating a goat and a cow together ;-) and =
using=20
>SPF to determine whether or not greylisiting should be applied would sur=
ely=20
>be an easy way for spammers to defeat greylisting...
> =20
>

It may be, this entry is only a reminder for me. I know for sure that=20
blindly trusting SPF is a no-no, the "experiment" only means that I'm=20
wondering if SQLgrey rejecting SPF invalid senders instead of=20
greylisting them may be useful (the question is merely to find out if=20
there's a point combining both informations outside Postfix in the=20
policy server or not) or if relying on a separate policy server is the=20
way to go (and document this in the HOWTO). Don't pay too much attention=20
to this TODO entry.

>2/ I still would love to get sender and recipient based whitelisting in=20
>SQLgrey. Using Postfix tables for this purpose is not a satisfactory=20
>solution, for one can have a whole series of tests in Postfix, and diffe=
rent=20
>exceptions for each kind of test. One may want to skip greylisting for s=
ome=20
>sender (i.e. somebody@somedomain), but for example still want to perform=
 SPF=20
>tests on somedomain. Using a Postfix table with "somebody@somedomain =3D=
> OK"=20
>would cause *all* subsequent tests to be skipped for this message, not o=
nly=20
>greylisting. And it makes it a headache in ordering tests if using diffe=
rent=20
>Postfix tables for this...
> =20
>

I don't find test ordering in Postfix the most intuitive thing either :-)

>It would sound logical and easier to me that each "policy server" embark=
s its=20
>own independent whitelisting for conditions under which this given test=20
>should be performed or not...
>
> =20
>

For recipients I'm more than OK with it (this is the opt-in and opt-out=20
TODO entry).
For senders, as I already said, I see it as a big hole in the=20
greylisting process.

> =20
>
>>In my opinion a separate log parsing tool would bring far more useful
>>stats.
>>   =20
>>
>
>Sure, a log parsing tool is most useful, and probably most mail admins h=
ave=20
>something like this. But a counter gives *different* information that ca=
n be=20
>seen in the databaseat a glimpse, i.e. "is this sender an usual, frequen=
t=20
>correspondent, or did he send only once" ? (as some spammers or viruses =
do,=20
>and yes, sometimes, they can pass thru greylisting)...
>
>The counter would allow, for example, to easily extract the ratio of sen=
der=20
>that have been seen only once compared to the ratio of "repeating" sende=
rs=20
>present in the database. For analyzing the database, this is useful (and=
 easy=20
>to get), and a log parsing tool won't give this information.
> =20
>

Now, that's more an argument I can understand for storing this=20
information. But won't someone prefer a "previously_seen" (which by the=20
way is slightly more complex to implement) ? If the entry can't be found=20
more than once in the logs covering the awl ttl period, you'll have=20
nearly the same information...

Lionel.