Re: gtube.txt (aka sample-spam.txt) with headers isn't recognized

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> I have setup an postfix + amavis + spamassassin + pyzor all from the
> debian lenny packages [1]. Everything works fine. But I discovered that
> the sample spam [2] send through the whole system is not scored with
> pyzor (but is with razor). Pyzor returns an exit code of 1.

The SA debugging output you included show that Pyzor was checked:

> /usr/sbin/amavisd-new[21828]: (21828-01) SA dbg: pyzor: got response:
> public.pyzor.org:24441 (200, 'OK') 0 0

This shows that Pyzor was checked, and that the response was 0 hits
and 0 whitelist counts.

> Indeed :
>   # pyzor check < /tmp/gtube.txt
>     public.pyzor.org:24441  (200, 'OK')     151     0
>   # echo $?
>     0
> and
>   # pyzor check /tmp/.spamassassin21828A1Yhoatmp
>     public.pyzor.org:24441  (200, 'OK')     0       0
>   # echo $?
>     1
>
> So what's the differences ?

Your diff shows that it's not additional headers being added - it is
replacements headers.  The key is the last line, which is blank.  The
/tmp/.spamassassin21828A1Yhoatmp message has only the headers that are
shown in the diff, not those in the original GTUBE sample message.
Then the body of the /tmp/.spamassassin21828A1Yhoatmp message is all
of the gtube.txt file (i.e. the headers in that file are part of the
body as well).  That means that the messages are substantially
different, so there are different pyzor digests, and therefore
different responses.

The exit codes simply reflect the results - 0 means "found hits and no
whitelist count" and 1 means "found no hits, or a positive whitelist
count".

If you're asking the larger question about whether GTUBE should always
trigger a pyzor hit, I'm not certain, but I lean towards "no".  Pyzor
is about creating unique hashes for essentially identical messages,
and checking how often those have been seen by others.  My feeling is
that GTUBE checking is therefore not appropriate here (because it's
part of a larger message).  GTUBE isn't meant to be detected by every
anti-spam solution (e.g. DNSBL systems generally provide a 127.0.0.2
checking address for the same purpose), and it is simple to add GTUBE
checking to any system that also uses Pyzor (the dominant system being
SA, which, of course, already does a GTUBE check).

The purpose of GTUBE is to check that the filter is working correctly.
 It does seem reasonable to provide a similar function in pyzor - but
I think this would be best done by providing (on the Pyzor wiki) a
couple of complete emails that can be checked - one that is known to
have a high hit count, one that is known to have a high whitelist
count (and ensure that these results stay constant).  If anyone would
find that useful, then please open a ticket on the issue tracker, and
I'll happily add such functionality (but if no-one needs it, then it's
not really worth doing).

We could use the sample message provided by SA
(http://spamassassin.apache.org/gtube/gtube.txt) as the 'high hit
count' example - that would somewhat address both concerns (and in
fact it is the case now, since there have been 151 reports of the
sample message).  Any other message containing the GTUBE string
wouldn't (necessarily) have a high hit count, but the example message
would.  OTOH, maybe that would just be confusing, as here.

Cheers,
Tony