DSPAM / Feature Requests / #55 Create an unique but determined signature

Stevan Bajic - 2010-02-19

Hallo Enrico,

what issue are you expecting to solve with one unique signature per message? The current database schema can not attach multiple UID's to one signature.

--
Kind Regards from Switzerland,

Stevan Bajić

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Enrico Scholz - 2010-02-19

I want that users can force dspam to relearn a message. Relearning requires knowledge about the signature but because the signature is different for every recipient it can not be added to the e-mail headers. Hence, there is no way how users can relearn a message.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stevan Bajic - 2010-02-19

Why can a DSPAM signature (as it is today) not be added into the headers? Each user has normally in DSPAM his on storage and training/retraining with a signature is going to switch tokens for the user.

If you want the signature to stay persistent per message then just use one DSPAM user to classify/process the message and then deliver from your Milter to each user, adding the same DSPAM signature to the header and set your training alias to be executed under the DSPAM user you used when classifying/processing the message. Or use something like shared groups in DSPAM.

If I understand you right then your goal is to have just one signature per mail and you don't care if inside the DSPAM database the data is saved multiple times (for each user once) as long as the signature stays the same. Right? Adding something like that could be possible but stuff like UID in signature would then not work.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Enrico Scholz - 2010-02-19

> Why can a DSPAM signature (as it is today) not be added into the headers?

All local recipients which were given as RCPT: will get exactly the
same e-mail (header + body). Per-user signatures violate this.

> then deliver from your Milter to each user

Milter do not deliver mails but process mails while the MTA receives
them (e.g. 'dspam' within the milter classifies mail before MTA gives
the final response to DATA ('220 OK' or reject due to spamminess)).

> set your training alias to be executed under the DSPAM user you
> used when classifying/processing the message. Or use something like
> shared groups in DSPAM.

afair, one of 'dspam' basic ideas is that spam filtering should be
applied per user.

> If I understand you right then your goal is to have just one signature
> per mail and you don't care if inside the DSPAM database the data is
> saved multiple times (for each user once) as long as the signature
> stays the same. Right?

afaik, 'dspam' stores the set of tokens within an e-mail at a place
which is associated with the signature. This set of tokens depends
only on the e-mail but not the recipients, doesn't it?

E.g. the set of tokens for the e-mail sent as

| MAIL FROM <postmaster@example.com>
| 220 OK
| RCPT TO: <foo@example.com>
| 220 OK
| RCPT TO: <bar@example.com>
| 220 OK
| DATA
| Subject: ...
|
| Some message
| .
| 220 OK

will be the same for 'foo@example.com' and for 'bar@example.com'.

Each element of this set of tokens will be inserted into a user
specific database and spam/innocent counters be incremented.

For retraining, the signature is used to lookup the set of tokens and
the counters in the user database will be reverted/corrected.

Hence, there are two datasets: the tokens which are common for all
recipients and the classification of the tokens which is user specific.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stevan Bajic - 2010-02-20

> All local recipients which were given as RCPT: will get exactly the
> same e-mail (header + body). Per-user signatures violate this.
>
That is not always the case. I for example have always a "Delivered-To" header in all mail that I get and this is for not the same for every recipient of a mail.

> Milter do not deliver mails but process mails while the MTA receives
> them (e.g. 'dspam' within the milter classifies mail before MTA gives
> the final response to DATA ('220 OK' or reject due to spamminess)).
>
Okay

> afair, one of 'dspam' basic ideas is that spam filtering should be
> applied per user.
>
Not much things in DSPAM are a must. You can but you don't need to.

> afaik, 'dspam' stores the set of tokens within an e-mail at a place
> which is associated with the signature.
>
AND an DSPAM user ID.

> This set of tokens depends
> only on the e-mail but not the recipients, doesn't it?
>
What do you mean with that? I don't understand. Can you rephrase this?

> E.g. the set of tokens for the e-mail sent as
> | MAIL FROM <postmaster@example.com>
> | 220 OK
> | RCPT TO: <foo@example.com>
> | 220 OK
> | RCPT TO: <bar@example.com>
> | 220 OK
> | DATA
> | Subject: ...
> |
> | Some message
> | .
> | 220 OK
>
> will be the same for 'foo@example.com' and for 'bar@example.com'.
>
No. It will not be the same. The reason why it might be different is the whitelisting feature of DSPAM. The bigger part of the tokens will be the same but whitelisting can result in a bunch of tokens being diferent for foo then for bar.

> Each element of this set of tokens will be inserted into a user
> specific database and spam/innocent counters be incremented.
>
Definitely not. Assume the mail is HAM and assume that you run something else then TEFT and assume that the mail for foo is correctly classified as HAM and assume that the mail is classified as SPAM for bar and assume that foo does not retrain the message as SPAM and assume that bar is retraining the message as HAM then only the tokens for bar will be modified. For foo nothing changes in his token set.

> For retraining, the signature is used to lookup the set of tokens and
> the counters in the user database will be reverted/corrected.
>
This is not true.

1) One could run DSPAM in pristine mode then the tokens are not saved in dspam_signature_data (assuming you use a SQL based backend in DSPAM).

2) Assume you don't run pristine mode then the degenerated mail can be found in dspam_signature_data. This does not need to be necessarily whole mail. It could easy be that you have set your database to only allow 4MB of data in dspam_signature_data and assume the whole mail was 8MB then when you retrain DSPAM is going to read the degenerated mail from dspam_signature_data (but only the first 4MB) and then it is using that data and TOKENIZING it and those tokens are then switched/added in dspam_token_data.

> Hence, there are two datasets: the tokens which are common for all
> recipients and the classification of the tokens which is user specific.
>
This is not 100% true. You forget pristine mode. And since this is an option you can turn on/off on a per user basis (if you use preference extension) you can't say with 100% sureness (from outside DSPAM) that user foo AND bar will have their (common) dataset in dspam_signature_data.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Create an unique but determined signature

Group

Searches

Help

#55 Create an unique but determined signature

Discussion