|
From: Alpana W. <al...@ut...> - 2009-09-04 21:09:45
|
Hello, I am writing my Master's dissertation on Pyzor and would be grateful for an answer to the following. What types of emails should be whitelisted? I assume that there is no value in me whitelisting for example a personal email from my sister of which I am the only recipient because no one else will receive that same email? If that is correct, do we only whitelist hams that have many recipients e.g. the weekly emails sent from Mar...@mo... where recipients have elected to subscribe to the emails? I suppose my broader question is: how should spam and ham be defined for the use of Pyzor? Many thanks, Alpana Weaver |
|
From: Matus U. - f. <uh...@fa...> - 2009-09-07 07:46:08
|
On 04.09.09 21:57, Alpana Weaver wrote: > I am writing my Master's dissertation on Pyzor and would be grateful for > an answer to the following. > > What types of emails should be whitelisted? > I assume that there is no > value in me whitelisting for example a personal email from my sister of > which I am the only recipient because no one else will receive that same > email? If that is correct, do we only whitelist hams that have many > recipients e.g. the weekly emails sent from > Mar...@mo... where recipients have elected to > subscribe to the emails? I think that all mail that hits pyzor and you believe it's not spam should be whitelisted. That means, not emails from your sister, but email from your sister if it hits pyzor. > I suppose my broader question is: how should spam and ham be defined for > the use of Pyzor? -- Matus UHLAR - fantomas, uh...@fa... ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. 10 GOTO 10 : REM (C) Bill Gates 1998, All Rights Reserved! |
|
From: Tony M. <to...@sp...> - 2009-09-07 22:40:39
|
Hi Alpana, > What types of emails should be whitelisted? IMO: * 'Bugs': ham where the hash of the message is identical to the hash of another (unrelated) message, either because of a flaw in the code (e.g. the one mentioned in ticket #57 and in this list at various times), or because the digest algorithm isn't selecting sufficiently unique text to hash. This is the only case where single-recipient messages would be worth reporting. * 'False positives': ham that has already been processed with Pyzor, which isn't sent to a single recipient, and had a non-zero hit count. IOW someone has incorrectly reported the message as spam (it's unfortunately common for people to hit the 'spam' button for automated messages that they subscribed to but are no longer interested in; this could also occur if someone is automatically feeding messages to Pyzor; it could also be a true mistake). If you are the only recipient, then there isn't really any value in whitelisting, but if there is even one other, then it's possible that you'll whitelist before their message is checked, and you'll save them from the false positive. An average user is not going to be able to distinguish between these two, but they are distinct to me as (a) a Pyzor developer, and (b) someone using Pyzor to filter many users' mail. * 'My mail': this is controversial, and no-one is doing this on public.pyzor.org at the moment (but might be in their own servers). If you're sending out messages to multiple recipients, then you could whitelist the message before you actually send it (this could be done automatically by your local SMTP server). Since headers aren't used in the hash, the hash would be the same pre-sending as on arrival. You'd then be ensuring that Pyzor didn't detect your messages. Of course, if the spammers did this, it would not be good! (Whitelisting typically requires an account on the Pyzor server, to avoid that). * 'True negatives': i.e. you have some other method(s) of determining if a multiple-recipient message is ham/spam, and if it comes out as ham, then you report it to Pyzor, either without checking Pyzor, or even when Pyzor does not consider it spam. I'm not certain this is a good idea - it increases the resources required for the Pyzor server, and you're saying that you really trust these other classification methods (because they'll override any Pyzor result). However, if you assume that the other classification never has false negatives, and you're willing to commit the additional server resources, then this would avoid false positives (from bugs or user error) for anyone classifying their mail after you do. Something that you could look at (if you get suitable corpora of email) is the time distribution of identically-hashed messages (ignoring any caused by bugs). If identical-hash messages generally arrive in a small period of time (e.g. several hours), then the value of manual whitelisting based on user feedback is diminished (because it is likely that by the time it is done all the messages have been classified already). In that case, only automated whitelisting would have much value. However, if they arrive over a long period of time (e.g. batches occurring over a week), then there is value in a manual whitelist operation. Cheers, Tony |
|
From: Guido <lis...@gu...> - 2009-09-07 10:21:32
|
On (09-09-04 21:57), Alpana Weaver wrote: > What types of emails should be whitelisted? I assume that there is no > value in me whitelisting for example a personal email from my sister of > which I am the only recipient because no one else will receive that same > email? If that is correct, do we only whitelist hams that have many > recipients e.g. the weekly emails sent from > Mar...@mo... where recipients have elected to > subscribe to the emails? > I suppose my broader question is: how should spam and ham be defined for > the use of Pyzor? >From my point of view there are two reasons why someone would whitelist a message. The first (and may be the more important one) is if a spam mail has been correctly reported as spam. Unfortunately it may happen that the digest of the spam mail is not meaningful and therefore may match the digest of ham mails. Check this thread [1] for an example. Here I assume that a low false positive rate is more important than a high true positive rate. The second case is that someone reported (by accident or not) a ham message as spam. As it is not easy to remove reported messages from the server someone has to whitelist this message in order that it does not hit pyzor any more. This may be the case for bulk mails for example. Unfortunately it is very likely that these messages to your users already hit pyzor at the time when you whitelist it. As the messages from your sister should be unique there is no need to whitelist them. That's true. This would just generate futile entries in pyzords database. >From my point of view whitelisting messages from bulk senders that have not been reported yet, does not make sense too. At the time you find this message in you mailbox it is very likely that it is already delivered to all recipients. So a whitelisting entry in the database would be useless too. Hopefully the next month's message will differ from the actual message ;-) If you indeed get recurring messages with the same digest it may be a good idea to whitelist those. Just to prevent from accidental reporting. (For reporting would not have any affect after this.) Anyhow, I can't see any use case for this. There may be another reason for whitelisting messages. Assuming that pyzor checks are done _before_ the mail goes into spamassassin it could save a lot of system resources when a whitelisted message bypasses spamassin. But I am not sure if this acutally makes sense in real world. Good luck with your paper. Would like to read it after it's finished. [1] http://sourceforge.net/mailarchive/message.php?msg_name=1248706984.24454.17.camel%40werner |