It would be nice to have the capability to track
messages that have already been classified and added to
the bogofilter database. (Maybe by Message-Id.) So
that bogofilter will not add a given message twice. Or
so that a misclassified message can be properly
classified without explicitly specifying the -Ns or -sN
options.
For example, running bogofilter on a set of folders or
mailboxes that are known to be non-spam. One of the
folders contains a message that was previously
misclassified as spam. Bogofilter automatically runs
-Sn on this message, placing the tokens in the proper
database.
If bogofilter encounters a message id not in its
database, it will classify the message via -s or -n
based on the given option.
If this feature were available, it would be easy to
automate cron-job automation for those applications
where it is not possible to use the -u option. (Say on
a set of folders that have been sorted after a day's
work.) Without having to worry about duplicated or
incorrectly sorted messages.
A recurse folders option would also be handy in
conjunction with this feature!
Logged In: YES
user_id=30510
A tracking system such as you propose could be implemented
with a script external to bogofilter. If you would care to
create and support this script, we can add it to the
bogofilter/contrib directory.
Script to build bogofilter databases automatically...(perl)
Logged In: YES
user_id=787521
I've created a perl script which does the above
automatically. It uses the perl dbm commands to maintain a
list of email message IDs inside the .bogofilter directory.
Only messages in the "Spam" folder will be considered spam.
Everything in any other folder except Spam, Sent and Drafts
is considered non-spam. Spam, Sent and Drafts are ignored
for purposes of this script.
This script is written such that it can be called at any
time to update your bogofilter database. It can be used to
build the initial database, and can be called from a cron
job daily. It expects you to have a mail folder hierarchy
in ~/Mail.
If you find that you've sorted an email into the wrong
folder, put it in the correct folder and re-run the script.
It will take care of updating bogofilter for you
(bogofilter -Ns or sN).
I'm not a perl wiz. The script just works...you're welcome
to make it better. (I'm open to any suggestions.)
Initially, I had thought it would be nice to have this
functionality built into bogofilter, so that it uses the
same db format, and could be activated with just a couple
switches...
Updated version of above script with notes explaining how it works...
Logged In: YES
user_id=2788
A word of warning:
The Message-ID alone is not sufficient to distinguish one mail from
another. Mail originating in badly configured multidrop (aka
domain-in-a-mailbox) systems that has been reinjected for instance will
have the same Message-ID, but a different envelope sender. Mail that
has floated through a list exploder will again have the same
Message-ID, but a different envelope and some headers added. You
may want to distinguish these mails, so the tuple (Message-ID,
Envelope-Sender, Envelope-Recipient) will be more effective. The
Envelope-Sender can be read from the Return-Path: header (see
RFC-2821), the Envelope Recipient is not standardized. Common
headers to look in are X-Original-To:, Delivered-To:, X-Envelope-To:,
but it depends on the receiving MTA.
Logged In: YES
user_id=41225
SpamProbe uses an md5-hash value of the complete mail to avoid that
mails are duplicated or similar.
I'd love to have this feature in bogofilter too :)
Balu
Logged In: YES
user_id=787521
An md5 hash would be nice except that it might involve a
fair amount of computation and/or disk access. Unless maybe
the hash is computed and stored at the time the message is
created. (Just retrieve the hash when the mail is scanned.)
Imap has a unique id for each message, but you don't have
this id if you change mailbox formats.
It would also be nice to have the ability to understand
different mailbox formats/retrieval methods. In case you
were to access the mail as a separate process via Imap, or
even on a separate server...