#17 Easy message sorting and classification

Russel Ross

It would be nice to have the capability to track
messages that have already been classified and added to
the bogofilter database. (Maybe by Message-Id.) So
that bogofilter will not add a given message twice. Or
so that a misclassified message can be properly
classified without explicitly specifying the -Ns or -sN

For example, running bogofilter on a set of folders or
mailboxes that are known to be non-spam. One of the
folders contains a message that was previously
misclassified as spam. Bogofilter automatically runs
-Sn on this message, placing the tokens in the proper

If bogofilter encounters a message id not in its
database, it will classify the message via -s or -n
based on the given option.

If this feature were available, it would be easy to
automate cron-job automation for those applications
where it is not possible to use the -u option. (Say on
a set of folders that have been sorted after a day's
work.) Without having to worry about duplicated or
incorrectly sorted messages.

A recurse folders option would also be handy in
conjunction with this feature!


  • David Relson

    David Relson - 2003-06-06

    Logged In: YES

    A tracking system such as you propose could be implemented
    with a script external to bogofilter. If you would care to
    create and support this script, we can add it to the
    bogofilter/contrib directory.

  • Russel Ross

    Russel Ross - 2003-06-06

    Script to build bogofilter databases automatically...(perl)

  • Russel Ross

    Russel Ross - 2003-06-06

    Logged In: YES

    I've created a perl script which does the above
    automatically. It uses the perl dbm commands to maintain a
    list of email message IDs inside the .bogofilter directory.

    Only messages in the "Spam" folder will be considered spam.
    Everything in any other folder except Spam, Sent and Drafts
    is considered non-spam. Spam, Sent and Drafts are ignored
    for purposes of this script.

    This script is written such that it can be called at any
    time to update your bogofilter database. It can be used to
    build the initial database, and can be called from a cron
    job daily. It expects you to have a mail folder hierarchy
    in ~/Mail.

    If you find that you've sorted an email into the wrong
    folder, put it in the correct folder and re-run the script.
    It will take care of updating bogofilter for you
    (bogofilter -Ns or sN).

    I'm not a perl wiz. The script just works...you're welcome
    to make it better. (I'm open to any suggestions.)

    Initially, I had thought it would be nice to have this
    functionality built into bogofilter, so that it uses the
    same db format, and could be activated with just a couple

  • Russel Ross

    Russel Ross - 2003-06-06

    Updated version of above script with notes explaining how it works...

  • Matthias Andree

    Matthias Andree - 2003-11-02

    Logged In: YES

    A word of warning:
    The Message-ID alone is not sufficient to distinguish one mail from
    another. Mail originating in badly configured multidrop (aka
    domain-in-a-mailbox) systems that has been reinjected for instance will
    have the same Message-ID, but a different envelope sender. Mail that
    has floated through a list exploder will again have the same
    Message-ID, but a different envelope and some headers added. You
    may want to distinguish these mails, so the tuple (Message-ID,
    Envelope-Sender, Envelope-Recipient) will be more effective. The
    Envelope-Sender can be read from the Return-Path: header (see
    RFC-2821), the Envelope Recipient is not standardized. Common
    headers to look in are X-Original-To:, Delivered-To:, X-Envelope-To:,
    but it depends on the receiving MTA.

  • Anonymous - 2003-11-19

    Logged In: YES

    SpamProbe uses an md5-hash value of the complete mail to avoid that
    mails are duplicated or similar.

    I'd love to have this feature in bogofilter too :)


  • Russel Ross

    Russel Ross - 2003-11-19

    Logged In: YES

    An md5 hash would be nice except that it might involve a
    fair amount of computation and/or disk access. Unless maybe
    the hash is computed and stored at the time the message is
    created. (Just retrieve the hash when the mail is scanned.)
    Imap has a unique id for each message, but you don't have
    this id if you change mailbox formats.

    It would also be nice to have the ability to understand
    different mailbox formats/retrieval methods. In case you
    were to access the mail as a separate process via Imap, or
    even on a separate server...


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks