#896 Trainging via web interface (sometimes) doesn't work?

1.1.x
open
nobody
pop3proxy (138)
5
2010-10-25
2010-10-25
Carl Colijn
No

Hi all,

This is a summary of my post on the SpamBayes user discussion list where it went unanswered (october 14, http://blog.gmane.org/gmane.mail.spam.spambayes.general ).

I have 2 Thunderbird spam training folders (one ham, one spam). With SpamBayes 1.0.4 I used these to quickly re-train the filter after a re-install and such. These folder files have no extension, but they worked perfectly when uploading them for training via the web interface.

I now set up a SpamBayes 1.1a6 installation, and let it train on the training folders, but it didn't work. No errors in the web interface, training seemed to go OK (uploaded ok, Training... Saving... Done!) but the statistics on the main page ("Total emails trained") didn't reflect the newly trained mails (neither ham nor spam).

After that I uninstalled SpamBayes and tried the ThunderBayes++ Thunderbird plugin (which also includes version 1.1a6), but it too wouldn't train via the web interface - training seems to go OK but the trained-on mails don't arrive in the database.

Maybe it's some silly configuration issue on my part, but I've already tweaked it for quite a few hours now and can't get it right. But even while the cause could be my configuration history, it might mean that certain configuration changes might break a SpamBayes installation.

The attached zip file contains the configuration file set after test training on 1 ham message, and it also includes 2 sample Thunderbird mail files (spam & ham) containing one email each - I used these for the test training.

Some observations:
- I run Windows XP SP3 en-us with the SpamBayes 1.1a6 version shipped with ThunderBayes++ - databases are of the pickle version
- My training databases (2 Thunderbird files - spam and ham) contain +- 250 ham, +- 6000 spam
- When I start clean (close ThunderBird/SpamBayes, delete the cache & training databases) it re-creates them OK when restarted again
- After a restart it claims there are 0 trained messages (of course)
- When I upload the Thunderbird ham training folder file it seems to process it correctly but after it's done the counter still remains at "0 trained messages"
- hamme.db doesn't grow either (56 bytes after a clean database recreation, still 56 bytes after training)
- There's no errors in the log
- I've enabled caching messages (ThunderBayes by default has it off I think), and the uploaded messages do get extracted as separate messages in the cache - messageinfo.db indeed also grows
- "Review messages" sometimes shows the uploaded messages, but not consistently - they did appear a few times after I tweaked and restarted and such
- Copy/pasting a separate mail with headers and training on that has the same effect
- When I let it train on my Spam folder (with 6000+ mails in it) it is seriously busy - CPU at 100% for more than 10 minutes - so it must be doing something (apart from extracting to the cache)?
- Consecutively letting it train on the small Ham folder (250 messages) now takes far more time - the 6000+ spam messages it processed earlier must have influenced something
- When I look at the "More statistics" page the uploaded messages _do_ get reflected in the "Unsures trained as good" and "Unsures trained as spam" statistics
- Training via the ThunderBayes plugin buttons in ThunderBird _do_ raise the "trained on" counters - what does it do that I cannot do?
- There are no SMTP proxy details info specified in the settings - I assume ThunderBayes++ passes the ham/spam training via the web interface as well?
- Starting from scratch again (delete db's, clear email cache) and selecting "bsddb" as db type didn't change a thing

Here's the spambayes.ini file I use:

[Headers]
include_score:True
notate_subject:
[Storage]
persistent_use_database:pickle
persistent_storage_file:databases/hammie.db
cache_expiry_days:2
cache_messages:True
no_cache_bulk_ham:False
messageinfo_storage_file:databases/messageinfo.db
ham_cache:cache/ham
spam_cache:cache/spam
unknown_cache:cache/unsure
[html_ui]
default_spam_action:defer
display_score:True
[pop3proxy]
use_ssl:automatic
listen_ports:53100,53101,53102
remote_servers:xxx.xxx.com:110,xxx.xxx.com:995,xxx.xxx.nl:110

If anyone wants a copy of my installation or wants me to test some things, please feel free to ask - I'm willing to have my installation dissected for the good cause.

Discussion

  • Carl Colijn
    Carl Colijn
    2010-10-25

    Test training files and resulting config

     
    Attachments