Menu

#383 Database corruption after multiple trainings

Source code 1.0a6
closed-fixed
imapfilter (36)
5
2003-11-18
2003-10-20
Jacob
No

I've made a few posts about this to the Spambayes
mailing list and I'm going to paste those messages for
reference. However, in summary: regardless of what
data format I use, after several trainings using the -t
flag, my database becomes corrupted. I've been able to
reproduce this using both a Pickle and Bsddb[3]. Each
time, if I remove the DB and retrain from scratch, there
isn't a problem. Also, if I just classify, I don't have any
corruption problems (that is, if I just train once and
after that never train again). The training always
completes, and when it moves onto classifying, I get an
assertion error.

The messages from the mailing list follow. Below the
message, I've included a sample sb_imapfilter session
transcript.

----------------------------------------
Messages from mailing list:
----------------------------------------
From "Tony Meyer" <tameyer@xxxx.xx.xx>
Subject RE: [Spambayes] Serious Database Corruption
Problems
Date Wed, October 15, 2003 6:18 pm
To
leotune@xxxxxxxxxxxxxxxxxx.xxx,spambayes@xxxxxx.xxx

------------------------------------------------------
--------------------------

> I'm having a lot of trouble with what I think is
database corruption.
> I've included the output I get from the program
before, but
> from what I've read, an assertion error usually means
the database is
dead.

Yes - what this is saying is that you have a token that
has appeared in more
spam than you have trained, which is obviously
impossible.

> As the FAQ suggests, I've tried both Bsddb[3] and
Pickle formats, but
> after a few trainings, I always get this error. If I
delete
> my databases and start over, then I'm fine for a few
additional trainings,

> but the same thing happens.

It's very strange that this happens with a pickle. To
me, that sounds like
this is an imapfilter bug, although not one I've seen
reported before.

> I'm getting a little frusturated with this. Is there
> something I can do to keep this from happening?

Do you do all your training with "sb_imapfilter.py -t"? Up
until the
assertion error, does the training always successfully
complete? (i.e. it
doesn't crash halfway through?)

If you run db_expimp.py on your database to convert it
to text
("db_expimp.py -e -d hammie.db -f hammie.txt" if it's a
pickle) and open it
up, what are the ham and spam counts at the top? (I
suspect 0 for both).

=Tony Meyer
--------------------------------------------------
From jacob-spambayes-list@xxxxxxxxxxxxxxxxxx.xxx
Subject RE: [Spambayes] Serious Database Corruption
Problems
Date Wed, October 15, 2003 10:56 pm
To spambayes@xxxxxx.xxx

------------------------------------------------------
--------------------------

>> I'm getting a little frusturated with this. Is there
>> something I can do to keep this from happening?
>
> Do you do all your training with "sb_imapfilter.py -t"?
Up until the
> assertion error, does the training always successfully
complete? (i.e.
> it doesn't crash halfway through?)

Yes, I do all of my training that way. The training
always completes, and
then the program fails during classification. I've
included a typical
transcript below. Something worth making note of: it
seems like, many
times during training, it'll report that messages are
trained when there
are no new messages in that particular folder.

>
> If you run db_expimp.py on your database to convert
it to text
> ("db_expimp.py -e -d hammie.db -f hammie.txt" if it's
a pickle) and open
> it
> up, what are the ham and spam counts at the top? (I
suspect 0 for both).

suslik% more hammie.txt
311,431,

I can send you the whole file if it'd be useful.

Thanks,
Jacob

----------------------------------------
A sample sb_imapfilter transcript:
----------------------------------------
Something worth noting about the following transcript:
for the lines that look like these, the contents of those
two folders never changed, so I don't understand why
they indicated messages were trained. It doesn't do
this with the Inbox.

Training ham folder INBOX
.*............ 1 trained.
Training spam folder INBOX.-Spam
*..........................................................................
............................................................................
............................................................................
............................................................................
............................................................................
....................................................
1 trained.
----------------------------------------

suslik% ./sb_imapfilter.py -l 5 -c -t -v -d hammie.db
SpamBayes IMAP Filter Beta1, version 0.1 (September
2003),
using SpamBayes IMAP Filter Web Interface Alpha2,
version 0.02
and engine SpamBayes Beta2, version 0.2 (July 2003).

Loading state from hammie.db pickle
hammie.db is an existing pickle, with 310 ham and 417
spam
Loading database hammie.db... Done.
Training
Training ham folder INBOX.-Wanted
............................................................................
............................................................................
............................................................................
.....................................................................
0 trained.
Training ham folder INBOX
.*............ 1 trained.
Training spam folder INBOX.-Spam
*..........................................................................
............................................................................
............................................................................
............................................................................
............................................................................
......................................**************
15 trained.
Persisting hammie.db as a pickle
Training took 35.0596210957 seconds, 16 messages
were trained
Classifying
...................
Classified 0 ham, 0 spam, and 0 unsure.
Classifying took 0.656105995178 seconds.
Training
Training ham folder INBOX.-Wanted
............................................................................
............................................................................
............................................................................
.....................................................................
0 trained.
Training ham folder INBOX
.*............ 1 trained.
Training spam folder INBOX.-Spam
*..........................................................................
............................................................................
............................................................................
............................................................................
............................................................................
....................................................
1 trained.
Persisting hammie.db as a pickle
Training took 29.7854119539 seconds, 2 messages were
trained
Classifying
..................*.Traceback (most recent call last):
File "./sb_imapfilter.py", line 824, in ?
run()
File "./sb_imapfilter.py", line 814, in run
imap_filter.Filter()
File "./sb_imapfilter.py", line 675, in Filter
self.unsure_folder)
File "./sb_imapfilter.py", line 594, in Filter
evidence=True)
File "/u/jpfarmer/lib/python2.3/site-
packages/spambayes/classifier.py",
line 158, in chi2_spamprob
clues = self._getclues(wordstream)
File "/u/jpfarmer/lib/python2.3/site-
packages/spambayes/classifier.py",
line 395, in _getclues
prob = self.probability(record)
File "/u/jpfarmer/lib/python2.3/site-
packages/spambayes/classifier.py",
line 245, in probability
assert spamcount <= nspam
AssertionError

Discussion

  • Jacob

    Jacob - 2003-11-07

    Logged In: YES
    user_id=890838

    The behavior seems to have disappeared in the a7 release.

     
  • Tony Meyer

    Tony Meyer - 2003-11-18
    • status: open --> closed-fixed
     
  • Tony Meyer

    Tony Meyer - 2003-11-18

    Logged In: YES
    user_id=552329

    Hopefully it was caused by one of the bugs fixed for 1.0a7,
    then. Please reopen if it does reoccur with 1.0a7.

     

Log in to post a comment.