Thread: [bugs]database corruption problem
Fast Bayesian spam filter along lines suggested by Paul Graham
Brought to you by:
m-a
From: David A. <arn...@pa...> - 2004-06-19 02:28:59
|
I am using bogofilter 0.91.2 as installed by my ISP panix.com (they are competent). Recently, I have been getting my database file ~/.bogofilter/wordlist.db corrupted. Specifically, if Iexecute bogoutil -d ~/.bogofilter/wordlist.db then I get a listing that, after several thousand unique lines, repeats a block of lines forever. This problem now occurs about once per day, on average. It has been happening since version 0.17.5, at least. Each time I created a new database file from text using "bogoutil -l". The fact that this happens repeatedly, even when I create the database file "from scratch" suggests that something is amiss. As a work-around, is there a database utility that can repair my wordlist.db file? Thanks for any suggestions. -- David Arnstein arn...@po... |
From: Matthias A. <mat...@gm...> - 2004-06-19 03:13:07
|
"David Arnstein" <arn...@pa...> writes: > I am using bogofilter 0.91.2 as installed by my ISP panix.com (they > are competent). > > Recently, I have been getting my database file > ~/.bogofilter/wordlist.db corrupted. Specifically, if Iexecute > bogoutil -d ~/.bogofilter/wordlist.db > then I get a listing that, after several thousand unique lines, > repeats a block of lines forever. Which BerkeleyDB version does your bogofilter use? bogofilter -V will tell you. What operating system and version does bogofilter run on? What kind of file system are you using? local (UFS, EXT3, ReiserFS, XFS, JFS, VxFS) or networked (NFS, AFS, Coda)? > This problem now occurs about once per day, on average. It has been > happening since version 0.17.5, at least. Each time I created a new > database file from text using "bogoutil -l". Are you running bogofilter with -u flag? Does bogofilter bump into file size limits? In bash, you can try ulimit -aH and ulimit -aS to check the limits, and ls -l .bogofilter to check if the size is close. > As a work-around, is there a database utility that can repair my > wordlist.db file? Thanks for any suggestions. No, that is for later when the transactional Berkeley DB data store will be merged into the baseline. -- Matthias Andree Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 |
From: David R. <re...@os...> - 2004-06-19 03:38:16
|
On Sat, 19 Jun 2004 05:12:57 +0200 Matthias Andree wrote: > "David Arnstein" <arn...@pa...> writes: > > > I am using bogofilter 0.91.2 as installed by my ISP panix.com (they > > are competent). > > > > Recently, I have been getting my database file > > ~/.bogofilter/wordlist.db corrupted. Specifically, if Iexecute > > bogoutil -d ~/.bogofilter/wordlist.db > > then I get a listing that, after several thousand unique lines, > > repeats a block of lines forever. > > Which BerkeleyDB version does your bogofilter use? > bogofilter -V will tell you. > > What operating system and version does bogofilter run on? > > What kind of file system are you using? local (UFS, EXT3, ReiserFS, > XFS, JFS, VxFS) or networked (NFS, AFS, Coda)? > > > This problem now occurs about once per day, on average. It has been > > happening since version 0.17.5, at least. Each time I created a new > > database file from text using "bogoutil -l". > > Are you running bogofilter with -u flag? > > Does bogofilter bump into file size limits? In bash, you can try > ulimit -aH and ulimit -aS to check the limits, and ls -l > .bogofilter to check if the size is close. > > > As a work-around, is there a database utility that can repair my > > wordlist.db file? Thanks for any suggestions. > > No, that is for later when the transactional Berkeley DB data store > will be merged into the baseline. Matthias, An excellent set of questions! FWIW, I've been running "-u" for 18 months or so, receive 800-100 messages daily, use procmail's locking when running bogofilter, and have not seen database corruption in many months. This is on a Mandrake 10.0 system with BerkeleyDB 4.1.25. David |
From: David A. <arn...@po...> - 2004-06-20 00:27:18
|
A big "thank you" to Mr. Andree and Mr. Relson for responding to my post. I attempt to supply all requested information about my execution environment for bogofilter. The response to "bogofilter --version" is: -------------------------------------------------------------------------- bogofilter version 0.91.2 Database: BerkeleyDB (3.3.11) Copyright (C) 2002-2004 Eric S. Raymond, David Relson, Matthias Andree, Greg Louis bogofilter comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under the General Public License. See the COPYING file with the source distribution for details. -------------------------------------------------------------------------- The response to "uname -a" is: -------------------------------------------------------------------------- NetBSD panix3.panix.com 1.5.4_ALPHA NetBSD 1.5.4_ALPHA (PANIX-USER) #0: Thu Feb 26 14:11:15 EST 2004 ro...@ju...:/devel/NO-BACKUPS/release-1.5-20 020917/src/sys/arch/i386/compile/PANIX-USER i386 -------------------------------------------------------------------------- The filesystem seems to be some sort of networked commercial file server, I don't know the details. If this information is truly important, I'll post a request for help to the ISP. Let me know please. The way I run bogofilter is that it is featured in one of my procmail recipes. In particular, my procmail recipe features the following: -------------------------------------------------------------------------- ######################### # Bogofilter processing # ######################### # Examine each incoming e-mail with bogofilter, and add a header line # to it. :0fw | $BOGOFILTER -u -e -p # If bogofilter failed, return the mail to the queue, the MTA will # retry to deliver it later. # 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h. :0e { EXITCODE=75 HOST } # file the mail to appropriate folder if it's spam. :0: * ^X-Bogosity: Yes, tests=bogofilter $BOGOFILE -------------------------------------------------------------------------- Procmail itself is executed by the incoming mail daemon. This is accomplished by the existence of my ~/.forward file, which says -------------------------------------------------------------------------- "|IFS=' ' && exec /usr/local/bin/procmail -f- || exit 75 #arnstein" -------------------------------------------------------------------------- The "ulimit" info is as follows: -------------------------------------------------------------------------- panix3 114> ulimit -aH core file size (blocks, -c) unlimited data seg size (kbytes, -d) 1048576 file size (blocks, -f) unlimited max locked memory (kbytes, -l) 509184 max memory size (kbytes, -m) 509184 open files (-n) 9932 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 32768 cpu time (seconds, -t) unlimited max user processes (-u) 3092 virtual memory (kbytes, -v) 1081344 panix3 115> panix3 115> ulimit -aS core file size (blocks, -c) 0 data seg size (kbytes, -d) 131072 file size (blocks, -f) unlimited max locked memory (kbytes, -l) 169728 max memory size (kbytes, -m) 509184 open files (-n) 64 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 2048 cpu time (seconds, -t) unlimited max user processes (-u) 80 virtual memory (kbytes, -v) 133120 -------------------------------------------------------------------------- My file .bogofilter/wordlist.db is currently 9,912,320 bytes long. It contains 231,603 entries. These numbers increase constantly. Gentlemen, thank you once again for your kind attention. Best regards, -- David Arnstein arn...@po... |
From: David R. <re...@os...> - 2004-06-20 01:13:11
|
On Sat, 19 Jun 2004 17:26:20 -0700 David Arnstein wrote: > A big "thank you" to Mr. Andree and Mr. Relson for responding to my > post. I attempt to supply all requested information about my > execution environment for bogofilter. Matthias and I are the major authors. Matthias is our expert in matters of autoconf, automake, BerkeleyDB, locking, and portability. I'm responsible for most of the rest. ...[snip]... > bogofilter version 0.91.2 > Database: BerkeleyDB (3.3.11) > Copyright (C) 2002-2004 Eric S. Raymond, > David Relson, Matthias Andree, Greg Louis ...[snip]... You're on a recent version of bogofilter, which is good, but an old version of BerkeleyDB, not good. If you have a 4.1 or newer version available, use it. > --------------------------------------------------------------------- > ----- > > The response to "uname -a" is: > > --------------------------------------------------------------------- > ----- NetBSD panix3.panix.com 1.5.4_ALPHA NetBSD 1.5.4_ALPHA > (PANIX-USER) #0: Thu Feb > 26 14:11:15 EST 2004 > ro...@ju...:/devel/NO-BACKUPS/release-1.5-20 > 020917/src/sys/arch/i386/compile/PANIX-USER i386 > --------------------------------------------------------------------- > ----- > > The filesystem seems to be some sort of networked commercial file > server, I don't know the details. If this information is truly > important, I'll post a request for help to the ISP. Let me know > please. Networked filesystems can have file locking problems, especially if the database (wordlist.db) is accessible through the network. There's some info on this in the FAQ. You can improve your stability by using procmail's locking facilities. Change the ":0fw" in your procmail recipe to ":0fw:". Matthias is much more familiar with such issues than am I. I'm sure he'll respond tommorrow. > The way I run bogofilter is that it is featured in one of my procmail > recipes. In particular, my procmail recipe features the following: > > --------------------------------------------------------------------- > -----######################### > # Bogofilter processing # > ######################### > > # Examine each incoming e-mail with bogofilter, and add a header line > # to it. > :0fw > | $BOGOFILTER -u -e -p ...[snip]... > My file .bogofilter/wordlist.db is currently 9,912,320 bytes long. It > > contains 231,603 entries. These numbers increase constantly. Since "-u" autoupdates the database, the wordlist size _will_ increase with every message scored by bogofilter. The size may be limited by your MTA. For example, if you're running postfix check the values of mailbox_size_limit and message_size_limit. Command "postconf | grep size_limit" will show those values. HTH, David |
From: Matthias A. <mat...@gm...> - 2004-06-20 08:14:25
|
On Sat, 19 Jun 2004, David Arnstein wrote: > -------------------------------------------------------------------------- > bogofilter version 0.91.2 > Database: BerkeleyDB (3.3.11) I haven't used BerkeleyDB 3.3 in production for a long time, so I cannot say if that alone is it. > -------------------------------------------------------------------------- > > The response to "uname -a" is: > > -------------------------------------------------------------------------- > NetBSD panix3.panix.com 1.5.4_ALPHA NetBSD 1.5.4_ALPHA (PANIX-USER) > #0: Thu Feb > 26 14:11:15 EST 2004 > ro...@ju...:/devel/NO-BACKUPS/release-1.5-20 > 020917/src/sys/arch/i386/compile/PANIX-USER i386 So it is an oldish non-stable NetBSD release on i386. This alone needn't be the cause either. > -------------------------------------------------------------------------- > The filesystem seems to be some sort of networked commercial file > server, I don't know the details. If this information is truly > important, I'll post a request for help to the ISP. Let me know please. Well, that might be it, but we cannot be sure. Are you using the BOGOFILTER_CONCURRENT_DATA_STORE mode? If so, that mode cannot work on NFS, as the __db.NNN files must reside on a local file system (ufs is one of those on NetBSD). > -------------------------------------------------------------------------- > ######################### > # Bogofilter processing # > ######################### > > # Examine each incoming e-mail with bogofilter, and add a header line > # to it. > :0fw > | $BOGOFILTER -u -e -p Bogofilter thus updates its file every time a mail comes in. Try if changing the recipe to :0fw:bogofilter.lock | $BOGOFILTER -u -e -p and see if that helps. Alternatively, remove the -u and train bogofilter explicitly. > The "ulimit" info is as follows: ... The limits look ample, they are not the problem. > My file .bogofilter/wordlist.db is currently 9,912,320 bytes long. It > contains 231,603 entries. These numbers increase constantly. That's fine. -- Matthias Andree Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 |
From: David R. <re...@os...> - 2004-06-19 03:15:36
|
On Fri, 18 Jun 2004 22:28:57 -0400 (EDT) David Arnstein wrote: > I am using bogofilter 0.91.2 as installed by my ISP panix.com (they > are competent). > > Recently, I have been getting my database file > ~/.bogofilter/wordlist.db corrupted. Specifically, if Iexecute > bogoutil -d ~/.bogofilter/wordlist.db > then I get a listing that, after several thousand unique lines, > repeats a block of lines forever. > > This problem now occurs about once per day, on average. It has been > happening since version 0.17.5, at least. Each time I created a new > database file from text using "bogoutil -l". > > The fact that this happens repeatedly, even when I create the database > file "from scratch" suggests that something is amiss. > > As a work-around, is there a database utility that can repair my > wordlist.db file? Thanks for any suggestions. > -- > David Arnstein > arn...@po... Hello David, Sounds like you've got problems! Database problems have (historically) been few and far between. You should be encountering problems rarely, if ever. Something's different (wrong) in how you're running bogofilter and accessing the database. The usual usage of bogofilter is in scoring messages. That opens the database read-only which can't cause corruption. I use the autoupdate ('-u') option which will add tokens to the database when a message is scored as ham or spam (but not as unsure). With a procmail recipe (including locking) to run bogofilter, I've not seen database corruption in a long long time. What environment are you running in, i.e. operating system, architecture, MTA, etc? How large a message load is bogofilter dealing with? What flags are you running it with? One workaround would be to create a copy of the database periodically and confirm its integrity using db_verify. That would give you a fallback if/when you next encounter trouble. Also, Matthias has code for using BerkeleyDB's transaction capabilities to ensure that the database remains correct. That code is available via CVS if you want to try it. Looking forward to you answer. Regards, David |
From: Matthias A. <mat...@gm...> - 2004-06-19 09:15:47
|
On Fri, 18 Jun 2004, David Relson wrote: > An excellent set of questions! FWIW, I've been running "-u" for 18 I'd considered if we should make such a questionnaire a part of the FAQ or make it a downloadable .txt file so posting the URL is sufficient. > months or so, receive 800-100 messages daily, use procmail's locking > when running bogofilter, and have not seen database corruption in many > months. This is on a Mandrake 10.0 system with BerkeleyDB 4.1.25. I regret having to inform you that this bears no practical relevance. Bogofilter is supposed to work without external locking, so if you are interested in getting relevant results, remove the locking from your procmail recipe and see if the database remains intact or becomes corrupt. The TXN version does well without external locking even when run in parallel but hasn't seen the 0.91 interface cleanup code yet which I suspect to be the problem. I wonder if we and how we can try to collide against bogofilter's opening the same file twice at the same time to see if bogofilter copes with that - from the test suite, that is. OTOH, I'm not 100% convinced that Berkeley DB is correct at all times. I've heard from subversion users they'll usually recommend DB 4.2.52. I've seen the fetchmail subversion respository (see http://developer.berlios.de/projects/fetchmail/) becoming corrupt for no apparent reason, too. Berlios use svn 1.0.4 with db-4.1, but they use the TXN interface so svnadmin recover will fix it usually. -- Matthias Andree Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 |
From: David R. <re...@os...> - 2004-06-19 12:27:50
|
On Sat, 19 Jun 2004 11:15:40 +0200 Matthias Andree wrote: > On Fri, 18 Jun 2004, David Relson wrote: ...[snip]... > The TXN version does well without external locking even when run in > parallel but hasn't seen the 0.91 interface cleanup code yet which I > suspect to be the problem. Matthias, I assume you're referring to the version 0.91.1 problem. It had changes to the interface between the datastore and database levels to better handle the differences between opening an existing database and creating a new database. The changes were made so that .WORDLIST_VERSION could be added when bogofilter created a new wordlist. The problem in 0.91.1 was fixed in 0.91.2 Below is a small test script, t.lock, and the output of running it against versions 0.90.0, 0.91.0, 0.91.1, 0.91.2, and 0.91.3. It shows quite clearly that 0.91.1 had a problem which is particular to that one release. David ### Test script t.lock ### #!/bin/sh # run from bogofilter/src directory OPTS="-C -d . -M" if [ ! -f wordlist.db ] ; then bogofilter $OPTS -v -n -I tests/inputs/good.mbx bogofilter $OPTS -v -s -I tests/inputs/spam.mbx fi for N in `seq 1 5` ; do bogofilter $OPTS -u -I tests/inputs/spam.mbx & done ### Output of t.lock for 0.9?.? ### [relson@osage bogofilter]$ for N in 09?? ; do ( cd $N/src ; pwd ; t.lock ; sleep 3 ; echo "" ) ; done /home/relson/bogofilter/0900/src /home/relson/bogofilter/0910/src /home/relson/bogofilter/0911/src bogofilter: (db) DB->open(./wordlist.db) - actually ./wordlist.db bogohome: . -, err: 17, File exists Can't open file 'wordlist.db' in directory '.'. error #17 - File exists. bogofilter: (db) DB->open(./wordlist.db) - actually ./wordlist.db bogohome: . -, err: 17, File exists Can't open file 'wordlist.db' in directory '.'. error #17 - File exists. bogofilter: (db) DB->open(./wordlist.db) - actually ./wordlist.db bogohome: . -, err: 17, File exists Can't open file 'wordlist.db' in directory '.'. error #17 - File exists. bogofilter: (db) DB->open(./wordlist.db) - actually ./wordlist.db bogohome: . -, err: 17, File exists Can't open file 'wordlist.db' in directory '.'. error #17 - File exists. /home/relson/bogofilter/0912/src /home/relson/bogofilter/0913/src |
From: Matthias A. <mat...@gm...> - 2004-06-20 11:37:35
|
> You're on a recent version of bogofilter, which is good, but an old > version of BerkeleyDB, not good. If you have a 4.1 or newer version > available, use it. He wrote Panix installed the system for him so he may not have the power to choose the version, although I'd suggest that he use 4.2 if he has the choice. > Since "-u" autoupdates the database, the wordlist size _will_ increase > with every message scored by bogofilter. The size may be limited by > your MTA. For example, if you're running postfix check the values of > mailbox_size_limit and message_size_limit. Command "postconf | grep > size_limit" will show those values. I don't expect to see problems here, the sizes are well below usual limits. -- Matthias Andree Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 |
From: David R. <re...@os...> - 2004-06-20 11:46:41
|
On Sun, 20 Jun 2004 13:37:30 +0200 Matthias Andree wrote: > > You're on a recent version of bogofilter, which is good, but an old > > version of BerkeleyDB, not good. If you have a 4.1 or newer version > > available, use it. > > He wrote Panix installed the system for him so he may not have the > power to choose the version, although I'd suggest that he use 4.2 if > he has the choice. If he's allowed to build executables, he could create $HOME/bin and run his own copies of BerkeleyDB and bogofilter. > > Since "-u" autoupdates the database, the wordlist size _will_ > > increase with every message scored by bogofilter. The size may be > > limited by your MTA. For example, if you're running postfix check > > the values of mailbox_size_limit and message_size_limit. Command > > "postconf | grep size_limit" will show those values. > > I don't expect to see problems here, the sizes are well below usual > limits. True. |