Re: [bugs] Rare bug that causes bogofilter to segfault
Fast Bayesian spam filter along lines suggested by Paul Graham
Brought to you by:
m-a
From: Garen E. <sc...@tr...> - 2009-08-23 23:57:08
|
Hi David; Thanks for responding so quickly. You must have no life. :) On Sun, 23 Aug 2009, David Relson wrote: > "bogoutil -i" drops .MSG_COUNT up through v1.2.0. This problem has > been fixed in v1.2.1. The -i switch on bogoutil seems to be missing documentation in the man page. > Your pruning algorithm is "interesting". I'd be concerned about its > effect on scoring tokens. In practice it has about zero effect on scoring. Once the token counts reach zero counts for both ham/spam the token is just taking up space in the database as a "most likely" random string of characters. If the string is encountered again in a new message it gets re-added to the database and has the same effect as if the counts were incremented again. Shrug. Deleting unused or rarely used tokens frees up database space, thus reducing the overall size of the database for new messages. Because counts are decremented for both spam and ham, and only on tokens that were 6 months old or older once a month, over time rarely seen tokens will eventually age out. The procmail script I use has bogofilter auto learn only if the wordlist.db is less than 30meg in size. After that messages have to be manually retrained. This keeps the database size down to a reasonable size, yet after pruning, when the database size is dropped to 15 meg or so, it allows for new variations of spam to be "learned" again without my having to interfere with the process too much. The overall effect is that bogofilter auto learns up to about the 15th or 20th of the month, then just uses the database as is, until it's pruned again. If you are interested, I can share the script that I'm using to accomplish this. It would be easier if message counts could be decremented on tokens with bogoutil, but there isn't any option I've seen to decrement the token counts in bogoutil. Anyway, I probably would not have seen this .MSG_COUNT bug show up if I hadn't recently rebuilt the wordlist.db from scratch with the fc10 upgrade. The old wordlist.db was a few years old. Am glad it's fixed in the newer version. Garen On Sun, 23 Aug 2009, David Relson wrote: > Date: Sun, 23 Aug 2009 15:03:34 -0400 > From: David Relson <re...@os...> > To: Garen Erdoisa <sc...@tr...> > Cc: bog...@li... > Subject: Re: [bugs] Rare bug that causes bogofilter to segfault > > Hello Garen, > > "bogoutil -i" drops .MSG_COUNT up through v1.2.0. This problem has > been fixed in v1.2.1. > > Using "bogofilter -N" or "bogofilter -S" enough to set .MSG_COUNT to > zero is a misuse of the capability. It's allowed but ill advised. > > Your pruning algorithm is "interesting". I'd be concerned about its > effect on scoring tokens. > > Regards, > > David > > On Sun, 23 Aug 2009 11:23:35 -0600 (MDT) > Garen Erdoisa wrote: > >> >> Bug report: bogofilter-1.2.0-1.fc10.i386.rpm >> >> Synopsis: Bogofilter starts to segfault if the database .MSG_COUNT >> token drops to zero on either spam or ham counts. >> >> This is reproducible in current fc10 versions of BOGOFILTER and >> bogofilter-sqlite3, possibly others. I haven't looked at the source >> code yet, but I suspect it's a divide by zero situation. >> >> When retraining messages using >> >> cat message |bogofilter -Ns >> >> or >> >> cat message |bogofilter -Sn >> >> It decrements the database .MSG_COUNT token counts when messages are >> unregistered. If one of these counts reaches zero, then subsequent >> messages are processed using for example: >> >> cat message |bogofilter -Ns >> cat message |bogofilter -Ns >> cat message |bogofilter -vvvp >> >> bogofilter will begin to segfault. >> >> if >> >> cat message |bogofilter -TT >> >> is entered, the result will always be 0.000000000000 >> even if the message would otherwise score as 1.000000000000 causing >> all messages to score as ham. >> >> This bug became apparent on my system because of some other >> semi-automated scripts I use to retrain messages as either spam or >> ham. The script uses -Ns and -Sn to train the messages to exhaustion. >> In the case of -Ns the script retrains repeatedly on the same message >> until the score reaches .99 or higher, and in the case of -Sn the >> script trains until the score is less than .30. I'm using bogofilter >> in tristate mode. >> >> For most situations this scenario will probably go unnoticed unless >> people retrain messages alot using -Ns and -Sn. >> >> I've been aware of bogofilter segfaulting for about a month now, and >> have spent the last couple of days tracking down the source of this >> bug. >> >> I've verified the bug by manually manipulating the .MSG_COUNT token >> in wordlist.db, then rebuilding the database. >> >> ie: >> bogoutil -d wordlist.db >wordlistdump.txt >> then edit wordlistdump.txt to change the value of the .MSG_COUNT >> token to 0 or 1 >> >> then rebuilt the database with >> cat wordlistdump.txt |bogoutil -l wordlist.new.db >> mv wordlist.new.db wordlist.db >> >> Additional info not necessarily related to the bug: >> >> My wordlist.db file currently over 30meg in size and has been >> training for several months, so it has plenty of tokens in it. >> >> I stop auto training the database at around 30meg and prune it once a >> month via a cron job that decrements message counts by one count on >> tokens older than 6 months and again on tokens older than 1 year. If >> the counts on a given token reaches zero for both ham and spam, the >> token is not included in a database rebuild so is effectively >> deleted. This tends to delete random strings of characters that >> otherwise bloat the database and trims it down to about 15 meg in >> size. >> >> Also, I've been using earlier versions of bogofilter for several >> years now and have had no other issues. This segfault issue started >> with my recent upgrade to FC10 from FC9, when I also decided to >> upgrade to the most recent version of bogofilter. In that upgrade I >> had decided to rebuild the wordlist.db from scratch using spam and >> ham from the last 90 days. >> >> -- >> Garen Erdoisa >> sc...@tr... >> >> ------------------------------------------------------------------------------ >> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 >> 30-Day trial. Simplify your report design, integration and deployment >> - and focus on what you do best, core application coding. Discover >> what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july >> _______________________________________________ >> Bogofilter-bugs mailing list >> Bog...@li... >> https://lists.sourceforge.net/lists/listinfo/bogofilter-bugs > |