Re: [bugs] Rare bug that causes bogofilter to segfault
Fast Bayesian spam filter along lines suggested by Paul Graham
Brought to you by:
m-a
From: David R. <re...@os...> - 2009-08-23 23:42:04
|
On Sun, 23 Aug 2009 16:53:33 -0600 (MDT) Garen Erdoisa wrote: > > Hi David; > > Thanks for responding so quickly. You must have no life. :) It's all in the timing. Your message happened to be present when I was checking email ... > On Sun, 23 Aug 2009, David Relson wrote: > > "bogoutil -i" drops .MSG_COUNT up through v1.2.0. This problem has > > been fixed in v1.2.1. > > The -i switch on bogoutil seems to be missing documentation in the > man page. My mistake. I was thinking "import" when I should have been thinking "load". The command is "bogoutil -l", not "bogoutil -i" > > Your pruning algorithm is "interesting". I'd be concerned about its > > effect on scoring tokens. > > In practice it has about zero effect on scoring. Once the token > counts reach zero counts for both ham/spam the token is just taking > up space in the database as a "most likely" random string of > characters. > > If the string is encountered again in a new message it gets re-added > to the database and has the same effect as if the counts were > incremented again. Shrug. > > Deleting unused or rarely used tokens frees up database > space, thus reducing the overall size of the database for > new messages. Token dates reflect training information, i.e. when the token's count was last updated. Thus a token's date can be really old even though it's used regularly to score messages. > Because counts are decremented for both spam and ham, and only on > tokens that were 6 months old or older once a month, over time rarely > seen tokens will eventually age out. > > The procmail script I use has bogofilter auto learn only if the > wordlist.db is less than 30meg in size. After that messages have to > be manually retrained. This keeps the database size down to a > reasonable size, yet after pruning, when the database size is dropped > to 15 meg or so, it allows for new variations of spam to be "learned" > again without my having to interfere with the process too much. > > The overall effect is that bogofilter auto learns up to about the > 15th or 20th of the month, then just uses the database as is, until > it's pruned again. > > If you are interested, I can share the script that I'm using to > accomplish this. It would be easier if message counts could be > decremented on tokens with bogoutil, but there isn't any option I've > seen to decrement the token counts in bogoutil. AFAICT, you're the first person to use for decrementing counts as a means of removing tokens. Bogoutil can filter tokens based on age ("-a") and count ("-c"). A "--decrement" flag could be added. For example "bogoutil -a 30 ... " deletes tokens older than 30 days "bogoutil -a 30 --decrement" decrements tokens older than 30 days "bogoutil --decrement -c 3" decrements all counts and deletes if cnt < 3 "bogoutil -a 30 --decrement -c 3" decrements if older than 30 and deletes if cnt < 3 Feel free to send the script. I'll take a look at it. Alternative, subscribe to bogofilter's users mailing list, offer your script, and see what the response is. > Anyway, I probably would not have seen this .MSG_COUNT bug show up if > I hadn't recently rebuilt the wordlist.db from scratch with the fc10 > upgrade. The old wordlist.db was a few years old. Am glad it's fixed > in the newer version. The .MSG_COUNT bug has been present since the .MSG_COUNT token was added. I don't recall how long ago that was, but the NEWS.0 file mentions .MSG_COUNT back in early 2004. Undoubtedly its existence goes back even further. Regards, David |