Thread: [cvs] bogofilter/tuning README.bogotune,1.3,1.4 bogol,1.1,1.2 bogol.1,1.1,1.2 bogotune,1.3,1.4 bogot
Fast Bayesian spam filter along lines suggested by Paul Graham
Brought to you by:
m-a
From: <re...@us...> - 2003-06-24 20:13:40
|
Update of /cvsroot/bogofilter/bogofilter/tuning In directory sc8-pr-cvs1:/tmp/cvs-serv32753 Modified Files: README.bogotune bogol bogol.1 bogotune bogotune.1 Log Message: Update bogotune to 0.2.4 Index: README.bogotune =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/tuning/README.bogotune,v retrieving revision 1.3 retrieving revision 1.4 diff -u -d -r1.3 -r1.4 --- README.bogotune 20 Jun 2003 13:02:54 -0000 1.3 +++ README.bogotune 24 Jun 2003 20:13:36 -0000 1.4 @@ -1,4 +1,4 @@ -README for bogotune version 0.2.2 +README for bogotune version 0.2.4 (How to tune bogofilter with minimum effort) This document describes a script called bogotune that will completely @@ -71,10 +71,12 @@ values of s, min_dev, x, the spam cutoff and the counts of false positives and false negatives, respectively. -The message files may be in either mbox or msg-count format, but they -must all be one or the other; don't mix them. If they're in mbox +The message files may be in MH, mbox or msg-count format, but they must +all be of the same type; don't mix them. If they're not in msg-count format, you'll need enough free disk space so bogotune can create -msg-count files to use in the scans. +msg-count files to use in the scans. (I _think_ bogotune should work +with maildir format as well, but this has not been tested; feedback +would be much appreciated.) If bogotune aborts, there may be leftover files named Rxxx in the directory from which bogotune ran; the xxx stands for some number of Index: bogol =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/tuning/bogol,v retrieving revision 1.1 retrieving revision 1.2 diff -u -d -r1.1 -r1.2 --- bogol 20 Jun 2003 03:28:07 -0000 1.1 +++ bogol 24 Jun 2003 20:13:36 -0000 1.2 @@ -5,7 +5,7 @@ NAME bogol - lexical analysis and database lookup for an email message SYNOPSIS - bogol [-h | path/to/bogofilter/directory] + bogol [-h | path/to/bogofilter/directory [bogolexer options]] DESCRIPTION bogol creates a message digest consisting of a .MSG_COUNT line, followed by one line per token. The format is @@ -24,14 +24,16 @@ -h The option -h displays help. BOGOFILTER DIRECTORY The bogofilter directory is where the wordlists are kept. It defaults - to ~/.bogofilter if no path is provided on the command line. + to ~/.bogofilter if no path is provided on the command line. The + path must be provided if further options are given; any further + options are passed through to bogolexer. SEE ALSO bogofilter(1), bogolexer(1), bogoutil(1), apclass(1). EOT exit 0 fi db=~/.bogofilter -test "x$1" = "x" || db=$1 -( echo .MSG_COUNT; bogolexer -p | sort -u) | \ +if [ "x$1" != "x" ]; then db=$1; shift; fi +( echo .MSG_COUNT; bogolexer -p $* | sort -u ) | \ bogoutil -w $db | \ awk 'NF == 3 {printf("\"%s\" %s %s\n", $1, $2, $3)}' Index: bogol.1 =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/tuning/bogol.1,v retrieving revision 1.1 retrieving revision 1.2 diff -u -d -r1.1 -r1.2 --- bogol.1 20 Jun 2003 15:12:47 -0000 1.1 +++ bogol.1 24 Jun 2003 20:13:36 -0000 1.2 @@ -1,5 +1,5 @@ ." Text automatically generated by txt2man-1.4.7 -.TH bogol 1 "June 10, 2003" "" "" +.TH bogol 1 "June 21, 2003" "" "" .SH NAME \fBbogol \fP- lexical analysis and database lookup for an email message .SH SYNOPSIS Index: bogotune =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/tuning/bogotune,v retrieving revision 1.3 retrieving revision 1.4 diff -u -d -r1.3 -r1.4 --- bogotune 20 Jun 2003 13:02:54 -0000 1.3 +++ bogotune 24 Jun 2003 20:13:36 -0000 1.4 @@ -1,6 +1,6 @@ #! /usr/bin/perl # bogotune - a bogofilter tuning tool -# 20030619, version 0.2.2 +# 20030621, version 0.2.4 # Copyright (c) 2003 Gregory Louis; distributed wiithout warranty of # any kind under the GNU General Public License (GPL). @@ -34,8 +34,8 @@ " nonspam messages, and the ratio of spam to nonspam must be in\n", " the range 0.2 to 5. There must be at least 500 spam messages\n", " and 500 nonspam in the message files. Message files must all\n", - " be in either mbox or message-count format; don't mix files of\n", - " both types.\n", + " be in MH, mbox or message-count format; don't mix files of\n", + " different types.\n", "COMMAND LINE PARAMETERS\n", " bogodir The directory where the training database is stored.\n", " bogodir defaults to \$BOGOFILTER_DIR, if defined, or\n", @@ -127,38 +127,80 @@ ### 2. Validate test inputs; create msg-count files if need be # Check that at least 500 each of spam and nonspam are available for -# testing, in either mbox or msg-count format. +# testing, in MH, mbox or msg-count format. print("Verifying test files...\n"); -$msgcountfiles = 0; -$cmd = join(" ", "cat", @spfiles, "| grep -c '^From ' |"); -open(COUNTS, $cmd) or yuk(5, "$cmd pipe failed"); -$scount = <COUNTS>; chop $scount; close COUNTS; -$cmd = join(" ", "cat", @nsfiles, "| grep -c '^From ' |"); -open(COUNTS, $cmd) or yuk(5, "$cmd pipe failed"); -$ncount = <COUNTS>; chop $ncount; close COUNTS; -if($scount < 500 || $ncount < 500) { - $msgcountfiles = 1; - $cmd = join(" ", "cat", @spfiles, "| grep -c '^.\.MSG_COUNT' |"); +$msgformat = "mbox"; +if(-d $spfiles[0]) { + $msgformat = "MH"; $scount = 0; + foreach $dir (@spfiles) { + $n = `ls $dir/[0-9]* 2>/dev/null | wc -l`; + $scount += $n; + } + $ncount = 0; + foreach $dir (@nsfiles) { + $n = `ls $dir/[0-9]* 2>/dev/null | wc -l`; + $ncount += $n; + } +} else { + $cmd = join(" ", "cat", @spfiles, "| grep -c '^From ' |"); open(COUNTS, $cmd) or yuk(5, "$cmd pipe failed"); $scount = <COUNTS>; chop $scount; close COUNTS; - $cmd = join(" ", "cat", @nsfiles, "| grep -c '^.\.MSG_COUNT' |"); + $cmd = join(" ", "cat", @nsfiles, "| grep -c '^From ' |"); open(COUNTS, $cmd) or yuk(5, "$cmd pipe failed"); $ncount = <COUNTS>; chop $ncount; close COUNTS; - if($scount < 500 || $ncount < 500) { yuk(4, - "At least 500 spam and 500 nonspam required for testing\n", - $scount, " and ", $ncount, " were found."); } + if($scount < 500 || $ncount < 500) { + $msgformat = "msgcount"; + $cmd = join(" ", "cat", @spfiles, "| grep -c '^.\.MSG_COUNT' |"); + open(COUNTS, $cmd) or yuk(5, "$cmd pipe failed"); + $scount = <COUNTS>; chop $scount; close COUNTS; + $cmd = join(" ", "cat", @nsfiles, "| grep -c '^.\.MSG_COUNT' |"); + open(COUNTS, $cmd) or yuk(5, "$cmd pipe failed"); + $ncount = <COUNTS>; chop $ncount; close COUNTS; + } } +if($scount < 500 || $ncount < 500) { yuk(4, + "At least 500 spam and 500 nonspam required for testing\n", + $scount, " and ", $ncount, " were found."); } print("Verification completed successfully.\n"); -if(! $msgcountfiles) { - print("Test files are in mbox format, creating message-count files...\n"); +if($msgformat ne "msgcount") { + print("Creating message-count files...\n"); $spwork = $workfn . ".sp"; $nswork = $workfn . ".ns"; - $cmd = join(" ", "cat", @spfiles, "| formail -s bogol", $bogodir, - ">", $spwork); - system($cmd) == 0 or yuk(7, "Problem processing spam files"); - $cmd = join(" ", "cat", @nsfiles, "| formail -s bogol", $bogodir, - ">", $nswork); - system($cmd) == 0 or yuk(7, "Problem processing nonspam files"); + if($msgformat eq "mbox") { + $cmd = join(" ", "cat", @spfiles, "| formail -s bogol", $bogodir, + $cf, ">", $spwork); + system($cmd) == 0 or yuk(7, "Problem processing spam files"); + $cmd = join(" ", "cat", @nsfiles, "| formail -s bogol", $bogodir, + $cf, ">", $nswork); + system($cmd) == 0 or yuk(7, "Problem processing nonspam files"); + } else { + unlink($spwork); + foreach $dir (@spfiles) { + opendir(DH, $dir) or yuk(7, "Problem processing spam files"); + @msgs = readdir(DH); closedir(DH); + foreach $msg(@msgs) { + if($msg =~ /^[0-9]/) { + $cmd = join(" ", "cat $dir/$msg | bogol", $bogodir, $cf, + ">>$spwork"); + system($cmd) == 0 or + yuk(7, "Problem writing spam msg-count file"); + } + } + } + unlink($nswork); + foreach $dir (@nsfiles) { + opendir(DH, $dir) or yuk(7, "Problem processing nonspam files"); + @msgs = readdir(DH); closedir(DH); + foreach $msg(@msgs) { + if($msg =~ /^[0-9]/) { + $cmd = join(" ", "cat $dir/$msg | bogol", $bogodir, $cf, + ">>$nswork"); + system($cmd) == 0 or + yuk(7, "Problem writing nonspam msg-count file"); + } + } + } + } @spfiles = $spwork; @nsfiles = $nswork; print("Message-count files ", $workfn, ".{sp,ns} created\n"); } @@ -174,7 +216,7 @@ $cmd = "ls -l " . $bogodir . "/spamlist.db |"; open(COUNTS, $cmd) or yuk(5, "$cmd pipe failed"); $counts = <COUNTS>; chop $counts; close COUNTS; - ($junk1, $junk2, $junk3, $junk4, $dbsize, $junk6) = split(/\s+/, $counts); + ($junk1, $junk2, $junk3, $junk4, $dbs2, $junk6) = split(/\s+/, $counts); if($dbs2 > $dbsize) { $dbsize = $dbs2; } } $cachesize = POSIX::ceil($dbsize / (1024*1024*3)); @@ -351,7 +393,7 @@ } $med = $parms[int(scalar @parms / 2)][5]; if($verbose) { print("Median fn count was ", $med, "\n"); } - $n = 0; + $n = $o = 0; foreach $i (0 .. $#parms) { $rsi = $parms[$i][0]; $mdi = $parms[$i][1]; $rxi = $parms[$i][2]; if( ($rsi == 0 || gfn($rsi-1, $mdi, $rxi) < $med) @@ -362,11 +404,16 @@ && ($rxi == $#rxval || gfn($rsi, $mdi, $rxi+1) < $med)) { $n = 1; last; } + $o = $i; + } + if($o > 0) { + print($o, " outlier", $o > 1 ? "s" : "", " encountered.", + " \n"); } if($n == 0) { $rsi = $parms[0][0]; $mdi = $parms[0][1]; $rxi = $parms[0][2]; print("No smooth minimum encountered, using lowest fn count", - " \n"); + " (an outlier). \n"); } $robs = $rsval[$rsi]; $md = $mdval[$mdi]; $robx = $rxval[$rxi]; printf("Minimum found at s %0.4f, md %0.3f, x %0.3f %s\n", Index: bogotune.1 =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/tuning/bogotune.1,v retrieving revision 1.1 retrieving revision 1.2 diff -u -d -r1.1 -r1.2 --- bogotune.1 20 Jun 2003 15:12:47 -0000 1.1 +++ bogotune.1 24 Jun 2003 20:13:36 -0000 1.2 @@ -1,5 +1,5 @@ ." Text automatically generated by txt2man-1.4.7 -.TH bogotune 1 "June 19, 2003" "" "" +.TH bogotune 1 "June 22, 2003" "" "" .SH NAME \fBbogotune \fP- find optimum parameter settings for bogofilter .SH SYNOPSIS @@ -17,8 +17,8 @@ nonspam messages, and the ratio of spam to nonspam must be in the range 0.2 to 5. There must be at least 500 spam messages and 500 nonspam in the message files. Message files must all -be in either mbox or message-count format; don't mix files of -both types. +be in MH, mbox or message-count format; don't mix files of +different types. .SH COMMAND LINE PARAMETERS .TP .B |