[cvs] bogofilter/doc README.db,1.2,1.3 bogoutil.xml,1.17,1.18
Fast Bayesian spam filter along lines suggested by Paul Graham
Brought to you by:
m-a
From: <m-...@us...> - 2004-10-29 01:12:02
|
Update of /cvsroot/bogofilter/bogofilter/doc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv29084/doc Modified Files: README.db bogoutil.xml Log Message: Merge Transactional Store from branch. Index: README.db =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/doc/README.db,v retrieving revision 1.2 retrieving revision 1.3 diff -u -d -r1.2 -r1.3 --- README.db 8 Mar 2004 15:52:28 -0000 1.2 +++ README.db 29 Oct 2004 01:11:52 -0000 1.3 @@ -16,91 +16,308 @@ 1. Overview ------------------------------------------------------------ -This bogofilter version now contains code to use BerkeleyDB environments -for locking and caching data base access instead of the proven -fcntl()/F_SETLK locks bogofilter used to use. +This bogofilter version has been upgraded to use the Berkeley DB +Transactional Data Store, to be able to recover a data base after an +application or system crash. -The new code must be enabled by setting an environment variable. +2. Prerequisites and Caveats ------------------------------------------- -The ultimate goal is to migrate to the Berkeley DB Transactional Data -Store, in order to improve data base robustness. +2.1 Compatibility, Berkeley DB versions -The current code is the first step in that direction, the code uses the -Berkeley DB Concurrent Data Store model, it has the potential of using -finer grained locks, for improved concurrency - depending on BerkeleyDB -version, and it uses shared buffer pools to improve performance. +Berkeley DB version 4.2 has been tested successfully in September 2004, +but version 3.0, 3.1, 3.2, 3.3, 4.0 and 4.1 should also work. -The former bogofilter locking code allowed either one registration -process at the same time or an unlimited amount of readers at the same -time. A high amount of bogofilter processes that were scoring could -effectively lock out registration altogether, and registration -interrupted scoring. The new model allows for concurrent use of many -readers and one writer at the same time. +Berkeley DB versions 4.1 and 4.2 are recommended over the previous +versions, because these can detect data corruptions more reliably +(through the use of checksums that detect partially written data base +pages); Berkeley DB 4.2 appears faster than 4.1 under load (i. e. with +multiple running copies of bogofilter/bogoutil operating on the same +data base with at least one spam registration in progress). -2. Use, restrictions, caveats ------------------------------------------ +2.2 Recoverability -The new code is enabled iff you define the environment variable -BOGOFILTER_CONCURRENT_DATA_STORE, irrespective of its content. +The ability to recover the data base after a crash (power failure!) +depends on data being written to the disk (or a battery-backed write +cache) _immediately_ rather than delayed to be written later. -If, for some reason, it does not work for you, stop all bogofilter -processes and try to recover the environment with the db_recover -utility. Give it your data base directory as argument to the -h option, -for instance: +Common disk drives in current PCs and MACs are of the ATA or SATA kind +and usually ship with their write cache enabled. They write fast, but +can lose or corrupt up to a few MB of data when the power fails. +Note: This problem is not specific to bogofilter. - db_recover -h ~/.bogofilter +It is possible to sacrifice a bit bit of the the write speed and get +reliability in turn, by switching off the disk's write cache (see +appendix A for instructions). -If that does not cure your problems, please report to the mailing list -1. what you did, 2. what you got, 3. what you expected instead of #2. +Switching the write cache off may however adversely affect the +performance below acceptable levels, particularly for large writes such +as recording live audio or video data to hard disk. +If performance is degraded too much, consider getting a separate disk +drive and using one for fast writes (with the write cache one) and one +for reliable writes (with the write cache off, for bogofilter, mail +servers and other applications that need survive a power loss without +data loss). -As an example, you could use this as a boilerplate to start using the -new code: +2.3 Choosing a file system -env BOGOFILTER_CONCURRENT_DATA_STORE=1 bogofilter [options] +If your computer saves the data on its own disk drive (a "local file +system"), BerkeleyDB should work fine. Such file systems are ext2, ext3, +ffs, jfs, hfs, hfs+, reiserfs, ufs, xfs. -where [options] is a placeholder for the options you regularly run -bogofilter with, it can be empty. +Berkeley DB does not work reliably with a networked file system. AFS, +CIFS, Coda, NFS, SMBFS fall into this category. -Note that this env BOGOFILTER... is necessary for all bogofilter related -commands that access the data base, for instance, bogoutil and bogotune -in particular. +Strictly speaking, with BerkeleyDB 4.0 and older versions, the data base +block size must be written atomically. The bogofilter maintainers are +not currently aware of a file system that meets this requirement and is +production quality at the same time. -BerkeleyDB keeps some additional statistics about locking, caching and -their efficiency. These can be obtained by running the db_stat utility -with the -e or -c option, examples: +2.4 Do make backups! + +The transactional data store is no good if the disk drive has become +inaccessible (which happens after some months or years with every +drive), so you _must_ back up your data base regularly (see the +db_archive utility for additional documentation of a "hot" backup), +bogofilter cannot, of course, guess data that got lost through a hard +drive fault. + +Although backup strategies are beyond the scope of this document, be +sure to store fresh backups of important data outside of your house +regularly. + +3. Use and troubleshooting --------------------------------------------- + +3.1 LOG FILE HANDLING + +The Berkeley DB Transactional Data Store uses log files to store data +base changes so that they can be rolled back or restored after an +application crash. + +The logs of the transactional data store, log.NNNNNNNNNN files of up to +10 MB in size (in the default configuration), can clog up considerable +amounts of disk space and many users wish to purge or compress these log +files. These can safely be handled with BerkeleyDB's db_checkpoint and +db_archive utilities, which you'll run in this order: + +- db_checkpoint migrates written-ahead data from the log files into the + data base, and places a checkpoint which will speed up data base + recovery, for instance after a premature bogofilter abort. + +- db_archive allows to identify log files that are no longer in use so + that you can compress or remove them. + +Before choosing to remove log files, be sure to read the db_archive +documentation that ships with BerkeleyDB, because removing logs has an +impact on recoverability and can render your data base unrecoverable. + +The db_archive documentation also contains suggestions for several +backup strategies. + +3.2 LOCK TABLE EXHAUSTION + +One common failure case is known: + +Problem: Operations that affect large parts of the data base or the data + base as a whole (bogoutil usually) may require many locks and + exhaust the maximum number of locks or the maximum number of + locked objects that the Berkeley DB environments support. + +Symptom: Operations abort with "out of memory" although the machine has + plenty of RAM and/or swap. + + Operations report lock or object table exhaustion and abort. + +Cause: Natural data base growth. + +Fix: Resize the lock tables. It is easy and requires these two + steps: (In the next steps, adjust the ~/.bogofilter path if you + don't have the data base in its default location) + +a. Create a ~/.bogofilter/DB_CONFIG file (in the same directory as your + wordlist.db file) that looks like this: + +set_lk_max_objects 32768 +set_lk_max_locks 32768 + +You may need to adjust these figures. You will need up to one lock per +data base page and a bit of headroom for future training -- see section +4.2 below to determine the size of the data base and data base page. + +b. Run bogoutil -f ~/.bogofilter (use the path from step a, omitting + the /DB_CONFIG part). + +4. Other Information of Interest --------------------------------------- + +4.1 GENERAL INFORMATION + +Berkeley DB keeps some additional statistics about locking, caching and +their efficiency. These can be obtained by running the db_stat utility: + +db_stat -h ~/.bogofilter -d wordlist.db # data base statistics db_stat -h ~/.bogofilter -e # environment statistics -db_stat -h ~/.bogofilter -c # lock statistics +db_stat -h ~/.bogofilter -c # lock statistics - needed for lock resizing db_stat -h ~/.bogofilter -m # buffer pool statistics +db_stat -h ~/.bogofilter -l # log statistics +db_stat -h ~/.bogofilter -t # transaction statistics -db_stat ~/.bogofilter/wordlist.db # data base statistics - (this has also been available with the traditional bogofilter code) +Note that statistics may disappear when the data base is recovered. They +will reappear after running bogofilter and are the more reliable the +more often bogofilter has been used since the last recovery. -The new code will store files named __db.NNN - where NNN are numbers - -in the ~/.bogofilter directory. These MUST NOT be removed manually as -they can contain update data for the data base that must be still written back -to the wordlist.db file - this happens when there are many concurrent -processes alongside a registration process. The db_recover utility may -remove these files, but it knows about BerkeleyDB internals. +You MUST NOT remove files named __db.NNN and log.NNNNNNNNNN - where +NNN are numbers - in the ~/.bogofilter directory. +REMOVING THESE FILES CAUSES DATA BASE CORRUPTION +(there is one exception, see below) -This code is deadlock-free, so if bogofilter hangs with this -experimental code enabled as documented above, either the data base or -the environment should be checked for corruptions. +These can contain update data for the data base that must be still +written back to the wordlist.db file - this happens when there are many +concurrent processes alongside a registration process. -3. Open issues and troubleshooting ------------------------------------- +Exception: after reading the Berkeley DB documentation for the +db_archive utility and using that utility, you may be able to remove +some of the log.NNNNNNNNNN files. This may be necessary to reclaim disk +space, but you must strictly adhere to the Berkeley DB documentation +lest you risk your data base become unrecoverable in case of trouble. -a. The DB_ENV based code appears to be more sensitive (not to say - fragile) with respect to premature abortion, hangs of bogofilter - processes after an ungraceful bogofilter shutdown have been observed. - These can usually be resolved by killing all hanging bogofilter - processes, then running +WARNING: If you need to copy data base files, + DO NOT USE cp, BUT DO USE dd instead and give it a block size + that matches the data base's block size, which can be found by + running db_stat with -d option as shown above. - db_recover -h ~/.bogofilter +4.2 SPECIFIC INFORMATION ON RESIZING THE LOCK TABLES -b. TODO: bogofilter should catch common interrupt signals, SIGHUP, - SIGINT, SIGTERM, and ensure a graceful shutdown of the data base. +In all the commands shown below, replace the ~/.bogofilter path by the +name of the directory holding your wordlist.db file. -c. TODO: The bogofilter utilities need to be taught about the - environment, too, to avoid avoidable corruptions. +a. Determine the data base size: -d. TODO: Make sure that the token updates and the .MSG_COUNT are bundled and - the whole bundle is written atomically. +ls -l ~/.bogofilter/wordlist.db + +b. Determine the data base page size: + +db_stat -h ~/.bogofilter/ -d wordlist.db +The relevant line has "database page size" + +c. The number of locks and lock objects needed is the data base size +divided by the data base page size, rounded up generously, for example: + +(output from step a) +-rw-r--r-- 1 joe users 15360000 2004-05-11 12:25 wordlist.db + +(output from step b) +53162 Btree magic number. +8 Btree version number. +Flags: +2 Minimum keys per-page. +4096 Underlying database page size. +3 Number of levels in the tree. +... + +Hence: 15360000 / 4096 = 3750 + +d. Round up, 4096 may be adequate if you train seldomly, use +higher values if you train often or in preparation of training on a +large mailbox. Higher values make your lock region, usually __db.004, +larger, but allow for larger data bases. Lower values save disk space +but may require you to to resize the lock region more often. + +e. Use this rounded-up figure for both of the the two DB_CONFIG file + lines mentioned in section 3.1 + +f. run bogoutil -f ~/.bogofilter + +Bogofilter will re-create the lock tables automatically +the next time it is run. + +A. Switching the disk drive's write cache off and on ------------------- + +A.1 Introduction + +You need to determine the name of the disk device and its type. +Type "mount", you'll usually get an output that contains lines like +these; find the "on /home" if you have it, if you don't check for "on +/usr" if you have it, or finally, resort to looking at the "on /" line. + +From this line, look at the left hand column, usually starting with /dev. + +If you have FreeBSD, skip to section A.3 now. + +A.2 Switching the write cache off or on in Linux + +In this line you've found (see previous section A.1), you'll usually find +something that starts with /dev/hda, /dev/hde or /dev/sda in the left +hand column of that line, you can ignore the trailing number. /dev/hd* +means ATA, /dev/sd* means SCSI. + +If the drive name starts with /dev/hd, type the following line, but +replace hda by hde or what else you may have found: + +/sbin/hdparm -W0 /dev/hda + (replace -W0 by -W1 to reenable the write cache) + +If your drive name starts with /dev/sd, use the graphical scsi-config +utility and add a blank the device name on the command line; for +example: + +scsi-config /dev/sda + +You can "try changes" (they will be forgotten the next time the computer +is switched off) or "save changes" (settings will be saved permanently); +you can use the same utility to restore the previous setting or load +manufacturer defaults. Skip to section 2.4. + +What is this scsi-config? + +The scsi-config command is a Tk script, delivered with the scsiinfo +package. At the time of writing, scsiinfo can be found at +ftp://tsx-11.mit.edu/pub/linux/ALPHA/scsi/scsiinfo-1.7.tar.gz . + +For users who don't run X on their mail servers, there is also a +command-line utility, scsiinfo, in the package. Setting parameters +with scsiinfo is a bit hairy, but the following sequence worked for two +of us who tried it (back up your drive first): + +# get current disk settings and turn off the write cache +# (substitute the appropriate device for /dev/sda in all these commands) +parms=`scsiinfo -cX /dev/sda | sed 's/^./0/'` + +# write the parameters back to the hard drive's current settings +# this needs to be put in a boot script +scsiinfo -cXR /dev/sda $parms + +# if you don't want to put this in a boot script, you can alternatively +# save the parameters to the hard drive's settings area: +scsiinfo -cXRS /dev/sda $parms + +You did back up your drive before trying that, right? :) + + +A.3 Switching the write cache off in FreeBSD + +Have you read section A.1 already? You should have. + +In this line you've found (see section A.1), you'll usually have a line +that starts with /dev/ad0, /dev/wd0 (either means you have ATA) or +/dev/da0 (which means you have SCSI). + +If you have ATA, add the line + + hw.ata.wc="0" + +to /boot/loader.conf.local, shut down all applications and reboot. (To +revert the change, remove the line, shut down all applications and +reboot.) + +If you have SCSI, you'll need to decide if you want the setting until the next +reboot, or permanent (the permanent setting can be changed back, don't worry). +In either case, omit the leading /dev and trailing s<NUMBER><LETTER> parts +(/dev/da0s1a -> da0; /dev/da4s3f -> da4). Replace da0 by your device name in +these examples, and leave out the part in parentheses: + + camcontrol modepage da0 -m8 -e -P0 (effective until computer is switched off) + camcontrol modepage da0 -m8 -e -P3 (save parameters permanently) + +camcontrol will open a temporary file with a WCE: line on top. Edit the +figure to read 0 (cache disabled) or 1 (cache enabled), then save the +file and exit the editor. Index: bogoutil.xml =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/doc/bogoutil.xml,v retrieving revision 1.17 retrieving revision 1.18 diff -u -d -r1.17 -r1.18 --- bogoutil.xml 9 Aug 2004 19:27:56 -0000 1.17 +++ bogoutil.xml 29 Oct 2004 01:11:52 -0000 1.18 @@ -28,6 +28,8 @@ <cmdsynopsis> <command>bogoutil</command> <group choice="req"> + <arg choice="plain">-f</arg> + <arg choice="plain">-F</arg> <arg choice="plain">-r</arg> <arg choice="plain">-R</arg> </group> @@ -114,6 +116,16 @@ Option <option>-p</option> takes the same arguments as option <option>-w</option> . </para> + <para>The <option>-f <replaceable>dir</replaceable></option> + option runs a regular data base recovery in the data base directory + dir. If that fails, it will retry with a (usually slower) + catastrophic data base recovery. If that fails, too, your + data base cannot be repaired and must be rebuilt from + scratch.</para> + <para>The <option>-F <replaceable>dir</replaceable></option> + option runs a catastrophic data base recovery in the data base directory + dir. If that fails, your data base cannot be repaired and + must be rebuilt from scratch.</para> <para>The <option>-r</option> option tells <application>bogoutil</application> to recalculate the ROBX value and print it as a six-digit fraction. |