[cvs] bogofilter/doc README.db,1.2,1.3 bogoutil.xml,1.17,1.18

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Update of /cvsroot/bogofilter/bogofilter/doc
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv29084/doc

Modified Files:
	README.db bogoutil.xml 
Log Message:
Merge Transactional Store from branch.

Index: README.db
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/doc/README.db,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -d -r1.2 -r1.3

--- README.db	8 Mar 2004 15:52:28 -0000	1.2
+++ README.db	29 Oct 2004 01:11:52 -0000	1.3
@@ -16,91 +16,308 @@
 
 1. Overview ------------------------------------------------------------
 
-This bogofilter version now contains code to use BerkeleyDB environments
-for locking and caching data base access instead of the proven
-fcntl()/F_SETLK locks bogofilter used to use.
+This bogofilter version has been upgraded to use the Berkeley DB
+Transactional Data Store, to be able to recover a data base after an
+application or system crash.
 
-The new code must be enabled by setting an environment variable.
+2. Prerequisites and Caveats -------------------------------------------
 
-The ultimate goal is to migrate to the Berkeley DB Transactional Data
-Store, in order to improve data base robustness.
+2.1 Compatibility, Berkeley DB versions
 
-The current code is the first step in that direction, the code uses the
-Berkeley DB Concurrent Data Store model, it has the potential of using
-finer grained locks, for improved concurrency - depending on BerkeleyDB
-version, and it uses shared buffer pools to improve performance.
+Berkeley DB version 4.2 has been tested successfully in September 2004,
+but version 3.0, 3.1, 3.2, 3.3, 4.0 and 4.1 should also work.
 
-The former bogofilter locking code allowed either one registration
-process at the same time or an unlimited amount of readers at the same
-time. A high amount of bogofilter processes that were scoring could
-effectively lock out registration altogether, and registration
-interrupted scoring. The new model allows for concurrent use of many
-readers and one writer at the same time.
+Berkeley DB versions 4.1 and 4.2 are recommended over the previous
+versions, because these can detect data corruptions more reliably
+(through the use of checksums that detect partially written data base
+pages); Berkeley DB 4.2 appears faster than 4.1 under load (i. e. with
+multiple running copies of bogofilter/bogoutil operating on the same
+data base with at least one spam registration in progress).
 
-2. Use, restrictions, caveats ------------------------------------------
+2.2 Recoverability
 
-The new code is enabled iff you define the environment variable
-BOGOFILTER_CONCURRENT_DATA_STORE, irrespective of its content.
+The ability to recover the data base after a crash (power failure!)
+depends on data being written to the disk (or a battery-backed write
+cache) _immediately_ rather than delayed to be written later.
 
-If, for some reason, it does not work for you, stop all bogofilter
-processes and try to recover the environment with the db_recover
-utility. Give it your data base directory as argument to the -h option,
-for instance:
+Common disk drives in current PCs and MACs are of the ATA or SATA kind
+and usually ship with their write cache enabled. They write fast, but
+can lose or corrupt up to a few MB of data when the power fails.
+Note: This problem is not specific to bogofilter.
 
-    db_recover -h ~/.bogofilter
+It is possible to sacrifice a bit bit of the the write speed and get
+reliability in turn, by switching off the disk's write cache (see
+appendix A for instructions).
 
-If that does not cure your problems, please report to the mailing list
-1. what you did, 2. what you got, 3. what you expected instead of #2.
+Switching the write cache off may however adversely affect the
+performance below acceptable levels, particularly for large writes such
+as recording live audio or video data to hard disk.
+If performance is degraded too much, consider getting a separate disk
+drive and using one for fast writes (with the write cache one) and one
+for reliable writes (with the write cache off, for bogofilter, mail
+servers and other applications that need survive a power loss without
+data loss).
 
-As an example, you could use this as a boilerplate to start using the
-new code:
+2.3 Choosing a file system
 
-env BOGOFILTER_CONCURRENT_DATA_STORE=1 bogofilter [options]
+If your computer saves the data on its own disk drive (a "local file
+system"), BerkeleyDB should work fine. Such file systems are ext2, ext3,
+ffs, jfs, hfs, hfs+, reiserfs, ufs, xfs.
 
-where [options] is a placeholder for the options you regularly run
-bogofilter with, it can be empty.
+Berkeley DB does not work reliably with a networked file system. AFS,
+CIFS, Coda, NFS, SMBFS fall into this category.
 
-Note that this env BOGOFILTER... is necessary for all bogofilter related
-commands that access the data base, for instance, bogoutil and bogotune
-in particular.
+Strictly speaking, with BerkeleyDB 4.0 and older versions, the data base
+block size must be written atomically. The bogofilter maintainers are
+not currently aware of a file system that meets this requirement and is
+production quality at the same time.
 
-BerkeleyDB keeps some additional statistics about locking, caching and
-their efficiency. These can be obtained by running the db_stat utility
-with the -e or -c option, examples:
+2.4 Do make backups!
+
+The transactional data store is no good if the disk drive has become
+inaccessible (which happens after some months or years with every
+drive), so you _must_ back up your data base regularly (see the
+db_archive utility for additional documentation of a "hot" backup),
+bogofilter cannot, of course, guess data that got lost through a hard
+drive fault.
+
+Although backup strategies are beyond the scope of this document, be
+sure to store fresh backups of important data outside of your house
+regularly.
+
+3. Use and troubleshooting ---------------------------------------------
+
+3.1 LOG FILE HANDLING
+
+The Berkeley DB Transactional Data Store uses log files to store data
+base changes so that they can be rolled back or restored after an
+application crash.
+
+The logs of the transactional data store, log.NNNNNNNNNN files of up to
+10 MB in size (in the default configuration), can clog up considerable
+amounts of disk space and many users wish to purge or compress these log
+files. These can safely be handled with BerkeleyDB's db_checkpoint and
+db_archive utilities, which you'll run in this order:
+
+- db_checkpoint migrates written-ahead data from the log files into the
+  data base, and places a checkpoint which will speed up data base
+  recovery, for instance after a premature bogofilter abort.
+
+- db_archive allows to identify log files that are no longer in use so
+  that you can compress or remove them.
+
+Before choosing to remove log files, be sure to read the db_archive
+documentation that ships with BerkeleyDB, because removing logs has an
+impact on recoverability and can render your data base unrecoverable.
+
+The db_archive documentation also contains suggestions for several
+backup strategies.
+
+3.2 LOCK TABLE EXHAUSTION
+
+One common failure case is known:
+
+Problem: Operations that affect large parts of the data base or the data
+	 base as a whole (bogoutil usually) may require many locks and
+	 exhaust the maximum number of locks or the maximum number of
+	 locked objects that the Berkeley DB environments support.
+
+Symptom: Operations abort with "out of memory" although the machine has
+	 plenty of RAM and/or swap.
+
+	 Operations report lock or object table exhaustion and abort.
+
+Cause:	 Natural data base growth.
+
+Fix:     Resize the lock tables. It is easy and requires these two
+	 steps: (In the next steps, adjust the ~/.bogofilter path if you
+	 don't have the data base in its default location)
+
+a. Create a ~/.bogofilter/DB_CONFIG file (in the same directory as your
+   wordlist.db file) that looks like this:
+
+set_lk_max_objects  32768
+set_lk_max_locks    32768
+
+You may need to adjust these figures. You will need up to one lock per
+data base page and a bit of headroom for future training -- see section
+4.2 below to determine the size of the data base and data base page.
+
+b. Run bogoutil -f ~/.bogofilter (use the path from step a, omitting
+   the /DB_CONFIG part).
+
+4. Other Information of Interest ---------------------------------------
+
+4.1 GENERAL INFORMATION
+
+Berkeley DB keeps some additional statistics about locking, caching and
+their efficiency. These can be obtained by running the db_stat utility:
+
+db_stat -h ~/.bogofilter -d wordlist.db # data base statistics
 
 db_stat -h ~/.bogofilter -e # environment statistics
-db_stat -h ~/.bogofilter -c # lock statistics
+db_stat -h ~/.bogofilter -c # lock statistics - needed for lock resizing
 db_stat -h ~/.bogofilter -m # buffer pool statistics
+db_stat -h ~/.bogofilter -l # log statistics
+db_stat -h ~/.bogofilter -t # transaction statistics
 
-db_stat ~/.bogofilter/wordlist.db # data base statistics
-   (this has also been available with the traditional bogofilter code)
+Note that statistics may disappear when the data base is recovered. They
+will reappear after running bogofilter and are the more reliable the
+more often bogofilter has been used since the last recovery.
 
-The new code will store files named __db.NNN - where NNN are numbers -
-in the ~/.bogofilter directory. These MUST NOT be removed manually as
-they can contain update data for the data base that must be still written back
-to the wordlist.db file - this happens when there are many concurrent
-processes alongside a registration process. The db_recover utility may
-remove these files, but it knows about BerkeleyDB internals.
+You MUST NOT remove files named __db.NNN and log.NNNNNNNNNN - where
+NNN are numbers - in the ~/.bogofilter directory.
+REMOVING THESE FILES CAUSES DATA BASE CORRUPTION
+(there is one exception, see below)
 
-This code is deadlock-free, so if bogofilter hangs with this
-experimental code enabled as documented above, either the data base or
-the environment should be checked for corruptions.
+These can contain update data for the data base that must be still
+written back to the wordlist.db file - this happens when there are many
+concurrent processes alongside a registration process.
 
-3. Open issues and troubleshooting -------------------------------------
+Exception: after reading the Berkeley DB documentation for the
+db_archive utility and using that utility, you may be able to remove
+some of the log.NNNNNNNNNN files. This may be necessary to reclaim disk
+space, but you must strictly adhere to the Berkeley DB documentation
+lest you risk your data base become unrecoverable in case of trouble.
 
-a. The DB_ENV based code appears to be more sensitive (not to say
-   fragile) with respect to premature abortion, hangs of bogofilter
-   processes after an ungraceful bogofilter shutdown have been observed.
-   These can usually be resolved by killing all hanging bogofilter
-   processes, then running
+WARNING: If you need to copy data base files,
+	 DO NOT USE cp, BUT DO USE dd instead and give it a block size
+	 that matches the data base's block size, which can be found by
+	 running db_stat with -d option as shown above.
 
-   db_recover -h ~/.bogofilter
+4.2 SPECIFIC INFORMATION ON RESIZING THE LOCK TABLES
 
-b. TODO: bogofilter should catch common interrupt signals, SIGHUP,
-   SIGINT, SIGTERM, and ensure a graceful shutdown of the data base.
+In all the commands shown below, replace the ~/.bogofilter path by the
+name of the directory holding your wordlist.db file.
 
-c. TODO: The bogofilter utilities need to be taught about the
-   environment, too, to avoid avoidable corruptions.
+a. Determine the data base size:
 
-d. TODO: Make sure that the token updates and the .MSG_COUNT are bundled and
-   the whole bundle is written atomically.
+ls -l ~/.bogofilter/wordlist.db
+
+b. Determine the data base page size:
+
+db_stat -h ~/.bogofilter/ -d wordlist.db
+The relevant line has "database page size"
+
+c. The number of locks and lock objects needed is the data base size
+divided by the data base page size, rounded up generously, for example:
+
+(output from step a)
+-rw-r--r--    1 joe      users    15360000 2004-05-11 12:25 wordlist.db
+
+(output from step b)
+53162   Btree magic number.
+8       Btree version number.
+Flags:
+2       Minimum keys per-page.
+4096    Underlying database page size.
+3       Number of levels in the tree.
+...
+
+Hence: 15360000 / 4096 = 3750
+
+d. Round up, 4096 may be adequate if you train seldomly, use
+higher values if you train often or in preparation of training on a
+large mailbox. Higher values make your lock region, usually __db.004,
+larger, but allow for larger data bases. Lower values save disk space
+but may require you to to resize the lock region more often.
+
+e. Use this rounded-up figure for both of the the two DB_CONFIG file
+   lines mentioned in section 3.1
+
+f. run bogoutil -f ~/.bogofilter
+
+Bogofilter will re-create the lock tables automatically
+the next time it is run.
+
+A. Switching the disk drive's write cache off and on -------------------
+
+A.1 Introduction
+
+You need to determine the name of the disk device and its type.
+Type "mount", you'll usually get an output that contains lines like
+these; find the "on /home" if you have it, if you don't check for "on
+/usr" if you have it, or finally, resort to looking at the "on /" line.
+
+From this line, look at the left hand column, usually starting with /dev.
+
+If you have FreeBSD, skip to section A.3 now.
+
+A.2 Switching the write cache off or on in Linux
+
+In this line you've found (see previous section A.1), you'll usually find
+something that starts with /dev/hda, /dev/hde or /dev/sda in the left
+hand column of that line, you can ignore the trailing number. /dev/hd*
+means ATA, /dev/sd* means SCSI.
+
+If the drive name starts with /dev/hd, type the following line, but
+replace hda by hde or what else you may have found:
+
+/sbin/hdparm -W0 /dev/hda
+                 (replace -W0 by -W1 to reenable the write cache)
+
+If your drive name starts with /dev/sd, use the graphical scsi-config
+utility and add a blank the device name on the command line; for
+example:
+
+scsi-config /dev/sda
+
+You can "try changes" (they will be forgotten the next time the computer
+is switched off) or "save changes" (settings will be saved permanently);
+you can use the same utility to restore the previous setting or load
+manufacturer defaults. Skip to section 2.4.
+
+What is this scsi-config?
+
+The scsi-config command is a Tk script, delivered with the scsiinfo
+package.  At the time of writing, scsiinfo can be found at
+ftp://tsx-11.mit.edu/pub/linux/ALPHA/scsi/scsiinfo-1.7.tar.gz .
+
+For users who don't run X on their mail servers, there is also a
+command-line utility, scsiinfo, in the package.  Setting parameters
+with scsiinfo is a bit hairy, but the following sequence worked for two
+of us who tried it (back up your drive first):
+
+# get current disk settings and turn off the write cache
+# (substitute the appropriate device for /dev/sda in all these commands)
+parms=`scsiinfo -cX /dev/sda | sed 's/^./0/'`
+
+# write the parameters back to the hard drive's current settings
+# this needs to be put in a boot script
+scsiinfo -cXR /dev/sda $parms
+
+# if you don't want to put this in a boot script, you can alternatively
+# save the parameters to the hard drive's settings area:
+scsiinfo -cXRS /dev/sda $parms
+
+You did back up your drive before trying that, right? :)
+
+
+A.3 Switching the write cache off in FreeBSD
+
+Have you read section A.1 already? You should have.
+
+In this line you've found (see section A.1), you'll usually have a line
+that starts with /dev/ad0, /dev/wd0 (either means you have ATA) or
+/dev/da0 (which means you have SCSI).
+
+If you have ATA, add the line
+
+      hw.ata.wc="0"
+
+to /boot/loader.conf.local, shut down all applications and reboot. (To
+revert the change, remove the line, shut down all applications and
+reboot.)
+
+If you have SCSI, you'll need to decide if you want the setting until the next
+reboot, or permanent (the permanent setting can be changed back, don't worry).
+In either case, omit the leading /dev and trailing s<NUMBER><LETTER> parts
+(/dev/da0s1a -> da0; /dev/da4s3f -> da4). Replace da0 by your device name in
+these examples, and leave out the part in parentheses:
+
+ camcontrol modepage da0 -m8 -e -P0 (effective until computer is switched off)
+ camcontrol modepage da0 -m8 -e -P3 (save parameters permanently)
+
+camcontrol will open a temporary file with a WCE: line on top. Edit the
+figure to read 0 (cache disabled) or 1 (cache enabled), then save the
+file and exit the editor.

Index: bogoutil.xml
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/doc/bogoutil.xml,v
retrieving revision 1.17
retrieving revision 1.18
diff -u -d -r1.17 -r1.18
--- bogoutil.xml	9 Aug 2004 19:27:56 -0000	1.17
+++ bogoutil.xml	29 Oct 2004 01:11:52 -0000	1.18
@@ -28,6 +28,8 @@
 	<cmdsynopsis>
 	    <command>bogoutil</command>
 	    <group choice="req">
+		<arg choice="plain">-f</arg>
+		<arg choice="plain">-F</arg>
 		<arg choice="plain">-r</arg>
 		<arg choice="plain">-R</arg>
 	    </group>
@@ -114,6 +116,16 @@
 	    Option <option>-p</option> takes the same arguments as
 	    option <option>-w</option> .
 	</para>
+	<para>The <option>-f <replaceable>dir</replaceable></option>
+	    option runs a regular data base recovery in the data base directory
+	    dir. If that fails, it will retry with a (usually slower)
+	    catastrophic data base recovery. If that fails, too, your
+	    data base cannot be repaired and must be rebuilt from
+	    scratch.</para>
+	<para>The <option>-F <replaceable>dir</replaceable></option>
+	    option runs a catastrophic data base recovery in the data base directory
+	    dir. If that fails, your data base cannot be repaired and
+	    must be rebuilt from scratch.</para>
 	<para>The <option>-r</option> option tells
 	    <application>bogoutil</application> to recalculate the ROBX
 	    value and print it as a six-digit fraction.





[cvs] bogofilter/doc README.db,1.2,1.3 bogoutil.xml,1.17,1.18

Fast Bayesian spam filter along lines suggested by Paul Graham

[cvs] bogofilter/doc README.db,1.2,1.3 bogoutil.xml,1.17,1.18