[cvs] bogofilter RELEASE.NOTES,1.6,1.7 Makefile.am,1.161,1.162 RELEASE.NOTES-0.92,1.2,NONE RELEASE.N
Fast Bayesian spam filter along lines suggested by Paul Graham
Brought to you by:
m-a
From: <m-...@us...> - 2004-11-09 12:57:30
|
Update of /cvsroot/bogofilter/bogofilter In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28434 Modified Files: RELEASE.NOTES Makefile.am Removed Files: RELEASE.NOTES-0.92 RELEASE.NOTES-0.93 Log Message: Clean up the release notes mess. Reformat document for clarity and merge the 0.17 and 0.16 sections. Index: RELEASE.NOTES =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/RELEASE.NOTES,v retrieving revision 1.6 retrieving revision 1.7 diff -u -d -r1.6 -r1.7 --- RELEASE.NOTES 26 Oct 2004 00:04:41 -0000 1.6 +++ RELEASE.NOTES 9 Nov 2004 12:57:11 -0000 1.7 @@ -1,374 +1,254 @@ - ####### RELEASE.NOTES-0.92 ####### - - Caution: If upgrading from an old version and skipping several - intervening versions of bogofilter, be smart and verify that the - new version is working for you before putting it into production. - Bogofilter's developers recommend that you run a few manual - command line tests. Use options, config files, and wordlists from - your production environment. This will help ensure that the new - version of bogofilter is working properly. - - 0.92.2 2004-07-11 - NOTE: the formatting parameters have changed, - '%A' is now the message's IP address. - '%I' is now the Message-ID. - '%Q' is now the Queue-ID. - - - ####### RELEASE.NOTES-0.17 ####### - - Code Clean-Up - Phase 2 - ----------------------- - - Update - ======== - - Since bogofilter 0.17.3, the BerkeleyDB data store model (the default - data base) has been switched to the Berkeley DB Concurrent Data Store. - This changes some aspects of how concurrent bogofilter processes are - handles. Please see the file doc/README.db for details. - - The text below applies to all versions since 0.17.0 inclusively, and is - still true for 0.17.3: - - Introduction - ------------ - - Bogofilter was released over a year ago and has continually been - extended, corrected, enhanced, and refined. Over this time it has - evolved from a simple Bayesian filter to a sophisticated filter that - understands email, decodes text parts of multi-part MIME messages, - processes html, etc. - - During this evolution, old functions have remained in the code and - command-line options have been added to provide compatibility with - older versions. Many of these functions and options have started - collecting dust - some are not commonly used and others are not - well-tested. - - Bogofilter is suffering from creeping featuritis and optionitis. - - It is time to clean house! - - In bogofilter 0.16.0 much of the cruft was bracketed with #ifdef - ENABLE_DEPRECATED_CODE and #endif statements. The later 0.16.x - releases tagged additional code that had been overlooked in 0.16.0. - - Now, with bogofilter 0.17.0, all the tagged code has been removed. - The documentation and sample config file have been updated to reflect - the deleted options and features. - - The following list is supposed to be complete. Let us know if we've - omitted anything. We shall try to provide workarounds and migration - paths whenever possible. - - Feature List - ------------ - - 1) Scoring algorithms: - - Bogofilter will support only the Robinson-Fisher algorithm, - commonly called the "Fisher algorithm". The Graham algorithm and - Robinson geometric-mean algorithm, a.k.a. Robinson algorithm, have - been deprecated. - - 2) Wordlist support. - - Bogofilter will now support only the combined wordlist, i.e. - wordlist.db, which contains both the ham and spam counts for each - token. The older, separate wordlists (spamlist.db and goodlist.db) - are no longer supported. - - The bogoupgrade program can still be used to merge the separate - databases for you. Type "bogoupgrade -d /you/wordlist/directory/" - to do the job. - - Ignore lists, i.e. ignorelist.db, are also being deprecated. The - ignore list feature has never been thoroughly tested and is not - used (as far as we know). - - 3) BerkeleyDB support - - Binary RPM packages are now being built with BerkeleyDB-4.1 (or - newer). +This file documents incompatible (that require further action upon +update) and major (noteworthy but compatible) changes since bogofilter +0.11, separated by line feeds. - For convenience, use whatever BerkeleyDB version came with your - system. We have tested BerkeleyDB 3.2 and newer, but our testing - focus is with the recent 4.X releases. We developers are no longer - using BerkeleyDB-3.3, but will leave the code in bogofilter to - allow its continued use. +Caution: If upgrading from an old version and skipping several +intervening versions of bogofilter, be smart and check all the +"incompatible changes" sections of the versions you skipped! - 4) Command line switches: +Also, if you upgrade to a new minor version (that is, the first or +second number in bogofilter's version have been changed), you MUST +verify that the new version is working for you BEFORE putting it into +production. - Bogofilter will no longer support the switches listed in this - section. If used, bogofilter will print an error message and exit. +Bogofilter's developers recommend that you run a few manual command line +tests. Use options, config files, and wordlists from your production +environment. This will help ensure that the new version of bogofilter +is working properly. - Scoring related switches: +NOTE: the NEWS and CHANGES documents have greater detail on some of +these changes. You should consult them. - -g - select Graham algorithm - -r - select Robinson Geometric-Mean algorithm - -f - select Robinson-Fisher algorithm - -2 - set binary classification mode - -3 - set ternary classification mode +######################################################################## - Note: The Robinson-Fisher algorithm is bogofilter's one and - only algorithm. The classification mode switches are - unnecessary. Bogofilter will use binary mode if ham_cutoff is - zero and will use ternary mode (Yes, No, Unsure) if ham_cutoff - in non-zero and less than spam_cutoff. +INCOMPATIBLE CHANGES IN BOGOFILTER 0.93 +======================================= - Wordlist switches: +Summary for the hasty +--------------------- - -W - use combined wordlist for spam and ham tokens - -WW - use separate wordlists for spam and ham tokens +YOU MUST ADJUST YOUR SCRIPTS IF YOU ARE EVALUATING X-Bogosity HEADERS! - Note: Combined mode is now the only supported mode. +YOU MUST READ doc/README.db AND POSSIBLY CONFIGURE THE DATABASE! - Backwards compatible token generation switches: - -Pi and -PI - ignore_case - -Pt and -PT - tokenize_html_tags - -Pc and -PC - strict_check - -Pd and -PD - degen_enabled - -Pf and -PF - first_match +Defaults changed +---------------- - Note: Since last May, the default values for these switches - have been: +Bogofilter's defaults have been changed. It now operates in tri-state +mode and will classify messages as Spam, Ham, or Unsure. - ignore_case disabled - tokenize_html_tags enabled - strict_check disabled - degen_enabled disabled - first_match disabled +If you're checking messages for "X-Bogosity: Yes" or "X-Bogosity: No", +you _need_ to change your checks. Use "X-Bogosity: Spam" and +"X-Bogosity: Ham" instead of the old forms. Also, checking for +"X-Bogosity: Unsure" and putting those messages in a separate folder (or +mailbox) will give you an excellent set of messages for training, as +"Unsure" messages are messages that bogofilter has too little +information to classify (with certainty) as spam or ham. - There will be no change in the default values. - 5) Configuration options: +Berkeley DB switched to Transactional Data Store +------------------------------------------------ - The following configuration options (for the above switches) are - deprecated: +Bogofilter will now use the Berkeley DB Transactional Data Store when +compiled with Berkeley DB as the data base engine (the default). - algorithm +When using BerkeleyDB 4.1 or 4.2, it is recommended that you dump and +load the data bases to add checksums, for enhanced reliablity. See +section 2.2 in doc/README.db for details. - wordlist - wordlist_mode +This means that bogofilter programs now exhibit the A C I D traits: +changes are atomic (all-or-nothing); the data base is always consistent; +changes are always isolated from each other; and all changes that are +acknowledged are durable. - ignore_case - tokenize_html_tags - tokenize_html_script - header_degen - degen_enabled - first_match +Bogofilter can support multiple writers at the same time, mixed freely +with simultaneous readers, and the data base will not be corrupted by +application or system crashes, except when the disk drive gets damaged. - The following configuration options (which don't correspond to - switches) are deprecated: +Note that this requires that the operating system and disk drive +maintain proper write order on the disk, and that both be honest about +synchronous I/O completion. - thresh_stats - thresh_rtable +Note also that this causes bogofilter to write additional "log" files +to its ~/.bogofilter (or other) home directory. The log files need to +be archived or deleted periodically. - Note: Bogofilter will print a warning message if it sees any of - these options, but will run fine anyhow. +For detailed instructions, be sure to _read_ doc/README.db and check the +BerkeleyDB documentation. - 6) Miscellany: +These benefits are not available when bogofilter is compiled to use the +TDB or QDBM data bases. - The user formatted SPAM_HEADER will no longer support format - specification "%a" (for algorithm) since bogofilter now has only - one algorithm. - Operational Note - ---------------- +QDBM database format changed to B+ trees +---------------------------------------- - With the 0.16.0 release, a number of features have been deprecated. - The relevant code is bracketed by "#ifdef ENABLE_DEPRECATED_CODE" and - "#endif" statements. The default build will not include the - deprecated features. For those who still need these features, - configure option "--enable-deprecated-code" exists to allow them to be - turned on. +The QDBM database format has been changed from hash tables to B+ +trees, i.e. from the Depot API to the Villa API. This results in +significantly better performance, i.e. faster speed. Unfortunately, +the two modes are incompatible, so upgrading to 0.93 requires running +a special command to convert the database once: - Plan - ---- +bogoQDBMupgrade wordlist.qdbm wordlist.tmp wordlist.qdbm.old - Bogofilter 0.16.0 will be the "Code Clean-Up - Phase 1" release. The - "deprecated" state will exist until 0.16.X is promoted to "stable" - status, or for a month, whichever is longer. +If this command didn't print anything, everything has gone well and it +has left your old data base in wordlist.qdbm.old. - Bogofilter 0.17.0 will be the "Code Clean-Up - Phase 2" release. All the - deprecated code will be removed. +######################################################################## - ####### RELEASE.NOTES-0.16 ####### - - Code Clean-Up - Phase 1 - ----------------------- - - Introduction - ------------ +INCOMPATIBLE CHANGES IN BOGOFILTER 0.92 +======================================= - Bogofilter was released over a year ago and has continually been - extended, corrected, enhanced, and refined. Over this time it has - evolved from a simple Bayesian filter to a sophisticated filter that - understands email, decodes text parts of multi-part MIME messages, - processes html, etc. +NOTE: the formatting parameters have changed, + '%A' is now the message's IP address. + '%I' is now the Message-ID. + '%Q' is now the Queue-ID. - During this evolution, old functions have remained in the code and - command-line options have been added to provide compatibility with - older versions. Many of these functions and options have started - collecting dust - some are not commonly used and others are not - well-tested. +######################################################################## + +INCOMPATIBLE CHANGES IN BOGOFILTER 0.17 +======================================= - Bogofilter is suffering from creeping featuritis and optionitis. +Support for --enable-deprecated-code (see the 0.16 release notes) +has been removed. If you've run 0.16.X without that switch, nothing +changes for you. - It is time to clean house! +######################################################################## + +INCOMPATIBLE CHANGES IN BOGOFILTER 0.16 +======================================= - The goal of the bogofilter 0.16 series is to clean out this excess - code and create a core of high quality code. This will necessarily cut - some ties with previous versions, and you may need to adjust your - wrapper scripts to make up for features we have dropped. +With the 0.16.0 release, a number of features have been deprecated. The +relevant code is bracketed by "#ifdef ENABLE_DEPRECATED_CODE" and +"#endif" statements. The default build will not include the deprecated +features. For those who still need these features, configure option +"--enable-deprecated-code" exists to allow them to be turned on. - The following list is supposed to be complete. Let us know if we've - omitted anything. We shall try to provide workarounds and migration - paths whenever possible. +THIS MAY REQUIRE MAJOR CHANGES TO YOUR CONFIGURATION OR SCRIPTS! - Feature List - ------------ +The following list is supposed to be complete. Let us know if we've +omitted anything. We shall try to provide workarounds and migration +paths whenever possible. - 1) Scoring algorithms: - Bogofilter will support only the Robinson-Fisher algorithm, - commonly called the "Fisher algorithm". The Graham algorithm and - Robinson geometric-mean algorithm, a.k.a. Robinson algorithm, have - been deprecated. +1) Scoring algorithms +--------------------- - 2) Wordlist support. +Bogofilter will support only the Robinson-Fisher algorithm, commonly +called the "Fisher algorithm". The Graham algorithm and Robinson +geometric-mean algorithm, a.k.a. Robinson algorithm, have been +deprecated. - Bogofilter will now support only the combined wordlist, i.e. - wordlist.db, which contains both the ham and spam counts for each - token. The older, separate wordlists (spamlist.db and goodlist.db) - are no longer supported. - The bogoupgrade program can still be used to merge the separate - databases for you. Type "bogoupgrade -d /you/wordlist/directory/" - to do the job. +2) Wordlist support +------------------- - Ignore lists, i.e. ignorelist.db, are also being deprecated. The - ignore list feature has never been thoroughly tested and is not - used (as far as we know). +Bogofilter will now support only the combined wordlist, i.e. +wordlist.db, which contains both the ham and spam counts for each token. +The older, separate wordlists (spamlist.db and goodlist.db) are no +longer supported. - 3) BerkeleyDB support +The bogoupgrade program can still be used to merge the separate +databases for you. Type "bogoupgrade -d /you/wordlist/directory/". - Binary RPM packages are now being built with BerkeleyDB-4.1 (or - newer). +Ignore lists, i.e. ignorelist.db, are also being deprecated. The ignore +list feature has never been thoroughly tested and is not used (as far as +we know). - For convenience, use whatever BerkeleyDB version came with your - system. We have tested BerkeleyDB 3.2 and newer, but our testing - focus is with the recent 4.X releases. We developers are no longer - using BerkeleyDB-3.3, but will leave the code in bogofilter to - allow its continued use. - 4) Command line switches: +3) Command line switches +------------------------ - Bogofilter will no longer support the switches listed in this - section. If used, bogofilter will print an error message and exit. +Bogofilter will no longer support the switches listed in this section. +If used, bogofilter will print an error message and exit. - Scoring related switches: + Scoring related switches: - -g - select Graham algorithm - -r - select Robinson Geometric-Mean algorithm - -f - select Robinson-Fisher algorithm - -2 - set binary classification mode - -3 - set ternary classification mode + -g - select Graham algorithm + -r - select Robinson Geometric-Mean algorithm + -f - select Robinson-Fisher algorithm - Note: The Robinson-Fisher algorithm is bogofilter's one and - only algorithm. The classification mode switches are - unnecessary. Bogofilter will use binary mode if ham_cutoff is - zero and will use ternary mode (Yes, No, Unsure) if ham_cutoff - in non-zero and less than spam_cutoff. + see section 1 above - Wordlist switches: + -2 - set binary classification mode + -3 - set ternary classification mode - -W - use combined wordlist for spam and ham tokens - -WW - use separate wordlists for spam and ham tokens + Bogofilter will use binary mode if ham_cutoff is zero and will use + ternary mode (Yes, No, Unsure) if ham_cutoff in non-zero and less + than spam_cutoff. - Note: Combined mode is now the only supported mode. + Wordlist modes: - Backwards compatible token generation switches: + -W - use combined wordlist for spam and ham tokens + -WW - use separate wordlists for spam and ham tokens - -Pi and -PI - ignore_case - -Pt and -PT - tokenize_html_tags - -Pc and -PC - strict_check - -Pd and -PD - degen_enabled - -Pf and -PF - first_match + Bogofilter will always operate in combined mode now. - Note: Since last May, the default values for these switches - have been: + Backwards compatible token generation switches: - ignore_case disabled - tokenize_html_tags enabled - strict_check disabled - degen_enabled disabled - first_match disabled + -Pi and -PI - ignore_case + -Pt and -PT - tokenize_html_tags + -Pc and -PC - strict_check + -Pd and -PD - degen_enabled + -Pf and -PF - first_match - There will be no change in the default values. + Note: Since last May, the default values for these switches + have been: - 5) Configuration options: + ignore_case disabled + tokenize_html_tags enabled + strict_check disabled + degen_enabled disabled + first_match disabled - The following configuration options (for the above switches) are - deprecated: + There will be no change in the default values. - algorithm - wordlist - wordlist_mode +4) Configuration options +------------------------ - ignore_case - tokenize_html_tags - tokenize_html_script - header_degen - degen_enabled - first_match +The following configuration options (for the above switches) are +deprecated: - The following configuration options (which don't correspond to - switches) are deprecated: + algorithm - thresh_stats - thresh_rtable + wordlist + wordlist_mode - Note: Bogofilter will print a warning message if it sees any of - these options, but will run fine anyhow. + ignore_case + tokenize_html_tags + tokenize_html_script + header_degen + degen_enabled + first_match - 6) Miscellany: +The following configuration options (which don't correspond to +switches) are deprecated: - The user formatted SPAM_HEADER will no longer support format - specification "%a" (for algorithm) since bogofilter now has only - one algorithm. + thresh_stats + thresh_rtable - Operational Note - ---------------- +Note: Bogofilter will print a warning message if it sees any of +these options, but will run fine anyhow. - With the 0.16.0 release, a number of features have been deprecated. - The relevant code is bracketed by "#ifdef ENABLE_DEPRECATED_CODE" and - "#endif" statements. The default build will not include the - deprecated features. For those who still need these features, - configure option "--enable-deprecated-code" exists to allow them to be - turned on. - Plan - ---- +5) Miscellany +------------- - Bogofilter 0.16.0 will be the "Code Clean-Up - Phase 1" release. The - "deprecated" state will exist until 0.16.X is promoted to "stable" - status, or for a month, whichever is longer. +The user formatted SPAM_HEADER will no longer support format +specification "%a" (for algorithm) since bogofilter now has only one +algorithm. - Bogofilter 0.17.0 will be the "Code Clean-Up - Phase 2" release. All the - deprecated code will be removed. +######################################################################## - ####### RELEASE.NOTES-0.15 ####### - - *** GOOD NEWS ... BAD NEWS *** +INCOMPATIBLE CHANGES IN BOGOFILTER 0.15 +======================================= - Since release 0.15.9, bogofilter no longer allows to disable algorithms, - which has never been supported well. +Since release 0.15.9, bogofilter no longer allows to disable algorithms, +which has never been supported well. - With release 0.15.4, all header line tokens are now tagged as: +Since release 0.15.4, all header line tokens are now tagged as: Subject: subj: To: to: @@ -377,155 +257,160 @@ Received: rcvd: ***new*** any other: head: ***new*** - Since existing wordlists don't have "head:???" tokens, the new tokens - won't be found in the wordlist and bogofilter's accuracy will go down. - To correct this you can do one of the following things: - - 1 - Use the new "-H" (for header-degen) option when scoring messages. - This option tells bogofilter to check the wordlist twice for each - header token - once for "head:xyz" and a second time for "xyz". The - ham and spam counts are added together to give a cumulative result. +Because existing wordlists don't have "head:???" tokens, the new tokens +won't be found in the wordlist and bogofilter's accuracy will go down. - Note that, with bogofilter 0.15.4 and later, during message - registration, "head:xyz" tokens are added to the wordlist (for the - header lines). The "-H" option is only applied during scoring. +To correct this you can do one of the following things: - The "-H" option is meant for temporary usage to cover the period while - bogofilter goes from having no "head:xyz" tokens in the wordlist to - the time when there are enough such tokens to score messages - effectively. After a few weeks, or perhaps months, of registering - messages with the new bogofilter, use of the "-H" option can end and - bogofilter will use the newly added "head:xyz" tokens. +1 - Use the new "-H" (for header-degen) option when scoring messages. +This option tells bogofilter to check the wordlist twice for each header +token - once for "head:xyz" and a second time for "xyz". The ham and +spam counts are added together to give a cumulative result. - 2 - Retrain bogofilter with whatever ham and spam you have available. - This will create "header:xyz" tokens and allow the new, more effective - header tagging to be used to fullest advantage. +Note that, with bogofilter 0.15.4 and later, during message +registration, "head:xyz" tokens are added to the wordlist (for the +header lines). The "-H" option is only applied during scoring. - *** A MAJOR ENHANCEMENT *** +The "-H" option is meant for temporary usage to cover the period while +bogofilter goes from having no "head:xyz" tokens in the wordlist to the +time when there are enough such tokens to score messages effectively. +After a few weeks, or perhaps months, of registering messages with the +new bogofilter, use of the "-H" option can end and bogofilter will use +the newly added "head:xyz" tokens. - With release 0.15, bogofilter's code for processing multiple messages - has been rewritten. In addition to understanding mbox format files, - bogofilter now understands maildirs and MH folders. - - ####### RELEASE.NOTES-0.14 ####### +2 - Retrain bogofilter with whatever ham and spam you have available. +This will create "header:xyz" tokens and allow the new, more effective +header tagging to be used to fullest advantage. - With release 0.14, bogofilter's use of BerkelyDB has changed. First, - TrivialDB (tdb) can be used. Second, instead of separate wordlists - for spam and ham tokens, bogofilter can now use a single combined, - wordlist that stores both all tokens. However, this change broke the - early versions (up to and including 0.14.2) of bogofilter. You should - use at least bogofilter 0.14.3. - In the combined wordlist each token contains two counts - for spam and - ham. The name of the new file is wordlist.db. - Bogofilter will check in $BOGOFILTER_DIR and use the wordlist(s) that - are there. If wordlist.db is present, bogofilter will use the - combined mode. If wordlist.db is not present, but both spamlist.db - and goodlist.db are present, bogofilter will use the separate wordlist - mode. If no wordlists are present, bogofilter will create wordlist.db - and use it. +MAJOR CHANGES IN BOGOFILTER 0.15 +================================ - Command line switches '-W' and '-WW' can be used to tell bogofilter - the mode you want. Also config file options "wordlist_mode=combined" - and "wordlist_mode=separate" can be used. +With release 0.15, bogofilter's code for processing multiple messages +has been rewritten. In addition to understanding mbox format files, +bogofilter now understands maildirs and MH folders. - Upgrading from an old bogofilter environment with its two wordlists - (spamlist.db and goodlist.db) to the new 0.14.x environment with its - single, combined wordlist.db involves 3 main steps - dumping the - current spamlist.db and goodlist.db files, formatting that output, and - then loading the data into a new file wordlist.db. Script bogoupgrade - is included with bogofilter and performs the task. Use command - "bogoupgrade -d /path/to/your/wordlists" to do the upgrade. After - running it, your BOGOFILTER_DIR will contain all 3 database files. - When started, bogofilter checks for wordlist.db and will use it. +######################################################################## + +INCOMPATIBLE CHANGES IN BOGOFILTER 0.14 +======================================= - Also, exit codes returned by bogofilter have been expanded. They are: +The exit codes returned by bogofilter have been expanded. They are: Spam = 0 -- unchanged Ham = 1 -- unchanged Unsure = 2 -- *NEW* Error = 3 -- *CHANGED* - NOTE: See the CHANGES-0.14 document for a list of all the changes. - - ####### RELEASE.NOTES-0.13 ####### - NOTE: Please also see the CHANGES-0.13 document for a detailed summary. +MAJOR CHANGES IN BOGOFILTER 0.14 +================================ - With release 0.13, bogofilter's parsing has changed. As background, - Paul Graham has done work to improve the results of his bayesian - filter and has published them in "Better Bayesian Filtering" at - http://www.paulgraham.com/better.html. He found the following - definition of a token to be beneficial: +Bogofilter 0.14 now supports TDB (Trivial Data base). + +Instead of separate wordlists for spam and ham tokens, bogofilter can +now use a single combined, wordlist that stores both all tokens. +In the combined wordlist each token contains two counts - for spam and +ham. The name of the new file is wordlist.db. + +However, this change broke the early versions (up to and including +0.14.2) of bogofilter. You should use at least bogofilter 0.14.3. + +Bogofilter will check in $BOGOFILTER_DIR and use the wordlist(s) that +are there. If wordlist.db is present, bogofilter will use the combined +mode. If wordlist.db is not present, but both spamlist.db and +goodlist.db are present, bogofilter will use the separate wordlist mode. +If no wordlists are present, bogofilter will create wordlist.db and use +it. + +Command line switches '-W' and '-WW' can be used to tell bogofilter the +mode you want. Also config file options "wordlist_mode=combined" and +"wordlist_mode=separate" can be used. + +Upgrading from an old bogofilter environment with its two wordlists +(spamlist.db and goodlist.db) to the new 0.14.x environment with its +single, combined wordlist.db involves 3 main steps - dumping the current +spamlist.db and goodlist.db files, formatting that output, and then +loading the data into a new file wordlist.db. The script "bogoupgrade" is +included with bogofilter and performs the task. Use command +"bogoupgrade -d /path/to/your/wordlists" to do the upgrade. After +running it, your BOGOFILTER_DIR will contain all 3 database files. When +started, bogofilter checks for wordlist.db and will use it. + +######################################################################## + +INCOMPATIBLE CHANGES IN BOGOFILTER 0.13 +======================================= + +With release 0.13, bogofilter's parsing has changed. As background, +Paul Graham has done work to improve the results of his bayesian filter +and has published them in "Better Bayesian Filtering" at +http://www.paulgraham.com/better.html. He found the following +definition of a token to be beneficial: 1. Case is preserved. 2. Exclamation points are constituent characters. 3. Periods and commas are constituents if they occur between two - digits. This lets me get ip addresses and prices intact. + digits. This lets me get ip addresses and prices intact. 4. A price range like $20-25 yields two tokens, $20 and $25. 5. Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. - Bogofilter has always done #3 and has tagged for Subject lines for a - while. Its parser now does all of these things. Several command line - switches and config file options have been added to allow enabling or - disabling them. Here are the new switches and options: +Bogofilter has always done #3 and has tagged for Subject lines for a +while. Its parser now does all of these things. Several command line +switches and config file options have been added to allow enabling or +disabling them. Here are the new switches and options: -Pi/-PI ignore_case default - disabled -Ph/-PH header_line_markup default - enabled -Pt/-PT tokenize_html_tags default - enabled - The options can be enabled using the lower case switch or disabled - using the upper case switch. - - When header_line_markup_is enabled, tokens in To:, From:, Subject:, - and Return-Path: lines are prefixed by "to:", "from:", "subj:", and - "rtrn:" respectively. +The options can be enabled using the lower case switch or disabled using +the upper case switch. - When tokenize_html_tags_is enabled, tokens in A, IMG, and FONT tags - are scored while classifying the message. +When header_line_markup_is enabled, tokens in To:, From:, Subject:, and +Return-Path: lines are prefixed by "to:", "from:", "subj:", and "rtrn:" +respectively. - NOTE: +When tokenize_html_tags_is enabled, tokens in A, IMG, and FONT tags are +scored while classifying the message. - To take full advantage of these changes, additional training of - bogofilter is necessary. +NOTE: To take full advantage of these changes, additional training of +bogofilter is necessary. Here's why: - Here's why: +With bogofilter's use of upper and lower case, the wordlists won't match +as many words as before. For example, "From" and "from" both used to +match "from", but this is no longer the case. As additional training is +done, words like these will be added to the wordlists and bogofilter +will have a larger number of distinct tokens to use when classifying +messages. This will improve its classification accuracy. - With bogofilter's use of upper and lower case, the wordlists won't - match as many words as before. For example, "From" and "from" both - used to match "from", but this is no longer the case. As additional - training is done, words like these will be added to the wordlists and - bogofilter will have a larger number of distinct tokens to use when - classifying messages. This will improve its classification accuracy. +Similarly, the use of header_line_markup will tokenize "Subject: great +p0rn site" as "subj:great", "subj:p0rn", and "subj:site". At first +these tokens won't be recognized, so bogofilter won't use them to score +the message. After being trained, bogofilter will have these additional +tokens to aid in the classification process. - Similarly, the use of header_line_markup will tokenize "Subject: great - p0rn site" as "subj:great", "subj:p0rn", and "subj:site". At first - these tokens won't be recognized, so bogofilter won't use them to - score the message. After being trained, bogofilter will have these - additional tokens to aid in the classification process. +######################################################################## - ####### RELEASE.NOTES-0.12 ####### - - Bogofilter 0.12.0 includes a new file, bogofilter-tuning.HOWTO. - It's in the bogofilter/doc directory and replaces README.Robinson. +MAJOR CHANGES IN BOGOFILTER 0.12 +================================ - Directory bogofilter/tuning has been added and contains scripts for - running tuning experiments as described in the new HOWTO. See file - bogofilter/tuning/README for more information. +Directory bogofilter/tuning has been added and contains scripts for +running tuning experiments as described in the new HOWTO. See file +bogofilter/tuning/README for more information. - Bogofilter's man page and help message describe the many command line - switches. They have been divided into groups (help, classification, - registration, general, algorithm, parameter, and info) in both - places. +Bogofilter's man page and help message describe the many command line +switches. They have been divided into groups (help, classification, +registration, general, algorithm, parameter, and info) in both places. - Bogofilter 0.12.0 has three new command line switches for rapidly - scoring large numbers of messages. These "bulk mode" switches are - especially useful for the tuning process. The new switches are: +Bogofilter 0.12.0 has three new command line switches for rapidly +scoring large numbers of messages. These "bulk mode" switches are +especially useful for the tuning process. The new switches are: -M - allows scoring all the messages in a mbox formatted file. If used with "-v", an X-Bogosity line is printed as each message is @@ -544,46 +429,57 @@ "ls Maildir/* | bogofilter -b ..." If used with "-v", the file name is included in each printed line. Using "-t" is recommended. - New script bogolex.sh converts an email to a special file format that - contains the information needed by bogofilter to score the email. - Its use speeds up the message scoring done by the tuning scripts. The - script is described in more detail in bogofilter/tuning/README. +New script bogolex.sh converts an email to a special file format that +contains the information needed by bogofilter to score the email. Its +use speeds up the message scoring done by the tuning scripts. The +script is described in more detail in bogofilter/tuning/README. + +######################################################################## - ####### RELEASE.NOTES-0.11 ####### +INCOMPATIBLE CHANGES IN BOGOFILTER 0.11 +======================================= - Command line flags - - The meaning of command line flags '-S' and '-N' was changed in - version 0.11.0. Previously '-S' meant to unregister a message - from the spam wordlist and register the message in the - non-spam wordlist and '-N' meant to unregister from non-spam - and register as spam. +Command line flags +------------------ - Each of the flags now performs a single action. - '-S' unregisters a message from the spam wordlist and +The meaning of command line flags '-S' and '-N' was changed in version +0.11.0. Previously '-S' meant to unregister a message from the spam +wordlist and register the message in the non-spam wordlist and '-N' +meant to unregister from non-spam and register as spam. + +Each of the flags now performs a single action. + + '-S' unregisters a message from the spam wordlist and '-N' unregisters a message from the non-spam wordlist. - To duplicate the old (compound) actions, it is necessary to - use two options - an unregister option ('-S' or '-N') and a - register option ('-s' or '-n'). +To duplicate the old (compound) actions, it is necessary to use two +options - an unregister option ('-S' or '-N') and a register option +('-s' or '-n'). - To duplicate the effect of the old '-S' option, use '-N -s'. - To duplicate the effect of the old '-N' option, use '-S -n'. - The order of the options doesn't matter and they can be - concatenated, as in '-Sn' and '-sN'. +To duplicate the effect of the old '-S' option, use '-N -s'. To +duplicate the effect of the old '-N' option, use '-S -n'. The order of +the options doesn't matter and they can be concatenated, as in '-Sn' and +'-sN'. - Config file processing - The code to process config files now checks numeric values - for validity. It complains when it detects something - wrong. In particular, double precision values are no longer - allowed to have a terminal 'f'. For example - "spam_cutoff=0.95f" will generate a messages. +Config file processing +---------------------- - New parameter query option +The code to process config files now checks numeric values for validity. +It complains when it detects something wrong. In particular, double +precision values are no longer allowed to have a terminal 'f'. For +example "spam_cutoff=0.95f" will generate a messages. - Using options "-q -v" in a bogofilter command line will run - the query_config() function and will display bogofilter's - various parameter values. This can be very useful in finding - the reason for an unexpected message classification. +MAJOR CHANGES IN BOGOFILTER 0.11 +================================ + +New parameter query option +-------------------------- + +Using options "-q -v" in a bogofilter command line will run the +query_config() function and will display bogofilter's various parameter +values. This can be very useful in finding the reason for an unexpected +message classification. + +######################################################################## Index: Makefile.am =================================================================== RCS file: /cvsroot/bogofilter/bogofilter/Makefile.am,v retrieving revision 1.161 retrieving revision 1.162 diff -u -d -r1.161 -r1.162 --- Makefile.am 29 Oct 2004 01:18:16 -0000 1.161 +++ Makefile.am 9 Nov 2004 12:57:11 -0000 1.162 @@ -20,7 +20,7 @@ GETTING.STARTED \ README.cvs \ RELEASE.NOTES \ - CHANGES-0.9x RELEASE.NOTES-0.93 + CHANGES-0.9x .PHONY: check --- RELEASE.NOTES-0.92 DELETED --- --- RELEASE.NOTES-0.93 DELETED --- |