Share

Annoyance Filter

File Release Notes and Changelog

Release Name: 0.1-RC6

Notes:
Release 0.1-RC6 adds the ability to classify messages based on
multi-word phrases as well as individual words.  Memory-mapped
I/O can be used (on platforms which support it) to permit shared
access to the dictionary for high volume applications.  Error checking
for inconsistent option settings is improved, and an error in computing
the number of messages in a folder when some messages contained
MIME attachments has been corrected.

Changes: \date{2002 October 19} Added a check in |classifyMessages| to verify that a dictionary has been loaded before attempting to classify a message. If no dictionary is present, a warning is written to standard error and the junk probability is returned as 0.5. Added a warning if command line are specified after a \.{--classify} command. Since this command always exits with an exit code indicating the classification, specifying subsequent arguments is always an error. Added a bunch of consistency checking for combinations of options which don't make any sense and suggest the user doesn't understand in which order they should be specified. To facilitate this, I modified the code for the \.{--classify} option to set a new |lastOption| flag to bail out of the option processing loop and set |exitStatus| to the classification rather than exiting directly before the option consistency checks are performed. This cleans up the control structure in any case. In the process of adding the above code, I discovered that the |any()| method of |bitset| seems to be broken in the \.{glibc} which accompanies \.{gcc} 2.96. I tested |count()| against zero and that seems to work OK. Implemented phrase tokens. You can consider phrases of consecutive tokens as primitive tokens by specifying the minimum and maximum words composing a phrase with the \.{--phrasemin} and \.{phrasemax} options. These default to 1 and 1, which suppresses all phrase-related flailing around. If set otherwise, tokens are assembled into a queue and all phrases within the length bounds are emitted as tokens. How well this works is a research question we may now address with the requisite tool in hand. \date{2002 October 20} Added code to import a binary dictionary file with the \.{--read} option using memory-mapped I/O if \.{./configure} detects that facility and defines \.{HAVE\_MMAP}. This isn't a big win on individual runs of the program, but if you're installing it on a high volume server, multiple read-only references to the dictionary file (be sure to make the file read-only, by the way) can simply bring the file into memory where it is re-used by multiple instances of the program. (Of course, if the system has an efficient file system cache, that may work just as well, but there's no harm in memory mapping in any case.) Thanks to the \CPP/ theologians who deprecated the incredibly useful |strstream| facility, which is precisely what you need to efficiently access a block of memory mapped data as a stream, I included a copy of the definition of this facility in \.{mystrstream.h} so we don't have to depend on the \CPP/ library providing it. I was a little worried about writing phrases in CSV format without quoting the fields, but I did an experiment with Excel and discovered it doesn't quote such fields either---it only uses quotes if the cell contains a comma or a quote (in which case it forces the quote by doubling it). Since our token definition doesn't permit either a comma or a quote within a token, we're still safe. \date{2002 October 21} Added a \.{--phraselimit} option to discard phrases longer than the specified limit on the fly. This prevents dictionary bloat due to ``phrases'' generated by concatenation of gibberish from headers and strings decoded from binary attachments. These will usually be eliminated by a \.{--prune}, but that doesn't help if the swap file's already filled up with garbage phrases before reaching the end of the mail folder. The default \.{--phraselimit} is 0, which imposes no limit on the length of phrases. \date{2002 October 22} When the default |getNextEncodedLine| of a |MIMEdecoder| encountered the ``\.{From\ }'' line of the next message in a mail folder, it failed to store the line as the part boundary, which in turn caused |mailFolder| to mis-count the number of messages in a folder being parsed when training. I fixed this, and in the process re-wrote an archaic \CEE/ string test used in |@<Check for start of new message in folder@>| to use a proper \CPP/ |string| comparison. Corrected some ancient URLs in \.{README}, and added information on the SourceForge project there and in \.{annoyance-filter.manm}.