#7 word detection improvments

David Saez

ability to define other characters that could define word boundaries, like - , _ or '
i.e, interest_rate, interest-loans are detected as one word but they are two words,
also internet's or d'acord should be detected as internet or acord (all examples
are from real bogofilter databases). Care should be taken as this are valid characters
on url's and filenames, that should not be splitted with this characters.

We also have lot's of entries in both databases that consist on repetitions of the -
character (like -------, ---------------, etc ...) that could be understood like the same type
of word. This kind of 'words' also fool bogofilter on html tag detection, we have some
words like this in our database:


also other words like this mime boundary are detected:


and are filling the database with lots's of boundary tags with random data at the
right of the = with low (alsmost always 1) word count

Same for the random stuff spammers use to put at the end of the message to fool
razor, a thershold could be defined so words longer that x characters will be hold
on the database as (i.e) VERY_LONG_WORD so all this random stuff will not fill
the database and could be counted as the same, generic, word.


    David Relson - 2003-01-20
    • status: open --> closed
    Bogofilter, as of version 0.10.0, has a number of these
    items fixed. Boundary tags no longer go into the wordlist.
    The new database maintenace capability allows discarding
    words that are short or long, have low counts, are too old
    (using values you provide for the decisions).

    Having user specified characters for word boundaries is not
    included. If you want it, feel free to implement it and
    contribute the code ;-)

    You mention having tokens like <br> in your wordlists.
    Since '<' and '>' aren't allowed in tokens, they shouldn't
    (can't) exist. If you can make that happen with a current
    release, I'd like to know.



