#36 About lowercasing

closed
nobody
None
5
2003-02-23
2003-02-23
Ache
No

Hi. Why you lowercase words before adding them to the
databases? It can lower spam detection probability,
because spam very often capitalize (i.e. emphase) words
which normal mail does not.

Discussion

  • David Relson

    David Relson - 2003-02-23

    Logged In: YES
    user_id=30510

    The decision to lower case words was made long ago, before I
    got involved with bogofilter. I view it as a trade-off of
    accuracy vs speed and wordlist size.

    If bogofilter was case sensitive, then the wordlist would
    likely contain "The", "the", and "THE" which is (perhaps) a
    bit much. Evaluating a message with the three
    capitalizations would require 3 database accesses.

    Along similar efficiency lines, bogofilter ignores
    repetitions of a word in a message. One could argue that a
    message that says "sex, sex, sex" is spammier than one that
    simpley says "sex".

    Anyhow, bogofilter is case-insensitive and is likely to stay
    that way. If you are seriously interested, modify your copy
    to preserve case and run some tests to see if it does better
    than the released version.

    Also, I suggest you subscribe to the mailing list by sending
    a message to "bogofilter-subscribe@aotto.com".

    Enjoy!

    David

     
  • David Relson

    David Relson - 2003-02-23
    • status: open --> closed
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks