effective size factor implementation

2006-06-02
2013-04-11
  • Tearfang Wolf
    Tearfang Wolf
    2006-06-02

    As read the source code you use one effective size factor for both spam and ham. As I understand the Gary Robinson’s paper which introduced the idea he used separate factors for ham and spam. Is there research since then showing that the values are usually the same, or am I missing something?

     


    • 2006-08-14

      My most sincere apologies for the extreme delay in response.  I am still here... just rediculously busy...

      Your reading is correct.  The ESF (Effective Size Factor) is a method of compensating for token redundancy in the email.  As you have pointed out, this may differ (on average) between HAM and SPAM. 

      In the tests we ran, changing the ESF during engine training did not appear to have a significant impact on the overall effectiveness of the engine.  In fact, the presence of token redundancy did not appear to have a significant impact itself.

      I believe this is due to a few reasons.  In order to get accurate results, it is best to train the engine with a roughly equal number of ham vs. spam (the engine shipped with the 0.9 release was trained with about 10,000 of each).  The more emails used to train, the less significant the impact of token redundancy on the overall result.     Secondly, one of the optional parameters when training is to set the maximum number of characters to use from the email when tokenizing.  We found that most spam will try to convey their message in the first few lines (excluding noise words etc).  This often meant that the most spammy words were at the begninning of the email.  In fact, simply looking at the first few lines is often enough to determine the score.  In this case, the token redundancy is minimised.  Of course, there is a balance between the number of tokens being too large and creating redundancy, and too small which does not given enough information.

      The reality is that there are a whole bunch of other things which affect the ultimate score of the email more than ESF.  Just accurate tokenizing of an email is the hardest part (bypassing all the spammer tricks etc).

      We have implemented jASEN in a commercial product, however the RobinsonScanner is only one of many plugins which act upon the email.  It alone is accurate in about 90% of cases, but in order to push it up to 99%+ you need to look at additional techniques (RBL lists, white listing etc).

      Your point is valid however, and the user really ought to be given the choice themselves.

      I will add this to the next release (whenever that is).

      P.S.

      Several improvements to the initial release have been made, but are not yet tested thoroughly.  I am really hoping to deply a new release soon... Just as soon as I get some sleep!