Received-SPF: pass (1b2kzd1.ch3.sourceforge.com: domain of gmail.com
designates 209.85.220.221 as permitted sender)
client-ip=209.85.220.221;
envelope-from=thomas.michael.hagen@...;
helo=mail-fx0-f221.google.com;
MIME-Version: 1.0
Date: Tue, 6 Oct 2009 15:26:53 +0200
From: Thomas Michael Hagen <thomas.michael.hagen@...>
Content-Type: text/plain; charset="us-ascii"
how many errors must usually be corrected before crm reaches 80%, 90%,
95% and 99% accuracy?
are we talking hundreds, thousands or tens of thousands?
Hundreds, or less, for 90%. The SpamAssassin corpus trains about 500
documents, and we get about 98 to 99% accuracy on the "tenfold
validate torture test" on most of the classifiers just with judicious
choices of learning method and threshold thickness.
Past that, I have less than 2000 documents *total* in
my Reavercache of known examples, and probably have purged away
twice that many, and am pushing 99.99% accuracy on my real email.
does the hyperspace take the order of features in a text into account,
and would it be useful to arrange the features in a text into
five-grams, or perhaps sparse bigrams, like osbf does?
Somewhat. Default hyperspase uses sparse bigrams, like OSB and
OSBF do. Beyond that, it doesn't care about order of features
(in fact, to optimize classification, it hashes them and then
sorts the hashes to get O(n) classification time).
- Bill Yerazunis
|