You mention in another post that dbacl compares quite well with bogofilter. Could you share your experiences?
How many calculations per second have you found bogogilter to achieve in comparison to dbacl with only 2 categories?
If the number categories was to increase (thus invoking bogofilter multiple times to classify 1 document against said categories) what advantages does dbacl show from its ability to classify in one go (accurancy/speed)?
:o)
Thanks for your thoughts!
It is hard to evaluate some of the classifers when there doesn't seem to be any benchmarks I can find.
Jamie.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sure, just ask the right questions :)
<p>
Comparing bogofilter and dbacl and other filters is tricky, because
they don't all do the same things in the same ways. Also, most people are
interested in accuracy comparisons, which need different kinds of benchmarks.
<p>
To answer your question, you can't compare dbacl and bogofilter fully
for both learning and classifying. bogo can learn incrementally, while
dbacl can't for several reasons.
<p>
You can compare the two programs on tasks they can both do: learning a
large set of documents from scratch in one go, and classifying a large
set of documents individually. On those two tasks, I found both
programs to be roughly similar in speed, but I haven't kept track of
bogofilter for a long time.
<p>
For dbacl, on a PIII/500 test machine, here are some numbers:
<pre>
% ls -lh mail/spam.1
-rw-r--r-- 1 laird laird 150M Dec 4 15:29 mail/spam.1
% time dbacl -l dummy -T email -H 20 mail/spam.1
real 2m12.703s
user 1m46.433s
sys 0m1.581s
% time dbacl -c dummy mail/spam.1
real 1m20.546s
user 1m4.873s
sys 0m1.412s
</pre>
<p>
The spam collection has about 19000 spams in it, with 428877 unique
tokens and 13950929 total tokens. On this test, dbacl parsed about
230000 tokens per second, since there was no serious I/O to be done,
or about 300 spam messages per second. Of course this is best case, no
repeated opening of messages, loading of categories etc. If you need
to do all those things, in a best case scenario you can handle about
100 emails per second with the right switches etc, but this would take
much longer to explain here.
<p>
One more thing to note: the learning time consists of parsing time +
an opitization step. The parsing time is roughly the same for both classifying
and learning, but learning needs an extra 40 seconds to compute various
quantities.
<p>
The theoretical cost for classification is O(kn), where k is the number of
categories and n is the number of tokens parsed. You would need to add to
that loading time for the categories, but it's negligible if the categories are
static.
<p>
Finally, these are best case type scenarios such as they occur when doing
a pure laboratory classification test. If you intend to constantly learn
and classify every possible message, then the learning overhead kills the
performance completely and you can do at best only a couple of learning/classifications per second.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi
You mention in another post that dbacl compares quite well with bogofilter. Could you share your experiences?
How many calculations per second have you found bogogilter to achieve in comparison to dbacl with only 2 categories?
If the number categories was to increase (thus invoking bogofilter multiple times to classify 1 document against said categories) what advantages does dbacl show from its ability to classify in one go (accurancy/speed)?
:o)
Thanks for your thoughts!
It is hard to evaluate some of the classifers when there doesn't seem to be any benchmarks I can find.
Jamie.
Sure, just ask the right questions :)
<p>
Comparing bogofilter and dbacl and other filters is tricky, because
they don't all do the same things in the same ways. Also, most people are
interested in accuracy comparisons, which need different kinds of benchmarks.
<p>
To answer your question, you can't compare dbacl and bogofilter fully
for both learning and classifying. bogo can learn incrementally, while
dbacl can't for several reasons.
<p>
You can compare the two programs on tasks they can both do: learning a
large set of documents from scratch in one go, and classifying a large
set of documents individually. On those two tasks, I found both
programs to be roughly similar in speed, but I haven't kept track of
bogofilter for a long time.
<p>
For dbacl, on a PIII/500 test machine, here are some numbers:
<pre>
% ls -lh mail/spam.1
-rw-r--r-- 1 laird laird 150M Dec 4 15:29 mail/spam.1
% time dbacl -l dummy -T email -H 20 mail/spam.1
real 2m12.703s
user 1m46.433s
sys 0m1.581s
% time dbacl -c dummy mail/spam.1
real 1m20.546s
user 1m4.873s
sys 0m1.412s
</pre>
<p>
The spam collection has about 19000 spams in it, with 428877 unique
tokens and 13950929 total tokens. On this test, dbacl parsed about
230000 tokens per second, since there was no serious I/O to be done,
or about 300 spam messages per second. Of course this is best case, no
repeated opening of messages, loading of categories etc. If you need
to do all those things, in a best case scenario you can handle about
100 emails per second with the right switches etc, but this would take
much longer to explain here.
<p>
One more thing to note: the learning time consists of parsing time +
an opitization step. The parsing time is roughly the same for both classifying
and learning, but learning needs an extra 40 seconds to compute various
quantities.
<p>
The theoretical cost for classification is O(kn), where k is the number of
categories and n is the number of tokens parsed. You would need to add to
that loading time for the categories, but it's negligible if the categories are
static.
<p>
Finally, these are best case type scenarios such as they occur when doing
a pure laboratory classification test. If you intend to constantly learn
and classify every possible message, then the learning overhead kills the
performance completely and you can do at best only a couple of learning/classifications per second.