Menu

Performance

2005-03-29
2013-04-15
  • asdsadsadsa

    asdsadsadsa - 2005-03-29

    Hi

    You mention in another post that dbacl compares quite well with bogofilter. Could you share your experiences?

    How many calculations per second have you found bogogilter to achieve in comparison to dbacl with only 2 categories?

    If the number categories was to increase (thus invoking bogofilter multiple times to classify 1 document against said categories) what advantages does dbacl show from its ability to classify in one go (accurancy/speed)?

    :o)

    Thanks for your thoughts!

    It is hard to evaluate some of the classifers when there doesn't seem to be any benchmarks I can find.

    Jamie.

     
    • Laird Breyer

      Laird Breyer - 2005-03-30

      Sure, just ask the right questions :)
      <p>
      Comparing bogofilter and dbacl and other filters is tricky, because
      they don't all do the same things in the same ways. Also, most people are
      interested in accuracy comparisons, which need different kinds of benchmarks.
      <p>
      To answer your question, you can't compare dbacl and bogofilter fully
      for both learning and classifying. bogo can learn incrementally, while
      dbacl can't for several reasons.
      <p>
      You can compare the two programs on tasks they can both do: learning a
      large set of documents from scratch in one go, and classifying a large
      set of documents individually. On those two tasks, I found both
      programs to be roughly similar in speed, but I haven't kept track of
      bogofilter for a long time.
      <p>
      For dbacl, on a PIII/500 test machine, here are some numbers:
      <pre>
      % ls -lh mail/spam.1
      -rw-r--r--    1 laird    laird        150M Dec  4 15:29 mail/spam.1
      % time dbacl -l dummy -T email -H 20 mail/spam.1

      real    2m12.703s
      user    1m46.433s
      sys     0m1.581s
      % time dbacl -c dummy mail/spam.1

      real    1m20.546s
      user    1m4.873s
      sys     0m1.412s
      </pre>
      <p>
      The spam collection has about 19000 spams in it, with 428877 unique
      tokens and 13950929 total tokens. On this test, dbacl parsed about
      230000 tokens per second, since there was no serious I/O to be done,
      or about 300 spam messages per second. Of course this is best case, no
      repeated opening of messages, loading of categories etc. If you need
      to do all those things, in a best case scenario you can handle about
      100 emails per second with the right switches etc, but this would take
      much longer to explain here.
      <p>
      One more thing to note: the learning time consists of parsing time +
      an opitization step. The parsing time is roughly the same for both classifying
      and learning, but learning needs an extra 40 seconds to compute various
      quantities.
      <p>
      The theoretical cost for classification is O(kn), where k is the number of
      categories and n is the number of tokens parsed. You would need to add to
      that loading time for the categories, but it's negligible if the categories are
      static.
      <p>
      Finally, these are best case type scenarios such as they occur when doing
      a pure laboratory classification test. If you intend to constantly learn
      and classify every possible message, then the learning overhead kills the
      performance completely and you can do at best only a couple of learning/classifications per second.

       

Log in to post a comment.