Re: [Classifier4j-devel] Dev Plan

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> A couple of points:
> - Is there a reason why you've used tabs instead of spaces? Generally
spaces
> are prefered, it's more standard. Some people may have their tab size set
to
> 4 while others have it set to 8 etc... If you always convert tabs to
spaces,
> it's always the same...

Yes, I've reset Eclipse to subsitiute spaces. As I check-in stuff is being
fixed.

> - Have you seen hsqldb? http://hsqldb.sourceforge.net/ provides an in
memory
> / disk java based database with a JDBC interface. It would be interesting
to
> compare performance between different database solutions. eg.
> JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's
> HibernateWordsDatabase -> hsqldb / mysql etc.
>

If you look at the code for the examples in Classifier4J-Optional, you'll
see some commented out code to use a JDBCWordsDataSource with HSQLDB. If I
use the training example, I get about 50 words per second with HSQLDB, but
with JDBM it takes less than 1 second for all 3000 words.

I'm using HSQLDB persistant tables and I'm not sure how often that writes to
disk - I'm pretty sure it's not after every update, because the HSQLDB
documentation talks about needing to do a CHECKPOINT to make sure it is
written. With JDBM I only commit at the end of the training session, so
that's a big speed win.

In the Analayser example, JDBM completes in less than 1 second, and HSQLDB
runs at about  80 words per second.

My patches to NNTP://RSS
(http://www.mackmo.com/nick/blog/java/?permalink=nntprssc4javailable.txt)
use the HSQLDB database integrated in NNTP://RSS.

I've looked at Axion
(http://www.mackmo.com/nick/blog/java/?permalink=axion2.txt) in the past,
too.

> I'll look into the following:
> - Fix the following in BayesianClassifier
>   * @todo need an option to only use the "X" most "important" words when
> calculating overall probability
>   * "important" is defined as being most distant from NEUTAL_PROBABILITY

Cool.

> - Look into the current Tokenizer - For example, "1.4" currently gets
split
> into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split into
> "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming
up
> with a set of test cases.

Yes, that needs fixing. Also, I'm not sure about how to deal with URLs: at
the moment http://www.google.com/something gets split up, but I think it
probably shouldn't (?)

> - Implement an HTML Tokenizer (depending on how it is configured, html
tags
> will be either included or ignored).

Very good idea.

> - Implement HibernateWordsDataSource
> - Implement a project which uses Classifier4J.
>

That's a really good idea!! ;-)