RE: [Classifier4j-devel] First look at classifier4j...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Heya,

> you are talking about the Source zip?. 
Yep...

> I just zipped the source with WinZip, 
> so it doesn't surprise me.
that explains it...

> 2) I'm reasonably familiar with Hibernate 
> (in theory at least). I'm reluctant to 
> replace the JDBC Data Source with a Hibernate 
> one because I want to make Classifier4J very 
> easy to drop into people's code without too
> many dependencies. 
Fair enough....

> However, I'm not opposed to a 
> HibernateWordsDataSource if
> you'd like to work on that. 
Cool...

> Please be aware that there are 
> performance problems at the moment 
> when using a database backend, 
> and I'm not convinced that a normal 
> DB backend will ever be able to 
> deliver sufficient performance on 
> large documents. 
I'd like to try and convince you otherwise! :) I think
it'll be possible with a schema change & using
hibernate's caching mechanisms...

> I think I'll need to have a table that 
> contains the precalculated word probability 
> for each word to get rid of the two-queries-per 
> word issue
Having precalculated word probs in the database makes
it difficult to teach the classifier new sentences... 

Currently the schema is something like:

TABLE matching_words
- word varchar
- word_count int

TABLE nonmatching_words
- word varchar
- word_count int

I would recommend using something like:

TABLE words
- word varchar
- nonmatching_count int
- matching_count int

This will allow you to obtain the required information
during classification with one query per word instead
of two. It also makes it easier to teach the classifier
(you just increment the nonmatching_count or
matching_count by one).

> Any comments on the API -
> net.sf.classifier4J.IClassifier
> in particular?
Hmmm there needs to be a way to teach the Classifier
with new input. Not sure if that would be under
BayesianClassifier or IClassifier or a
ITeachableClassifier (which BayesianClassifier would
extend). Really depends on what classifiers you want to
implement in the future. I don't really mind...
Otherwise IClassifier looks ok to me...

I would recommend changing the IWordsDataSource to
return WordProbability objects instead of double. This
would ensure that BayesianClassifier doesn't have to
know how to create WordProbability Objects, it just
gets them from the IWordsDataSource.

It would be nice if WordProbability knew how to
calculate it's own probability, given the number of
nonmatching & matching counts. This will reduce
duplication of code with new IWordsDataSource
implementations.

Pete