RE: [Classifier4j-devel] First look at classifier4j...

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> 
> TABLE matching_words
> - word varchar
> - word_count int
> 
> TABLE nonmatching_words
> - word varchar
> - word_count int
> 
> I would recommend using something like:
> 
> TABLE words
> - word varchar
> - nonmatching_count int
> - matching_count int
> 
> This will allow you to obtain the required information
> during classification with one query per word instead
> of two. It also makes it easier to teach the classifier
> (you just increment the nonmatching_count or
> matching_count by one).
> 

Mmm.. I actually experimented with that. I can't remember why I abandoned it
- I'll look at my old code.

> > Any comments on the API -
> > net.sf.classifier4J.IClassifier
> > in particular?
> Hmmm there needs to be a way to teach the Classifier
> with new input. Not sure if that would be under
> BayesianClassifier or IClassifier or a
> ITeachableClassifier (which BayesianClassifier would
> extend). Really depends on what classifiers you want to
> implement in the future. I don't really mind...
> Otherwise IClassifier looks ok to me...
> 

I think that the interface for training should be totally separate from the
IClassifier heirachy There are just so many ways of doing the training - it
probably depends on the backend as well.

I do agree that there is a need for a training API, though. I'd like to
leave that for a bit until we understand the problem space better. In
particualar, I want to try a Vector Space Search classifier (See
<http://www.mackmo.com/nick/blog/java/?permalink=LatentSemanticIndexing.txt>
)

> I would recommend changing the IWordsDataSource to
> return WordProbability objects instead of double. This
> would ensure that BayesianClassifier doesn't have to
> know how to create WordProbability Objects, it just
> gets them from the IWordsDataSource.
> 
> It would be nice if WordProbability knew how to
> calculate it's own probability, given the number of
> nonmatching & matching counts. This will reduce
> duplication of code with new IWordsDataSource
> implementations.
> 

Sounds pretty reasonable. I'm not very happy with the use of the
WordProbability object at the moment, anyway.