RE: [Classifier4j-devel] First look at classifier4j...
Status: Beta
Brought to you by:
nicklothian
|
From: Nick L. <nl...@es...> - 2003-07-02 08:41:57
|
> > TABLE matching_words > - word varchar > - word_count int > > TABLE nonmatching_words > - word varchar > - word_count int > > I would recommend using something like: > > TABLE words > - word varchar > - nonmatching_count int > - matching_count int > > This will allow you to obtain the required information > during classification with one query per word instead > of two. It also makes it easier to teach the classifier > (you just increment the nonmatching_count or > matching_count by one). > Mmm.. I actually experimented with that. I can't remember why I abandoned it - I'll look at my old code. > > Any comments on the API - > > net.sf.classifier4J.IClassifier > > in particular? > Hmmm there needs to be a way to teach the Classifier > with new input. Not sure if that would be under > BayesianClassifier or IClassifier or a > ITeachableClassifier (which BayesianClassifier would > extend). Really depends on what classifiers you want to > implement in the future. I don't really mind... > Otherwise IClassifier looks ok to me... > I think that the interface for training should be totally separate from the IClassifier heirachy There are just so many ways of doing the training - it probably depends on the backend as well. I do agree that there is a need for a training API, though. I'd like to leave that for a bit until we understand the problem space better. In particualar, I want to try a Vector Space Search classifier (See <http://www.mackmo.com/nick/blog/java/?permalink=LatentSemanticIndexing.txt> ) > I would recommend changing the IWordsDataSource to > return WordProbability objects instead of double. This > would ensure that BayesianClassifier doesn't have to > know how to create WordProbability Objects, it just > gets them from the IWordsDataSource. > > It would be nice if WordProbability knew how to > calculate it's own probability, given the number of > nonmatching & matching counts. This will reduce > duplication of code with new IWordsDataSource > implementations. > Sounds pretty reasonable. I'm not very happy with the use of the WordProbability object at the moment, anyway. |