RE: [Classifier4j-devel] First look at classifier4j...
Status: Beta
Brought to you by:
nicklothian
From: Peter L. <pe...@le...> - 2003-07-02 06:40:12
|
Heya, > you are talking about the Source zip?. Yep... > I just zipped the source with WinZip, > so it doesn't surprise me. that explains it... > 2) I'm reasonably familiar with Hibernate > (in theory at least). I'm reluctant to > replace the JDBC Data Source with a Hibernate > one because I want to make Classifier4J very > easy to drop into people's code without too > many dependencies. Fair enough.... > However, I'm not opposed to a > HibernateWordsDataSource if > you'd like to work on that. Cool... > Please be aware that there are > performance problems at the moment > when using a database backend, > and I'm not convinced that a normal > DB backend will ever be able to > deliver sufficient performance on > large documents. I'd like to try and convince you otherwise! :) I think it'll be possible with a schema change & using hibernate's caching mechanisms... > I think I'll need to have a table that > contains the precalculated word probability > for each word to get rid of the two-queries-per > word issue Having precalculated word probs in the database makes it difficult to teach the classifier new sentences... Currently the schema is something like: TABLE matching_words - word varchar - word_count int TABLE nonmatching_words - word varchar - word_count int I would recommend using something like: TABLE words - word varchar - nonmatching_count int - matching_count int This will allow you to obtain the required information during classification with one query per word instead of two. It also makes it easier to teach the classifier (you just increment the nonmatching_count or matching_count by one). > Any comments on the API - > net.sf.classifier4J.IClassifier > in particular? Hmmm there needs to be a way to teach the Classifier with new input. Not sure if that would be under BayesianClassifier or IClassifier or a ITeachableClassifier (which BayesianClassifier would extend). Really depends on what classifiers you want to implement in the future. I don't really mind... Otherwise IClassifier looks ok to me... I would recommend changing the IWordsDataSource to return WordProbability objects instead of double. This would ensure that BayesianClassifier doesn't have to know how to create WordProbability Objects, it just gets them from the IWordsDataSource. It would be nice if WordProbability knew how to calculate it's own probability, given the number of nonmatching & matching counts. This will reduce duplication of code with new IWordsDataSource implementations. Pete |