Re: [Classifier4j-devel] Dev Plan
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <ni...@ma...> - 2003-08-10 03:15:20
|
> A couple of points: > - Is there a reason why you've used tabs instead of spaces? Generally spaces > are prefered, it's more standard. Some people may have their tab size set to > 4 while others have it set to 8 etc... If you always convert tabs to spaces, > it's always the same... Yes, I've reset Eclipse to subsitiute spaces. As I check-in stuff is being fixed. > - Have you seen hsqldb? http://hsqldb.sourceforge.net/ provides an in memory > / disk java based database with a JDBC interface. It would be interesting to > compare performance between different database solutions. eg. > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > HibernateWordsDatabase -> hsqldb / mysql etc. > If you look at the code for the examples in Classifier4J-Optional, you'll see some commented out code to use a JDBCWordsDataSource with HSQLDB. If I use the training example, I get about 50 words per second with HSQLDB, but with JDBM it takes less than 1 second for all 3000 words. I'm using HSQLDB persistant tables and I'm not sure how often that writes to disk - I'm pretty sure it's not after every update, because the HSQLDB documentation talks about needing to do a CHECKPOINT to make sure it is written. With JDBM I only commit at the end of the training session, so that's a big speed win. In the Analayser example, JDBM completes in less than 1 second, and HSQLDB runs at about 80 words per second. My patches to NNTP://RSS (http://www.mackmo.com/nick/blog/java/?permalink=nntprssc4javailable.txt) use the HSQLDB database integrated in NNTP://RSS. I've looked at Axion (http://www.mackmo.com/nick/blog/java/?permalink=axion2.txt) in the past, too. > I'll look into the following: > - Fix the following in BayesianClassifier > * @todo need an option to only use the "X" most "important" words when > calculating overall probability > * "important" is defined as being most distant from NEUTAL_PROBABILITY Cool. > - Look into the current Tokenizer - For example, "1.4" currently gets split > into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split into > "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming up > with a set of test cases. Yes, that needs fixing. Also, I'm not sure about how to deal with URLs: at the moment http://www.google.com/something gets split up, but I think it probably shouldn't (?) > - Implement an HTML Tokenizer (depending on how it is configured, html tags > will be either included or ignored). Very good idea. > - Implement HibernateWordsDataSource > - Implement a project which uses Classifier4J. > That's a really good idea!! ;-) |