Thread: [Classifier4j-devel] Dev Plan
Status: Beta
Brought to you by:
nicklothian
From: Peter L. <pe...@le...> - 2003-08-01 00:16:41
|
Hi Nick, what are your current plans for JClassifier? What are you planning on implementing in the near future? I just don't want to double up on what we do... Pete |
From: Nick L. <ni...@ma...> - 2003-08-03 07:11:17
|
Currently I'm focused on two things: 1) Refactoring category support. -- I've added ICategorisedClassifier and ICategorisedWordsDataSource interfaces which have methods like ICategorisedClassifier.classify(String category, String input); etc, so the categories can be used directly from the classifier, without having to do "setCategory" on the datasource. I can't see why we need to keep that state, so I'm removing it. I've just added these changes to CVS. 2) A Classifier4J-Optional jar, which (currently) contains a couple of demos, a JDBMWordsDataSource (very fast and reliable) and a JispWordsDataSource (fast, but prone to data corruption, so I'll probably throw it out). Currently this is not in CVS. If you are still interested in the HibernateWordsDataSource, I would see it going in here. As well as those changes I've done some work on Text Summary (http://www.mackmo.com/nick/blog/java/?permalink=TextSummaryApp.txt), which is also available. I have some plans to do a 0.4 release sometime this week. What are you interested in working on? Nick ----- Original Message ----- From: "Peter Leschev" <pe...@le...> To: <cla...@li...> Sent: Friday, August 01, 2003 9:46 AM Subject: [Classifier4j-devel] Dev Plan > Hi Nick, > > what are your current plans for JClassifier? What are you planning on implementing in the > near future? I just don't want to double up on what we do... > > Pete > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Peter L. <pe...@le...> - 2003-08-08 14:21:06
|
Hi Nick, I just did a cvs update and took a look at version 0.4. I like the IStopWordProvider concept.... A couple of points: - Is there a reason why you've used tabs instead of spaces? Generally spaces are prefered, it's more standard. Some people may have their tab size set to 4 while others have it set to 8 etc... If you always convert tabs to spaces, it's always the same... - Have you seen hsqldb? http://hsqldb.sourceforge.net/ provides an in memory / disk java based database with a JDBC interface. It would be interesting to compare performance between different database solutions. eg. JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's HibernateWordsDatabase -> hsqldb / mysql etc. I'll look into the following: - Fix the following in BayesianClassifier * @todo need an option to only use the "X" most "important" words when calculating overall probability * "important" is defined as being most distant from NEUTAL_PROBABILITY - Look into the current Tokenizer - For example, "1.4" currently gets split into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split into "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming up with a set of test cases. - Implement an HTML Tokenizer (depending on how it is configured, html tags will be either included or ignored). - Implement HibernateWordsDataSource - Implement a project which uses Classifier4J. It's looking good! Pete ----- Original Message ----- From: "Nick Lothian" <ni...@ma...> To: <cla...@li...> Sent: Sunday, August 03, 2003 5:12 PM Subject: Re: [Classifier4j-devel] Dev Plan > Currently I'm focused on two things: > > 1) Refactoring category support. > -- I've added ICategorisedClassifier and ICategorisedWordsDataSource > interfaces which have methods like ICategorisedClassifier.classify(String > category, String input); etc, so the categories can be used directly from > the classifier, without having to do "setCategory" on the datasource. I > can't see why we need to keep that state, so I'm removing it. I've just > added these changes to CVS. > > 2) A Classifier4J-Optional jar, which (currently) contains a couple of > demos, a JDBMWordsDataSource (very fast and reliable) and a > JispWordsDataSource (fast, but prone to data corruption, so I'll probably > throw it out). Currently this is not in CVS. > > If you are still interested in the HibernateWordsDataSource, I would see it > going in here. > > As well as those changes I've done some work on Text Summary > (http://www.mackmo.com/nick/blog/java/?permalink=TextSummaryApp.txt), which > is also available. > > I have some plans to do a 0.4 release sometime this week. > > What are you interested in working on? > > Nick > > > ----- Original Message ----- > From: "Peter Leschev" <pe...@le...> > To: <cla...@li...> > Sent: Friday, August 01, 2003 9:46 AM > Subject: [Classifier4j-devel] Dev Plan > > > > Hi Nick, > > > > what are your current plans for JClassifier? What are you > planning on implementing in the > > near future? I just don't want to double up on what we do... > > > > Pete > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > > Data Reports, E-commerce, Portals, and Forums are available now. > > Download today and enter to win an XBOX or Visual Studio .NET. > > > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > |
From: Nick L. <ni...@ma...> - 2003-08-10 03:15:20
|
> A couple of points: > - Is there a reason why you've used tabs instead of spaces? Generally spaces > are prefered, it's more standard. Some people may have their tab size set to > 4 while others have it set to 8 etc... If you always convert tabs to spaces, > it's always the same... Yes, I've reset Eclipse to subsitiute spaces. As I check-in stuff is being fixed. > - Have you seen hsqldb? http://hsqldb.sourceforge.net/ provides an in memory > / disk java based database with a JDBC interface. It would be interesting to > compare performance between different database solutions. eg. > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > HibernateWordsDatabase -> hsqldb / mysql etc. > If you look at the code for the examples in Classifier4J-Optional, you'll see some commented out code to use a JDBCWordsDataSource with HSQLDB. If I use the training example, I get about 50 words per second with HSQLDB, but with JDBM it takes less than 1 second for all 3000 words. I'm using HSQLDB persistant tables and I'm not sure how often that writes to disk - I'm pretty sure it's not after every update, because the HSQLDB documentation talks about needing to do a CHECKPOINT to make sure it is written. With JDBM I only commit at the end of the training session, so that's a big speed win. In the Analayser example, JDBM completes in less than 1 second, and HSQLDB runs at about 80 words per second. My patches to NNTP://RSS (http://www.mackmo.com/nick/blog/java/?permalink=nntprssc4javailable.txt) use the HSQLDB database integrated in NNTP://RSS. I've looked at Axion (http://www.mackmo.com/nick/blog/java/?permalink=axion2.txt) in the past, too. > I'll look into the following: > - Fix the following in BayesianClassifier > * @todo need an option to only use the "X" most "important" words when > calculating overall probability > * "important" is defined as being most distant from NEUTAL_PROBABILITY Cool. > - Look into the current Tokenizer - For example, "1.4" currently gets split > into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split into > "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming up > with a set of test cases. Yes, that needs fixing. Also, I'm not sure about how to deal with URLs: at the moment http://www.google.com/something gets split up, but I think it probably shouldn't (?) > - Implement an HTML Tokenizer (depending on how it is configured, html tags > will be either included or ignored). Very good idea. > - Implement HibernateWordsDataSource > - Implement a project which uses Classifier4J. > That's a really good idea!! ;-) |
From: Peter L. <pe...@le...> - 2003-08-16 02:11:58
|
Heya, > My patches to NNTP://RSS > (http://www.mackmo.com/nick/blog/java/?permalink=nntprssc4javailable.txt) > use the HSQLDB database integrated in NNTP://RSS. I took a look at NNTP://RSS & the Classifier4J patch. Damn useful. I wanted to write a Job (as in employment) searching app which used Classifier4J to rate all the incoming jobs. All I have to do now is write an Adapter for each job site which converts them to RSS feeds. There needs to be a classifying category assigned for each RSS feed though. For example, the words in a job description would be alot different from a news article that I'm interested in... Having a category for each feed would be simple to impl but not as effective as a few categories. I've added a new patch (FastHashMapWordsDataSource) - submitted via the patch manager to improve the stats :) Pete |
From: Nick L. <ni...@ma...> - 2003-08-24 06:10:15
|
> I've added a new patch (FastHashMapWordsDataSource) - submitted via the > patch manager to improve the stats :) > And I have _finally_ got around to applying some of Pete's patches and bug fixes. Nick |
From: Peter L. <pe...@le...> - 2003-08-25 13:32:52
|
Heya, > If you look at the code for the examples in Classifier4J-Optional, Is there a reason why you've kept the same package structure in Optional? I'm getting compilation errors complaining that there are two net.sf.classifier4J.bayesian.AllTests classes. Could we put all the optional classes in the net.sf.classifier4J.optional package? (eg net.sf.classifier4J.optional.bayesian etc). Pete |
From: Nick L. <ni...@ma...> - 2003-08-30 04:12:24
|
How are you compiling it? I intended the optional package to be a totally stand-alone project, so it shouldn't be compiling to the same target directory as the normal Classifier4J project (which would cause the errors you are seeing). I put it in the same package so that we can use package-level access to methods if we need to (in particular for the tests, since often I find that is a useful technique). ----- Original Message ----- From: "Peter Leschev" <pe...@le...> To: "Nick Lothian" <ni...@ma...>; <cla...@li...> Sent: Monday, August 25, 2003 5:45 PM Subject: Re: [Classifier4j-devel] Dev Plan > Heya, > > > If you look at the code for the examples in Classifier4J-Optional, > Is there a reason why you've kept the same package structure in Optional? > I'm getting compilation errors complaining that there are two > net.sf.classifier4J.bayesian.AllTests classes. Could we put all the optional > classes in the net.sf.classifier4J.optional package? (eg > net.sf.classifier4J.optional.bayesian etc). > > Pete > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: VM Ware > With VMware you can run multiple operating systems on a single machine. > WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines > at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0 > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Peter L. <pe...@le...> - 2003-08-30 08:22:39
|
Hi Nick, I'm currently compiling both Classifier4J & Classifier4J-Optional at the same time. I agree with having the same package structure for the tests (of the same project), but I wouldn't recommend having the same package structure across two projects (Classifier4J & Classifier4J-Optional). The situation we're seeing here is two different classes being generated (or attempted to be generated) with the same fully qualified class name net.sf.classifier4J.bayesian.AllTests in the following locations: classifier4j\Classifier4J\src\test\net\sf\classifier4J\bayesian\AllTests.jav a classifier4j\Classifier4J-Optional\src\test\net\sf\classifier4J\bayesian\All Tests.java Even if the compilation is done separately, if we refer to net.sf.classifier4J.bayesian.AllTests, we're going to get the AllTest classes from either Classifier4J or Classifier4J-Optional depending on where both jars are in the classpath (which is really bad)... Regards, Peter Leschev ----- Original Message ----- From: "Nick Lothian" <ni...@ma...> To: <cla...@li...> Sent: Saturday, August 30, 2003 2:12 PM Subject: Re: [Classifier4j-devel] Dev Plan > How are you compiling it? > > I intended the optional package to be a totally stand-alone project, so it > shouldn't be compiling to the same target directory as the normal > Classifier4J project (which would cause the errors you are seeing). > > I put it in the same package so that we can use package-level access to > methods if we need to (in particular for the tests, since often I find that > is a useful technique). > > ----- Original Message ----- > From: "Peter Leschev" <pe...@le...> > To: "Nick Lothian" <ni...@ma...>; > <cla...@li...> > Sent: Monday, August 25, 2003 5:45 PM > Subject: Re: [Classifier4j-devel] Dev Plan > > > > Heya, > > > > > If you look at the code for the examples in Classifier4J-Optional, > > Is there a reason why you've kept the same package structure in Optional? > > I'm getting compilation errors complaining that there are two > > net.sf.classifier4J.bayesian.AllTests classes. Could we put all the > optional > > classes in the net.sf.classifier4J.optional package? (eg > > net.sf.classifier4J.optional.bayesian etc). > > > > Pete > > > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: VM Ware > > With VMware you can run multiple operating systems on a single machine. > > WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines > > at the same time. Free trial click > here:http://www.vmware.com/wl/offer/358/0 > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Nick L. <ni...@ma...> - 2003-08-30 08:27:26
|
> > Even if the compilation is done separately, if we refer to > net.sf.classifier4J.bayesian.AllTests, we're going to get the AllTest > classes from either Classifier4J or Classifier4J-Optional depending on where > both jars are in the classpath (which is really bad)... > Yes, that is a good point. I'll move it into the *.optional.* structure you suggested |
From: Peter L. <pe...@le...> - 2003-08-27 13:05:13
|
Heya, > > It would be interesting to > > compare performance between different database solutions. eg. > > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > > HibernateWordsDatabase -> hsqldb / mysql etc. > > - Implement HibernateWordsDataSource I just submitted a patch which brings me closer to releasing a performance test class & the HibernateWordsDataSource class (should be ready by this weekend)... Pete |
From: Nick L. <ni...@ma...> - 2003-08-30 04:22:40
|
I've applied all outstanding patches. I had some trouble with the patch to JDBMWordsDataSource - it didn't apply cleanly for some reason. I had a look through it, and added the finalize method like you suggested. Were there any other changes to that class? Nick ----- Original Message ----- From: "Peter Leschev" <pe...@le...> To: "Nick Lothian" <ni...@ma...>; <cla...@li...> Sent: Wednesday, August 27, 2003 10:35 PM Subject: Re: [Classifier4j-devel] Dev Plan > Heya, > > > > It would be interesting to > > > compare performance between different database solutions. eg. > > > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > > > HibernateWordsDatabase -> hsqldb / mysql etc. > > > > - Implement HibernateWordsDataSource > > I just submitted a patch which brings me closer to releasing a > performance test class & the HibernateWordsDataSource class (should be ready > by this weekend)... > > Pete > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Peter L. <pe...@le...> - 2003-08-30 08:27:11
|
Hi! Thanks for applying the patches. I had a toString method in JDBMWordsDataSource, nothing major - I wouldn't worry about it... Regards, Peter Leschev ----- Original Message ----- From: "Nick Lothian" <ni...@ma...> To: <cla...@li...> Sent: Saturday, August 30, 2003 2:22 PM Subject: Re: [Classifier4j-devel] Dev Plan > I've applied all outstanding patches. > > I had some trouble with the patch to JDBMWordsDataSource - it didn't apply > cleanly for some reason. I had a look through it, and added the finalize > method like you suggested. Were there any other changes to that class? > > Nick > > > ----- Original Message ----- > From: "Peter Leschev" <pe...@le...> > To: "Nick Lothian" <ni...@ma...>; > <cla...@li...> > Sent: Wednesday, August 27, 2003 10:35 PM > Subject: Re: [Classifier4j-devel] Dev Plan > > > > Heya, > > > > > > It would be interesting to > > > > compare performance between different database solutions. eg. > > > > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > > > > HibernateWordsDatabase -> hsqldb / mysql etc. > > > > > > - Implement HibernateWordsDataSource > > > > I just submitted a patch which brings me closer to releasing a > > performance test class & the HibernateWordsDataSource class (should be > ready > > by this weekend)... > > > > Pete > > > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |