classifier4j-devel Mailing List for Classifier4J (Page 11)
Status: Beta
Brought to you by:
nicklothian
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(18) |
Aug
(14) |
Sep
|
Oct
|
Nov
(74) |
Dec
(9) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(15) |
Feb
(6) |
Mar
|
Apr
|
May
(27) |
Jun
(1) |
Jul
(14) |
Aug
(3) |
Sep
(9) |
Oct
|
Nov
(3) |
Dec
(6) |
2005 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
2006 |
Jan
|
Feb
(5) |
Mar
(5) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(10) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
(1) |
Jun
(4) |
Jul
(10) |
Aug
(5) |
Sep
(10) |
Oct
(18) |
Nov
(39) |
Dec
(73) |
2009 |
Jan
(78) |
Feb
(24) |
Mar
(32) |
Apr
(53) |
May
(115) |
Jun
(99) |
Jul
(72) |
Aug
(18) |
Sep
(22) |
Oct
(35) |
Nov
(10) |
Dec
(19) |
2010 |
Jan
(6) |
Feb
(7) |
Mar
(43) |
Apr
(55) |
May
(78) |
Jun
(71) |
Jul
(43) |
Aug
(42) |
Sep
(19) |
Oct
(5) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Nick L. <nl...@es...> - 2003-11-12 05:16:08
|
Could you post a complete stack trace of the error? Also, I'm not clear if your dbTest1 class has the same problem. Does it? > -----Original Message----- > From: Matt Collier [mailto:MCo...@my...] > Sent: Wednesday, 12 November 2003 8:14 AM > To: Classifier4J > Subject: [Classifier4j-devel] Update Word Probability Break Down > > > Hello All! > > I have been working around the clock on various issues relating to my > ignorance of Java and the nuances of Classifier4J. > > Thanks to Nick, and using the latest CVS code, I have > succeeded in implemeting > Classifier4J after only 60 hours! > > I have now come upon an interesting problem. > > My project involves categorizing a large volume of data. > That data exists in > a blob field in a mySQL (4.0.16) database. I am using this > same database to > store my word_probability table. I am using the mySQL > connector/J 3.0.9. I > am using Java SDK 1.4.2_02. > > My project begins by teaching classifier 4J large amounts of already > classified data. I am providing a category and a string > taken from the mySQL > blob field. All is well at this point. > > The bayesian teachMatch function works great for about 4000 > words (in my > environment, results may vary), then: > --- > SQL Exception in updateWordProbability : Unable to connect to > any hosts due to > exception: java.net.BindException: Address already in use: connect > > WordsDataSourceException Occurred during teachMatch : Problem > updating > WordProbability > --- > > I have added System.out e.getMessage() to the Exception > Handler in the > updateWordProbability function to produce the above result. > Otherwise, you > simply see an SQL Exception. > > Initially I thought this problem related to my ignorance and improper > implementation of connection pooling. I wrote the attached > test program to > eliminate this possibility. I found that the error still > existed and is 100% > reproduceable on my system. > > This program effectively loops through x number of teachMatch > functions. On > my system, the program starts generating exceptions just > before 4000, usually > between 3800 and 4900 iterations. > > Just to make sure I didn't have some environmental problem, I > wrote another > program that writes x records to mySQL, emulating the function of > updateWordProbability. No problems here atleast up to > 100,000 records. > > I hope someone with more knowlege and experience will be able > to figure this > one out. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > |
From: Matt C. <MCo...@my...> - 2003-11-12 04:46:57
|
Here's that HSQLDB version of connect.java I left out of the last message. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: moedusa <mo...@in...> - 2003-11-12 04:38:13
|
Nick Lothian wrote: > I had always thought it would make more sense to load the JDBC driver > class in your own code. ... Do people disagree with this logic? Well, not. But when using this from inside IDE (NetBeans) you have to mount driver manually, and that is not how-we-used-to-do-it-before (usually we have to supply JDBC connection details with properties or smth., see http://www.quartzscheduler.org/features.jsp for exmp.)... But I think that is my own trick (I was playing with C4J alone, so it was the only thing required DB access), so there is no need to change something in there, but, maybe, make a note in javadoc about it... |
From: Matt C. <MCo...@my...> - 2003-11-12 03:22:57
|
I have duplicated this problem in a completely separate computing environment (my home). In this case, mySQL is running on localhost. Exact same problem, and exact same symptoms. I would also add that my clients in both environments are running Windows XP Pro. ---- More data on this issue: Switching to HSQLDB produces the exact same results. I have attached the revised connect.java. for use with HDSQLDB. ---- Another interesting discovery. If I attempt to run connect.java a second time immediately after running it the first time when in errors out, the following message is displayed immediately: WordsDataSourceException Occurred : Problem creating table java.lang.IllegalArgumentException: IWordsDataSource can't be null at net.sf.classifier4J.bayesian.BayesianClassifier.<init> (BayesianClassifier.java:141) at net.sf.classifier4J.bayesian.BayesianClassifier.<init> (BayesianClassifier.java:128) at net.sf.classifier4J.bayesian.BayesianClassifier.<init> (BayesianClassifier.java:118) at Connect.main(Connect.java:26) Exception in thread "main" However, if I wait about 60-90 seconds between executions, it will process the ~3900 records again and die. ---- I just discovered that the reply address on the list messages is not the list but the sender. Is it possible to alter this setting and would we want to? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: "Classifier4J" <cla...@li...> Date: Tue, 11 Nov 2003 15:43:39 -0600 Subject: [Classifier4j-devel] Update Word Probability Break Down > Hello All! > > I have been working around the clock on various issues relating to my > ignorance of Java and the nuances of Classifier4J. > > Thanks to Nick, and using the latest CVS code, I have succeeded in > implemeting > Classifier4J after only 60 hours! > > I have now come upon an interesting problem. > > My project involves categorizing a large volume of data. That data exists in > a blob field in a mySQL (4.0.16) database. I am using this same database to > store my word_probability table. I am using the mySQL connector/J 3.0.9. I > am using Java SDK 1.4.2_02. > > My project begins by teaching classifier 4J large amounts of already > classified data. I am providing a category and a string taken from the mySQL > blob field. All is well at this point. > > The bayesian teachMatch function works great for about 4000 words (in my > environment, results may vary), then: > --- > SQL Exception in updateWordProbability : Unable to connect to any hosts due > to > exception: java.net.BindException: Address already in use: connect > > WordsDataSourceException Occurred during teachMatch : Problem updating > WordProbability > --- > > I have added System.out e.getMessage() to the Exception Handler in the > updateWordProbability function to produce the above result. Otherwise, you > simply see an SQL Exception. > > Initially I thought this problem related to my ignorance and improper > implementation of connection pooling. I wrote the attached test program to > eliminate this possibility. I found that the error still existed and is 100% > reproduceable on my system. > > This program effectively loops through x number of teachMatch functions. On > my system, the program starts generating exceptions just before 4000, usually > between 3800 and 4900 iterations. > > Just to make sure I didn't have some environmental problem, I wrote another > program that writes x records to mySQL, emulating the function of > updateWordProbability. No problems here atleast up to 100,000 records. > > I hope someone with more knowlege and experience will be able to figure this > one out. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN |
From: Nick L. <nl...@es...> - 2003-11-11 22:22:09
|
> > 1. in JDBCWordsDataSource - constructor did not initialised > IJDBCConnectionManager connectionManager provided to it ( I added > connectionManager = cm; below commented string //this(cm, > ICategorisedClassifier.DEFAULT_CATEGORY);). > Yes, that is fixed in CVS (note to self - must do a release and stop wasting people's time with this bug!) > 2. in DriverMangerJDBCConnectionManager I added one more > constructor and > one private field dbDriver (String) and slightly changed > getConnection() > method to initialise driver class if dbDriver != null > (Class.forName(dbDriver) > Hmm.. I had always thought it would make more sense to load the JDBC driver class in your own code. After all, I would have thought that if you are using a JDBC based database for Classifier4J then there is a good chance you are using it for other things as well, so it makes sense for you to control your own driver loading. Do people disagree with this logic? Nick |
From: Matt C. <MCo...@my...> - 2003-11-11 21:41:56
|
Hello All! I have been working around the clock on various issues relating to my ignorance of Java and the nuances of Classifier4J. Thanks to Nick, and using the latest CVS code, I have succeeded in implemeting Classifier4J after only 60 hours! I have now come upon an interesting problem. My project involves categorizing a large volume of data. That data exists in a blob field in a mySQL (4.0.16) database. I am using this same database to store my word_probability table. I am using the mySQL connector/J 3.0.9. I am using Java SDK 1.4.2_02. My project begins by teaching classifier 4J large amounts of already classified data. I am providing a category and a string taken from the mySQL blob field. All is well at this point. The bayesian teachMatch function works great for about 4000 words (in my environment, results may vary), then: --- SQL Exception in updateWordProbability : Unable to connect to any hosts due to exception: java.net.BindException: Address already in use: connect WordsDataSourceException Occurred during teachMatch : Problem updating WordProbability --- I have added System.out e.getMessage() to the Exception Handler in the updateWordProbability function to produce the above result. Otherwise, you simply see an SQL Exception. Initially I thought this problem related to my ignorance and improper implementation of connection pooling. I wrote the attached test program to eliminate this possibility. I found that the error still existed and is 100% reproduceable on my system. This program effectively loops through x number of teachMatch functions. On my system, the program starts generating exceptions just before 4000, usually between 3800 and 4900 iterations. Just to make sure I didn't have some environmental problem, I wrote another program that writes x records to mySQL, emulating the function of updateWordProbability. No problems here atleast up to 100,000 records. I hope someone with more knowlege and experience will be able to figure this one out. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: moedusa <mo...@in...> - 2003-11-11 11:07:28
|
Nick, I have understood how it works looking at unit tests, but to make it work I'v made some chages... I use custom categories, so it have to make me fix some inconsistencies: 1. in JDBCWordsDataSource - constructor did not initialised IJDBCConnectionManager connectionManager provided to it ( I added connectionManager = cm; below commented string //this(cm, ICategorisedClassifier.DEFAULT_CATEGORY);). 2. in DriverMangerJDBCConnectionManager I added one more constructor and one private field dbDriver (String) and slightly changed getConnection() method to initialise driver class if dbDriver != null (Class.forName(dbDriver) After this I could make things work as needed, so that's okay now :) Philipp. |
From: Nick L. <nl...@es...> - 2003-11-11 05:52:06
|
> -----Original Message----- > From: moedusa [mailto:mo...@in...] > Sent: Tuesday, 11 November 2003 7:46 AM > To: cla...@li... > Subject: [Classifier4j-devel] How do I train it? > > > Hi! I am playing with classifier4j now, and one thing is > confusing me: > let's say I want to train C4J this way: feed it with some > text (a fiew > paragraphs) and category id, but I can not figure out how to > implement > this now... Looks like I must parse text to words, filter > them and train > it word-by-word now... Is it right, or there is another way and my > understanding of training such things is completely wrong? > > Philipp. > You don't need to train word-by-word, you can just use the ITrainableClassifier.teachMatch(String) and teachNonMatch(String) methods (See <http://classifier4j.sourceforge.net/apidocs/net/sf/classifier4J/ITrainable. html#teachMatch(java.lang.String)>). If you want to use non-default categories then there are version of teachMatch & teachNonMatch that take a category, too. Have you looked at the code for the demos, in particular net.sf.classifier4J.demo.Trainer (<http://cvs.sourceforge.net/viewcvs.py/classifier4j/Classifier4J-Optional/s rc/java/net/sf/classifier4J/demo/Trainer.java?view=markup>)? Let me know if that doesn't make things clearer. Nick |
From: moedusa <mo...@in...> - 2003-11-10 21:14:25
|
Hi! I am playing with classifier4j now, and one thing is confusing me: let's say I want to train C4J this way: feed it with some text (a fiew paragraphs) and category id, but I can not figure out how to implement this now... Looks like I must parse text to words, filter them and train it word-by-word now... Is it right, or there is another way and my understanding of training such things is completely wrong? Philipp. |
From: Nick L. <ni...@ma...> - 2003-08-30 08:27:26
|
> > Even if the compilation is done separately, if we refer to > net.sf.classifier4J.bayesian.AllTests, we're going to get the AllTest > classes from either Classifier4J or Classifier4J-Optional depending on where > both jars are in the classpath (which is really bad)... > Yes, that is a good point. I'll move it into the *.optional.* structure you suggested |
From: Peter L. <pe...@le...> - 2003-08-30 08:27:11
|
Hi! Thanks for applying the patches. I had a toString method in JDBMWordsDataSource, nothing major - I wouldn't worry about it... Regards, Peter Leschev ----- Original Message ----- From: "Nick Lothian" <ni...@ma...> To: <cla...@li...> Sent: Saturday, August 30, 2003 2:22 PM Subject: Re: [Classifier4j-devel] Dev Plan > I've applied all outstanding patches. > > I had some trouble with the patch to JDBMWordsDataSource - it didn't apply > cleanly for some reason. I had a look through it, and added the finalize > method like you suggested. Were there any other changes to that class? > > Nick > > > ----- Original Message ----- > From: "Peter Leschev" <pe...@le...> > To: "Nick Lothian" <ni...@ma...>; > <cla...@li...> > Sent: Wednesday, August 27, 2003 10:35 PM > Subject: Re: [Classifier4j-devel] Dev Plan > > > > Heya, > > > > > > It would be interesting to > > > > compare performance between different database solutions. eg. > > > > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > > > > HibernateWordsDatabase -> hsqldb / mysql etc. > > > > > > - Implement HibernateWordsDataSource > > > > I just submitted a patch which brings me closer to releasing a > > performance test class & the HibernateWordsDataSource class (should be > ready > > by this weekend)... > > > > Pete > > > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Peter L. <pe...@le...> - 2003-08-30 08:22:39
|
Hi Nick, I'm currently compiling both Classifier4J & Classifier4J-Optional at the same time. I agree with having the same package structure for the tests (of the same project), but I wouldn't recommend having the same package structure across two projects (Classifier4J & Classifier4J-Optional). The situation we're seeing here is two different classes being generated (or attempted to be generated) with the same fully qualified class name net.sf.classifier4J.bayesian.AllTests in the following locations: classifier4j\Classifier4J\src\test\net\sf\classifier4J\bayesian\AllTests.jav a classifier4j\Classifier4J-Optional\src\test\net\sf\classifier4J\bayesian\All Tests.java Even if the compilation is done separately, if we refer to net.sf.classifier4J.bayesian.AllTests, we're going to get the AllTest classes from either Classifier4J or Classifier4J-Optional depending on where both jars are in the classpath (which is really bad)... Regards, Peter Leschev ----- Original Message ----- From: "Nick Lothian" <ni...@ma...> To: <cla...@li...> Sent: Saturday, August 30, 2003 2:12 PM Subject: Re: [Classifier4j-devel] Dev Plan > How are you compiling it? > > I intended the optional package to be a totally stand-alone project, so it > shouldn't be compiling to the same target directory as the normal > Classifier4J project (which would cause the errors you are seeing). > > I put it in the same package so that we can use package-level access to > methods if we need to (in particular for the tests, since often I find that > is a useful technique). > > ----- Original Message ----- > From: "Peter Leschev" <pe...@le...> > To: "Nick Lothian" <ni...@ma...>; > <cla...@li...> > Sent: Monday, August 25, 2003 5:45 PM > Subject: Re: [Classifier4j-devel] Dev Plan > > > > Heya, > > > > > If you look at the code for the examples in Classifier4J-Optional, > > Is there a reason why you've kept the same package structure in Optional? > > I'm getting compilation errors complaining that there are two > > net.sf.classifier4J.bayesian.AllTests classes. Could we put all the > optional > > classes in the net.sf.classifier4J.optional package? (eg > > net.sf.classifier4J.optional.bayesian etc). > > > > Pete > > > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: VM Ware > > With VMware you can run multiple operating systems on a single machine. > > WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines > > at the same time. Free trial click > here:http://www.vmware.com/wl/offer/358/0 > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Nick L. <ni...@ma...> - 2003-08-30 04:22:40
|
I've applied all outstanding patches. I had some trouble with the patch to JDBMWordsDataSource - it didn't apply cleanly for some reason. I had a look through it, and added the finalize method like you suggested. Were there any other changes to that class? Nick ----- Original Message ----- From: "Peter Leschev" <pe...@le...> To: "Nick Lothian" <ni...@ma...>; <cla...@li...> Sent: Wednesday, August 27, 2003 10:35 PM Subject: Re: [Classifier4j-devel] Dev Plan > Heya, > > > > It would be interesting to > > > compare performance between different database solutions. eg. > > > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > > > HibernateWordsDatabase -> hsqldb / mysql etc. > > > > - Implement HibernateWordsDataSource > > I just submitted a patch which brings me closer to releasing a > performance test class & the HibernateWordsDataSource class (should be ready > by this weekend)... > > Pete > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Nick L. <ni...@ma...> - 2003-08-30 04:12:24
|
How are you compiling it? I intended the optional package to be a totally stand-alone project, so it shouldn't be compiling to the same target directory as the normal Classifier4J project (which would cause the errors you are seeing). I put it in the same package so that we can use package-level access to methods if we need to (in particular for the tests, since often I find that is a useful technique). ----- Original Message ----- From: "Peter Leschev" <pe...@le...> To: "Nick Lothian" <ni...@ma...>; <cla...@li...> Sent: Monday, August 25, 2003 5:45 PM Subject: Re: [Classifier4j-devel] Dev Plan > Heya, > > > If you look at the code for the examples in Classifier4J-Optional, > Is there a reason why you've kept the same package structure in Optional? > I'm getting compilation errors complaining that there are two > net.sf.classifier4J.bayesian.AllTests classes. Could we put all the optional > classes in the net.sf.classifier4J.optional package? (eg > net.sf.classifier4J.optional.bayesian etc). > > Pete > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: VM Ware > With VMware you can run multiple operating systems on a single machine. > WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines > at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0 > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Peter L. <pe...@le...> - 2003-08-27 13:05:13
|
Heya, > > It would be interesting to > > compare performance between different database solutions. eg. > > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > > HibernateWordsDatabase -> hsqldb / mysql etc. > > - Implement HibernateWordsDataSource I just submitted a patch which brings me closer to releasing a performance test class & the HibernateWordsDataSource class (should be ready by this weekend)... Pete |
From: Peter L. <pe...@le...> - 2003-08-25 13:32:52
|
Heya, > If you look at the code for the examples in Classifier4J-Optional, Is there a reason why you've kept the same package structure in Optional? I'm getting compilation errors complaining that there are two net.sf.classifier4J.bayesian.AllTests classes. Could we put all the optional classes in the net.sf.classifier4J.optional package? (eg net.sf.classifier4J.optional.bayesian etc). Pete |
From: Nick L. <ni...@ma...> - 2003-08-24 06:10:15
|
> I've added a new patch (FastHashMapWordsDataSource) - submitted via the > patch manager to improve the stats :) > And I have _finally_ got around to applying some of Pete's patches and bug fixes. Nick |
From: Peter L. <pe...@le...> - 2003-08-16 02:11:58
|
Heya, > My patches to NNTP://RSS > (http://www.mackmo.com/nick/blog/java/?permalink=nntprssc4javailable.txt) > use the HSQLDB database integrated in NNTP://RSS. I took a look at NNTP://RSS & the Classifier4J patch. Damn useful. I wanted to write a Job (as in employment) searching app which used Classifier4J to rate all the incoming jobs. All I have to do now is write an Adapter for each job site which converts them to RSS feeds. There needs to be a classifying category assigned for each RSS feed though. For example, the words in a job description would be alot different from a news article that I'm interested in... Having a category for each feed would be simple to impl but not as effective as a few categories. I've added a new patch (FastHashMapWordsDataSource) - submitted via the patch manager to improve the stats :) Pete |
From: Nick L. <ni...@ma...> - 2003-08-10 03:15:20
|
> A couple of points: > - Is there a reason why you've used tabs instead of spaces? Generally spaces > are prefered, it's more standard. Some people may have their tab size set to > 4 while others have it set to 8 etc... If you always convert tabs to spaces, > it's always the same... Yes, I've reset Eclipse to subsitiute spaces. As I check-in stuff is being fixed. > - Have you seen hsqldb? http://hsqldb.sourceforge.net/ provides an in memory > / disk java based database with a JDBC interface. It would be interesting to > compare performance between different database solutions. eg. > JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's > HibernateWordsDatabase -> hsqldb / mysql etc. > If you look at the code for the examples in Classifier4J-Optional, you'll see some commented out code to use a JDBCWordsDataSource with HSQLDB. If I use the training example, I get about 50 words per second with HSQLDB, but with JDBM it takes less than 1 second for all 3000 words. I'm using HSQLDB persistant tables and I'm not sure how often that writes to disk - I'm pretty sure it's not after every update, because the HSQLDB documentation talks about needing to do a CHECKPOINT to make sure it is written. With JDBM I only commit at the end of the training session, so that's a big speed win. In the Analayser example, JDBM completes in less than 1 second, and HSQLDB runs at about 80 words per second. My patches to NNTP://RSS (http://www.mackmo.com/nick/blog/java/?permalink=nntprssc4javailable.txt) use the HSQLDB database integrated in NNTP://RSS. I've looked at Axion (http://www.mackmo.com/nick/blog/java/?permalink=axion2.txt) in the past, too. > I'll look into the following: > - Fix the following in BayesianClassifier > * @todo need an option to only use the "X" most "important" words when > calculating overall probability > * "important" is defined as being most distant from NEUTAL_PROBABILITY Cool. > - Look into the current Tokenizer - For example, "1.4" currently gets split > into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split into > "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming up > with a set of test cases. Yes, that needs fixing. Also, I'm not sure about how to deal with URLs: at the moment http://www.google.com/something gets split up, but I think it probably shouldn't (?) > - Implement an HTML Tokenizer (depending on how it is configured, html tags > will be either included or ignored). Very good idea. > - Implement HibernateWordsDataSource > - Implement a project which uses Classifier4J. > That's a really good idea!! ;-) |
From: Peter L. <pe...@le...> - 2003-08-08 14:21:06
|
Hi Nick, I just did a cvs update and took a look at version 0.4. I like the IStopWordProvider concept.... A couple of points: - Is there a reason why you've used tabs instead of spaces? Generally spaces are prefered, it's more standard. Some people may have their tab size set to 4 while others have it set to 8 etc... If you always convert tabs to spaces, it's always the same... - Have you seen hsqldb? http://hsqldb.sourceforge.net/ provides an in memory / disk java based database with a JDBC interface. It would be interesting to compare performance between different database solutions. eg. JDBMWordsDataSource v's JDBCWordsDataSource -> hsqldb v's HibernateWordsDatabase -> hsqldb / mysql etc. I'll look into the following: - Fix the following in BayesianClassifier * @todo need an option to only use the "X" most "important" words when calculating overall probability * "important" is defined as being most distant from NEUTAL_PROBABILITY - Look into the current Tokenizer - For example, "1.4" currently gets split into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split into "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming up with a set of test cases. - Implement an HTML Tokenizer (depending on how it is configured, html tags will be either included or ignored). - Implement HibernateWordsDataSource - Implement a project which uses Classifier4J. It's looking good! Pete ----- Original Message ----- From: "Nick Lothian" <ni...@ma...> To: <cla...@li...> Sent: Sunday, August 03, 2003 5:12 PM Subject: Re: [Classifier4j-devel] Dev Plan > Currently I'm focused on two things: > > 1) Refactoring category support. > -- I've added ICategorisedClassifier and ICategorisedWordsDataSource > interfaces which have methods like ICategorisedClassifier.classify(String > category, String input); etc, so the categories can be used directly from > the classifier, without having to do "setCategory" on the datasource. I > can't see why we need to keep that state, so I'm removing it. I've just > added these changes to CVS. > > 2) A Classifier4J-Optional jar, which (currently) contains a couple of > demos, a JDBMWordsDataSource (very fast and reliable) and a > JispWordsDataSource (fast, but prone to data corruption, so I'll probably > throw it out). Currently this is not in CVS. > > If you are still interested in the HibernateWordsDataSource, I would see it > going in here. > > As well as those changes I've done some work on Text Summary > (http://www.mackmo.com/nick/blog/java/?permalink=TextSummaryApp.txt), which > is also available. > > I have some plans to do a 0.4 release sometime this week. > > What are you interested in working on? > > Nick > > > ----- Original Message ----- > From: "Peter Leschev" <pe...@le...> > To: <cla...@li...> > Sent: Friday, August 01, 2003 9:46 AM > Subject: [Classifier4j-devel] Dev Plan > > > > Hi Nick, > > > > what are your current plans for JClassifier? What are you > planning on implementing in the > > near future? I just don't want to double up on what we do... > > > > Pete > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > > Data Reports, E-commerce, Portals, and Forums are available now. > > Download today and enter to win an XBOX or Visual Studio .NET. > > > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > |
From: Nick L. <ni...@ma...> - 2003-08-05 12:34:46
|
Classifier4J version 0.4 is now available for your downloading pleasure. This release has an optional jar, which includes a JDMB (<http://jdbm.sourceforge.net/>) datasource and a couple of simple demos. Classifier4J now also includes the code for summary extraction used by the text summary web application(<http://www.mackmo.com/summary/>). Nick |
From: Nick L. <ni...@ma...> - 2003-08-03 07:11:17
|
Currently I'm focused on two things: 1) Refactoring category support. -- I've added ICategorisedClassifier and ICategorisedWordsDataSource interfaces which have methods like ICategorisedClassifier.classify(String category, String input); etc, so the categories can be used directly from the classifier, without having to do "setCategory" on the datasource. I can't see why we need to keep that state, so I'm removing it. I've just added these changes to CVS. 2) A Classifier4J-Optional jar, which (currently) contains a couple of demos, a JDBMWordsDataSource (very fast and reliable) and a JispWordsDataSource (fast, but prone to data corruption, so I'll probably throw it out). Currently this is not in CVS. If you are still interested in the HibernateWordsDataSource, I would see it going in here. As well as those changes I've done some work on Text Summary (http://www.mackmo.com/nick/blog/java/?permalink=TextSummaryApp.txt), which is also available. I have some plans to do a 0.4 release sometime this week. What are you interested in working on? Nick ----- Original Message ----- From: "Peter Leschev" <pe...@le...> To: <cla...@li...> Sent: Friday, August 01, 2003 9:46 AM Subject: [Classifier4j-devel] Dev Plan > Hi Nick, > > what are your current plans for JClassifier? What are you planning on implementing in the > near future? I just don't want to double up on what we do... > > Pete > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Peter L. <pe...@le...> - 2003-08-01 00:16:41
|
Hi Nick, what are your current plans for JClassifier? What are you planning on implementing in the near future? I just don't want to double up on what we do... Pete |
From: Nick L. <ni...@ma...> - 2003-07-22 02:40:49
|
Classifier4J version 0.3 is now available. Classifer4J is a java library that provides an API for automatic classification of text, including Bayesian classification. Version 0.3 is the first version recommened for general use. Classifier4J is available from http://classifier4j.sourceforge.net/ Regards Nick Lothian |
From: Nick L. <ni...@ma...> - 2003-07-21 07:40:01
|
I've made some fairly significant commits today. These include: -- Stop Words Support: Allows words not to be used for classification. (see theIStopWordProvider interface) -- Training support: Training of the classifier can now be done via the BayesianClassifier, and the datasource will be updated with the new word statistics - thanks to Pete Leschev for the inital code for this (See the ITrainable interface, which is implemented by BayesianClassifer). -- A "createTable" method on JDBCWordsDataSource which will create the database table if they don't already exist -- BayesianClassifier is now case insensitive by default. I plan on doing a 0.3 release tomorrow. Nick |