classifier4j-devel Mailing List for Classifier4J (Page 10)
Status: Beta
Brought to you by:
nicklothian
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(18) |
Aug
(14) |
Sep
|
Oct
|
Nov
(74) |
Dec
(9) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(15) |
Feb
(6) |
Mar
|
Apr
|
May
(27) |
Jun
(1) |
Jul
(14) |
Aug
(3) |
Sep
(9) |
Oct
|
Nov
(3) |
Dec
(6) |
2005 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
2006 |
Jan
|
Feb
(5) |
Mar
(5) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(10) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
(1) |
Jun
(4) |
Jul
(10) |
Aug
(5) |
Sep
(10) |
Oct
(18) |
Nov
(39) |
Dec
(73) |
2009 |
Jan
(78) |
Feb
(24) |
Mar
(32) |
Apr
(53) |
May
(115) |
Jun
(99) |
Jul
(72) |
Aug
(18) |
Sep
(22) |
Oct
(35) |
Nov
(10) |
Dec
(19) |
2010 |
Jan
(6) |
Feb
(7) |
Mar
(43) |
Apr
(55) |
May
(78) |
Jun
(71) |
Jul
(43) |
Aug
(42) |
Sep
(19) |
Oct
(5) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Matt C. <MCo...@my...> - 2003-11-14 00:55:28
|
Below are the top 20 words for one of my classified categories. 1) nbsp, this was obviously originally   which will be addressed when we implement an HTML tokenizer. In the meantime however, I believe this is skewing my personal results. 2) "we" see several occurances of useless pronouns in this list. This can be addressed by an improved "stop list". There is evidently an excellent paper written on the top of stop lists aptly named "A stop list for general text" by Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 1989 ISSN:0163- 5840. If anyone has access to this paper, please advise. 3) the dreaded "s" a result no doubt of incorrectly tokenizing possesive nouns and pronouns, contractions etc. Anybody have a good algorithm for handling this? 4) By the match_counts on these words, I can see that each occurance of a word in a single document goes to the database. I don't see how this behavior is going to produce the desired result. Atleast in my case. I have run across several papers written about the effects of word frequency on text classification. Anybody have any experience in this area? +-------------+-------------+-------------+ | word | match_count | description | +-------------+-------------+-------------+ | nbsp | 4671 | CPA | | we | 874 | CPA | | our | 595 | CPA | | quickbooks | 478 | CPA | | tax | 417 | CPA | | accounting | 413 | CPA | | business | 346 | CPA | | cpa | 337 | CPA | | line | 320 | CPA | | by | 293 | CPA | | olive | 279 | CPA | | year | 264 | CPA | | s | 255 | CPA | | will | 253 | CPA | | help | 238 | CPA | | bookkeeping | 238 | CPA | | murphy | 231 | CPA | | do | 223 | CPA | | firm | 218 | CPA | | she | 216 | CPA | +-------------+-------------+-------------+ Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: moedusa <mo...@in...> - 2003-11-13 11:20:24
|
Nick Lothian wrote: > Yeah, looks good to me - although maybe I could just declare it to throw the > NamingException > well' that's your style. i prefer to throw trivial exceptions, without requiring people to import stuff for it (like in this case, it seems that client will pass only dsn String to DAO, so there is no need to import javax.naming just to handle exception, but, i repeat, tastes diffa u know) |
From: Nick L. <nl...@es...> - 2003-11-13 06:17:37
|
Yeah, looks good to me - although maybe I could just declare it to throw the NamingException > -----Original Message----- > From: moedusa [mailto:mo...@in...] > Sent: Thursday, 13 November 2003 3:41 PM > To: cla...@li... > Subject: Re: [Classifier4j-devel] Update Word Probability Break Down > > > Nick Lothian wrote: > > What do people think about this (untested - I don't even know if it > > compiles!) datasource. > > imho, that's okay, but I am not sure about runtime exceptions thrown, > perhaps it would be better to do this way? > > public DataSourceJDBCConnectionManager(String ctx) throws > java.lang.InstantiationException{ > this.datasourceContext = ctx; > try{ > Context ctx = new InitialContext(); > dataSource = (DataSource)ctx.lookup(datasourceContext); > }catch(javax.naming.NamingException e){ > throw new InstantiationException("Failure instantiating > datasource: > "+e.getMesssage()); > } > } > > then, clients will always know, if things go wrong... > > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: moedusa <mo...@in...> - 2003-11-13 05:09:28
|
Nick Lothian wrote: > What do people think about this (untested - I don't even know if it > compiles!) datasource. imho, that's okay, but I am not sure about runtime exceptions thrown, perhaps it would be better to do this way? public DataSourceJDBCConnectionManager(String ctx) throws java.lang.InstantiationException{ this.datasourceContext = ctx; try{ Context ctx = new InitialContext(); dataSource = (DataSource)ctx.lookup(datasourceContext); }catch(javax.naming.NamingException e){ throw new InstantiationException("Failure instantiating datasource: "+e.getMesssage()); } } then, clients will always know, if things go wrong... |
From: Matt C. <MCo...@my...> - 2003-11-13 04:48:40
|
hehe we're telling each other about the same documents. Only difference is, you evidently understand what they mean. I should qualify my earlier stament about my classification results. I have about 40 categories. The same document will score a .99 in several different categories. How am I to determine what category is best? Is this expected or is there some deficiency in my data? My application is to be able to classify web sites by business category, Insurance, Printing, Accounting, Attorney etc. I have already manually classified a fairly large number of sites for my corpus. I am treating the entire web site as one document. I am then trying to classify an entire website in the same fashion. Is it correct to say that the existence of a particular word only counts one time per document? This seems to be a key point in the POPFile documentation as I understand it. Word frequency within a single document counts for nothing. Is this correct? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: "'cla...@li...'" <classifier4j- de...@li...> Date: Thu, 13 Nov 2003 14:34:15 +1030 Subject: RE: [Classifier4j-devel] Bayesian Classification > > > > BayesianClassifier classifier = new BayesianClassifier(wds); > > double probability=classifier.classify("category","text to be > > classified"); > > > > Yep, looks good. > > > This is functioning fine but most of my probabilities are > > either 0.01 or 0.99. > > > > Yes, that is pretty much the way it works. See > <http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137> > for why this is. > > POPFile uses logarithms to get around this (which is actually quite a good > idea). Classifier4J uses cut-offs to avoid underflow and overflow. > > > I saw somewhere in the source that the algorithm is choosing X most > > significant words. Is there an easy way for me to determine > > what these words > > are on a category by category basis? > > > > No, you can't do this on a category by category basis. I haven't found it > makes a big difference anyway, so I'd test changing this setting on a single > category before you spend a lot of time changing this. > > > I think the fact that I still have html in my training data > > is causing me > > difficulty. Does this sound right? > > > > It depends on the task. In your typical Spam classifier HTML is an important > indicator. I use Classifier4J for classifying RSS feeds and I don't strip > HTML (That's not to say it wouldn't work better if I did - I just haven't > tried it). > > In a lot of cases the HTML supplies a surprising amount of useful data which > Classifier4J can use. > > > > As for the issues with the Bayesian tokenizer: > > ---- correspondence between Pete and Nick on 2003-08-09 > > Pete> Look into the current Tokenizer - For example, "1.4" > > currently gets > > split into "1" and "4". Shouldn't it just be "1.4"? Also > > "peter's" is split > > into "peter" and "s". Shouldn't this be "peter's"? It's > > probably worth coming > > up with a set of test cases. > > > > Nick> Yes, that needs fixing. Also, I'm not sure about how to > > deal with URLs: > > at the moment http://www.google.com/something gets split up, > > but I think it > > probably shouldn't (?) > > ---- > > > > Is this still outstanding? > > > > Yes, these points are outstanding. > > > How sophicated is this Bayesian classifier when compared with > > POPFile or > > SpamAssassin? There is some intersting reading about the > > POPFile engine at : > > http://sourceforge.net/docman/?group_id=63137 > > > > SpamAssassin isn't a Bayesian filter - it runs rules on mail headers and > filters mail like that (AFAIK?) > > Classifier4J is a (almost) pure, naive Bayesian classifier (POPFile is also > "naive" in the technical meaning of the word - it treats each word as being > independent). It implements Bayes theorem with very little variation (you > can specify to only use the X most significant words, and it uses cut-offs > to avoid arithmetic underflow). > > I'm more interested in investigating a Vector-Space classifier than > investing a huge amount of time modifying the Bayesian algorithm, especially > since the modifications most people do are aimed at detecting Spam, which > isn't my core goal with C4J. OTOH, if someone can suggest a change that > improves performance or gives some other tangible gain then I'm interested. > > > > POPFile has been designed to classify emails into "buckets" > > or categories. > > Evidently, there are some mathematical shortcuts if you're > > trying to classify > > a message against several different categories. > > > > Can you point me at them? I didn't see them in that doco in the POPFile > project. > > > One critial point made in POPfile is that words NOT in a > > document may be as > > important as the words that ARE in a document. Does the c4J > > take this into > > account? > > > > I'm not quite sure what you (or they?) mean here. Can you point me at what > they say? > > Classifier4J does (kind of) take words that are not in a document into > account. If a particualar word (say "Java") isn't in a document then the > document won't get the score-boost of having that word in there. > > Nick > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Matt C. <MCo...@my...> - 2003-11-13 04:30:35
|
None of this may be new information, but maybe it could be useful. Bayes applied to multiple "buckets" http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137 Symptom Model http://sourceforge.net/docman/display_doc.php?docid=16368&group_id=63137 These are from the Technical section in the POPFile Docs. at: http://sourceforge.net/docman/?group_id=63137 Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: "'cla...@li...'" <classifier4j- de...@li...> Date: Thu, 13 Nov 2003 14:34:15 +1030 Subject: RE: [Classifier4j-devel] Bayesian Classification > > > > BayesianClassifier classifier = new BayesianClassifier(wds); > > double probability=classifier.classify("category","text to be > > classified"); > > > > Yep, looks good. > > > This is functioning fine but most of my probabilities are > > either 0.01 or 0.99. > > > > Yes, that is pretty much the way it works. See > <http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137> > for why this is. > > POPFile uses logarithms to get around this (which is actually quite a good > idea). Classifier4J uses cut-offs to avoid underflow and overflow. > > > I saw somewhere in the source that the algorithm is choosing X most > > significant words. Is there an easy way for me to determine > > what these words > > are on a category by category basis? > > > > No, you can't do this on a category by category basis. I haven't found it > makes a big difference anyway, so I'd test changing this setting on a single > category before you spend a lot of time changing this. > > > I think the fact that I still have html in my training data > > is causing me > > difficulty. Does this sound right? > > > > It depends on the task. In your typical Spam classifier HTML is an important > indicator. I use Classifier4J for classifying RSS feeds and I don't strip > HTML (That's not to say it wouldn't work better if I did - I just haven't > tried it). > > In a lot of cases the HTML supplies a surprising amount of useful data which > Classifier4J can use. > > > > As for the issues with the Bayesian tokenizer: > > ---- correspondence between Pete and Nick on 2003-08-09 > > Pete> Look into the current Tokenizer - For example, "1.4" > > currently gets > > split into "1" and "4". Shouldn't it just be "1.4"? Also > > "peter's" is split > > into "peter" and "s". Shouldn't this be "peter's"? It's > > probably worth coming > > up with a set of test cases. > > > > Nick> Yes, that needs fixing. Also, I'm not sure about how to > > deal with URLs: > > at the moment http://www.google.com/something gets split up, > > but I think it > > probably shouldn't (?) > > ---- > > > > Is this still outstanding? > > > > Yes, these points are outstanding. > > > How sophicated is this Bayesian classifier when compared with > > POPFile or > > SpamAssassin? There is some intersting reading about the > > POPFile engine at : > > http://sourceforge.net/docman/?group_id=63137 > > > > SpamAssassin isn't a Bayesian filter - it runs rules on mail headers and > filters mail like that (AFAIK?) > > Classifier4J is a (almost) pure, naive Bayesian classifier (POPFile is also > "naive" in the technical meaning of the word - it treats each word as being > independent). It implements Bayes theorem with very little variation (you > can specify to only use the X most significant words, and it uses cut-offs > to avoid arithmetic underflow). > > I'm more interested in investigating a Vector-Space classifier than > investing a huge amount of time modifying the Bayesian algorithm, especially > since the modifications most people do are aimed at detecting Spam, which > isn't my core goal with C4J. OTOH, if someone can suggest a change that > improves performance or gives some other tangible gain then I'm interested. > > > > POPFile has been designed to classify emails into "buckets" > > or categories. > > Evidently, there are some mathematical shortcuts if you're > > trying to classify > > a message against several different categories. > > > > Can you point me at them? I didn't see them in that doco in the POPFile > project. > > > One critial point made in POPfile is that words NOT in a > > document may be as > > important as the words that ARE in a document. Does the c4J > > take this into > > account? > > > > I'm not quite sure what you (or they?) mean here. Can you point me at what > they say? > > Classifier4J does (kind of) take words that are not in a document into > account. If a particualar word (say "Java") isn't in a document then the > document won't get the score-boost of having that word in there. > > Nick > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Nick L. <nl...@es...> - 2003-11-13 04:05:49
|
> > BayesianClassifier classifier = new BayesianClassifier(wds); > double probability=classifier.classify("category","text to be > classified"); > Yep, looks good. > This is functioning fine but most of my probabilities are > either 0.01 or 0.99. > Yes, that is pretty much the way it works. See <http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137> for why this is. POPFile uses logarithms to get around this (which is actually quite a good idea). Classifier4J uses cut-offs to avoid underflow and overflow. > I saw somewhere in the source that the algorithm is choosing X most > significant words. Is there an easy way for me to determine > what these words > are on a category by category basis? > No, you can't do this on a category by category basis. I haven't found it makes a big difference anyway, so I'd test changing this setting on a single category before you spend a lot of time changing this. > I think the fact that I still have html in my training data > is causing me > difficulty. Does this sound right? > It depends on the task. In your typical Spam classifier HTML is an important indicator. I use Classifier4J for classifying RSS feeds and I don't strip HTML (That's not to say it wouldn't work better if I did - I just haven't tried it). In a lot of cases the HTML supplies a surprising amount of useful data which Classifier4J can use. > As for the issues with the Bayesian tokenizer: > ---- correspondence between Pete and Nick on 2003-08-09 > Pete> Look into the current Tokenizer - For example, "1.4" > currently gets > split into "1" and "4". Shouldn't it just be "1.4"? Also > "peter's" is split > into "peter" and "s". Shouldn't this be "peter's"? It's > probably worth coming > up with a set of test cases. > > Nick> Yes, that needs fixing. Also, I'm not sure about how to > deal with URLs: > at the moment http://www.google.com/something gets split up, > but I think it > probably shouldn't (?) > ---- > > Is this still outstanding? > Yes, these points are outstanding. > How sophicated is this Bayesian classifier when compared with > POPFile or > SpamAssassin? There is some intersting reading about the > POPFile engine at : > http://sourceforge.net/docman/?group_id=63137 > SpamAssassin isn't a Bayesian filter - it runs rules on mail headers and filters mail like that (AFAIK?) Classifier4J is a (almost) pure, naive Bayesian classifier (POPFile is also "naive" in the technical meaning of the word - it treats each word as being independent). It implements Bayes theorem with very little variation (you can specify to only use the X most significant words, and it uses cut-offs to avoid arithmetic underflow). I'm more interested in investigating a Vector-Space classifier than investing a huge amount of time modifying the Bayesian algorithm, especially since the modifications most people do are aimed at detecting Spam, which isn't my core goal with C4J. OTOH, if someone can suggest a change that improves performance or gives some other tangible gain then I'm interested. > POPFile has been designed to classify emails into "buckets" > or categories. > Evidently, there are some mathematical shortcuts if you're > trying to classify > a message against several different categories. > Can you point me at them? I didn't see them in that doco in the POPFile project. > One critial point made in POPfile is that words NOT in a > document may be as > important as the words that ARE in a document. Does the c4J > take this into > account? > I'm not quite sure what you (or they?) mean here. Can you point me at what they say? Classifier4J does (kind of) take words that are not in a document into account. If a particualar word (say "Java") isn't in a document then the document won't get the score-boost of having that word in there. Nick |
From: Nick L. <nl...@es...> - 2003-11-13 03:19:14
|
> Those of you who have been using c4J for a while, what word > data source are > you using? > Is there anyone out there in the "using c4j for a while" category? > Nick, you said you're using a flat file, how do you do this? > I'm using JDBMWordsDataSource (in the classifier4j-optional jar). It's not a flat file, it's a non-relational database - see <http://jdbm.sourceforge.net/> It is much, much faster (at least 10 times from memory) than any JDBC based database I've tried. However the current implementation doesn't support more than one category. I'm also using an old version of Classifier4J with HSQLDB for another project. > What performance data do we have for different data sources? I do (or did?) have some hard numbers somewhere. I'll try and find them for you. |
From: Matt C. <MCo...@my...> - 2003-11-13 03:13:04
|
First, to make dsure I'm implementing this properly, here's what I'm doing: BayesianClassifier classifier = new BayesianClassifier(wds); double probability=classifier.classify("category","text to be classified"); This is functioning fine but most of my probabilities are either 0.01 or 0.99. I saw somewhere in the source that the algorithm is choosing X most significant words. Is there an easy way for me to determine what these words are on a category by category basis? I think the fact that I still have html in my training data is causing me difficulty. Does this sound right? As for the issues with the Bayesian tokenizer: ---- correspondence between Pete and Nick on 2003-08-09 Pete> Look into the current Tokenizer - For example, "1.4" currently gets split into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split into "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming up with a set of test cases. Nick> Yes, that needs fixing. Also, I'm not sure about how to deal with URLs: at the moment http://www.google.com/something gets split up, but I think it probably shouldn't (?) ---- Is this still outstanding? How sophicated is this Bayesian classifier when compared with POPFile or SpamAssassin? There is some intersting reading about the POPFile engine at : http://sourceforge.net/docman/?group_id=63137 POPFile has been designed to classify emails into "buckets" or categories. Evidently, there are some mathematical shortcuts if you're trying to classify a message against several different categories. One critial point made in POPfile is that words NOT in a document may be as important as the words that ARE in a document. Does the c4J take this into account? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Matt C. <MCo...@my...> - 2003-11-13 02:40:51
|
Those of you who have been using c4J for a while, what word data source are you using? Nick, you said you're using a flat file, how do you do this? What performance data do we have for different data sources? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Nick L. <nl...@es...> - 2003-11-13 00:54:06
|
What do people think about this (untested - I don't even know if it compiles!) datasource. It integrates with the javax.sql.Datasource stuff so if the environment you are running in can do connection pooling Classifier4J can utilise it. You'd use it something like this: String jndiLookup = "java:comp/env/jdbc/TestDB"; IJDBCConnectionManager cm = new DataSourceJDBCConnectionManager(jndiLookup); ICategorisedWordsDataSource wds = new JDBCWordsDataSource(cm); Nick > -----Original Message----- > From: moedusa [mailto:mo...@in...] > Sent: Thursday, 13 November 2003 10:01 AM > To: cla...@li... > Subject: Re: [Classifier4j-devel] Update Word Probability Break Down > > > Matt Collier wrote: > > > Anyone see any problems with the pooling code I used from: > > > > > http://developer.java.sun.com/developer/onlineTraining/Program > ming/JDCBook/conp > > ool.html#example > > > > I am sure, that it would be better to use some kind of > container-managed > connection pool. that is more configurable and, well, proper > solution. > You can find how to configure tomcat with Apache DPCB connection pool > here (with code sample): > http://jakarta.apache.org/tomcat/tomcat-4.1-doc/jndi-datasourc > e-examples-howto.html > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Nick L. <nl...@es...> - 2003-11-13 00:25:24
|
> Bayesian tokenizer. It was reported that the tokenizer > improperly handles a > number of strings including possessive pronouns and others. > Anybody working > on this? > I don't remember this discussion. Could you post a reference? > HTML togenizer for Bayesian system. Idea was to be able to > "ignore" xml in a > classification string. This happens to be required for my > current project. > I've either got to remove HTML from my source documents or > get C4J to ignore > it. > Yes, this would be nice. If you want to do it in C4J then you need to implement the net.sf.classifier4J.ITokenizer interface. > Connection pooling. What ARE we going to do about connection pooling. > Still looking at this. > Documentation. We need some. I would like to help with > this. How do we do > it? What framework are we using for documentation. > Cool. I'm using Maven to build the website (which contains the docs, such as they are). The docs themselves are in CVS (See <http://cvs.sourceforge.net/viewcvs.py/classifier4j/Classifier4J/xdocs/>) in xdoc format. The xdoc format is (kindof) documented at <http://jakarta.apache.org/site/jakarta-site-tags.html> Patches/New docs/Whatever are greatfully accepted. |
From: moedusa <mo...@in...> - 2003-11-12 23:29:20
|
Matt Collier wrote: > Anyone see any problems with the pooling code I used from: > > http://developer.java.sun.com/developer/onlineTraining/Programming/JDCBook/conp > ool.html#example > I am sure, that it would be better to use some kind of container-managed connection pool. that is more configurable and, well, proper solution. You can find how to configure tomcat with Apache DPCB connection pool here (with code sample): http://jakarta.apache.org/tomcat/tomcat-4.1-doc/jndi-datasource-examples-howto.html |
From: Matt C. <MCo...@my...> - 2003-11-12 23:29:06
|
Alright... now that we're over THAT hump. There was talk in the mailing list archive regarding the addition of a number of features. Can someone bring me up to speed on where we're at in the following areas? Bayesian tokenizer. It was reported that the tokenizer improperly handles a number of strings including possessive pronouns and others. Anybody working on this? HTML togenizer for Bayesian system. Idea was to be able to "ignore" xml in a classification string. This happens to be required for my current project. I've either got to remove HTML from my source documents or get C4J to ignore it. Connection pooling. What ARE we going to do about connection pooling. Documentation. We need some. I would like to help with this. How do we do it? What framework are we using for documentation. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Matt C. <MCo...@my...> - 2003-11-12 23:20:26
|
Anyone see any problems with the pooling code I used from: http://developer.java.sun.com/developer/onlineTraining/Programming/JDCBook/conp ool.html#example I could not believe how easy it was to implement. Created a package called "pool" under my class, changed about 5 lines of code and away we went. There is one irritating factor about this, eclipse reports that the "class must implement the inherited abstract method" for about 20 functions in JDCConnections.java. thankfully, I don't need those functions (at the moment). So, this is certainly one weakness in this solution. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: "'cla...@li...'" <classifier4j- de...@li...> Date: Thu, 13 Nov 2003 09:38:05 +1030 Subject: RE: [Classifier4j-devel] Update Word Probability Break Down > > There are many open source implementations of JDBC > > connection pooling, there's no need to > > implement our own. eg: > > > > http://jakarta.apache.org/commons/dbcp/ > > http://sourceforge.net/projects/c3p0/ > > > > Nick? I know you like to keep your dependencies down :) > > > > > We could add a IJDBCConnectionManager implementation for one or both of > these in the optional packages if that would help anyone. > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Nick L. <nl...@es...> - 2003-11-12 23:09:32
|
> There are many open source implementations of JDBC > connection pooling, there's no need to > implement our own. eg: > > http://jakarta.apache.org/commons/dbcp/ > http://sourceforge.net/projects/c3p0/ > > Nick? I know you like to keep your dependencies down :) > We could add a IJDBCConnectionManager implementation for one or both of these in the optional packages if that would help anyone. |
From: Nick L. <nl...@es...> - 2003-11-12 23:08:11
|
> 3) if this problem is not isolated to my environment, how has gone > undetected. Seems doubtful that no one has attempted to > classify teachMatch() > a 4000+ word document, or maybe it is possible. > In earlier version of C4J I did some testing with 11,000 words and MySQL. I didn't have any problem (apart from performance - which was why I switched to using a non-relational database). However I agree that others may have the same problem you have. > 4) if this problem is not limited to my configuration, what > is to be done > about it. It was suggested that I "might" want to implement > connection > pooling in my own code. It seems to me, in light of this > issue, classifier4J > needs to implement connection pooling internally? Is this possible? > I'm not keen to implement connection pooling a C4J because then it won't operate well in environments that already do their own connection pooling. On the other hand I do want to fix this bug, and the connection management system in C4J was designed to make it easy to implement connection pooling if required. > 5) Meanwhile, any hints on implementing connection pooling in > conjunction with > classifier4J would be greatly appreciated. > Write a class that implements the net.sf.classifier4J.bayesian.IJDBCConnectionManager interface, then use that instead of DriverMangerJDBCConnectionManager You can probably just rename the modified DriverMangerJDBCConnectionManager to PoolingJDBCConnectionManager or something. Could you submit this back to classifier4J? Just send the code to the list. > I really wish I had some idea what I was talking about... > It makes sense to me! |
From: moedusa <mo...@in...> - 2003-11-12 21:40:56
|
Matt Collier wrote: > At any rate, it involved making only a minor change to > DriverManagerJDBCConnectionManager.java. With this pooling implementation you > only need to pass a driver name to getConnection so I altered > > Public Connection getConnection() to accept a single string (dbDriver). he-he-he http://sourceforge.net/mailarchive/forum.php?thread_id=3442885&forum_id=34026 "hmm... upgrades?" ;) Philipp. |
From: Matt C. <MCo...@my...> - 2003-11-12 21:31:01
|
By implementing the connection pooling code provided on the following site: http://developer.java.sun.com/developer/onlineTraining/Programming/JDCBook/conp ool.html#example I was able to elimate the updateWordProbability issue I was having, which was actually a JDBC issue. At any rate, it involved making only a minor change to DriverManagerJDBCConnectionManager.java. With this pooling implementation you only need to pass a driver name to getConnection so I altered Public Connection getConnection() to accept a single string (dbDriver). I don't know if this was the best way to handle this, but it's working! Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Peter L. <pe...@le...> - 2003-11-12 21:25:58
|
Hi Matt, There are many open source implementations of JDBC connection pooling, there's no need to implement our own. eg: http://jakarta.apache.org/commons/dbcp/ http://sourceforge.net/projects/c3p0/ Nick? I know you like to keep your dependencies down :) Regards, Peter > After I reconstituted dbTest.java (attached) to open and close the database > connection each iteration as updateWordProbability does, the exact same error > occurs at exactly the same time (around 3900 iterations). This is without > using any classifier4J code. > > So, now the questions arise... > > 1) Is this problem still somehow isolated to my configuration. I would love > it someone could reproduce this problem. > > 2) is this behavior somehow by design and if so, is there a setting to be > altered. > > 3) if this problem is not isolated to my environment, how has gone > undetected. Seems doubtful that no one has attempted to classify teachMatch() > a 4000+ word document, or maybe it is possible. > > 4) if this problem is not limited to my configuration, what is to be done > about it. It was suggested that I "might" want to implement connection > pooling in my own code. It seems to me, in light of this issue, classifier4J > needs to implement connection pooling internally? Is this possible? > > 5) Meanwhile, any hints on implementing connection pooling in conjunction with > classifier4J would be greatly appreciated. > > I really wish I had some idea what I was talking about... > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: "Matt Collier" <MCo...@my...> > To: cla...@li... > Date: Wed, 12 Nov 2003 11:07:04 -0600 > Subject: Re: [Classifier4j-devel] Update Word Probability Break Down > > > Is it correct to say that our database connection is getting setup and torn > > down each time updateWordProbability is called? > > > > From what I gather, this is not good practice to begin with. Opening and > > closing a database connection 60-80 times per second has to be taxing. As I > > understand it, this is where connection pooling comes in. > > > > I wonder if JDBC might have some protection mechanism build in for clients > > that go haywire. Perhaps it closes connections for processes that open and > > close connections too many times. Maybe it just fails. > > > > AH HA! This is a diffence between my dbTest.java and connect.java. I am not > > connecting and disconnecting on each record. I will rebuild this to test. > > > > I don't know the first thing about how to implement connection pooling to > > begin with, much less in this conext, but I guess that's what I'll start > > working on! > > > > BTW, I've narrowed the error to the call to connectionManager.getConnection() > > in updateWordProbability. I have increased the exception handling to produce > > the following information: > > > > SQLState: 08S01 > > VendorError: 0 > > NextException: null > > > > SQLState 08S01 = mySQL error ER_BAD_HOST_ERROR according to: > > > > http://mysql.mirror.trueserver.nl/doc/en/Error-returns.html > > |
From: Matt C. <MCo...@my...> - 2003-11-12 17:38:59
|
Oh! Happy Day! After I reconstituted dbTest.java (attached) to open and close the database connection each iteration as updateWordProbability does, the exact same error occurs at exactly the same time (around 3900 iterations). This is without using any classifier4J code. So, now the questions arise... 1) Is this problem still somehow isolated to my configuration. I would love it someone could reproduce this problem. 2) is this behavior somehow by design and if so, is there a setting to be altered. 3) if this problem is not isolated to my environment, how has gone undetected. Seems doubtful that no one has attempted to classify teachMatch() a 4000+ word document, or maybe it is possible. 4) if this problem is not limited to my configuration, what is to be done about it. It was suggested that I "might" want to implement connection pooling in my own code. It seems to me, in light of this issue, classifier4J needs to implement connection pooling internally? Is this possible? 5) Meanwhile, any hints on implementing connection pooling in conjunction with classifier4J would be greatly appreciated. I really wish I had some idea what I was talking about... Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: cla...@li... Date: Wed, 12 Nov 2003 11:07:04 -0600 Subject: Re: [Classifier4j-devel] Update Word Probability Break Down > Is it correct to say that our database connection is getting setup and torn > down each time updateWordProbability is called? > > From what I gather, this is not good practice to begin with. Opening and > closing a database connection 60-80 times per second has to be taxing. As I > understand it, this is where connection pooling comes in. > > I wonder if JDBC might have some protection mechanism build in for clients > that go haywire. Perhaps it closes connections for processes that open and > close connections too many times. Maybe it just fails. > > AH HA! This is a diffence between my dbTest.java and connect.java. I am not > connecting and disconnecting on each record. I will rebuild this to test. > > I don't know the first thing about how to implement connection pooling to > begin with, much less in this conext, but I guess that's what I'll start > working on! > > BTW, I've narrowed the error to the call to connectionManager.getConnection() > in updateWordProbability. I have increased the exception handling to produce > the following information: > > SQLState: 08S01 > VendorError: 0 > NextException: null > > SQLState 08S01 = mySQL error ER_BAD_HOST_ERROR according to: > > http://mysql.mirror.trueserver.nl/doc/en/Error-returns.html > |
From: Matt C. <MCo...@my...> - 2003-11-12 17:05:24
|
Is it correct to say that our database connection is getting setup and torn down each time updateWordProbability is called? From what I gather, this is not good practice to begin with. Opening and closing a database connection 60-80 times per second has to be taxing. As I understand it, this is where connection pooling comes in. I wonder if JDBC might have some protection mechanism build in for clients that go haywire. Perhaps it closes connections for processes that open and close connections too many times. Maybe it just fails. AH HA! This is a diffence between my dbTest.java and connect.java. I am not connecting and disconnecting on each record. I will rebuild this to test. I don't know the first thing about how to implement connection pooling to begin with, much less in this conext, but I guess that's what I'll start working on! BTW, I've narrowed the error to the call to connectionManager.getConnection() in updateWordProbability. I have increased the exception handling to produce the following information: SQLState: 08S01 VendorError: 0 NextException: null SQLState 08S01 = mySQL error ER_BAD_HOST_ERROR according to: http://mysql.mirror.trueserver.nl/doc/en/Error-returns.html Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: cla...@li... Date: Wed, 12 Nov 2003 09:34:08 -0600 Subject: [Classifier4j-devel] Update Word Probability Break Down > Hi Nick, yes I am using the latest CVS code. > > How did you determine that the problem resides in the createTable function? > > Have you been able to reproduce the problem? > > I am not catching an exception there, I'm catching it in the > updateWordProbability. > > I am including the stack trace and my JDBCWordsDataSource with the additional > debug code in it. > > I am still configured to use HSQLDB which is reflected in the trace. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: Nick Lothian <nl...@es...> > To: Classifier4J <cla...@li...> > Date: Wed, 12 Nov 2003 16:19:47 +1030 > Subject: RE: [Classifier4j-devel] Update Word Probability Break Down > > > > > > ---- > > > More data on this issue: > > > > > > Switching to HSQLDB produces the exact same results. I have > > > attached the > > > revised connect.java. for use with HDSQLDB. > > > ---- > > > Another interesting discovery. If I attempt to run > > > connect.java a second time > > > immediately after running it the first time when in errors > > > out, the following > > > message is displayed immediately: > > > > > > WordsDataSourceException Occurred : Problem creating table > > > java.lang.IllegalArgumentException: IWordsDataSource can't be null > > > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > > > (BayesianClassifier.java:141) > > > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > > > (BayesianClassifier.java:128) > > > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > > > (BayesianClassifier.java:118) > > > at Connect.main(Connect.java:26) > > > Exception in thread "main" > > > > > > However, if I wait about 60-90 seconds between executions, it > > > will process the > > > ~3900 records again and die. > > > ---- > > > > You are getting the second exception trace > > (ava.lang.IllegalArgumentException: IWordsDataSource can't be null) because > > you are ignoring the WordsDataSourceException, which means that the > > IWordsDataSource you are using is null. That make sense. > > > > Exactly why you are getting the original problem is escapign me at the > > moment. > > > > The error comes from line 247 in the CVS version of > JDBCWordsDataSource.java > > (you are using the CVS version, right?). > > > > It occurs if an exception occurs somewhere in the following code: > > > > 224 con = > > connectionManager.getConnection(); > > 225 > > 226 // check if the word_probability > > table exists > > 227 DatabaseMetaData dbm = > > con.getMetaData(); > > 228 ResultSet rs = dbm.getTables (null, > > null, "WORD_PROBABILITY", null); > > 229 if (!rs.next()) { > > 230 // the table does not exist > > 231 Statement stmt = > > con.createStatement(); > > 232 // Under Axion 1.0M1, > > use > > 233 // stmt.executeUpdate( > > "CREATE TABLE word_probability ( " > > 234 // + " > > word VARCHAR(255) NOT NULL," > > 235 // + " > > category VARCHAR(20) NOT NULL," > > 236 // + " > > match_count INTEGER NOT NULL," > > 237 // + " > > nonmatch_count INTEGER NOT NULL, " > > 238 // + " > > PRIMARY KEY(word, category) ) "); > > 239 stmt.executeUpdate ( "CREATE > > TABLE word_probability ( " > > 240 + " word > > VARCHAR(255) NOT NULL," > > 241 + " category > > VARCHAR(20) NOT NULL," > > 242 + " match_count > > INT DEFAULT 0 NOT NULL," > > 243 + " > > nonmatch_count INT DEFAULT 0 NOT NULL, " > > 244 + " PRIMARY > > KEY(word, category) ) "); > > 245 } > > > > There are three possiblities here > > > > 1) connectionManager.getConnection(); is failing > > 2) DatabaseMetaData dbm = con.getMetaData(); or ResultSet rs = > > dbm.getTables(null, null, "WORD_PROBABILITY", null); is failing > > 3) The create table query is failing. > > > > I suspect it is one of the first two. I found a reference to MySQL giving > > incorrect error messages when tables are missing > > <http://dbforums.com/arch/174/2003/10/952374>, and the error given is the > > error you were getting when you were using MySQL. > > > > Could you put an e.printStackStrace() in where it catches the SQLException > > (ie, just before line 247) and send the stack trace you get? > > > > Nick > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: ApacheCon 2003, > > 16-19 November in Las Vegas. Learn firsthand the latest > > developments in Apache, PHP, Perl, XML, Java, MySQL, > > WebDAV, and more! http://www.apachecon.com/ > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Matt C. <MCo...@my...> - 2003-11-12 15:32:35
|
Hi Nick, yes I am using the latest CVS code. How did you determine that the problem resides in the createTable function? Have you been able to reproduce the problem? I am not catching an exception there, I'm catching it in the updateWordProbability. I am including the stack trace and my JDBCWordsDataSource with the additional debug code in it. I am still configured to use HSQLDB which is reflected in the trace. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: Classifier4J <cla...@li...> Date: Wed, 12 Nov 2003 16:19:47 +1030 Subject: RE: [Classifier4j-devel] Update Word Probability Break Down > > > ---- > > More data on this issue: > > > > Switching to HSQLDB produces the exact same results. I have > > attached the > > revised connect.java. for use with HDSQLDB. > > ---- > > Another interesting discovery. If I attempt to run > > connect.java a second time > > immediately after running it the first time when in errors > > out, the following > > message is displayed immediately: > > > > WordsDataSourceException Occurred : Problem creating table > > java.lang.IllegalArgumentException: IWordsDataSource can't be null > > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > > (BayesianClassifier.java:141) > > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > > (BayesianClassifier.java:128) > > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > > (BayesianClassifier.java:118) > > at Connect.main(Connect.java:26) > > Exception in thread "main" > > > > However, if I wait about 60-90 seconds between executions, it > > will process the > > ~3900 records again and die. > > ---- > > You are getting the second exception trace > (ava.lang.IllegalArgumentException: IWordsDataSource can't be null) because > you are ignoring the WordsDataSourceException, which means that the > IWordsDataSource you are using is null. That make sense. > > Exactly why you are getting the original problem is escapign me at the > moment. > > The error comes from line 247 in the CVS version of JDBCWordsDataSource.java > (you are using the CVS version, right?). > > It occurs if an exception occurs somewhere in the following code: > > 224 con = > connectionManager.getConnection(); > 225 > 226 // check if the word_probability > table exists > 227 DatabaseMetaData dbm = > con.getMetaData(); > 228 ResultSet rs = dbm.getTables(null, > null, "WORD_PROBABILITY", null); > 229 if (!rs.next()) { > 230 // the table does not exist > 231 Statement stmt = > con.createStatement(); > 232 // Under Axion 1.0M1, > use > 233 // stmt.executeUpdate( > "CREATE TABLE word_probability ( " > 234 // + " > word VARCHAR(255) NOT NULL," > 235 // + " > category VARCHAR(20) NOT NULL," > 236 // + " > match_count INTEGER NOT NULL," > 237 // + " > nonmatch_count INTEGER NOT NULL, " > 238 // + " > PRIMARY KEY(word, category) ) "); > 239 stmt.executeUpdate( "CREATE > TABLE word_probability ( " > 240 + " word > VARCHAR(255) NOT NULL," > 241 + " category > VARCHAR(20) NOT NULL," > 242 + " match_count > INT DEFAULT 0 NOT NULL," > 243 + " > nonmatch_count INT DEFAULT 0 NOT NULL, " > 244 + " PRIMARY > KEY(word, category) ) "); > 245 } > > There are three possiblities here > > 1) connectionManager.getConnection(); is failing > 2) DatabaseMetaData dbm = con.getMetaData(); or ResultSet rs = > dbm.getTables(null, null, "WORD_PROBABILITY", null); is failing > 3) The create table query is failing. > > I suspect it is one of the first two. I found a reference to MySQL giving > incorrect error messages when tables are missing > <http://dbforums.com/arch/174/2003/10/952374>, and the error given is the > error you were getting when you were using MySQL. > > Could you put an e.printStackStrace() in where it catches the SQLException > (ie, just before line 247) and send the stack trace you get? > > Nick > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Nick L. <nl...@es...> - 2003-11-12 05:51:31
|
> ---- > More data on this issue: > > Switching to HSQLDB produces the exact same results. I have > attached the > revised connect.java. for use with HDSQLDB. > ---- > Another interesting discovery. If I attempt to run > connect.java a second time > immediately after running it the first time when in errors > out, the following > message is displayed immediately: > > WordsDataSourceException Occurred : Problem creating table > java.lang.IllegalArgumentException: IWordsDataSource can't be null > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > (BayesianClassifier.java:141) > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > (BayesianClassifier.java:128) > at net.sf.classifier4J.bayesian.BayesianClassifier.<init> > (BayesianClassifier.java:118) > at Connect.main(Connect.java:26) > Exception in thread "main" > > However, if I wait about 60-90 seconds between executions, it > will process the > ~3900 records again and die. > ---- You are getting the second exception trace (ava.lang.IllegalArgumentException: IWordsDataSource can't be null) because you are ignoring the WordsDataSourceException, which means that the IWordsDataSource you are using is null. That make sense. Exactly why you are getting the original problem is escapign me at the moment. The error comes from line 247 in the CVS version of JDBCWordsDataSource.java (you are using the CVS version, right?). It occurs if an exception occurs somewhere in the following code: 224 con = connectionManager.getConnection(); 225 226 // check if the word_probability table exists 227 DatabaseMetaData dbm = con.getMetaData(); 228 ResultSet rs = dbm.getTables(null, null, "WORD_PROBABILITY", null); 229 if (!rs.next()) { 230 // the table does not exist 231 Statement stmt = con.createStatement(); 232 // Under Axion 1.0M1, use 233 // stmt.executeUpdate( "CREATE TABLE word_probability ( " 234 // + " word VARCHAR(255) NOT NULL," 235 // + " category VARCHAR(20) NOT NULL," 236 // + " match_count INTEGER NOT NULL," 237 // + " nonmatch_count INTEGER NOT NULL, " 238 // + " PRIMARY KEY(word, category) ) "); 239 stmt.executeUpdate( "CREATE TABLE word_probability ( " 240 + " word VARCHAR(255) NOT NULL," 241 + " category VARCHAR(20) NOT NULL," 242 + " match_count INT DEFAULT 0 NOT NULL," 243 + " nonmatch_count INT DEFAULT 0 NOT NULL, " 244 + " PRIMARY KEY(word, category) ) "); 245 } There are three possiblities here 1) connectionManager.getConnection(); is failing 2) DatabaseMetaData dbm = con.getMetaData(); or ResultSet rs = dbm.getTables(null, null, "WORD_PROBABILITY", null); is failing 3) The create table query is failing. I suspect it is one of the first two. I found a reference to MySQL giving incorrect error messages when tables are missing <http://dbforums.com/arch/174/2003/10/952374>, and the error given is the error you were getting when you were using MySQL. Could you put an e.printStackStrace() in where it catches the SQLException (ie, just before line 247) and send the stack trace you get? Nick |
From: Nick L. <nl...@es...> - 2003-11-12 05:21:28
|
> I just discovered that the reply address on the list messages > is not the list > but the sender. Is it possible to alter this setting and > would we want to? > Yes we would want to. I've changed it. |