RE: [Classifier4j-devel] New Stop Words Provider
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2003-11-18 22:20:00
|
I had a think about this last night. I'll probably do moedusa's idea about automatic resource loading, with a default filename ("defaultstopwords.txt"?), and then have a constructor that takes a name, so you can specify a non-default list. Does that sound reasonable? I intend to start commiting the changes/additions with discussed over the last week tonight, starting with a simple html parser, a stop word list provider, the datasource (pooling) word-datasource and hopefully getting to a stemmer. It'll take a while to show up on the anonymous CVS, though (up to a couple of days sometimes), so if anyone would like faster access give me your sourceforge username and I'll add you for read-only access to the Classifier4J CVS. Nick > -----Original Message----- > From: Matt Collier [mailto:MCo...@my...] > Sent: Wednesday, 19 November 2003 8:24 AM > To: cla...@li... > Subject: Re: [Classifier4j-devel] New Stop Words Provider > > > Attached is GammaStopWordsProvide.java. I discovered and > implemented the > ArrayList class. > > Still need to devise a way for users to pass the path to > their custom start > list, or implement moedusa's idea about automatic resource location. > > Also should throw an exception either in addition to or > instead of printing an > error message. > > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: "Matt Collier" <MCo...@my...> > To: cla...@li... > Date: Sat, 15 Nov 2003 13:23:54 -0600 > Subject: Re: [Classifier4j-devel] New Stop Words Provider > > > Attached, find BetaStopWordsProvider which EXTENDS > DefaultStopWordsProvider. > > I think I'm getting the hang of this. > > > > To use this, when you need to do something like this in your code: > > > > ICategorisedWordsDataSource wds=null; //define wds how you like > > IStopWordProvider swp=new BetaStopWordsProvider(); > > ITokenizer tok=new DefaultTokenizer(); > > > > BayesianClassifier classifier = new BayesianClassifier(wds,tok,swp); > > > > Everything is become clear to me now! > > > > One question remains in my mind, is it correct to say that > our html stripper > > and stemmer will both have to work out of > ITokenizer/DefaultTokenizer? > > > > Place BetaStopWordsProvider.java in the same directory as your > > DefaultStopWordsProvider.java, make sure you have a stop-list at > > c:/stoplist/english.stop and you should be in business. > > > > Matt Collier > > RemoteIT > > mco...@my... > > 877-4-NEW-LAN > > > > > > -----Original Message----- > > From: "Matt Collier" <MCo...@my...> > > To: "Classifier4J" <cla...@li...> > > Date: Sat, 15 Nov 2003 12:37:03 -0600 > > Subject: [Classifier4j-devel] New Stop Words Provider > > > > > Attached is an alternate stop words provider for > classifier4J. I simply > > > copied the whole of DefaultStopWordsProvide.java and > renamed it to > > > AlphaStopWordsProvider.java. > > > > > > I am pretty sure that this is not the correct way to do > this since there is > > a > > > comment about overriding the getStopWords method, but I'm > not sure how to > > do > > > this right now. I wanted to get this code out for > review. Please advise. > > > > > > This reads the stop list from a file > "c:/stoplist/english.stop". You will > > > need to download the stop list or create your own. There > is a link on the > > > wiki site for the stop-list that Nick found : > > > > > > http://www.ishmaelswiki.org/wiki/index.php/TextClassification > > > > > > there should be a single word on each line of your stop list file. > > > > > > Matt Collier > > > RemoteIT > > > mco...@my... > > > 877-4-NEW-LAN > |