RE: [Classifier4j-devel] New Stop Words Provider
Status: Beta
Brought to you by:
nicklothian
|
From: Nick L. <nl...@es...> - 2003-11-18 22:20:00
|
I had a think about this last night.
I'll probably do moedusa's idea about automatic resource loading, with a
default filename ("defaultstopwords.txt"?), and then have a constructor that
takes a name, so you can specify a non-default list. Does that sound
reasonable?
I intend to start commiting the changes/additions with discussed over the
last week tonight, starting with a simple html parser, a stop word list
provider, the datasource (pooling) word-datasource and hopefully getting to
a stemmer. It'll take a while to show up on the anonymous CVS, though (up to
a couple of days sometimes), so if anyone would like faster access give me
your sourceforge username and I'll add you for read-only access to the
Classifier4J CVS.
Nick
> -----Original Message-----
> From: Matt Collier [mailto:MCo...@my...]
> Sent: Wednesday, 19 November 2003 8:24 AM
> To: cla...@li...
> Subject: Re: [Classifier4j-devel] New Stop Words Provider
>
>
> Attached is GammaStopWordsProvide.java. I discovered and
> implemented the
> ArrayList class.
>
> Still need to devise a way for users to pass the path to
> their custom start
> list, or implement moedusa's idea about automatic resource location.
>
> Also should throw an exception either in addition to or
> instead of printing an
> error message.
>
>
> Matt Collier
> RemoteIT
> mco...@my...
> 877-4-NEW-LAN
>
>
> -----Original Message-----
> From: "Matt Collier" <MCo...@my...>
> To: cla...@li...
> Date: Sat, 15 Nov 2003 13:23:54 -0600
> Subject: Re: [Classifier4j-devel] New Stop Words Provider
>
> > Attached, find BetaStopWordsProvider which EXTENDS
> DefaultStopWordsProvider.
> > I think I'm getting the hang of this.
> >
> > To use this, when you need to do something like this in your code:
> >
> > ICategorisedWordsDataSource wds=null; //define wds how you like
> > IStopWordProvider swp=new BetaStopWordsProvider();
> > ITokenizer tok=new DefaultTokenizer();
> >
> > BayesianClassifier classifier = new BayesianClassifier(wds,tok,swp);
> >
> > Everything is become clear to me now!
> >
> > One question remains in my mind, is it correct to say that
> our html stripper
> > and stemmer will both have to work out of
> ITokenizer/DefaultTokenizer?
> >
> > Place BetaStopWordsProvider.java in the same directory as your
> > DefaultStopWordsProvider.java, make sure you have a stop-list at
> > c:/stoplist/english.stop and you should be in business.
> >
> > Matt Collier
> > RemoteIT
> > mco...@my...
> > 877-4-NEW-LAN
> >
> >
> > -----Original Message-----
> > From: "Matt Collier" <MCo...@my...>
> > To: "Classifier4J" <cla...@li...>
> > Date: Sat, 15 Nov 2003 12:37:03 -0600
> > Subject: [Classifier4j-devel] New Stop Words Provider
> >
> > > Attached is an alternate stop words provider for
> classifier4J. I simply
> > > copied the whole of DefaultStopWordsProvide.java and
> renamed it to
> > > AlphaStopWordsProvider.java.
> > >
> > > I am pretty sure that this is not the correct way to do
> this since there is
> > a
> > > comment about overriding the getStopWords method, but I'm
> not sure how to
> > do
> > > this right now. I wanted to get this code out for
> review. Please advise.
> > >
> > > This reads the stop list from a file
> "c:/stoplist/english.stop". You will
> > > need to download the stop list or create your own. There
> is a link on the
> > > wiki site for the stop-list that Nick found :
> > >
> > > http://www.ishmaelswiki.org/wiki/index.php/TextClassification
> > >
> > > there should be a single word on each line of your stop list file.
> > >
> > > Matt Collier
> > > RemoteIT
> > > mco...@my...
> > > 877-4-NEW-LAN
>
|