classifier4j-devel Mailing List for Classifier4J (Page 9)
Status: Beta
Brought to you by:
nicklothian
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(18) |
Aug
(14) |
Sep
|
Oct
|
Nov
(74) |
Dec
(9) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(15) |
Feb
(6) |
Mar
|
Apr
|
May
(27) |
Jun
(1) |
Jul
(14) |
Aug
(3) |
Sep
(9) |
Oct
|
Nov
(3) |
Dec
(6) |
2005 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
2006 |
Jan
|
Feb
(5) |
Mar
(5) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(10) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
(1) |
Jun
(4) |
Jul
(10) |
Aug
(5) |
Sep
(10) |
Oct
(18) |
Nov
(39) |
Dec
(73) |
2009 |
Jan
(78) |
Feb
(24) |
Mar
(32) |
Apr
(53) |
May
(115) |
Jun
(99) |
Jul
(72) |
Aug
(18) |
Sep
(22) |
Oct
(35) |
Nov
(10) |
Dec
(19) |
2010 |
Jan
(6) |
Feb
(7) |
Mar
(43) |
Apr
(55) |
May
(78) |
Jun
(71) |
Jul
(43) |
Aug
(42) |
Sep
(19) |
Oct
(5) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Nick L. <nl...@es...> - 2003-11-17 04:15:14
|
I won't get a chance to respond in detail to all these emails until tomorrow night. But briefly I think that yes, passing an additional argument in the constructor would be the way to go. > -----Original Message----- > From: Matt Collier [mailto:MCo...@my...] > Sent: Monday, 17 November 2003 12:59 PM > To: Classifier4J > Subject: [Classifier4j-devel] Where to put the stemmer? > > > Provisions have been made for a custom tokenizer and a custom > stop list. The > tokenizer excutes prior to the stop list being applied. I > initially thought > that the stemmer would be part of the tokenizer, however, we > know that we > cannot stem before we apply the stop list. Do we need to expand > BayesianClassifier.java to accept an addition argument IStemmer? > > If so, at what point do we pass the code to the Stemmer? > > Looking around, I found the transformWord() method in > BayesianClassifier.java > and I called the stemmer method from there. It works fine, > but this is not a > long term solution. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > > ------------------------------------------------------- > This SF. Net email is sponsored by: GoToMyPC > GoToMyPC is the fast, easy and secure way to access your computer from > any Web browser or wireless device. Click here to Try it Free! > https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/ g22lp.tmpl _______________________________________________ Classifier4j-devel mailing list Cla...@li... https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Matt C. <MCo...@my...> - 2003-11-17 02:26:35
|
Provisions have been made for a custom tokenizer and a custom stop list. The tokenizer excutes prior to the stop list being applied. I initially thought that the stemmer would be part of the tokenizer, however, we know that we cannot stem before we apply the stop list. Do we need to expand BayesianClassifier.java to accept an addition argument IStemmer? If so, at what point do we pass the code to the Stemmer? Looking around, I found the transformWord() method in BayesianClassifier.java and I called the stemmer method from there. It works fine, but this is not a long term solution. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: moedusa <mo...@in...> - 2003-11-16 10:49:30
|
Matt Collier wrote: > I think I just figured out the primary difference between what POPFile is > doing and what we are currently doing. > > POPfile is keeping track of how many messages have been trained in each > category as well as an overall message count. I believe it is this additional > information that is allowing them to calculate additional probabilities for > muti-category sorting. > > Does this sound reasonable? I think, yes. Though, I have no idea, how, even with this kind of metainformation, it could produce preciese results... But I know, that any metainformation is very useful :) Also it is interesting for me, is it possible somehow to notify something when classifier is not sure what category to assign. Let's say when probability is neutral, could'nt it rise an event that the text was missed, so operator could train it. It seems useful for me, because if you use classifier to classify, it must classify, and if it does not know what to do, it should just ask... Well, I suggest to think about how to make it a little bit more self-learning or something... Of course, it is possible to catch neutral probability from external code, but I think, that it would be a nice option in API also. Just a thought, nothing more. |
From: moedusa <mo...@in...> - 2003-11-16 10:38:41
|
> Matt Collier wrote: > >> See attached, you will need Xerces and NekoHTML in your classpath. Just to make a note: there is one more option to deal with HTML soup (when you nedd to clean up MSWord HTML, for example). It seems, that NekoHTML does the same thing, but there is one more library called JTidy (http://lempinen.net/sami/jtidy/) based on code from the W3C Tidy (http://www.w3.org/People/Raggett/tidy/). Since I did not work with Necko, I can not compare them, but, concerning JTidy, I must say, that it is pretty good library. It can be used like a JavaBean (http://sourceforge.net/docman/display_doc.php?docid=1298&group_id=13153), and, finally, it has a very nice option: draconianWord2000Cleaning (http://www.w3.org/People/Raggett/tidy/#word2000). I used it for this kind of things. Also it does not binded to concrete Xerces version. |
From: moedusa <mo...@in...> - 2003-11-16 10:21:24
|
Matt Collier wrote: > Attached is an alternate stop words provider for classifier4J. I simply > copied the whole of DefaultStopWordsProvide.java and renamed it to > AlphaStopWordsProvider.java. May I suggest one thing? I think, it would be better not to hard-code stop-list, or pass string or URL with stop-words-file location, but to find it automatically from the classpath. The idea came from an article at onjava.com (http://www.onjava.com/pub/a/onjava/excerpt/jebp_3/index1.html?page=3), here is a small quotation, explaining what should be done: "Example 3-4[http://www.onjava.com/pub/a/onjava/excerpt/jebp_3/index1.html?page=3#ex3-4] demonstrates the search technique with a class called Resource. Given a resource name, the Resource constructor searches the class path and resource path attempting to locate the resource. When the resource is found, it makes available the resource contents as well as its directory location and last modified time (if those are available). The last modified time helps an application know, for example, when to reload the configuration data. The class uses special code to convert file: URL resources to File objects. This proves handy because URLs, even file: URLs, often don't expose special features such as a modified time. By searching both the class path and the resource path this class can find server-wide resources and per-application resources." You can find code for that class here: http://www.onjava.com/pub/a/onjava/excerpt/jebp_3/index1.html?page=3 it seems that it would be better solution. Well, I hope. Sorry for my poor English, it is not my native language. Also I am sorry for just making suggestions and not doing any coding, but I have three deadlines now and simply have no time for that, but I want help somehow this project to become more useful. |
From: Matt C. <MCo...@my...> - 2003-11-15 19:51:53
|
I think I just figured out the primary difference between what POPFile is doing and what we are currently doing. POPfile is keeping track of how many messages have been trained in each category as well as an overall message count. I believe it is this additional information that is allowing them to calculate additional probabilities for muti-category sorting. Does this sound reasonable? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Matt C. <MCo...@my...> - 2003-11-15 19:39:58
|
What's the word on numerical tokens in the word probabilty database. Do they stay or do they go? All I know is I've got a slew of them and I doubt they are serving much purpose. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Matt C. <MCo...@my...> - 2003-11-15 19:21:34
|
Attached, find BetaStopWordsProvider which EXTENDS DefaultStopWordsProvider. I think I'm getting the hang of this. To use this, when you need to do something like this in your code: ICategorisedWordsDataSource wds=null; //define wds how you like IStopWordProvider swp=new BetaStopWordsProvider(); ITokenizer tok=new DefaultTokenizer(); BayesianClassifier classifier = new BayesianClassifier(wds,tok,swp); Everything is become clear to me now! One question remains in my mind, is it correct to say that our html stripper and stemmer will both have to work out of ITokenizer/DefaultTokenizer? Place BetaStopWordsProvider.java in the same directory as your DefaultStopWordsProvider.java, make sure you have a stop-list at c:/stoplist/english.stop and you should be in business. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: "Classifier4J" <cla...@li...> Date: Sat, 15 Nov 2003 12:37:03 -0600 Subject: [Classifier4j-devel] New Stop Words Provider > Attached is an alternate stop words provider for classifier4J. I simply > copied the whole of DefaultStopWordsProvide.java and renamed it to > AlphaStopWordsProvider.java. > > I am pretty sure that this is not the correct way to do this since there is a > comment about overriding the getStopWords method, but I'm not sure how to do > this right now. I wanted to get this code out for review. Please advise. > > This reads the stop list from a file "c:/stoplist/english.stop". You will > need to download the stop list or create your own. There is a link on the > wiki site for the stop-list that Nick found : > > http://www.ishmaelswiki.org/wiki/index.php/TextClassification > > there should be a single word on each line of your stop list file. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN |
From: Matt C. <MCo...@my...> - 2003-11-15 18:34:59
|
Attached is an alternate stop words provider for classifier4J. I simply copied the whole of DefaultStopWordsProvide.java and renamed it to AlphaStopWordsProvider.java. I am pretty sure that this is not the correct way to do this since there is a comment about overriding the getStopWords method, but I'm not sure how to do this right now. I wanted to get this code out for review. Please advise. This reads the stop list from a file "c:/stoplist/english.stop". You will need to download the stop list or create your own. There is a link on the wiki site for the stop-list that Nick found : http://www.ishmaelswiki.org/wiki/index.php/TextClassification there should be a single word on each line of your stop list file. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Matt C. <MCo...@my...> - 2003-11-15 17:04:33
|
I've established a spot on my wiki for keeping track of this kind of thing: http://www.ishmaelswiki.org/wiki/index.php/TextClassification Everyone, feel free to add whatever info you like to this page. Just hit the "Edit" button at the bottom of the page. I'll probably be putting alot of information there myself for the documentation project. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: moedusa <mo...@in...> To: cla...@li... Date: Sat, 15 Nov 2003 21:41:40 +0500 Subject: Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for review > Matt Collier wrote: > > See attached, you will need Xerces and NekoHTML in your classpath. > > Could'nt you provide url for such a specific things as NekoHTML? It will > save us a lot of time. > > > > > ------------------------------------------------------- > This SF. Net email is sponsored by: GoToMyPC > GoToMyPC is the fast, easy and secure way to access your computer from > any Web browser or wireless device. Click here to Try it Free! > https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: moedusa <mo...@in...> - 2003-11-15 16:40:13
|
Matt Collier wrote: > See attached, you will need Xerces and NekoHTML in your classpath. Could'nt you provide url for such a specific things as NekoHTML? It will save us a lot of time. |
From: Matt C. <MCo...@my...> - 2003-11-15 07:27:14
|
See attached, you will need Xerces and NekoHTML in your classpath. Run TestHTMLDOM and pass either a file name or a HTTP URL as an argument. Although it took me a while to figure out how Xerces works, I think this is an excellent solution. Very flexible. As for implementation, you tell me. This particular code only leaves in the following items: content of meta tags alt text of images plain text It's a cinch to configure alternative parameters. The current output has carriage returns, line feeds and spaces a-plenty. Anybody have a good way of cleaning this mess up? I'm thinking the thing to do would be to replace all the System.out.println calls with a call to some other method. Do we already have an appropriate method in place for this? Do we need a new one? How will this code integrate into c4J? How are we going to get this data into the stop-list-->stemmer? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: moedusa <mo...@in...> - 2003-11-14 05:39:16
|
Matt Collier wrote: > Another little snowball stemming test. I suppose consistency is the key to > the stemming process whatever the outcome. I am afraid that any text should be first a) tokenised (strip markup, if exists or any other symbols and get raw 'text' out) b) cleaned from stop words and only after that stemmed... |
From: Matt C. <MCo...@my...> - 2003-11-14 05:29:06
|
Another little snowball stemming test. I suppose consistency is the key to the stemming process whatever the outcome. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: cla...@li... Date: Thu, 13 Nov 2003 23:25:19 -0600 Subject: RE: [Classifier4j-devel] Bayesian Case Study > Attached are input and output files from the snowball stemmer. Clearly need > to remove punctuation before stemming with this one. Does this look OK? > > Anybody know why these stemmers like using input strings and single character > inputs. How do we quickly and easily send a string to this class? > > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: "Matt Collier" <MCo...@my...> > To: cla...@li... > Date: Thu, 13 Nov 2003 22:38:19 -0600 > Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > Looking for java stemmers, I found these: > > > > Lovins Stemmer > > http://sourceforge.net/projects/stemmers/ > > > > Snowball > > Source Code > > http://snowball.tartarus.org/snowball_java.tgz > > Home Page > > http://snowball.tartarus.org/ > > > > I don't even know what this is: > > > http://mailweb.udlap.mx/~hermes/javadoc/mx/udlap/ict/u_dl_a/irserver/qprocess > > or > > s/EnglishStemmer.html > > > > This is evidently the OFFICIAL Porter stemmer > > http://www.tartarus.org/~martin/PorterStemmer/ > > > > Lucene evidently uses snowball, as previously stated by Moedusa. > > > > One important piece of information I picked up from the vector-space > > information was to run stop-list BEFORE stemming. > > > > That's it for now, surely one of these will do the trick. > > > > Matt Collier > > RemoteIT > > mco...@my... > > 877-4-NEW-LAN > > > > > > -----Original Message----- > > From: Nick Lothian <nl...@es...> > > To: "'cla...@li...'" <classifier4j- > > de...@li...> > > Date: Fri, 14 Nov 2003 11:30:55 +1030 > > Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > > > > > > > > 3) the dreaded "s" a result no doubt of incorrectly > > > > tokenizing possesive nouns > > > > and pronouns, contractions etc. Anybody have a good > > > > algorithm for handling > > > > this? > > > > > > > > > > One way to handle it would be to run a Stemmer (seach for "Porter > Stemmer") > > > on each work before classifying it. > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email sponsored by: ApacheCon 2003, > > > 16-19 November in Las Vegas. Learn firsthand the latest > > > developments in Apache, PHP, Perl, XML, Java, MySQL, > > > WebDAV, and more! http://www.apachecon.com/ > > > _______________________________________________ > > > Classifier4j-devel mailing list > > > Cla...@li... > > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: ApacheCon 2003, > > 16-19 November in Las Vegas. Learn firsthand the latest > > developments in Apache, PHP, Perl, XML, Java, MySQL, > > WebDAV, and more! http://www.apachecon.com/ > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Matt C. <MCo...@my...> - 2003-11-14 05:23:42
|
Attached are input and output files from the snowball stemmer. Clearly need to remove punctuation before stemming with this one. Does this look OK? Anybody know why these stemmers like using input strings and single character inputs. How do we quickly and easily send a string to this class? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: cla...@li... Date: Thu, 13 Nov 2003 22:38:19 -0600 Subject: RE: [Classifier4j-devel] Bayesian Case Study > Looking for java stemmers, I found these: > > Lovins Stemmer > http://sourceforge.net/projects/stemmers/ > > Snowball > Source Code > http://snowball.tartarus.org/snowball_java.tgz > Home Page > http://snowball.tartarus.org/ > > I don't even know what this is: > http://mailweb.udlap.mx/~hermes/javadoc/mx/udlap/ict/u_dl_a/irserver/qprocess > or > s/EnglishStemmer.html > > This is evidently the OFFICIAL Porter stemmer > http://www.tartarus.org/~martin/PorterStemmer/ > > Lucene evidently uses snowball, as previously stated by Moedusa. > > One important piece of information I picked up from the vector-space > information was to run stop-list BEFORE stemming. > > That's it for now, surely one of these will do the trick. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: Nick Lothian <nl...@es...> > To: "'cla...@li...'" <classifier4j- > de...@li...> > Date: Fri, 14 Nov 2003 11:30:55 +1030 > Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > > > > > 3) the dreaded "s" a result no doubt of incorrectly > > > tokenizing possesive nouns > > > and pronouns, contractions etc. Anybody have a good > > > algorithm for handling > > > this? > > > > > > > One way to handle it would be to run a Stemmer (seach for "Porter Stemmer") > > on each work before classifying it. > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: ApacheCon 2003, > > 16-19 November in Las Vegas. Learn firsthand the latest > > developments in Apache, PHP, Perl, XML, Java, MySQL, > > WebDAV, and more! http://www.apachecon.com/ > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Matt C. <MCo...@my...> - 2003-11-14 04:36:23
|
Looking for java stemmers, I found these: Lovins Stemmer http://sourceforge.net/projects/stemmers/ Snowball Source Code http://snowball.tartarus.org/snowball_java.tgz Home Page http://snowball.tartarus.org/ I don't even know what this is: http://mailweb.udlap.mx/~hermes/javadoc/mx/udlap/ict/u_dl_a/irserver/qprocessor s/EnglishStemmer.html This is evidently the OFFICIAL Porter stemmer http://www.tartarus.org/~martin/PorterStemmer/ Lucene evidently uses snowball, as previously stated by Moedusa. One important piece of information I picked up from the vector-space information was to run stop-list BEFORE stemming. That's it for now, surely one of these will do the trick. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: "'cla...@li...'" <classifier4j- de...@li...> Date: Fri, 14 Nov 2003 11:30:55 +1030 Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > > 3) the dreaded "s" a result no doubt of incorrectly > > tokenizing possesive nouns > > and pronouns, contractions etc. Anybody have a good > > algorithm for handling > > this? > > > > One way to handle it would be to run a Stemmer (seach for "Porter Stemmer") > on each work before classifying it. > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Nick L. <nl...@es...> - 2003-11-14 03:47:09
|
> > As a general point I'm not sure you are really going to > find Bayesian > > classification a great match for deciding what kind of a > document something > > is, simply because I don't think you can fairly compare the > scores documents > > get in various categories and say if a score is higher in > one than the other > > it is a better match. > > > > For instance, if you have two categories (say Tax and > Investments), then you > > can't say that the word "Tax" in a document means that it > is not about > > "Investments". > > If this is true, I would then ask you how and why POPFile is > using a Bayesian > algorithm to do exactly this? Have they deviated somehow > from a true Bayesian > calculation? > Hmm.. that is a fair point. I should really do some experimentation. > The vector stuff sounds really cool too! Can you have that > working by next > week? :) > Yeah, if someone offers to pay :) |
From: Peter L. <pe...@le...> - 2003-11-14 03:21:32
|
On Thu, 13 Nov 2003 21:06:58 -0600, "Matt Collier" wrote: > > The vector stuff sounds really cool too! Can you have that working by next > week? :) Next week? How about by tomorrow? ;) |
From: Matt C. <MCo...@my...> - 2003-11-14 03:04:22
|
> As a general point I'm not sure you are really going to find Bayesian > classification a great match for deciding what kind of a document something > is, simply because I don't think you can fairly compare the scores documents > get in various categories and say if a score is higher in one than the other > it is a better match. > > For instance, if you have two categories (say Tax and Investments), then you > can't say that the word "Tax" in a document means that it is not about > "Investments". If this is true, I would then ask you how and why POPFile is using a Bayesian algorithm to do exactly this? Have they deviated somehow from a true Bayesian calculation? The vector stuff sounds really cool too! Can you have that working by next week? :) Matt |
From: Nick L. <nl...@es...> - 2003-11-14 01:33:36
|
> -----Original Message----- > From: Matt Collier [mailto:MCo...@my...] > Sent: Friday, 14 November 2003 11:44 AM > To: cla...@li... > Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > Very nice. Should we keep these in a flat file? This would > make alot of > sense in my opinion. > Tha makes sense to me. > Do we want to modify the default tokenizer and stop list > provider, or do we > want to extend it? > Create a new implemnetation of the IStopWordProvider interface that reads from a resource. You might want to read a bit abotu java interfaces if you haven't already. > If we want to extend it, can you please shortcut me to doing > this. I think I > understand that we will create a class that "extends default > tokenizer" etc, > but how will this new class be used by the other classes and > methods such as > bayesian.classify? Surely we won't have to modify all this > code, or perhaps > we do. I don't know... which is why I'm asking... :) > Yes, it is a valid question. Fortuanly, we thought of this when we coded it a while ago (pat myself on my back!). There is a constructor for BayesianClassifier that looks like: public BayesianClassifier(IWordsDataSource wd, ITokenizer tokenizer, IStopWordProvider swp) Which allows you to specify your own stop-word provider. As a general rule most of Classifier4J is coded against interfaces, to make this kind of change pretty easy. It means it is very flexible - it's just that we don't have many non-standard implementations.... |
From: Nick L. <nl...@es...> - 2003-11-14 01:22:40
|
> > 4) By the match_counts on these words, I can see that each > occurance of a word > in a single document goes to the database. I don't see how > this behavior is > going to produce the desired result. Atleast in my case. I > have run across > several papers written about the effects of word frequency on text > classification. Anybody have any experience in this area? > Are you saying that a document that contains the work "tax" twice addes it twice to the database? This is correct. Logically, a document that contains the same word multiple times is "more about" that word. As a general point I'm not sure you are really going to find Bayesian classification a great match for deciding what kind of a document something is, simply because I don't think you can fairly compare the scores documents get in various categories and say if a score is higher in one than the other it is a better match. For instance, if you have two categories (say Tax and Investments), then you can't say that the word "Tax" in a document means that it is not about "Investments". However, most people use Bayesian classification for simple boolean Match/Not Match (eg Spam/Not Spam) matching. In that case there are certian words that you almost never want to see in matching records (eg - that pill that starts with a V but I won't name in order to avoid setting off everyone's spam filters) Have you looked at Vector Space algorithms? <http://www.mackmo.com/nick/blog/java/?permalink=LatentSemanticIndexing.txt> and <http://www.perl.com/lpt/a/2003/02/19/engine.html>. I'd love to have enough time to implement one of these properly.... |
From: Matt C. <MCo...@my...> - 2003-11-14 01:11:43
|
Very nice. Should we keep these in a flat file? This would make alot of sense in my opinion. Do we want to modify the default tokenizer and stop list provider, or do we want to extend it? If we want to extend it, can you please shortcut me to doing this. I think I understand that we will create a class that "extends default tokenizer" etc, but how will this new class be used by the other classes and methods such as bayesian.classify? Surely we won't have to modify all this code, or perhaps we do. I don't know... which is why I'm asking... :) Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: "'cla...@li...'" <classifier4j- de...@li...> Date: Fri, 14 Nov 2003 11:29:55 +1030 Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > > 2) "we" see several occurances of useless pronouns in this > > list. This can be > > addressed by an improved "stop list". There is evidently an > > excellent paper > > written on the top of stop lists aptly named "A stop list for > > general text" by > > Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 > > 1989 ISSN:0163- > > 5840. If anyone has access to this paper, please advise. > > > > Here's a list of stop words I've been saving to add into classifier4J > sometime (from <ftp://ftp.cs.cornell.edu/pub/smart/>). > > a > a's > able > about > above > according > accordingly > across > actually > after > afterwards > again > against > ain't > all > allow > allows > almost > alone > along > already > also > although > always > am > among > amongst > an > and > another > any > anybody > anyhow > anyone > anything > anyway > anyways > anywhere > apart > appear > appreciate > appropriate > are > aren't > around > as > aside > ask > asking > associated > at > available > away > awfully > b > be > became > because > become > becomes > becoming > been > before > beforehand > behind > being > believe > below > beside > besides > best > better > between > beyond > both > brief > but > by > c > c'mon > c's > came > can > can't > cannot > cant > cause > causes > certain > certainly > changes > clearly > co > com > come > comes > concerning > consequently > consider > considering > contain > containing > contains > corresponding > could > couldn't > course > currently > d > definitely > described > despite > did > didn't > different > do > does > doesn't > doing > don't > done > down > downwards > during > e > each > edu > eg > eight > either > else > elsewhere > enough > entirely > especially > et > etc > even > ever > every > everybody > everyone > everything > everywhere > ex > exactly > example > except > f > far > few > fifth > first > five > followed > following > follows > for > former > formerly > forth > four > from > further > furthermore > g > get > gets > getting > given > gives > go > goes > going > gone > got > gotten > greetings > h > had > hadn't > happens > hardly > has > hasn't > have > haven't > having > he > he's > hello > help > hence > her > here > here's > hereafter > hereby > herein > hereupon > hers > herself > hi > him > himself > his > hither > hopefully > how > howbeit > however > i > i'd > i'll > i'm > i've > ie > if > ignored > immediate > in > inasmuch > inc > indeed > indicate > indicated > indicates > inner > insofar > instead > into > inward > is > isn't > it > it'd > it'll > it's > its > itself > j > just > k > keep > keeps > kept > know > knows > known > l > last > lately > later > latter > latterly > least > less > lest > let > let's > like > liked > likely > little > look > looking > looks > ltd > m > mainly > many > may > maybe > me > mean > meanwhile > merely > might > more > moreover > most > mostly > much > must > my > myself > n > name > namely > nd > near > nearly > necessary > need > needs > neither > never > nevertheless > new > next > nine > no > nobody > non > none > noone > nor > normally > not > nothing > novel > now > nowhere > o > obviously > of > off > often > oh > ok > okay > old > on > once > one > ones > only > onto > or > other > others > otherwise > ought > our > ours > ourselves > out > outside > over > overall > own > p > particular > particularly > per > perhaps > placed > please > plus > possible > presumably > probably > provides > q > que > quite > qv > r > rather > rd > re > really > reasonably > regarding > regardless > regards > relatively > respectively > right > s > said > same > saw > say > saying > says > second > secondly > see > seeing > seem > seemed > seeming > seems > seen > self > selves > sensible > sent > serious > seriously > seven > several > shall > she > should > shouldn't > since > six > so > some > somebody > somehow > someone > something > sometime > sometimes > somewhat > somewhere > soon > sorry > specified > specify > specifying > still > sub > such > sup > sure > t > t's > take > taken > tell > tends > th > than > thank > thanks > thanx > that > that's > thats > the > their > theirs > them > themselves > then > thence > there > there's > thereafter > thereby > therefore > therein > theres > thereupon > these > they > they'd > they'll > they're > they've > think > third > this > thorough > thoroughly > those > though > three > through > throughout > thru > thus > to > together > too > took > toward > towards > tried > tries > truly > try > trying > twice > two > u > un > under > unfortunately > unless > unlikely > until > unto > up > upon > us > use > used > useful > uses > using > usually > uucp > v > value > various > very > via > viz > vs > w > want > wants > was > wasn't > way > we > we'd > we'll > we're > we've > welcome > well > went > were > weren't > what > what's > whatever > when > whence > whenever > where > where's > whereafter > whereas > whereby > wherein > whereupon > wherever > whether > which > while > whither > who > who's > whoever > whole > whom > whose > why > will > willing > wish > with > within > without > won't > wonder > would > would > wouldn't > x > y > yes > yet > you > you'd > you'll > you're > you've > your > yours > yourself > yourselves > z > zero > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: moedusa <mo...@in...> - 2003-11-14 01:10:54
|
Concerning tokenization - is it possible somehow to reuse tokenization API and code from the Lucene (http://jakarta.apache.org/lucene)? It has html tokenizers, as well as stemmers, English, French, German, Russian and Chinese implementations, based on snowball (http://snowball.tartarus.org/) algorythm... But if I am not mistaken, it is based on JavaCC (tokenization, I mean). But stemming is not... |
From: Nick L. <nl...@es...> - 2003-11-14 01:02:10
|
> > 3) the dreaded "s" a result no doubt of incorrectly > tokenizing possesive nouns > and pronouns, contractions etc. Anybody have a good > algorithm for handling > this? > One way to handle it would be to run a Stemmer (seach for "Porter Stemmer") on each work before classifying it. |
From: Nick L. <nl...@es...> - 2003-11-14 01:01:12
|
> > 2) "we" see several occurances of useless pronouns in this > list. This can be > addressed by an improved "stop list". There is evidently an > excellent paper > written on the top of stop lists aptly named "A stop list for > general text" by > Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 > 1989 ISSN:0163- > 5840. If anyone has access to this paper, please advise. > Here's a list of stop words I've been saving to add into classifier4J sometime (from <ftp://ftp.cs.cornell.edu/pub/smart/>). a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently d definitely described despite did didn't different do does doesn't doing don't done down downwards during e each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except f far few fifth first five followed following follows for former formerly forth four from further furthermore g get gets getting given gives go goes going gone got gotten greetings h had hadn't happens hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn't it it'd it'll it's its itself j just k keep keeps kept know knows known l last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd m mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself n name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere o obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own p particular particularly per perhaps placed please plus possible presumably probably provides q que quite qv r rather rd re really reasonably regarding regardless regards relatively respectively right s said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two u un under unfortunately unless unlikely until unto up upon us use used useful uses using usually uucp v value various very via viz vs w want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would would wouldn't x y yes yet you you'd you'll you're you've your yours yourself yourselves z zero |