Thread: RE: [Classifier4j-devel] Bayesian Case Study
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2003-11-14 01:01:12
|
> > 2) "we" see several occurances of useless pronouns in this > list. This can be > addressed by an improved "stop list". There is evidently an > excellent paper > written on the top of stop lists aptly named "A stop list for > general text" by > Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 > 1989 ISSN:0163- > 5840. If anyone has access to this paper, please advise. > Here's a list of stop words I've been saving to add into classifier4J sometime (from <ftp://ftp.cs.cornell.edu/pub/smart/>). a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently d definitely described despite did didn't different do does doesn't doing don't done down downwards during e each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except f far few fifth first five followed following follows for former formerly forth four from further furthermore g get gets getting given gives go goes going gone got gotten greetings h had hadn't happens hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn't it it'd it'll it's its itself j just k keep keeps kept know knows known l last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd m mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself n name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere o obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own p particular particularly per perhaps placed please plus possible presumably probably provides q que quite qv r rather rd re really reasonably regarding regardless regards relatively respectively right s said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two u un under unfortunately unless unlikely until unto up upon us use used useful uses using usually uucp v value various very via viz vs w want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would would wouldn't x y yes yet you you'd you'll you're you've your yours yourself yourselves z zero |
From: Nick L. <nl...@es...> - 2003-11-14 01:02:10
|
> > 3) the dreaded "s" a result no doubt of incorrectly > tokenizing possesive nouns > and pronouns, contractions etc. Anybody have a good > algorithm for handling > this? > One way to handle it would be to run a Stemmer (seach for "Porter Stemmer") on each work before classifying it. |
From: Matt C. <MCo...@my...> - 2003-11-14 04:36:23
|
Looking for java stemmers, I found these: Lovins Stemmer http://sourceforge.net/projects/stemmers/ Snowball Source Code http://snowball.tartarus.org/snowball_java.tgz Home Page http://snowball.tartarus.org/ I don't even know what this is: http://mailweb.udlap.mx/~hermes/javadoc/mx/udlap/ict/u_dl_a/irserver/qprocessor s/EnglishStemmer.html This is evidently the OFFICIAL Porter stemmer http://www.tartarus.org/~martin/PorterStemmer/ Lucene evidently uses snowball, as previously stated by Moedusa. One important piece of information I picked up from the vector-space information was to run stop-list BEFORE stemming. That's it for now, surely one of these will do the trick. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: "'cla...@li...'" <classifier4j- de...@li...> Date: Fri, 14 Nov 2003 11:30:55 +1030 Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > > 3) the dreaded "s" a result no doubt of incorrectly > > tokenizing possesive nouns > > and pronouns, contractions etc. Anybody have a good > > algorithm for handling > > this? > > > > One way to handle it would be to run a Stemmer (seach for "Porter Stemmer") > on each work before classifying it. > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Matt C. <MCo...@my...> - 2003-11-14 05:23:42
Attachments:
stemmed.txt
stemtest.txt
|
Attached are input and output files from the snowball stemmer. Clearly need to remove punctuation before stemming with this one. Does this look OK? Anybody know why these stemmers like using input strings and single character inputs. How do we quickly and easily send a string to this class? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: cla...@li... Date: Thu, 13 Nov 2003 22:38:19 -0600 Subject: RE: [Classifier4j-devel] Bayesian Case Study > Looking for java stemmers, I found these: > > Lovins Stemmer > http://sourceforge.net/projects/stemmers/ > > Snowball > Source Code > http://snowball.tartarus.org/snowball_java.tgz > Home Page > http://snowball.tartarus.org/ > > I don't even know what this is: > http://mailweb.udlap.mx/~hermes/javadoc/mx/udlap/ict/u_dl_a/irserver/qprocess > or > s/EnglishStemmer.html > > This is evidently the OFFICIAL Porter stemmer > http://www.tartarus.org/~martin/PorterStemmer/ > > Lucene evidently uses snowball, as previously stated by Moedusa. > > One important piece of information I picked up from the vector-space > information was to run stop-list BEFORE stemming. > > That's it for now, surely one of these will do the trick. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: Nick Lothian <nl...@es...> > To: "'cla...@li...'" <classifier4j- > de...@li...> > Date: Fri, 14 Nov 2003 11:30:55 +1030 > Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > > > > > 3) the dreaded "s" a result no doubt of incorrectly > > > tokenizing possesive nouns > > > and pronouns, contractions etc. Anybody have a good > > > algorithm for handling > > > this? > > > > > > > One way to handle it would be to run a Stemmer (seach for "Porter Stemmer") > > on each work before classifying it. > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: ApacheCon 2003, > > 16-19 November in Las Vegas. Learn firsthand the latest > > developments in Apache, PHP, Perl, XML, Java, MySQL, > > WebDAV, and more! http://www.apachecon.com/ > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Matt C. <MCo...@my...> - 2003-11-14 05:29:06
Attachments:
stemmed.txt
stemtest.txt
|
Another little snowball stemming test. I suppose consistency is the key to the stemming process whatever the outcome. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: cla...@li... Date: Thu, 13 Nov 2003 23:25:19 -0600 Subject: RE: [Classifier4j-devel] Bayesian Case Study > Attached are input and output files from the snowball stemmer. Clearly need > to remove punctuation before stemming with this one. Does this look OK? > > Anybody know why these stemmers like using input strings and single character > inputs. How do we quickly and easily send a string to this class? > > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: "Matt Collier" <MCo...@my...> > To: cla...@li... > Date: Thu, 13 Nov 2003 22:38:19 -0600 > Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > Looking for java stemmers, I found these: > > > > Lovins Stemmer > > http://sourceforge.net/projects/stemmers/ > > > > Snowball > > Source Code > > http://snowball.tartarus.org/snowball_java.tgz > > Home Page > > http://snowball.tartarus.org/ > > > > I don't even know what this is: > > > http://mailweb.udlap.mx/~hermes/javadoc/mx/udlap/ict/u_dl_a/irserver/qprocess > > or > > s/EnglishStemmer.html > > > > This is evidently the OFFICIAL Porter stemmer > > http://www.tartarus.org/~martin/PorterStemmer/ > > > > Lucene evidently uses snowball, as previously stated by Moedusa. > > > > One important piece of information I picked up from the vector-space > > information was to run stop-list BEFORE stemming. > > > > That's it for now, surely one of these will do the trick. > > > > Matt Collier > > RemoteIT > > mco...@my... > > 877-4-NEW-LAN > > > > > > -----Original Message----- > > From: Nick Lothian <nl...@es...> > > To: "'cla...@li...'" <classifier4j- > > de...@li...> > > Date: Fri, 14 Nov 2003 11:30:55 +1030 > > Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > > > > > > > > 3) the dreaded "s" a result no doubt of incorrectly > > > > tokenizing possesive nouns > > > > and pronouns, contractions etc. Anybody have a good > > > > algorithm for handling > > > > this? > > > > > > > > > > One way to handle it would be to run a Stemmer (seach for "Porter > Stemmer") > > > on each work before classifying it. > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email sponsored by: ApacheCon 2003, > > > 16-19 November in Las Vegas. Learn firsthand the latest > > > developments in Apache, PHP, Perl, XML, Java, MySQL, > > > WebDAV, and more! http://www.apachecon.com/ > > > _______________________________________________ > > > Classifier4j-devel mailing list > > > Cla...@li... > > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > > > > > > ------------------------------------------------------- > > This SF.Net email sponsored by: ApacheCon 2003, > > 16-19 November in Las Vegas. Learn firsthand the latest > > developments in Apache, PHP, Perl, XML, Java, MySQL, > > WebDAV, and more! http://www.apachecon.com/ > > _______________________________________________ > > Classifier4j-devel mailing list > > Cla...@li... > > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: moedusa <mo...@in...> - 2003-11-14 05:39:16
|
Matt Collier wrote: > Another little snowball stemming test. I suppose consistency is the key to > the stemming process whatever the outcome. I am afraid that any text should be first a) tokenised (strip markup, if exists or any other symbols and get raw 'text' out) b) cleaned from stop words and only after that stemmed... |
From: Nick L. <nl...@es...> - 2003-11-14 01:22:40
|
> > 4) By the match_counts on these words, I can see that each > occurance of a word > in a single document goes to the database. I don't see how > this behavior is > going to produce the desired result. Atleast in my case. I > have run across > several papers written about the effects of word frequency on text > classification. Anybody have any experience in this area? > Are you saying that a document that contains the work "tax" twice addes it twice to the database? This is correct. Logically, a document that contains the same word multiple times is "more about" that word. As a general point I'm not sure you are really going to find Bayesian classification a great match for deciding what kind of a document something is, simply because I don't think you can fairly compare the scores documents get in various categories and say if a score is higher in one than the other it is a better match. For instance, if you have two categories (say Tax and Investments), then you can't say that the word "Tax" in a document means that it is not about "Investments". However, most people use Bayesian classification for simple boolean Match/Not Match (eg Spam/Not Spam) matching. In that case there are certian words that you almost never want to see in matching records (eg - that pill that starts with a V but I won't name in order to avoid setting off everyone's spam filters) Have you looked at Vector Space algorithms? <http://www.mackmo.com/nick/blog/java/?permalink=LatentSemanticIndexing.txt> and <http://www.perl.com/lpt/a/2003/02/19/engine.html>. I'd love to have enough time to implement one of these properly.... |
From: Matt C. <MCo...@my...> - 2003-11-14 03:04:22
|
> As a general point I'm not sure you are really going to find Bayesian > classification a great match for deciding what kind of a document something > is, simply because I don't think you can fairly compare the scores documents > get in various categories and say if a score is higher in one than the other > it is a better match. > > For instance, if you have two categories (say Tax and Investments), then you > can't say that the word "Tax" in a document means that it is not about > "Investments". If this is true, I would then ask you how and why POPFile is using a Bayesian algorithm to do exactly this? Have they deviated somehow from a true Bayesian calculation? The vector stuff sounds really cool too! Can you have that working by next week? :) Matt |
From: Nick L. <nl...@es...> - 2003-11-14 01:33:36
|
> -----Original Message----- > From: Matt Collier [mailto:MCo...@my...] > Sent: Friday, 14 November 2003 11:44 AM > To: cla...@li... > Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > Very nice. Should we keep these in a flat file? This would > make alot of > sense in my opinion. > Tha makes sense to me. > Do we want to modify the default tokenizer and stop list > provider, or do we > want to extend it? > Create a new implemnetation of the IStopWordProvider interface that reads from a resource. You might want to read a bit abotu java interfaces if you haven't already. > If we want to extend it, can you please shortcut me to doing > this. I think I > understand that we will create a class that "extends default > tokenizer" etc, > but how will this new class be used by the other classes and > methods such as > bayesian.classify? Surely we won't have to modify all this > code, or perhaps > we do. I don't know... which is why I'm asking... :) > Yes, it is a valid question. Fortuanly, we thought of this when we coded it a while ago (pat myself on my back!). There is a constructor for BayesianClassifier that looks like: public BayesianClassifier(IWordsDataSource wd, ITokenizer tokenizer, IStopWordProvider swp) Which allows you to specify your own stop-word provider. As a general rule most of Classifier4J is coded against interfaces, to make this kind of change pretty easy. It means it is very flexible - it's just that we don't have many non-standard implementations.... |
From: Peter L. <pe...@le...> - 2003-11-14 03:21:32
|
On Thu, 13 Nov 2003 21:06:58 -0600, "Matt Collier" wrote: > > The vector stuff sounds really cool too! Can you have that working by next > week? :) Next week? How about by tomorrow? ;) |
From: Nick L. <nl...@es...> - 2003-11-14 03:47:09
|
> > As a general point I'm not sure you are really going to > find Bayesian > > classification a great match for deciding what kind of a > document something > > is, simply because I don't think you can fairly compare the > scores documents > > get in various categories and say if a score is higher in > one than the other > > it is a better match. > > > > For instance, if you have two categories (say Tax and > Investments), then you > > can't say that the word "Tax" in a document means that it > is not about > > "Investments". > > If this is true, I would then ask you how and why POPFile is > using a Bayesian > algorithm to do exactly this? Have they deviated somehow > from a true Bayesian > calculation? > Hmm.. that is a fair point. I should really do some experimentation. > The vector stuff sounds really cool too! Can you have that > working by next > week? :) > Yeah, if someone offers to pay :) |
From: Matt C. <MCo...@my...> - 2003-11-14 01:11:43
|
Very nice. Should we keep these in a flat file? This would make alot of sense in my opinion. Do we want to modify the default tokenizer and stop list provider, or do we want to extend it? If we want to extend it, can you please shortcut me to doing this. I think I understand that we will create a class that "extends default tokenizer" etc, but how will this new class be used by the other classes and methods such as bayesian.classify? Surely we won't have to modify all this code, or perhaps we do. I don't know... which is why I'm asking... :) Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: "'cla...@li...'" <classifier4j- de...@li...> Date: Fri, 14 Nov 2003 11:29:55 +1030 Subject: RE: [Classifier4j-devel] Bayesian Case Study > > > > 2) "we" see several occurances of useless pronouns in this > > list. This can be > > addressed by an improved "stop list". There is evidently an > > excellent paper > > written on the top of stop lists aptly named "A stop list for > > general text" by > > Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 > > 1989 ISSN:0163- > > 5840. If anyone has access to this paper, please advise. > > > > Here's a list of stop words I've been saving to add into classifier4J > sometime (from <ftp://ftp.cs.cornell.edu/pub/smart/>). > > a > a's > able > about > above > according > accordingly > across > actually > after > afterwards > again > against > ain't > all > allow > allows > almost > alone > along > already > also > although > always > am > among > amongst > an > and > another > any > anybody > anyhow > anyone > anything > anyway > anyways > anywhere > apart > appear > appreciate > appropriate > are > aren't > around > as > aside > ask > asking > associated > at > available > away > awfully > b > be > became > because > become > becomes > becoming > been > before > beforehand > behind > being > believe > below > beside > besides > best > better > between > beyond > both > brief > but > by > c > c'mon > c's > came > can > can't > cannot > cant > cause > causes > certain > certainly > changes > clearly > co > com > come > comes > concerning > consequently > consider > considering > contain > containing > contains > corresponding > could > couldn't > course > currently > d > definitely > described > despite > did > didn't > different > do > does > doesn't > doing > don't > done > down > downwards > during > e > each > edu > eg > eight > either > else > elsewhere > enough > entirely > especially > et > etc > even > ever > every > everybody > everyone > everything > everywhere > ex > exactly > example > except > f > far > few > fifth > first > five > followed > following > follows > for > former > formerly > forth > four > from > further > furthermore > g > get > gets > getting > given > gives > go > goes > going > gone > got > gotten > greetings > h > had > hadn't > happens > hardly > has > hasn't > have > haven't > having > he > he's > hello > help > hence > her > here > here's > hereafter > hereby > herein > hereupon > hers > herself > hi > him > himself > his > hither > hopefully > how > howbeit > however > i > i'd > i'll > i'm > i've > ie > if > ignored > immediate > in > inasmuch > inc > indeed > indicate > indicated > indicates > inner > insofar > instead > into > inward > is > isn't > it > it'd > it'll > it's > its > itself > j > just > k > keep > keeps > kept > know > knows > known > l > last > lately > later > latter > latterly > least > less > lest > let > let's > like > liked > likely > little > look > looking > looks > ltd > m > mainly > many > may > maybe > me > mean > meanwhile > merely > might > more > moreover > most > mostly > much > must > my > myself > n > name > namely > nd > near > nearly > necessary > need > needs > neither > never > nevertheless > new > next > nine > no > nobody > non > none > noone > nor > normally > not > nothing > novel > now > nowhere > o > obviously > of > off > often > oh > ok > okay > old > on > once > one > ones > only > onto > or > other > others > otherwise > ought > our > ours > ourselves > out > outside > over > overall > own > p > particular > particularly > per > perhaps > placed > please > plus > possible > presumably > probably > provides > q > que > quite > qv > r > rather > rd > re > really > reasonably > regarding > regardless > regards > relatively > respectively > right > s > said > same > saw > say > saying > says > second > secondly > see > seeing > seem > seemed > seeming > seems > seen > self > selves > sensible > sent > serious > seriously > seven > several > shall > she > should > shouldn't > since > six > so > some > somebody > somehow > someone > something > sometime > sometimes > somewhat > somewhere > soon > sorry > specified > specify > specifying > still > sub > such > sup > sure > t > t's > take > taken > tell > tends > th > than > thank > thanks > thanx > that > that's > thats > the > their > theirs > them > themselves > then > thence > there > there's > thereafter > thereby > therefore > therein > theres > thereupon > these > they > they'd > they'll > they're > they've > think > third > this > thorough > thoroughly > those > though > three > through > throughout > thru > thus > to > together > too > took > toward > towards > tried > tries > truly > try > trying > twice > two > u > un > under > unfortunately > unless > unlikely > until > unto > up > upon > us > use > used > useful > uses > using > usually > uucp > v > value > various > very > via > viz > vs > w > want > wants > was > wasn't > way > we > we'd > we'll > we're > we've > welcome > well > went > were > weren't > what > what's > whatever > when > whence > whenever > where > where's > whereafter > whereas > whereby > wherein > whereupon > wherever > whether > which > while > whither > who > who's > whoever > whole > whom > whose > why > will > willing > wish > with > within > without > won't > wonder > would > would > wouldn't > x > y > yes > yet > you > you'd > you'll > you're > you've > your > yours > yourself > yourselves > z > zero > > > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |