classifier4j-devel Mailing List for Classifier4J (Page 8)
Status: Beta
Brought to you by:
nicklothian
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(18) |
Aug
(14) |
Sep
|
Oct
|
Nov
(74) |
Dec
(9) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(15) |
Feb
(6) |
Mar
|
Apr
|
May
(27) |
Jun
(1) |
Jul
(14) |
Aug
(3) |
Sep
(9) |
Oct
|
Nov
(3) |
Dec
(6) |
2005 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
2006 |
Jan
|
Feb
(5) |
Mar
(5) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(10) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
(1) |
Jun
(4) |
Jul
(10) |
Aug
(5) |
Sep
(10) |
Oct
(18) |
Nov
(39) |
Dec
(73) |
2009 |
Jan
(78) |
Feb
(24) |
Mar
(32) |
Apr
(53) |
May
(115) |
Jun
(99) |
Jul
(72) |
Aug
(18) |
Sep
(22) |
Oct
(35) |
Nov
(10) |
Dec
(19) |
2010 |
Jan
(6) |
Feb
(7) |
Mar
(43) |
Apr
(55) |
May
(78) |
Jun
(71) |
Jul
(43) |
Aug
(42) |
Sep
(19) |
Oct
(5) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Brent L J. <br...@bj...> - 2004-01-01 04:21:16
|
Im sure this is a fairly common question but I didn't see it in the mailing list archives. Im using MySQL and when I first create a JDBCWordsDataSource it creates a table 'word_probability'. But when I run it after that I get the following exception: net.sf.classifier4J.bayesian.WordsDataSourceException: Problem creating table at net.sf.classifier4J.bayesian.JDBCWordsDataSource.createTable(JDBCWordsDa taSource.java:252) at net.sf.classifier4J.bayesian.JDBCWordsDataSource.<init>(JDBCWordsDataSou rce.java:100) at com.li.sentinel.agent.classify.Classifier.main(Classifier.java:32) Caused by: java.sql.SQLException: General error, message from server: "Table 'word_probability' already exists" ... Even though the table exists it looks like the following line of code isn't finding the table? In JDBCWordsDataSource.java(233): ResultSet rs = dbm.getTables(null, null, "WORD_PROBABILITY", null); I get an empty resultset apparently. This is when using 0.5. Any ideas? Thanks, - Brent |
From: Brent L J. <br...@bj...> - 2003-12-30 19:30:57
|
> What type of datasource will you be using? > > If you need to classify based on multiple categories, you > will need to use > a "JDBC Datasource" such as mySQL to contain your corpus > which is what I have > experience with. If you only need a single category you can > use the inbuilt > datasource which I have no experience with. At first a single category (spam) but I'm planning on allowing the user to do more classification for email filtering and such. I do plan on using a JDBC datasource to contain the "corpus" (I assume thats the classification data is stored). Thanks, - Brent |
From: Nick L. <nl...@es...> - 2003-12-30 05:36:32
|
> > In the case of using any JDBC Datasource I found that it was > critical to > implement connection pooling and I implemented this in my own > haphazard way in > my code. I'm not sure if connection pooling has made its way > into the CVS > build yet or not. > The Datasource connection manager is included in C4J 0.5. This will allow you to use JDBC datasources which implement connection pooling (eg - the one in Tomcat) <http://classifier4j.sourceforge.net/apidocs/net/sf/classifier4J/bayesian/Da taSourceJDBCConnectionManager.html> |
From: Nick L. <nl...@es...> - 2003-12-30 05:35:28
|
Yes it is for 0.5. I'm not sure why it is still saying it is 0.4. > -----Original Message----- > From: Matt Collier [mailto:MCo...@my...] > Sent: Monday, 29 December 2003 2:35 PM > To: Classifier4J > Subject: [Classifier4j-devel] JavaDoc Versioning > Importance: Low > > > http://classifier4j.sourceforge.net/apidocs/index.html > > The JavaDoc is reporting it is for version 0.4 but it appears > to me that it > has been updated for version 0.5? > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign > up for IBM's > Free Linux Tutorials. Learn everything from the bash shell > to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Matt C. <MCo...@my...> - 2003-12-30 04:05:04
|
http://classifier4j.sourceforge.net/apidocs/index.html The JavaDoc is reporting it is for version 0.4 but it appears to me that it has been updated for version 0.5? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Matt C. <MCo...@my...> - 2003-12-30 03:58:35
|
Hi Brent, if you check the archives from me, you will find a number messages that include some sample code. If you can't find anything that suits you, I would be happy to resubmit a sample. What type of datasource will you be using? If you need to classify based on multiple categories, you will need to use a "JDBC Datasource" such as mySQL to contain your corpus which is what I have experience with. If you only need a single category you can use the inbuilt datasource which I have no experience with. In the case of using any JDBC Datasource I found that it was critical to implement connection pooling and I implemented this in my own haphazard way in my code. I'm not sure if connection pooling has made its way into the CVS build yet or not. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Brent L Johnson" <br...@bj...> To: <cla...@li...> Date: Mon, 29 Dec 2003 19:30:28 -0500 Subject: [Classifier4j-devel] Samples? > I'm sure this is a huge newbie question but is there > any sample code out there that uses Classifier4J? In > particular, the Bayesian classifier? > > I'm working on a project for doing server-side email > classification (i.e. by subject or "Spam") and I'm > interested in using Classifier4J to do this > (and also the summarizer for viewing summarized emails). > > Thanks, > > - Brent > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |
From: Brent L J. <br...@bj...> - 2003-12-30 00:30:44
|
I'm sure this is a huge newbie question but is there any sample code out there that uses Classifier4J? In particular, the Bayesian classifier? I'm working on a project for doing server-side email classification (i.e. by subject or "Spam") and I'm interested in using Classifier4J to do this (and also the summarizer for viewing summarized emails). Thanks, - Brent |
From: Nick L. <ni...@ma...> - 2003-12-18 21:39:48
|
What database are you running against? ----- Original Message ----- From: "ASARI Takashi" <as...@so...> To: <cla...@li...> Sent: Friday, December 19, 2003 1:01 AM Subject: [Classifier4j-devel] bugfix on JDBCWordsDataSource.java > Hello, I'm very interested in Classifier4J, > and now using a CVS version of it. > > I found very simple bug on the code, so I send you a diff. > Plelase have a look. > > -- > ASARI Takashi > |
From: ASARI T. <as...@so...> - 2003-12-18 14:34:02
|
Hello, I'm very interested in Classifier4J, and now using a CVS version of it. I found very simple bug on the code, so I send you a diff. Plelase have a look. -- ASARI Takashi |
From: Nick L. <ni...@ma...> - 2003-12-17 11:58:11
|
I've just released Classifier4J version 0.5. Some of the new features include: a.. JDBCWordsDataSource now properly stored the connection info set in the constructor (bug) a.. New DataSourceJDBCConnectionManager a.. New SimpleHTMLTokenizer a.. New CustomizableStopWordProvider a.. JDBCWordsDataSource now truncates any words longer than 255 characters a.. SimpleWordsDataSource is now Serializable a.. Revmoval of dependancy on commons-lang See http://classifier4j.sourceforge.net/ and http://sourceforge.net/projects/classifier4j Nick |
From: Nick L. <nl...@es...> - 2003-11-20 23:13:33
|
Jon Udell on using Bayesian classification to categrorise items: <http://weblog.infoworld.com/udell/2003/11/20.html#a851> <http://www.xml.com/pub/a/2003/11/19/udell.html> Nick |
From: Nick L. <nl...@es...> - 2003-11-19 01:03:07
|
Nice pickup. I guess we should truncate the words to 255 chars at the start of the method. > -----Original Message----- > From: Matt Collier [mailto:MCo...@my...] > Sent: Wednesday, 19 November 2003 11:27 AM > To: Classifier4J > Subject: [Classifier4j-devel] > JDBCWordsDataSource.updateWordProbability > fails > > > JDBCWordsDataSource.updateWordProbability fails if > word.length() > 255. > > The size of word in the database is varchar(255). > > "SELECT 1 FROM word_probability WHERE word = ? AND category = ?") > > Correctly returns no records, because a string containing 255 > "A"s does not > equal a string containing 256 "A"s. > > Since the method proceeds to insert a "new" value, word = 256 > "A"s, which is > of course truncated to 255 characters. A duplicate word > value, key violation > occurs. > > I have temporarily corrected this issue thusly: > > if ( !rs.next()) > > changes to: > > if ( !rs.next() && word.length() <= 255 ) > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > Does SourceForge.net help you be more productive? Does it > help you create better code? SHARE THE LOVE, and help us help > YOU! Click Here: http://sourceforge.net/donate/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Matt C. <MCo...@my...> - 2003-11-19 00:54:34
|
JDBCWordsDataSource.updateWordProbability fails if word.length() > 255. The size of word in the database is varchar(255). "SELECT 1 FROM word_probability WHERE word = ? AND category = ?") Correctly returns no records, because a string containing 255 "A"s does not equal a string containing 256 "A"s. Since the method proceeds to insert a "new" value, word = 256 "A"s, which is of course truncated to 255 characters. A duplicate word value, key violation occurs. I have temporarily corrected this issue thusly: if ( !rs.next()) changes to: if ( !rs.next() && word.length() <= 255 ) Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: Nick L. <nl...@es...> - 2003-11-18 22:20:00
|
I had a think about this last night. I'll probably do moedusa's idea about automatic resource loading, with a default filename ("defaultstopwords.txt"?), and then have a constructor that takes a name, so you can specify a non-default list. Does that sound reasonable? I intend to start commiting the changes/additions with discussed over the last week tonight, starting with a simple html parser, a stop word list provider, the datasource (pooling) word-datasource and hopefully getting to a stemmer. It'll take a while to show up on the anonymous CVS, though (up to a couple of days sometimes), so if anyone would like faster access give me your sourceforge username and I'll add you for read-only access to the Classifier4J CVS. Nick > -----Original Message----- > From: Matt Collier [mailto:MCo...@my...] > Sent: Wednesday, 19 November 2003 8:24 AM > To: cla...@li... > Subject: Re: [Classifier4j-devel] New Stop Words Provider > > > Attached is GammaStopWordsProvide.java. I discovered and > implemented the > ArrayList class. > > Still need to devise a way for users to pass the path to > their custom start > list, or implement moedusa's idea about automatic resource location. > > Also should throw an exception either in addition to or > instead of printing an > error message. > > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: "Matt Collier" <MCo...@my...> > To: cla...@li... > Date: Sat, 15 Nov 2003 13:23:54 -0600 > Subject: Re: [Classifier4j-devel] New Stop Words Provider > > > Attached, find BetaStopWordsProvider which EXTENDS > DefaultStopWordsProvider. > > I think I'm getting the hang of this. > > > > To use this, when you need to do something like this in your code: > > > > ICategorisedWordsDataSource wds=null; //define wds how you like > > IStopWordProvider swp=new BetaStopWordsProvider(); > > ITokenizer tok=new DefaultTokenizer(); > > > > BayesianClassifier classifier = new BayesianClassifier(wds,tok,swp); > > > > Everything is become clear to me now! > > > > One question remains in my mind, is it correct to say that > our html stripper > > and stemmer will both have to work out of > ITokenizer/DefaultTokenizer? > > > > Place BetaStopWordsProvider.java in the same directory as your > > DefaultStopWordsProvider.java, make sure you have a stop-list at > > c:/stoplist/english.stop and you should be in business. > > > > Matt Collier > > RemoteIT > > mco...@my... > > 877-4-NEW-LAN > > > > > > -----Original Message----- > > From: "Matt Collier" <MCo...@my...> > > To: "Classifier4J" <cla...@li...> > > Date: Sat, 15 Nov 2003 12:37:03 -0600 > > Subject: [Classifier4j-devel] New Stop Words Provider > > > > > Attached is an alternate stop words provider for > classifier4J. I simply > > > copied the whole of DefaultStopWordsProvide.java and > renamed it to > > > AlphaStopWordsProvider.java. > > > > > > I am pretty sure that this is not the correct way to do > this since there is > > a > > > comment about overriding the getStopWords method, but I'm > not sure how to > > do > > > this right now. I wanted to get this code out for > review. Please advise. > > > > > > This reads the stop list from a file > "c:/stoplist/english.stop". You will > > > need to download the stop list or create your own. There > is a link on the > > > wiki site for the stop-list that Nick found : > > > > > > http://www.ishmaelswiki.org/wiki/index.php/TextClassification > > > > > > there should be a single word on each line of your stop list file. > > > > > > Matt Collier > > > RemoteIT > > > mco...@my... > > > 877-4-NEW-LAN > |
From: Matt C. <MCo...@my...> - 2003-11-18 21:52:20
|
Attached is GammaStopWordsProvide.java. I discovered and implemented the ArrayList class. Still need to devise a way for users to pass the path to their custom start list, or implement moedusa's idea about automatic resource location. Also should throw an exception either in addition to or instead of printing an error message. Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: cla...@li... Date: Sat, 15 Nov 2003 13:23:54 -0600 Subject: Re: [Classifier4j-devel] New Stop Words Provider > Attached, find BetaStopWordsProvider which EXTENDS DefaultStopWordsProvider. > I think I'm getting the hang of this. > > To use this, when you need to do something like this in your code: > > ICategorisedWordsDataSource wds=null; //define wds how you like > IStopWordProvider swp=new BetaStopWordsProvider(); > ITokenizer tok=new DefaultTokenizer(); > > BayesianClassifier classifier = new BayesianClassifier(wds,tok,swp); > > Everything is become clear to me now! > > One question remains in my mind, is it correct to say that our html stripper > and stemmer will both have to work out of ITokenizer/DefaultTokenizer? > > Place BetaStopWordsProvider.java in the same directory as your > DefaultStopWordsProvider.java, make sure you have a stop-list at > c:/stoplist/english.stop and you should be in business. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > -----Original Message----- > From: "Matt Collier" <MCo...@my...> > To: "Classifier4J" <cla...@li...> > Date: Sat, 15 Nov 2003 12:37:03 -0600 > Subject: [Classifier4j-devel] New Stop Words Provider > > > Attached is an alternate stop words provider for classifier4J. I simply > > copied the whole of DefaultStopWordsProvide.java and renamed it to > > AlphaStopWordsProvider.java. > > > > I am pretty sure that this is not the correct way to do this since there is > a > > comment about overriding the getStopWords method, but I'm not sure how to > do > > this right now. I wanted to get this code out for review. Please advise. > > > > This reads the stop list from a file "c:/stoplist/english.stop". You will > > need to download the stop list or create your own. There is a link on the > > wiki site for the stop-list that Nick found : > > > > http://www.ishmaelswiki.org/wiki/index.php/TextClassification > > > > there should be a single word on each line of your stop list file. > > > > Matt Collier > > RemoteIT > > mco...@my... > > 877-4-NEW-LAN |
From: Nick L. <ni...@ma...> - 2003-11-18 08:57:59
|
You can customize this by passing a different regular expression to the constructor of DefaultTokenizer. ----- Original Message ----- From: "Matt Collier" <MCo...@my...> To: "Classifier4J" <cla...@li...> Sent: Sunday, November 16, 2003 6:12 AM Subject: [Classifier4j-devel] The case for numerical tokens > What's the word on numerical tokens in the word probabilty database. Do they > stay or do they go? > > All I know is I've got a slew of them and I doubt they are serving much > purpose. > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN > > > > ------------------------------------------------------- > This SF. Net email is sponsored by: GoToMyPC > GoToMyPC is the fast, easy and secure way to access your computer from > any Web browser or wireless device. Click here to Try it Free! > https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Nick L. <nl...@es...> - 2003-11-18 04:35:02
|
> > In my word_probability database, I currently have no > "nonMatchingCount"s. Yes, that will cause problems! > Therefore all my word probabilities are turning out as LOWER_BOUND, > NEUTRAL_PROBABILITY, or .99 since effectively > matchingCount/matchingCount = > 1. BayesianClassifier.normaliseSignificance() presumably > adjusts this > outcome from 1 to .99. > > This I believe represents a major difference between the > current method and my > understanding of POPFile's method. > > At this point, POPFile is calculating: > > Occurences of Word A in Category XYZ / Total Occurences of > ALL words in > Category XYZ. > > In other words: > match_count of A where Category=XYZ / sum(match_count) from > category XYZ. > Classifier4J calculates probability of a word matching = match_count/(match_count + non_match_count) I guess the difference between the two methods is quite important. I'm trying to analyse what it means and which is more useful. Consider the following case (based on my actual database of words): I want to analyse the sentance: "Apache Jakarta is a Java Site" to see if it matches my "I would probably be interested in this" criteria. I am expecting that it will match. // we need to calculate xy/(xy + z) // where z = (1-x)(1-y) Total of select sum(match_count) from word_probability = 15359 Apache: M=16, NM=2, C4J-P=0.8889 PF-P=0.001 Jakarta: M=16, NM=0, C4J-P=0.99 (using cut-off) PF-P=0.001 is = stop word a = stop word Java: M=98, NM=13, C4J-P=0.6805 PF-P=0.0064 Site: M=7, NM=4, C4J-P=0.6364 PF-P=0.0005 For Classifier4J, the calculation goes: (0.8889)(0.99)(0.6805)(0.6364)/((0.88889)(0.99)(0.6805)(0.6364) + (1 - 0.8889)(1 - 0.99)(1 - 0.6805)(1 - 0.6364)) = 0.3811065397722/(0.3811065397722 + (0.1111)(0.01)(0.3195)(0.3636)) = 0.3811065397722/(0.3811065397722 + 0.0001290650922) = 0.3811065397722/0.3812356048644 = 0.9996 For POPFile: (0.001)(0.001)(0.0064)(0.0005)/((0.001)(0.001)(0.0064)(0.0005) + (1 - 0.001)(1 - 0.001)(1 - 0.0064)(1 - 0.0005)) = 0.0000000000032/(0.0000000000032 + (0.999)(0.999)(0.9936)(0.9995)) = 0.0000000000032/(0.0000000000032 + 0.992904589291032) = 0.0000000000032/0.992904589294232 = pretty close to zero Now I realize they do their stuff with logs to get around this, but I don't really think you can call that Bayesian. Bayes's theroum looks like: <http://www.paulgraham.com/naivebayes.html> > This is my interpretation of the method discussed at: > http://sourceforge.net/docman/display_doc.php?docid=13334&grou > p_id=63137 > > Have I overlooked something, or is this just a difference > between the two > calucations? > I don't think you've overlooked anything. |
From: Matt C. <MCo...@my...> - 2003-11-18 03:10:34
|
WordProbability.calculateProbability includes the following: if (matchingCount == 0) { if (nonMatchingCount == 0) { result = IClassifier.NEUTRAL_PROBABILITY; } else { result = IClassifier.LOWER_BOUND; } } else { result = BayesianClassifier.normaliseSignificance((double)matchingCount / (double) (matchingCount + nonMatchingCount)); } In my word_probability database, I currently have no "nonMatchingCount"s. Therefore all my word probabilities are turning out as LOWER_BOUND, NEUTRAL_PROBABILITY, or .99 since effectively matchingCount/matchingCount = 1. BayesianClassifier.normaliseSignificance() presumably adjusts this outcome from 1 to .99. This I believe represents a major difference between the current method and my understanding of POPFile's method. At this point, POPFile is calculating: Occurences of Word A in Category XYZ / Total Occurences of ALL words in Category XYZ. In other words: match_count of A where Category=XYZ / sum(match_count) from category XYZ. This is my interpretation of the method discussed at: http://sourceforge.net/docman/display_doc.php?docid=13334&group_id=63137 Have I overlooked something, or is this just a difference between the two calucations? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: "Matt Collier" <MCo...@my...> To: "Classifier4J" <cla...@li...> Date: Mon, 17 Nov 2003 15:33:19 -0600 Subject: [Classifier4j-devel] Fwd: calculateOverallProbability Questions > Can someone explain to me what is happening in calculateOverallProbability. > > The "probability" for each word drawn into this method via > calcWordsProbabilty > is .99 if atleast one occurance of word exists in the database in the given > category and .5 (Neutral) if the word does not occur in the given category. > > This does not seem right to me. > > I am not sure, when, where, how and why the probability on the words is > getting assigned as described. > > Another thing that is confusing me is that several time during to course of > this method, the variable "z" goes to 0 (zero) and the process continues. > Attached is the tail end of a log of this method. If z goes to zero over and > over, what is the point of performing this calculation. It seems the > calculation would only take into account those words that are processed after > the very last time Z goes to zero. > > I simply added: > System.out.println("Z : [" + z +"] Word : [" + wps[i].getWord()+"] > Probability : [" + wps[i].getProbability() + "]"); > > after each assignment of z in BayesianClassifer.calculateOverallProbability() > > Also, z is recalculated on each occurence of a particular word. Is this > proper? > > > Matt Collier > RemoteIT > mco...@my... > 877-4-NEW-LAN |
From: Matt C. <MCo...@my...> - 2003-11-17 21:30:58
|
Can someone explain to me what is happening in calculateOverallProbability. The "probability" for each word drawn into this method via calcWordsProbabilty is .99 if atleast one occurance of word exists in the database in the given category and .5 (Neutral) if the word does not occur in the given category. This does not seem right to me. I am not sure, when, where, how and why the probability on the words is getting assigned as described. Another thing that is confusing me is that several time during to course of this method, the variable "z" goes to 0 (zero) and the process continues. Attached is the tail end of a log of this method. If z goes to zero over and over, what is the point of performing this calculation. It seems the calculation would only take into account those words that are processed after the very last time Z goes to zero. I simply added: System.out.println("Z : [" + z +"] Word : [" + wps[i].getWord()+"] Probability : [" + wps[i].getProbability() + "]"); after each assignment of z in BayesianClassifer.calculateOverallProbability() Also, z is recalculated on each occurence of a particular word. Is this proper? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |
From: moedusa <mo...@in...> - 2003-11-17 07:03:51
|
Nick Lothian wrote: >>Nick Lothian wrote: > Our stemmers & stop words will be language > specific. Unfortunately I don't see a way around this, unless there is some > magic way to generate a stop word list & stemmer in any language... Could we take the same approach Lucene does? In short (if you neve used Lucene) here is an article on indexing with Lucene http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html. So, to index documents in russian, i do something like this: RussianAnalyzer analyzer = new RussianAnalyzer(RussianCharsets.UnicodeRussian); IndexWriter writer = new IndexWriter(indexPath, analyzer, false /*do not create index, it exists*/); writer.addDocument(toLuceneDocument(dto)); writer.optimize(); writer.close(); where Index Writer is smth like JDBC provider. Analyzers are tokenisers and stemmers in one. There are simple analyzers - just to convert all text to lowercase letters. "...The second parameter provides the implementation of Analyzer that should be used for pre-processing the text before it is indexed. This [*NOT FROM MY CODE, see article for context (moedusa)*] particular implementation of Analyzer eliminates stop words, converts tokens to lower case, and performs a few other small input modifications, such as eliminating periods from acronyms" (http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html). So, to deal with encoding, I need only my (russian) analyzer. it sould be stemming analyzer, if there is an implementation, or just stop-words analyzer. Initial stop words are stored as array, and also could be instantiated from anywhere alse, that is to developer, there are no code in Lucene to search fo stopwords list or smth... It means that you, as author, must only provide API :) we'll do the rest. |
From: Nick L. <nl...@es...> - 2003-11-17 06:48:14
|
> > Nick Lothian wrote: > > So you are suggesting using the dublin core tags for > deciding what category > > a document says it is in when training? > > Well, if it is not too time-consuming to implement, it could > be a nice > option. As i know, DC metadata set is the only near-standard way to > write real metadata to metatags, since metas in html spec. are almost > undefined, and left to document author to decide what to do > with them... > Yes - it's a good idea. I'm not sure if I'll get to implement it, though ;-) > > With respect to non-ASCII text, why does C4J need to know > what encoding the > > source is in? I the definition of word breaks etc is > encoding (and language) > > specific, but this is a limitation of the current default > tokenizer, too. Is > > that the on;y reason to find the encoding? > > I am not very informed on how bayesian algorythm works. I > know, that it > can be used without knowing encoding, also, but we have talked about > stemmers and stop-words, and it seems that this stuff is > language-encoding specific... Correct me if I am wrong. > Yes, I think you are right. Our stemmers & stop words will be language specific. Unfortunately I don't see a way around this, unless there is some magic way to generate a stop word list & stemmer in any language... |
From: moedusa <mo...@in...> - 2003-11-17 06:39:37
|
Nick Lothian wrote: > So you are suggesting using the dublin core tags for deciding what category > a document says it is in when training? Well, if it is not too time-consuming to implement, it could be a nice option. As i know, DC metadata set is the only near-standard way to write real metadata to metatags, since metas in html spec. are almost undefined, and left to document author to decide what to do with them... > With respect to non-ASCII text, why does C4J need to know what encoding the > source is in? I the definition of word breaks etc is encoding (and language) > specific, but this is a limitation of the current default tokenizer, too. Is > that the on;y reason to find the encoding? I am not very informed on how bayesian algorythm works. I know, that it can be used without knowing encoding, also, but we have talked about stemmers and stop-words, and it seems that this stuff is language-encoding specific... Correct me if I am wrong. |
From: Nick L. <nl...@es...> - 2003-11-17 06:27:47
|
> > Nick Lothian wrote: > > What are peoples general requirements for an HTML Tokenizer? > > > > Personally, I want to get rid of all the tags and just get > the pure text of > > the document. > > I think meta tags are required if you need to classify (or train) > already classified html documents. Also remember Doublin Core > meta tags > (http://www.ietf.org/rfc/rfc2731.txt). But alts, titles etc could be > missed, since the only real meta are in meta tags... Also > remember that > if you need to classify non-ASCII text, the only source for > encoding is > meta tag. > > So you are suggesting using the dublin core tags for deciding what category a document says it is in when training? That's a good idea - I hadn't thought of that. With respect to non-ASCII text, why does C4J need to know what encoding the source is in? I the definition of word breaks etc is encoding (and language) specific, but this is a limitation of the current default tokenizer, too. Is that the on;y reason to find the encoding? |
From: moedusa <mo...@in...> - 2003-11-17 06:14:39
|
Nick Lothian wrote: > What are peoples general requirements for an HTML Tokenizer? > > Personally, I want to get rid of all the tags and just get the pure text of > the document. I think meta tags are required if you need to classify (or train) already classified html documents. Also remember Doublin Core meta tags (http://www.ietf.org/rfc/rfc2731.txt). But alts, titles etc could be missed, since the only real meta are in meta tags... Also remember that if you need to classify non-ASCII text, the only source for encoding is meta tag. |
From: Nick L. <nl...@es...> - 2003-11-17 05:50:43
|
What are peoples general requirements for an HTML Tokenizer? Personally, I want to get rid of all the tags and just get the pure text of the document. I'm thinking about writing a tokenizer based on this (obviouly cleaned up and turned into a tokenizer): import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.Stack; public class Main { static Stack stack = new Stack(); public static void main(String[] args) throws IOException { BufferedReader reader = new BufferedReader(new FileReader("input.html")); StringBuffer contents = new StringBuffer(); String line = reader.readLine(); while (line != null) { contents.append(line); line = reader.readLine(); } char[] chars = contents.toString().toCharArray(); for (int i = 0; i < chars.length; i++) { if (chars[i] == '<') { stack.push(Boolean.TRUE); } else if (chars[i] == '>') { stack.pop(); } else if (stack.size() == 0) { System.out.print(chars[i]); } } } } What do people think? The advantage is that it doesn't require any external libraries. The disadvange is that it can't return things like meta tag information, or things in alt or text attributes. Opinions? Nick > -----Original Message----- > From: moedusa [mailto:mo...@in...] > Sent: Sunday, 16 November 2003 9:10 PM > To: cla...@li... > Subject: Re: [Classifier4j-devel] HTML Tokenize v0.000001 Ready for > review > > > > Matt Collier wrote: > > > >> See attached, you will need Xerces and NekoHTML in your > classpath. > > Just to make a note: there is one more option to deal with HTML soup > (when you nedd to clean up MSWord HTML, for example). It seems, that > NekoHTML does the same thing, but there is one more library > called JTidy > (http://lempinen.net/sami/jtidy/) based on code from the W3C Tidy > (http://www.w3.org/People/Raggett/tidy/). Since I did not work with > Necko, I can not compare them, but, concerning JTidy, I must > say, that > it is pretty good library. It can be used like a JavaBean > (http://sourceforge.net/docman/display_doc.php?docid=1298&grou > p_id=13153), > and, finally, it has a very nice option: draconianWord2000Cleaning > (http://www.w3.org/People/Raggett/tidy/#word2000). I used it for this > kind of things. Also it does not binded to concrete Xerces version. > > > > > ------------------------------------------------------- > This SF. Net email is sponsored by: GoToMyPC > GoToMyPC is the fast, easy and secure way to access your computer from > any Web browser or wireless device. Click here to Try it Free! > https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/ g22lp.tmpl _______________________________________________ Classifier4j-devel mailing list Cla...@li... https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |