classifier4j-devel Mailing List for Classifier4J (Page 5)
Status: Beta
Brought to you by:
nicklothian
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(18) |
Aug
(14) |
Sep
|
Oct
|
Nov
(74) |
Dec
(9) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(15) |
Feb
(6) |
Mar
|
Apr
|
May
(27) |
Jun
(1) |
Jul
(14) |
Aug
(3) |
Sep
(9) |
Oct
|
Nov
(3) |
Dec
(6) |
2005 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
2006 |
Jan
|
Feb
(5) |
Mar
(5) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(10) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
(1) |
Jun
(4) |
Jul
(10) |
Aug
(5) |
Sep
(10) |
Oct
(18) |
Nov
(39) |
Dec
(73) |
2009 |
Jan
(78) |
Feb
(24) |
Mar
(32) |
Apr
(53) |
May
(115) |
Jun
(99) |
Jul
(72) |
Aug
(18) |
Sep
(22) |
Oct
(35) |
Nov
(10) |
Dec
(19) |
2010 |
Jan
(6) |
Feb
(7) |
Mar
(43) |
Apr
(55) |
May
(78) |
Jun
(71) |
Jul
(43) |
Aug
(42) |
Sep
(19) |
Oct
(5) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Wayne <des...@ho...> - 2004-11-28 22:47:08
|
My Bayesian test program compiles fine but I get this error when I try to run it: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at net.sf.classifier4J.bayesian.WordProbability.calculateProbability(WordProbab ility.java:167) at net.sf.classifier4J.bayesian.WordProbability.setMatchingCount(WordProbabilit y.java:138) at net.sf.classifier4J.bayesian.WordProbability.<init>(WordProbability.java:115 ) at net.sf.classifier4J.bayesian.SimpleWordsDataSource.addMatch(SimpleWordsDataS ource.java:94) at testing.Test1.main(Test1.java:15) I am using Eclipse 3.1M2 and have added the Classifier4J-0.51.jar as an external JAR library. This version of Eclipse uses JDK 5.0. Does anyone know what settings I need in Eclipse to run? Here is the test code in my project: package testing; import net.sf.classifier4J.ClassifierException; import net.sf.classifier4J.IClassifier; import net.sf.classifier4J.bayesian.BayesianClassifier; import net.sf.classifier4J.bayesian.IWordsDataSource; import net.sf.classifier4J.bayesian.SimpleWordsDataSource; import net.sf.classifier4J.bayesian.WordsDataSourceException; public class Test1 { public static void main(String[] args) { IWordsDataSource wds = new SimpleWordsDataSource(); try { wds.addMatch("Blah"); } catch (WordsDataSourceException e) { e.printStackTrace(); } IClassifier classifier = new BayesianClassifier(wds); try { dReturn = classifier.classify("Blah Happy Holidays"); } catch (ClassifierException e1) { e1.printStackTrace(); } System.out.println(dReturn); } private static double dReturn; } Thanks -Wayne |
From: Nick L. <nl...@es...> - 2004-09-27 03:19:48
|
Some of you might find this interesting: "Create Intelligent E-mail Filters with JavaMail and Classifier4j" <http://www.devx.com/opensource/Article/22019> Nick |
From: Nick L. <nl...@es...> - 2004-09-09 23:16:00
|
If that works better for you, then do it. I'd suggest something like (untested!): public class YourClassifier extends BayesianClassifier { protected double calculateOverallProbability(WordProbability[] wps) { if (wps == null || wps.length == 0) { return IClassifier.NEUTRAL_PROBABILITY; } else { return super.calculateOverallProbability(WordProbability[] wps); } } } The classifier heirachy is designed to be extended like that. Nick > -----Original Message----- > From: David Spencer [mailto:dav...@ya...] > Sent: Friday, 10 September 2004 4:29 AM > To: cla...@li... > Subject: [Classifier4j-devel] RE: Fwd: calculateOverallProbability > Questio ns > Importance: Low > > > I just stumbled across this thread from last year: > > http://sourceforge.net/mailarchive/forum.php?thread_id=3483166 > &forum_id=34026 > > I'm having similar "troubles" in the sense that classify() is > returning > 0.99 too often and it's because some of the words either have zero > matches or zero non-matches. > > The question is, does it make sense to ignore words that don't have at > least 1 match and at least 1 non-match? > > It's easy enough to extend BayesianClassifier and override > calculateOverallProbability() and in my experiment it seems to work > "better", though I guess you could argue it's not fair to ignore such > words, as maybe a given word will always be a match or non-match such > it should be considered somehow. > > Anyway the code mode I did was at the bottom here of this fragment - > just added 2 lines: > > protected double > calculateOverallProbability(WordProbability[] wps) > { > if (wps == null || wps.length == 0) > { > return IClassifier.NEUTRAL_PROBABILITY; > } > else > { > // we need to calculate xy/(xy + z) > // where z = (1-x)(1-y) > > // firstly, calculate z and xy > double z = 0d; > double xy = 0d; > for (int i = 0; i < wps.length; i++) > { > // dss begin > if ( wps[ i].getMatchingCount() > == 0) continue; > if ( wps[ > i].getNonMatchingCount() == 0) continue; > // dss end > ... > > > > > > ===== > > > > > _______________________________ > Do you Yahoo!? > Shop for Back-to-School deals on Yahoo! Shopping. > http://shopping.yahoo.com/backtoschool > > > ------------------------------------------------------- > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > Project Admins to receive an Apple iPod Mini FREE for your > judgement on > who ports your project to Linux PPC the best. Sponsored by IBM. > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > |
From: David S. <dav...@ya...> - 2004-09-09 18:58:48
|
I just stumbled across this thread from last year: http://sourceforge.net/mailarchive/forum.php?thread_id=3483166&forum_id=34026 I'm having similar "troubles" in the sense that classify() is returning 0.99 too often and it's because some of the words either have zero matches or zero non-matches. The question is, does it make sense to ignore words that don't have at least 1 match and at least 1 non-match? It's easy enough to extend BayesianClassifier and override calculateOverallProbability() and in my experiment it seems to work "better", though I guess you could argue it's not fair to ignore such words, as maybe a given word will always be a match or non-match such it should be considered somehow. Anyway the code mode I did was at the bottom here of this fragment - just added 2 lines: protected double calculateOverallProbability(WordProbability[] wps) { if (wps == null || wps.length == 0) { return IClassifier.NEUTRAL_PROBABILITY; } else { // we need to calculate xy/(xy + z) // where z = (1-x)(1-y) // firstly, calculate z and xy double z = 0d; double xy = 0d; for (int i = 0; i < wps.length; i++) { // dss begin if ( wps[ i].getMatchingCount() == 0) continue; if ( wps[ i].getNonMatchingCount() == 0) continue; // dss end ... ===== _______________________________ Do you Yahoo!? Shop for Back-to-School deals on Yahoo! Shopping. http://shopping.yahoo.com/backtoschool |
From: Nick L. <nl...@es...> - 2004-09-02 23:23:46
|
> > > On Thu, 2 Sep 2004 16:44:43 +0930, Nick Lothian > <nl...@es...> wrote: > > > DefaultTokenizer can only work for latin language. I'm > planning to > > > write a CJKTokenizer to splite chinese characters. > > > > > > > Why does DefaultTokenizer only work for latin languages? There is a > > constructor that will let you pass in a custom regexp to > split on - is that > > not sufficient? > > > > Some asian languages are not like English, words are not seperated by > space or any other characters. There are continous texts in a > sentence. > > Some discussion about CJK word segment: > http://www.webmasterworld.com/forum32/284.htm > I can't read that thread - it is marked member's only. I knew that asian languages didn't split based on spaces, but I did think it was possible to split based on a regexp. (See n-gram tokenization in http://sourceforge.net/mailarchive/forum.php?thread_id=3404351&forum_id=8740 and Zope's CJKSplitter: http://www.zope.org/Members/panjunyong/CJKSplitter). How are you planning on doing it? I've seen some discussion of dictionary-based splitting - is that what you are planning? In any case, I'm happy to accept patches to SimpleHTMLTokenizer to make it work how you'd like. Nick |
From: Leo L. <leo...@gm...> - 2004-09-02 09:14:50
|
On Thu, 2 Sep 2004 16:44:43 +0930, Nick Lothian <nl...@es...> wrote: > > DefaultTokenizer can only work for latin language. I'm planning to > > write a CJKTokenizer to splite chinese characters. > > > > Why does DefaultTokenizer only work for latin languages? There is a > constructor that will let you pass in a custom regexp to split on - is that > not sufficient? > Some asian languages are not like English, words are not seperated by space or any other characters. There are continous texts in a sentence. Some discussion about CJK word segment: http://www.webmasterworld.com/forum32/284.htm -- ----------------------------------------------------------------------------------------- Leo Liang E-mail: leo...@gm... Blog (tech & learning): http://aleung.blogbus.com Blog (photography & outdoor): http://sunnyday.cn2k.net Delicious bookmark: http://del.icio.us/aleung ----------------------------------------------------------------------------------------- |
From: Nick L. <nl...@es...> - 2004-09-02 07:17:47
|
> > > > > > > Now, SimpleHTMLTokenizer inherits from DefaultTokenizer. > If I make a > > > new ITokenizer implement, I have to rewrite a HTML tokenizer. > > > > > > If SimpleHTMLTokenizer use decorator pattern, it can be re-used in > > > other ITokenizer implements. > > > > > > --------------------> ITokenizer > > > | | | > > > -- SimpleHTMLTokenizer DefaultTokenizer > > > > > > > > > > Why would you want to use any of the functionality of > SimpleHTMLTokenizer > > without also using DefaultTokenizer? > > > > SimpleHTMLTokenizer doesn't really do a great deal more than > > DefaultTokenizer, and I would like to understand which > parts of it you want > > to reuse. > > > > Nick > > DefaultTokenizer can only work for latin language. I'm planning to > write a CJKTokenizer to splite chinese characters. > Why does DefaultTokenizer only work for latin languages? There is a constructor that will let you pass in a custom regexp to split on - is that not sufficient? I should also point out that SimpleHTMLTokenizer is probably insufficient for almost any real world usage - it will break on mis-matched tags, for instance. Nick |
From: Leo L. <leo...@gm...> - 2004-09-02 05:35:15
|
> > > > Now, SimpleHTMLTokenizer inherits from DefaultTokenizer. If I make a > > new ITokenizer implement, I have to rewrite a HTML tokenizer. > > > > If SimpleHTMLTokenizer use decorator pattern, it can be re-used in > > other ITokenizer implements. > > > > --------------------> ITokenizer > > | | | > > -- SimpleHTMLTokenizer DefaultTokenizer > > > > > > Why would you want to use any of the functionality of SimpleHTMLTokenizer > without also using DefaultTokenizer? > > SimpleHTMLTokenizer doesn't really do a great deal more than > DefaultTokenizer, and I would like to understand which parts of it you want > to reuse. > > Nick DefaultTokenizer can only work for latin language. I'm planning to write a CJKTokenizer to splite chinese characters. -- ----------------------------------------------------------------------------------------- Leo Liang |
From: Nick L. <nl...@es...> - 2004-09-02 03:53:46
|
> > > Hi, > > Now, SimpleHTMLTokenizer inherits from DefaultTokenizer. If I make a > new ITokenizer implement, I have to rewrite a HTML tokenizer. > > If SimpleHTMLTokenizer use decorator pattern, it can be re-used in > other ITokenizer implements. > > --------------------> ITokenizer > | | | > -- SimpleHTMLTokenizer DefaultTokenizer > > Why would you want to use any of the functionality of SimpleHTMLTokenizer without also using DefaultTokenizer? SimpleHTMLTokenizer doesn't really do a great deal more than DefaultTokenizer, and I would like to understand which parts of it you want to reuse. Nick |
From: Leo L. <leo...@gm...> - 2004-09-02 03:27:57
|
Hi, Now, SimpleHTMLTokenizer inherits from DefaultTokenizer. If I make a new ITokenizer implement, I have to rewrite a HTML tokenizer. If SimpleHTMLTokenizer use decorator pattern, it can be re-used in other ITokenizer implements. --------------------> ITokenizer | | | -- SimpleHTMLTokenizer DefaultTokenizer -- ----------------------------------------------------------------------------------------- Leo Liang E-mail: leo...@gm... Blog (tech & learning): http://aleung.blogbus.com Blog (photography & outdoor): http://sunnyday.cn2k.net Delicious bookmark: http://del.icio.us/aleung ----------------------------------------------------------------------------------------- |
From: Nick L. <ni...@ma...> - 2004-08-07 15:58:08
|
That should read: That will _tokenise_ the text passed to the teachMatch & teachNonMatch=20 methods as it goes. > Sorry - I've been away for a few days. > > To train Classifier4J, do something like this: > > http://sourceforge.net/mailarchive/forum.php?thread_id=3D5110155&forum_= id=3D34026=20 > > > That will y the text passed to the teachMatch & teachNonMatch methods=20 > as it goes. > > To classify after training use the classify(String) and/or=20 > isMatch(String) methods. > > Nick > > > satmeet wrote: > >> hi , >> I looked at the Code again and have gone through, the archives of the=20 >> mailing list (first 5 months). I am writing down a pseudocode for=20 >> Bayesian that I maybe you could help me visualize in terms of=20 >> CLASSIFIER4J. >> I am uploading this to>> www.satmeet.com/bayesian.html >> its optimized for IE (sorry ,I was in a hurry) >> >> I would like to know, what is the sequence for making tokens from=20 >> text, then using them for Training and Classification . If you could=20 >> just tell me the flow of Classifier4J according to this given pattern=20 >> , I know you are busy people but I will be very greatful if you could=20 >> help me . >> >> In pseudo-code training is ,Given: an email message, X, and a label=20 >> Ci =CE=B5 {CN,CS}, >> 1. break X into its tokens, hx1, . . . , xki >> 2. for each token, xj >> (a) Increment the counter for token xj for class Ci >> (b) Increment the count of total tokens in class Ci >> 3. Increment the total number of email messages for class Ci =20 >> >> And Classification can be written as : >> Given: an UNLABELED email message, X >> 1. PN :=3D Pr[CN] >> 2. PS :=3D Pr[CS] >> 3. break X into its tokens, hx1, . . . , xki >> 4. for each token, xj >> (a) PN :=3D PN =E2=80=A2 Pr[xj |CN] >> (b) PS :=3D PS =E2=80=A2 Pr[xj |CS] >> 5. if PN >PS then return NORMAL >> 6. else return SPAM >> >> >> Where Pr[CN] =3D#NORMAL emails =3D #NORMAL emails >> -------------- --------------- >> total # emails # SPAM + # NORMAL >> >> Pr[CS] =3D # SPAM emails =3D #SPAM emails >> ---------- ------------ >> total # emails #SPAM + # NORMAL >> >> Pr[xj |Ci] =3D # of tokens of type xj seen in class Ci >> ------------------------------------------- >> total # of tokens seen in class Ci >> >> Thanking you >> >> Satmeet >> >> >> >> > > > > ------------------------------------------------------- > This SF.Net email is sponsored by OSTG. Have you noticed the changes on > Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, > one more big change to announce. We are now OSTG- Open Source Technolog= y > Group. Come see the changes on the new OSTG site. www.ostg.com > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > |
From: Nick L. <ni...@ma...> - 2004-08-07 12:30:16
|
Sorry - I've been away for a few days. To train Classifier4J, do something like this: http://sourceforge.net/mailarchive/forum.php?thread_id=3D5110155&forum_id= =3D34026 That will y the text passed to the teachMatch & teachNonMatch methods as=20 it goes. To classify after training use the classify(String) and/or=20 isMatch(String) methods. Nick satmeet wrote: > hi , > I looked at the Code again and have gone through, the archives of the=20 > mailing list (first 5 months). I am writing down a pseudocode for=20 > Bayesian that I maybe you could help me visualize in terms of=20 > CLASSIFIER4J. > I am uploading this to>> www.satmeet.com/bayesian.html > its optimized for IE (sorry ,I was in a hurry) > > I would like to know, what is the sequence for making tokens from=20 > text, then using them for Training and Classification . If you could=20 > just tell me the flow of Classifier4J according to this given pattern=20 > , I know you are busy people but I will be very greatful if you could=20 > help me . > > In pseudo-code training is ,Given: an email message, X, and a label Ci=20 > =CE=B5 {CN,CS}, > 1. break X into its tokens, hx1, . . . , xki > 2. for each token, xj > (a) Increment the counter for token xj for class Ci > (b) Increment the count of total tokens in class Ci > 3. Increment the total number of email messages for class Ci =20 > > > And Classification can be written as : > Given: an UNLABELED email message, X > 1. PN :=3D Pr[CN] > 2. PS :=3D Pr[CS] > 3. break X into its tokens, hx1, . . . , xki > 4. for each token, xj > (a) PN :=3D PN =E2=80=A2 Pr[xj |CN] > (b) PS :=3D PS =E2=80=A2 Pr[xj |CS] > 5. if PN >PS then return NORMAL > 6. else return SPAM > > > Where Pr[CN] =3D#NORMAL emails =3D #NORMAL emails > -------------- --------------- > total # emails # SPAM + # NORMAL > > Pr[CS] =3D # SPAM emails =3D #SPAM emails > ---------- ------------ > total # emails #SPAM + # NORMAL > > Pr[xj |Ci] =3D # of tokens of type xj seen in class Ci > ------------------------------------------- > total # of tokens seen in class Ci > > Thanking you > > Satmeet > > > > |
From: satmeet <ja...@sa...> - 2004-08-03 10:54:41
|
hi , I looked at the Code again and have gone through, the archives of the mailing list (first 5 months). I am writing down a pseudocode for Bayesian that I maybe you could help me visualize in terms of CLASSIFIER4J. I am uploading this to>> www.satmeet.com/bayesian.html its optimized for IE (sorry ,I was in a hurry) I would like to know, what is the sequence for making tokens from text, then using them for Training and Classification . If you could just tell me the flow of Classifier4J according to this given pattern , I know you are busy people but I will be very greatful if you could help me . In pseudo-code training is ,Given: an email message, X, and a label Ci ε {CN,CS}, 1. break X into its tokens, hx1, . . . , xki 2. for each token, xj (a) Increment the counter for token xj for class Ci (b) Increment the count of total tokens in class Ci 3. Increment the total number of email messages for class Ci And Classification can be written as : Given: an UNLABELED email message, X 1. PN := Pr[CN] 2. PS := Pr[CS] 3. break X into its tokens, hx1, . . . , xki 4. for each token, xj (a) PN := PN • Pr[xj |CN] (b) PS := PS • Pr[xj |CS] 5. if PN >PS then return NORMAL 6. else return SPAM Where Pr[CN] =#NORMAL emails = #NORMAL emails -------------- --------------- total # emails # SPAM + # NORMAL Pr[CS] = # SPAM emails = #SPAM emails ---------- ------------ total # emails #SPAM + # NORMAL Pr[xj |Ci] = # of tokens of type xj seen in class Ci ------------------------------------------- total # of tokens seen in class Ci Thanking you Satmeet -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/ |
From: Nick L. <nl...@es...> - 2004-07-30 04:04:45
|
Glad that it worked for you? Can we ask what the research was? For an acknowledgement you just need something like: "This software includes software developed by the Classifier4J ( http://classifier4j.sourceforge.net <http://classifier4j.sourceforge.net> ) team" Nick -----Original Message----- From: Kashif [mailto:ks...@ai...] Sent: Friday, 30 July 2004 1:19 PM To: cla...@li... Subject: [Classifier4j-devel] Acknowlegement of Classifier4J Importance: Low Hi Nick Classifier 4J is working fine now with my Email Filter. Thanks for your help. Before I submit my research, I would like to know that if I need to acknowledge C4J Team. Is there a required format or statement I should to include in my code, Kashif |
From: Kashif <ks...@ai...> - 2004-07-30 03:49:09
|
Hi Nick Classifier 4J is working fine now with my Email Filter. Thanks for your help. Before I submit my research, I would like to know that if I need to acknowledge C4J Team. Is there a required format or statement I should to include in my code, Kashif |
From: Nick L. <ni...@ma...> - 2004-07-18 12:02:28
|
> for (int i=0; i<n; i++) { > > double result[] = new double[n]; > > result[i] = > sclassifier.classify(message[i].getSubject()); > > System.out.println("The Probability of the > message no. " + i + " is: " + result[i] ); > > > > } > I suspect his code isn't quite doing what you want it to do, either - the line double result[] = new double[n]; should probably be before the loop.... Nick |
From: <br...@bj...> - 2004-07-18 05:13:35
|
> You MUST teach non matches as well as matches - otherwise you > will get the results you are currently getting. > > With most spam-type filters, you have a set of "spam" (which > is used to train spam matches), and a set of normal mail (or > "ham") which is used to train non-matches. I must say that I ran into this exact problem when I first used C4J. I did my spam classification and was suprised when it marked everything as spam. I sent a message to this list.. and someone (most likely Nick.. hehe) informed me that you need to sample both match and non-match messages. I dont know what project you are using this for.. I was working on an email spam classifier. So here's what I did: I exported two sets of messages.. spam and non-spam. Then I wrote a class that reads a directory for a set of mbox style messages. From there it parsed them separating out the subject and body. Then it tokenized the messages into whitespace separated words and reformed them into a string. I then ran teachMatch and teachNonMatch on them depending on the known message type (spam or not spam). Im not sure tokenizing and reforming is really needed since I think C4J does that internally anyway (in some form or fashion). Anyways... it seems to work pretty well :) It's not as good as SpamBayes.. but thats only because Ive been teaching SpamBayes much longer than C4J. Im actually thinking of writing a program to read the SpamBayes database and insert the necessary data into the C4J database. I've just been having problems exporting the SpamBayes database into something useable (damn Python). - Brent |
From: Nick L. <ni...@ma...> - 2004-07-18 03:59:46
|
You don't need to do anything with defaultStopWords - it is=20 automatically used. You MUST teach non matches as well as matches - otherwise you will get=20 the results you are currently getting. With most spam-type filters, you have a set of "spam" (which is used to=20 train spam matches), and a set of normal mail (or "ham") which is used=20 to train non-matches. Nick Kashif wrote: > Hi > > Filter is working now on black list and white list when I compare the=20 > =93from=94 field. > > If I want to apply the filtering on =93subject=94 field (but its giving= me=20 > 0.5 or 0.99 no matter what subject I use) > > At the moment I am doing this: > > 1) Transfer each line (which is a single word) of=20 > =93defaultStopWords.txt=94 in an array stopWordListArray[ ] > > 2) Then I create another instance of IwordDatasource as (swds) and=20 > ITrainableClassifier as (sclassifier). > > 3) I used a for loop to teach match. I know that I should also train=20 > non match as well. But not sure with What? > > 4) I was wondering with that does the c4J uses defaultStopWords.txt,=20 > automatically or we have to call the list some how? > > Here=92s my code: > > IWordsDataSource swds =3D new SimpleWordsDataSource(); > > ITrainableClassifier sclassifier =3D new BayesianClassifier(swds); > > for (int i=3D0; i<stopWordListArray.length; i++) { > > sclassifier.teachMatch(stopWordListArray[i]); > > } > > for (int i=3D0; i<n; i++) { > > double result[] =3D new double[n]; > > result[i] =3D sclassifier.classify(message[i].getSubject()); > > System.out.println("The Probability of the message no. " + i + " is: "=20 > + result[i] ); > > } > > Thanks heaps for your help > |
From: Kashif <ks...@ai...> - 2004-07-16 08:11:23
|
Hi Filter is working now on black list and white list when I compare the "from" field. If I want to apply the filtering on "subject" field (but its giving me 0.5 or 0.99 no matter what subject I use) At the moment I am doing this: 1) Transfer each line (which is a single word) of "defaultStopWords.txt" in an array stopWordListArray[ ] 2) Then I create another instance of IwordDatasource as (swds) and ITrainableClassifier as (sclassifier). 3) I used a for loop to teach match. I know that I should also train non match as well. But not sure with What? 4) I was wondering with that does the c4J uses defaultStopWords.txt, automatically or we have to call the list some how? Here's my code: IWordsDataSource swds = new SimpleWordsDataSource(); ITrainableClassifier sclassifier = new BayesianClassifier(swds); for (int i=0; i<stopWordListArray.length; i++) { sclassifier.teachMatch(stopWordListArray[i]); } for (int i=0; i<n; i++) { double result[] = new double[n]; result[i] = sclassifier.classify(message[i].getSubject()); System.out.println("The Probability of the message no. " + i + " is: " + result[i] ); } Thanks heaps for your help |
From: Nick L. <nl...@es...> - 2004-07-15 23:19:02
|
> > > Right - your first two for-loops are training your > classifier to learn which messages are your > whitelist messages. Once this is done you can > test any other message against your training > to get a "rating". When you run the classify() > method it will return a value between 0.0 and 1.0. > > 0 meaning that the new message you ran classify > on is definitely a blacklist message... 1.0 meaning > its definitely a whitelist message. > > At least this is the way I'm using Classifier4J.. > not sure if its the absolute correct way :) > > My rules are similar to SpamBayes in that > I mark anything with a 0.9 and above as a definite > match.. anything below that is considered a partial > match (I use C4J as a spam filter against email msgs). > > Nick - correct me if Im wrong? I'm no C4J expert, > but using it this way seems to work pretty well > for me. > Yes, that is exactly correct. You can use the IClassifier.isMatch(String) to do the same thing - each instance of the classifier has a setCutoff() method (I think that's the name) to set the exact point above which anything will be marked as spam. |
From: <br...@bj...> - 2004-07-15 16:45:12
|
Right - your first two for-loops are training your classifier to learn which messages are your whitelist messages. Once this is done you can test any other message against your training to get a "rating". When you run the classify() method it will return a value between 0.0 and 1.0. 0 meaning that the new message you ran classify on is definitely a blacklist message... 1.0 meaning its definitely a whitelist message. At least this is the way I'm using Classifier4J.. not sure if its the absolute correct way :) My rules are similar to SpamBayes in that I mark anything with a 0.9 and above as a definite match.. anything below that is considered a partial match (I use C4J as a spam filter against email msgs). Nick - correct me if Im wrong? I'm no C4J expert, but using it this way seems to work pretty well for me. - Brent > -----Original Message----- > From: Nick Lothian [mailto:nl...@es...] > Sent: Thursday, July 15, 2004 3:25 AM > To: 'cla...@li...' > Subject: RE: [Classifier4j-devel] Next Step after Training? > > Yes, that looks fine. > > The line "double result = > classifier.classify(bayMsgs[i].getSubject());" is doing the filtering. > > Nick > > > -----Original Message----- > From: Kashif [mailto:ks...@ai...] > Sent: Thursday, 15 July 2004 3:54 PM > To: cla...@li... > Subject: [Classifier4j-devel] Next Step after Training? > Importance: Low > > > Hi > > I solved the earlier problem with the "Exception in thread "main" > java.lang.NoClassDefFoundError: > org/apache/commons/logging/" I had to install the latest > version of jdk. > > Thanks Nick for your help with training the filter > > I used the following code > > // For Black List Arrays > for (int i=0; i<blackListArray.length; i++) { > classifier.teachNonMatch(blackListArray[i]); > } > > // For White List Arrays > for (int i=0; i<whiteListArray.length; i++) { > classifier.teachMatch(whiteListArray[i]); > } > > // For BayMsgs Arrays > // Applying Bayesian Filter on subject > // BayMsgs is an array of Message Object. > > for (int i=0; i<bayMsgs.length; i++) { > double result = > classifier.classify(bayMsgs[i].getSubject()); > } > > > Does this look OK to you. > > Can you please let me know what are the next steps, and what > should I do next to apply Bayesian filter on the subject > field only of bayMsgs Array, which holds the Message Object. > > Regards > > And Thanks for our help. > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop FREE > Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel > > > |
From: Nick L. <nl...@es...> - 2004-07-15 07:27:17
|
Yes, that looks fine. The line "double result = classifier.classify(bayMsgs[i].getSubject());" is doing the filtering. Nick -----Original Message----- From: Kashif [mailto:ks...@ai...] Sent: Thursday, 15 July 2004 3:54 PM To: cla...@li... Subject: [Classifier4j-devel] Next Step after Training? Importance: Low Hi I solved the earlier problem with the "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/" I had to install the latest version of jdk. Thanks Nick for your help with training the filter I used the following code // For Black List Arrays for (int i=0; i<blackListArray.length; i++) { classifier.teachNonMatch(blackListArray[i]); } // For White List Arrays for (int i=0; i<whiteListArray.length; i++) { classifier.teachMatch(whiteListArray[i]); } // For BayMsgs Arrays // Applying Bayesian Filter on subject // BayMsgs is an array of Message Object. for (int i=0; i<bayMsgs.length; i++) { double result = classifier.classify(bayMsgs[i].getSubject()); } Does this look OK to you. Can you please let me know what are the next steps, and what should I do next to apply Bayesian filter on the subject field only of bayMsgs Array, which holds the Message Object. Regards And Thanks for our help. |
From: Kashif <ks...@ai...> - 2004-07-15 06:24:06
|
Hi I solved the earlier problem with the "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/" I had to install the latest version of jdk. Thanks Nick for your help with training the filter I used the following code // For Black List Arrays for (int i=0; i<blackListArray.length; i++) { classifier.teachNonMatch(blackListArray[i]); } // For White List Arrays for (int i=0; i<whiteListArray.length; i++) { classifier.teachMatch(whiteListArray[i]); } // For BayMsgs Arrays // Applying Bayesian Filter on subject // BayMsgs is an array of Message Object. for (int i=0; i<bayMsgs.length; i++) { double result = classifier.classify(bayMsgs[i].getSubject()); } Does this look OK to you. Can you please let me know what are the next steps, and what should I do next to apply Bayesian filter on the subject field only of bayMsgs Array, which holds the Message Object. Regards And Thanks for our help. |
From: Nick L. <nl...@es...> - 2004-07-15 02:50:54
|
That looks correct to me. You should need JUnit in your classpath to run it. What version of Java are you using? -----Original Message----- From: Kashif [mailto:ks...@ai...] Sent: Thursday, 15 July 2004 12:07 PM To: cla...@li... Subject: [Classifier4j-devel] Error Help: NoClassDefFoundError: org/apache/commons/logging/Lo Importance: Low Hi My class has compiled fine and with out errors but when I run it I get the following error. I have included commons logging jar and junit jar files in my class path. Is there any thing else I am suppose to do. Here's my class path. .;C:\Java\Classes;C:\Java;C:\Java\JarClasses\activation.jar;C:\Java\JarClass es\mail.jar;C:\Java\src;C:\Java\Classifier4J;C:\Java\commons-logging-1.0.3.j ar;C:\Java\junit-3.8.1.jar Any suggestions: Here is the Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/Lo actory at net.sf.classifier4J.bayesian.WordProbability.calculateProbability(WordProbab i ty.java:167) at net.sf.classifier4J.bayesian.WordProbability.setMatchingCount(WordProbabilit y ava:138) at net.sf.classifier4J.bayesian.WordProbability.<init>(WordProbability.java:115 ) at net.sf.classifier4J.bayesian.SimpleWordsDataSource.addNonMatch(SimpleWordsDa t ource.java:107) at net.sf.classifier4J.bayesian.BayesianClassifier.teachNonMatch(BayesianClassi f r.java:269) at net.sf.classifier4J.bayesian.BayesianClassifier.teachNonMatch(BayesianClassi f r.java:218) at net.sf.classifier4J.bayesian.BayesianClassifier.teachNonMatch(BayesianClassi f r.java:190) at GetEmail.main(GetEmail.java:115) Thanks for help Kashif |
From: Kashif <ks...@ai...> - 2004-07-15 02:37:01
|
Hi My class has compiled fine and with out errors but when I run it I get the following error. I have included commons logging jar and junit jar files in my class path. Is there any thing else I am suppose to do. Here's my class path. .;C:\Java\Classes;C:\Java;C:\Java\JarClasses\activation.jar;C:\Java\JarClass es\mail.jar;C:\Java\src;C:\Java\Classifier4J;C:\Java\commons-logging-1.0.3.j ar;C:\Java\junit-3.8.1.jar Any suggestions: Here is the Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/Lo actory at net.sf.classifier4J.bayesian.WordProbability.calculateProbability(WordProbab i ty.java:167) at net.sf.classifier4J.bayesian.WordProbability.setMatchingCount(WordProbabilit y ava:138) at net.sf.classifier4J.bayesian.WordProbability.<init>(WordProbability.java:115 ) at net.sf.classifier4J.bayesian.SimpleWordsDataSource.addNonMatch(SimpleWordsDa t ource.java:107) at net.sf.classifier4J.bayesian.BayesianClassifier.teachNonMatch(BayesianClassi f r.java:269) at net.sf.classifier4J.bayesian.BayesianClassifier.teachNonMatch(BayesianClassi f r.java:218) at net.sf.classifier4J.bayesian.BayesianClassifier.teachNonMatch(BayesianClassi f r.java:190) at GetEmail.main(GetEmail.java:115) Thanks for help Kashif |