RE: [Classifier4j-devel] Bayesian Classification

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

hehe we're telling each other about the same documents.  Only difference is, 
you evidently understand what they mean.

I should qualify my earlier stament about my classification results.  I have 
about 40 categories.  The same document will score a .99 in several different 
categories.  How am I to determine what category is best?  Is this expected or 
is there some deficiency in my data?

My application is to be able to classify web sites by business category, 
Insurance, Printing, Accounting, Attorney etc.  I have already manually 
classified a fairly large number of sites for my corpus.  I am treating the 
entire web site as one document.  I am then trying to classify an entire 
website in the same fashion.

Is it correct to say that the existence of a particular word only counts one 
time per document?  This seems to be a key point in the POPFile documentation 
as I understand it.  Word frequency within a single document counts for 
nothing.  Is this correct?

Matt Collier
RemoteIT
mco...@my...
877-4-NEW-LAN

-----Original Message-----
From: Nick Lothian <nl...@es...>
To: "'cla...@li...'" <classifier4j-
de...@li...>
Date: Thu, 13 Nov 2003 14:34:15 +1030
Subject: RE: [Classifier4j-devel] Bayesian Classification

> > 
> > BayesianClassifier classifier = new BayesianClassifier(wds);
> > double probability=classifier.classify("category","text to be 
> > classified");
> > 
> 
> Yep, looks good.
> 
> > This is functioning fine but most of my probabilities are 
> > either 0.01 or 0.99.
> > 
> 
> Yes, that is pretty much the way it works. See
> <http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137>
> for why this is.
> 
> POPFile uses logarithms to get around this (which is actually quite a good
> idea). Classifier4J uses cut-offs to avoid underflow and overflow.
> 
> > I saw somewhere in the source that the algorithm is choosing X most 
> > significant words.  Is there an easy way for me to determine 
> > what these words 
> > are on a category by category basis?
> > 
> 
> No, you can't do this on a category by category basis. I haven't found it
> makes a big difference anyway, so I'd test changing this setting on a single
> category before you spend a lot of time changing this.
> 
> > I think the fact that I still have html in my training data 
> > is causing me 
> > difficulty.  Does this sound right?
> > 
> 
> It depends on the task. In your typical Spam classifier HTML is an important
> indicator. I use Classifier4J for classifying RSS feeds and I don't strip
> HTML (That's not to say it wouldn't work better if I did - I just haven't
> tried it).
> 
> In a lot of cases the HTML supplies a surprising amount of useful data which
> Classifier4J can use.
> 
> 
> > As for the issues with the Bayesian tokenizer:
> > ---- correspondence between Pete and Nick on 2003-08-09
> > Pete> Look into the current Tokenizer - For example, "1.4" 
> > currently gets 
> > split into "1" and "4". Shouldn't it just be "1.4"? Also 
> > "peter's" is split 
> > into "peter" and "s". Shouldn't this be "peter's"? It's 
> > probably worth coming
> > up with a set of test cases.
> >  
> > Nick> Yes, that needs fixing. Also, I'm not sure about how to 
> > deal with URLs: 
> > at the moment http://www.google.com/something gets split up, 
> > but I think it 
> > probably shouldn't (?)
> > ----
> > 
> > Is this still outstanding?
> > 
> 
> Yes, these points are outstanding.
> 
> > How sophicated is this Bayesian classifier when compared with 
> > POPFile or 
> > SpamAssassin?  There is some intersting reading about the 
> > POPFile engine at :
> > http://sourceforge.net/docman/?group_id=63137
> > 
> 
> SpamAssassin isn't a Bayesian filter - it runs rules on mail headers and
> filters mail like that (AFAIK?)
> 
> Classifier4J is a (almost) pure, naive Bayesian classifier (POPFile is also
> "naive" in the technical meaning of the word - it treats each word as being
> independent). It implements Bayes theorem with very little variation (you
> can specify to only use the X most significant words, and it uses cut-offs
> to avoid arithmetic underflow).
> 
> I'm more interested in investigating a Vector-Space classifier than
> investing a huge amount of time modifying the Bayesian algorithm, especially
> since the modifications most people do are aimed at detecting Spam, which
> isn't my core goal with C4J. OTOH, if someone can suggest a change that
> improves performance or gives some other tangible gain then I'm interested.
> 
> 
> > POPFile has been designed to classify emails into "buckets" 
> > or categories.  
> > Evidently, there are some mathematical shortcuts if you're 
> > trying to classify 
> > a message against several different categories.
> > 
> 
> Can you point me at them? I didn't see them in that doco in the POPFile
> project.
> 
> > One critial point made in POPfile is that words NOT in a 
> > document may be as 
> > important as the words that ARE in a document.  Does the c4J 
> > take this into 
> > account?
> > 
> 
> I'm not quite sure what you (or they?) mean here. Can you point me at what
> they say?
> 
> Classifier4J does (kind of) take words that are not in a document into
> account. If a particualar word (say "Java") isn't in a document then the
> document won't get the score-boost of having that word in there.
> 
> Nick
> 
> 
> -------------------------------------------------------
> This SF.Net email sponsored by: ApacheCon 2003,
> 16-19 November in Las Vegas. Learn firsthand the latest
> developments in Apache, PHP, Perl, XML, Java, MySQL,
> WebDAV, and more! http://www.apachecon.com/
> _______________________________________________
> Classifier4j-devel mailing list
> Cla...@li...
> https://lists.sourceforge.net/lists/listinfo/classifier4j-devel