[Classifier4j-devel] Bayesian with multiple categories

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Apologies in advance if this comes through in HTML, I'm stuck
on Lotus Notes here at work.

I have a bunch of legislative text, around 400,000 individual
paragraphs, that have each been hand-categorized into one of
five categories.

Since I have a few hundred thousand still to go, I thought the
Bayesian classifier could give me a leg up on this process.

So I wrote a little trainer that does something like the
following:

switch(existingcategory){
  case "category1":
        classifier.TeachMatch("category1", mytext);
        classifier.TeachNonMatch("category2", mytext);
        classifier.TeachNonMatch("category3", mytext);
        classifier.TeachNonMatch("category4", mytext);
        classifier.TeachNonMatch("category5", mytext);
        break;
  case "category2":
        classifier.TeachNonMatch("category1", mytext);
        classifier.TeachMatch("category2", mytext);
        classifier.TeachNonMatch("category3", mytext);
        classifier.TeachNonMatch("category4", mytext);
        classifier.TeachNonMatch("category5", mytext);
        break;
  case "category3":
        classifier.TeachNonMatch("category1", mytext);
        classifier.TeachNonMatch("category2", mytext);
        classifier.TeachMatch("category3", mytext);
        classifier.TeachNonMatch("category4", mytext);
        classifier.TeachNonMatch("category5", mytext);
        break;
  case "category4":
        ...
}

The problem is, *one* of the categories is *much* more common than
the others, so it gets more matches and fewer non-matches for almost
*any* word.

So, now when I send a new string through the trained classifier and
compare the scores, that category almost always wins out, and in a
big way (generally around 99% for it, 1% for the others).

Am I training this classifier wrong, or is this a limitation of 
using Bayesian filters with more than two categories or with a 
corpus that is unevenly distributed among the categories?

I thought maybe I should try the VectorClassifier instead, but I
have *tens of thousands* of strings in each category that I need to
train it on, and the docs state that you can't incrementally train
it (which, I presume, means I would need to concatenate the entire
training corpus into one string per category).

Any help would be greatly appreciated...

--
Richard S. Tallent
ERM (Beaumont, TX)
409-833-7755

----------------------------------------------

This electronic mail message may contain information which is (a) LEGALLY 
PRIVILEGED,  PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM 
DISCLOSURE, and (b) intended only for the use of the Addressee (s) names 
herein.  If you are not the Addressee (s), or the person responsible for 
delivering this to the Addressee (s), you are hereby notified that 
reading, copying, or distributing this message is prohibited.  If you have 
received this electronic mail message in error, please contact us 
immediately at (281) 600-1000 and take the steps necessary to delete the 
message completely from your computer system.  Thank you,   Environmental 
Resources Management.   Please visit ERM's web site: http://www.erm.com