[Classifier4j-devel] Bayesian with multiple categories
Status: Beta
Brought to you by:
nicklothian
|
From: <Ric...@er...> - 2006-03-01 15:04:26
|
Apologies in advance if this comes through in HTML, I'm stuck
on Lotus Notes here at work.
I have a bunch of legislative text, around 400,000 individual
paragraphs, that have each been hand-categorized into one of
five categories.
Since I have a few hundred thousand still to go, I thought the
Bayesian classifier could give me a leg up on this process.
So I wrote a little trainer that does something like the
following:
switch(existingcategory){
case "category1":
classifier.TeachMatch("category1", mytext);
classifier.TeachNonMatch("category2", mytext);
classifier.TeachNonMatch("category3", mytext);
classifier.TeachNonMatch("category4", mytext);
classifier.TeachNonMatch("category5", mytext);
break;
case "category2":
classifier.TeachNonMatch("category1", mytext);
classifier.TeachMatch("category2", mytext);
classifier.TeachNonMatch("category3", mytext);
classifier.TeachNonMatch("category4", mytext);
classifier.TeachNonMatch("category5", mytext);
break;
case "category3":
classifier.TeachNonMatch("category1", mytext);
classifier.TeachNonMatch("category2", mytext);
classifier.TeachMatch("category3", mytext);
classifier.TeachNonMatch("category4", mytext);
classifier.TeachNonMatch("category5", mytext);
break;
case "category4":
...
}
The problem is, *one* of the categories is *much* more common than
the others, so it gets more matches and fewer non-matches for almost
*any* word.
So, now when I send a new string through the trained classifier and
compare the scores, that category almost always wins out, and in a
big way (generally around 99% for it, 1% for the others).
Am I training this classifier wrong, or is this a limitation of
using Bayesian filters with more than two categories or with a
corpus that is unevenly distributed among the categories?
I thought maybe I should try the VectorClassifier instead, but I
have *tens of thousands* of strings in each category that I need to
train it on, and the docs state that you can't incrementally train
it (which, I presume, means I would need to concatenate the entire
training corpus into one string per category).
Any help would be greatly appreciated...
--
Richard S. Tallent
ERM (Beaumont, TX)
409-833-7755
----------------------------------------------
This electronic mail message may contain information which is (a) LEGALLY
PRIVILEGED, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM
DISCLOSURE, and (b) intended only for the use of the Addressee (s) names
herein. If you are not the Addressee (s), or the person responsible for
delivering this to the Addressee (s), you are hereby notified that
reading, copying, or distributing this message is prohibited. If you have
received this electronic mail message in error, please contact us
immediately at (281) 600-1000 and take the steps necessary to delete the
message completely from your computer system. Thank you, Environmental
Resources Management. Please visit ERM's web site: http://www.erm.com |