Re: [Classifier4j-devel] Bayesian with multiple categories
Status: Beta
Brought to you by:
nicklothian
|
From: Joe S. <sca...@gm...> - 2006-03-10 13:48:25
|
Richard -
I was wondering what you ended up doing on this -- I have a similar
situation
joe
On 3/2/06, Nick Lothian <ni...@ma...> wrote:
>
> See inline
>
>
> Ric...@er... wrote:
>
>
> Apologies in advance if this comes through in HTML, I'm stuck
> on Lotus Notes here at work.
>
> I have a bunch of legislative text, around 400,000 individual
> paragraphs, that have each been hand-categorized into one of
> five categories.
>
> Since I have a few hundred thousand still to go, I thought the
> Bayesian classifier could give me a leg up on this process.
>
> So I wrote a little trainer that does something like the
> following:
>
> switch(existingcategory){
> case "category1":
> classifier.TeachMatch("category1", mytext);
> classifier.TeachNonMatch("category2", mytext);
> classifier.TeachNonMatch("category3", mytext);
> classifier.TeachNonMatch("category4", mytext);
> classifier.TeachNonMatch("category5", mytext);
> break;
> case "category2":
> classifier.TeachNonMatch("category1", mytext);
> classifier.TeachMatch("category2", mytext);
> classifier.TeachNonMatch("category3", mytext);
> classifier.TeachNonMatch("category4", mytext);
> classifier.TeachNonMatch("category5", mytext);
> break;
> case "category3":
> classifier.TeachNonMatch("category1", mytext);
> classifier.TeachNonMatch("category2", mytext);
> classifier.TeachMatch("category3", mytext);
> classifier.TeachNonMatch("category4", mytext);
> classifier.TeachNonMatch("category5", mytext);
> break;
> case "category4":
> ...
> }
>
> The problem is, *one* of the categories is *much* more common than
> the others, so it gets more matches and fewer non-matches for almost
> *any* word.
>
> So, now when I send a new string through the trained classifier and
> compare the scores, that category almost always wins out, and in a
> big way (generally around 99% for it, 1% for the others).
>
> It isn't really possible to compare scores across categories to say tha=
t
> one category is the "best" category.
>
> All the Bayesian classifier will do is say if something matches the
> current category. As you've seen it does that well - you'll typically end=
up
> with a very high score (99%) or a very low score (1%) and not much in
> between.
>
> Perhaps you could classify the big category last, and only check it is
> none of the other ones find a match.
>
>
> Am I training this classifier wrong, or is this a limitation of
> using Bayesian filters with more than two categories or with a
> corpus that is unevenly distributed among the categories?
>
> I thought maybe I should try the VectorClassifier instead, but I
> have *tens of thousands* of strings in each category that I need to
> train it on, and the docs state that you can't incrementally train
> it (which, I presume, means I would need to concatenate the entire
> training corpus into one string per category).
>
>
> That means just that the training interfaces aren't properly implemented
> (yet). I've attached an updatable HashMapTermVectorStorage that fixes thi=
s
> (I haven't tested it though) - it might give you something to start from.
>
> Nick
>
>
> package net.sf.classifier4J.vector;
>
> import java.io.Serializable;
> import java.util.HashMap;
> import java.util.Hashtable;
> import java.util.Map;
> import java.util.Set;
>
>
> public class MyHashMapTermVectorStorage implements TermVectorStorage,
> Serializable {
> private static final long serialVersionUID =3D 1L;
> private Map storage;
>
>
> public MyHashMapTermVectorStorage(int amount)
> {
> storage =3D new HashMap(amount);
> }
>
>
>
> public MyHashMapTermVectorStorage()
> {
> storage =3D new HashMap();
> }
>
> /**
> * @see net.sf.classifier4J.vector.TermVectorStorage#addTermVector(
> java.lang.String, net.sf.classifier4J.vector.TermVector)
> */
> public void addTermVector(String category, TermVector termVector) {
> //storage.put(category, termVector);
> //modified: Abelssoft, Sven Abels, 16.03.2005:
>
> TermVector old=3D(TermVector)storage.get(category);
> if (old=3D=3Dnull) storage.put(category, termVector);
> else
> {
> old.add(termVector);
> storage.put(category, old);
> }
> }
>
> /**
> * @see net.sf.classifier4J.vector.TermVectorStorage#getTermVector(
> java.lang.String)
> */
> public TermVector getTermVector(String category) {
> return (TermVector) storage.get(category);
> }
>
> public int size()
> {
> if (storage=3D=3Dnull) return 0;
> return storage.size();
> }
>
> }
>
>
>
|