Thread: [Classifier4j-devel] Bayesian with multiple categories
Status: Beta
Brought to you by:
nicklothian
|
From: <Ric...@er...> - 2006-03-01 15:04:26
|
Apologies in advance if this comes through in HTML, I'm stuck
on Lotus Notes here at work.
I have a bunch of legislative text, around 400,000 individual
paragraphs, that have each been hand-categorized into one of
five categories.
Since I have a few hundred thousand still to go, I thought the
Bayesian classifier could give me a leg up on this process.
So I wrote a little trainer that does something like the
following:
switch(existingcategory){
case "category1":
classifier.TeachMatch("category1", mytext);
classifier.TeachNonMatch("category2", mytext);
classifier.TeachNonMatch("category3", mytext);
classifier.TeachNonMatch("category4", mytext);
classifier.TeachNonMatch("category5", mytext);
break;
case "category2":
classifier.TeachNonMatch("category1", mytext);
classifier.TeachMatch("category2", mytext);
classifier.TeachNonMatch("category3", mytext);
classifier.TeachNonMatch("category4", mytext);
classifier.TeachNonMatch("category5", mytext);
break;
case "category3":
classifier.TeachNonMatch("category1", mytext);
classifier.TeachNonMatch("category2", mytext);
classifier.TeachMatch("category3", mytext);
classifier.TeachNonMatch("category4", mytext);
classifier.TeachNonMatch("category5", mytext);
break;
case "category4":
...
}
The problem is, *one* of the categories is *much* more common than
the others, so it gets more matches and fewer non-matches for almost
*any* word.
So, now when I send a new string through the trained classifier and
compare the scores, that category almost always wins out, and in a
big way (generally around 99% for it, 1% for the others).
Am I training this classifier wrong, or is this a limitation of
using Bayesian filters with more than two categories or with a
corpus that is unevenly distributed among the categories?
I thought maybe I should try the VectorClassifier instead, but I
have *tens of thousands* of strings in each category that I need to
train it on, and the docs state that you can't incrementally train
it (which, I presume, means I would need to concatenate the entire
training corpus into one string per category).
Any help would be greatly appreciated...
--
Richard S. Tallent
ERM (Beaumont, TX)
409-833-7755
----------------------------------------------
This electronic mail message may contain information which is (a) LEGALLY
PRIVILEGED, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM
DISCLOSURE, and (b) intended only for the use of the Addressee (s) names
herein. If you are not the Addressee (s), or the person responsible for
delivering this to the Addressee (s), you are hereby notified that
reading, copying, or distributing this message is prohibited. If you have
received this electronic mail message in error, please contact us
immediately at (281) 600-1000 and take the steps necessary to delete the
message completely from your computer system. Thank you, Environmental
Resources Management. Please visit ERM's web site: http://www.erm.com |
|
From: Nick L. <ni...@ma...> - 2006-03-02 11:56:50
|
package net.sf.classifier4J.vector;
import java.io.Serializable;
import java.util.HashMap;
import java.util.Hashtable;
import java.util.Map;
import java.util.Set;
public class MyHashMapTermVectorStorage implements TermVectorStorage, Serializable {
private static final long serialVersionUID = 1L;
private Map storage;
public MyHashMapTermVectorStorage(int amount)
{
storage = new HashMap(amount);
}
public MyHashMapTermVectorStorage()
{
storage = new HashMap();
}
/**
* @see net.sf.classifier4J.vector.TermVectorStorage#addTermVector(java.lang.String, net.sf.classifier4J.vector.TermVector)
*/
public void addTermVector(String category, TermVector termVector) {
//storage.put(category, termVector);
//modified: Abelssoft, Sven Abels, 16.03.2005:
TermVector old=(TermVector)storage.get(category);
if (old==null) storage.put(category, termVector);
else
{
old.add(termVector);
storage.put(category, old);
}
}
/**
* @see net.sf.classifier4J.vector.TermVectorStorage#getTermVector(java.lang.String)
*/
public TermVector getTermVector(String category) {
return (TermVector) storage.get(category);
}
public int size()
{
if (storage==null) return 0;
return storage.size();
}
}
|
|
From: Joe S. <sca...@gm...> - 2006-03-10 13:48:25
|
Richard -
I was wondering what you ended up doing on this -- I have a similar
situation
joe
On 3/2/06, Nick Lothian <ni...@ma...> wrote:
>
> See inline
>
>
> Ric...@er... wrote:
>
>
> Apologies in advance if this comes through in HTML, I'm stuck
> on Lotus Notes here at work.
>
> I have a bunch of legislative text, around 400,000 individual
> paragraphs, that have each been hand-categorized into one of
> five categories.
>
> Since I have a few hundred thousand still to go, I thought the
> Bayesian classifier could give me a leg up on this process.
>
> So I wrote a little trainer that does something like the
> following:
>
> switch(existingcategory){
> case "category1":
> classifier.TeachMatch("category1", mytext);
> classifier.TeachNonMatch("category2", mytext);
> classifier.TeachNonMatch("category3", mytext);
> classifier.TeachNonMatch("category4", mytext);
> classifier.TeachNonMatch("category5", mytext);
> break;
> case "category2":
> classifier.TeachNonMatch("category1", mytext);
> classifier.TeachMatch("category2", mytext);
> classifier.TeachNonMatch("category3", mytext);
> classifier.TeachNonMatch("category4", mytext);
> classifier.TeachNonMatch("category5", mytext);
> break;
> case "category3":
> classifier.TeachNonMatch("category1", mytext);
> classifier.TeachNonMatch("category2", mytext);
> classifier.TeachMatch("category3", mytext);
> classifier.TeachNonMatch("category4", mytext);
> classifier.TeachNonMatch("category5", mytext);
> break;
> case "category4":
> ...
> }
>
> The problem is, *one* of the categories is *much* more common than
> the others, so it gets more matches and fewer non-matches for almost
> *any* word.
>
> So, now when I send a new string through the trained classifier and
> compare the scores, that category almost always wins out, and in a
> big way (generally around 99% for it, 1% for the others).
>
> It isn't really possible to compare scores across categories to say tha=
t
> one category is the "best" category.
>
> All the Bayesian classifier will do is say if something matches the
> current category. As you've seen it does that well - you'll typically end=
up
> with a very high score (99%) or a very low score (1%) and not much in
> between.
>
> Perhaps you could classify the big category last, and only check it is
> none of the other ones find a match.
>
>
> Am I training this classifier wrong, or is this a limitation of
> using Bayesian filters with more than two categories or with a
> corpus that is unevenly distributed among the categories?
>
> I thought maybe I should try the VectorClassifier instead, but I
> have *tens of thousands* of strings in each category that I need to
> train it on, and the docs state that you can't incrementally train
> it (which, I presume, means I would need to concatenate the entire
> training corpus into one string per category).
>
>
> That means just that the training interfaces aren't properly implemented
> (yet). I've attached an updatable HashMapTermVectorStorage that fixes thi=
s
> (I haven't tested it though) - it might give you something to start from.
>
> Nick
>
>
> package net.sf.classifier4J.vector;
>
> import java.io.Serializable;
> import java.util.HashMap;
> import java.util.Hashtable;
> import java.util.Map;
> import java.util.Set;
>
>
> public class MyHashMapTermVectorStorage implements TermVectorStorage,
> Serializable {
> private static final long serialVersionUID =3D 1L;
> private Map storage;
>
>
> public MyHashMapTermVectorStorage(int amount)
> {
> storage =3D new HashMap(amount);
> }
>
>
>
> public MyHashMapTermVectorStorage()
> {
> storage =3D new HashMap();
> }
>
> /**
> * @see net.sf.classifier4J.vector.TermVectorStorage#addTermVector(
> java.lang.String, net.sf.classifier4J.vector.TermVector)
> */
> public void addTermVector(String category, TermVector termVector) {
> //storage.put(category, termVector);
> //modified: Abelssoft, Sven Abels, 16.03.2005:
>
> TermVector old=3D(TermVector)storage.get(category);
> if (old=3D=3Dnull) storage.put(category, termVector);
> else
> {
> old.add(termVector);
> storage.put(category, old);
> }
> }
>
> /**
> * @see net.sf.classifier4J.vector.TermVectorStorage#getTermVector(
> java.lang.String)
> */
> public TermVector getTermVector(String category) {
> return (TermVector) storage.get(category);
> }
>
> public int size()
> {
> if (storage=3D=3Dnull) return 0;
> return storage.size();
> }
>
> }
>
>
>
|
|
From: <Ric...@er...> - 2006-03-10 14:46:18
|
cla...@li... wrote on 03/10/2006 07:48:19 AM: >> It isn't really possible to compare scores across categories to say >> that one category is the "best" category. >> All the Bayesian classifier will do is say if something matches the >> current category. > I was wondering what you ended up doing on this -- I have > a similar situation I'm actually using a port of Classifier4J for .NET called NClassifier, which is based on Classified4J 0.51, so there is no working VectorClassifier implementation. I've given up for now and will re-evaluate when the NClassifier library catches up--no billable time available to port the updates myself. --Richard ---------------------------------------------- This electronic mail message may contain information which is (a) LEGALLY PRIVILEGED, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) intended only for the use of the Addressee (s) names herein. If you are not the Addressee (s), or the person responsible for delivering this to the Addressee (s), you are hereby notified that reading, copying, or distributing this message is prohibited. If you have received this electronic mail message in error, please contact us immediately at (281) 600-1000 and take the steps necessary to delete the message completely from your computer system. Thank you, Environmental Resources Management. Please visit ERM's web site: http://www.erm.com |
|
From: Nick L. <ni...@ma...> - 2006-03-12 11:38:22
|
You could try running Classifier4J in .NET under IKVM (http://www.ikvm.net/). I'd imagine that it would work pretty well. Let me know if it works! Nick Ric...@er... wrote: > > cla...@li... wrote on 03/10/2006 > 07:48:19 AM: > > >> It isn't really possible to compare scores across categories to say > >> that one category is the "best" category. > >> All the Bayesian classifier will do is say if something matches the > >> current category. > > > I was wondering what you ended up doing on this -- I have > > a similar situation > > > I'm actually using a port of Classifier4J for .NET called > NClassifier, which is based on Classified4J 0.51, so there > is no working VectorClassifier implementation. I've given > up for now and will re-evaluate when the NClassifier library > catches up--no billable time available to port the updates > myself. > > --Richard > > > ---------------------------------------------- > > > This electronic mail message may contain information which is (a) > LEGALLY PRIVILEGED, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY > LAW FROM DISCLOSURE, and (b) intended only for the use of the > Addressee (s) names herein. If you are not the Addressee (s), or the > person responsible for delivering this to the Addressee (s), you are > hereby notified that reading, copying, or distributing this message is > prohibited. If you have received this electronic mail message in > error, please contact us immediately at (281) 600-1000 and take the > steps necessary to delete the message completely from your computer > system. Thank you, Environmental Resources Management. Please > visit ERM's web site: http://www.erm.com |