classifier4j-devel Mailing List for Classifier4J (Page 4)

Status: Beta

Brought to you by: nicklothian

classifier4j-devel — Development and use of Classifier4J

You can subscribe to this list here.

2003	Jan	Feb	Mar	Apr	May	Jun	Jul (18)	Aug (14)	Sep	Oct	Nov (74)	Dec (9)
2004	Jan (15)	Feb (6)	Mar	Apr	May (27)	Jun (1)	Jul (14)	Aug (3)	Sep (9)	Oct	Nov (3)	Dec (6)
2005	Jan	Feb (2)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec (3)
2006	Jan	Feb (5)	Mar (5)	Apr	May (2)	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2007	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (10)	Sep	Oct (1)	Nov	Dec
2008	Jan	Feb	Mar (1)	Apr (4)	May (1)	Jun (4)	Jul (10)	Aug (5)	Sep (10)	Oct (18)	Nov (39)	Dec (73)
2009	Jan (78)	Feb (24)	Mar (32)	Apr (53)	May (115)	Jun (99)	Jul (72)	Aug (18)	Sep (22)	Oct (35)	Nov (10)	Dec (19)
2010	Jan (6)	Feb (7)	Mar (43)	Apr (55)	May (78)	Jun (71)	Jul (43)	Aug (42)	Sep (19)	Oct (5)	Nov	Dec
2012	Jan	Feb (1)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2013	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

Flat | Threaded

<< < 1 2 3 4 5 6 .. 12 > >> (Page 4 of 12)

Re: [Classifier4j-devel] Problem with MySQL and JDBCWordsDataSource

From: Nick L. <ni...@ma...> - 2006-05-07 12:29:22

Hi Nadja,

Have you tried pooling your database connections?

Nick


Nadja Senoucci wrote:

> Hello all,
>
> I am trying out Classifier4J as a possible tool for categorizing news
> messages. I have several thousand test files of varying length at the
> moment and 12 different categories. With that amount of data I have to
> use JDBCWordsDataSource (I naturally get "out of memory"-errors with
> SimpleWordDataSource) or something similar. Also, I chose to use
> JDBCWordsDataSource over JDBMWordsDataSource mostly because I couldn't
> figure out how to properly use JDBMWordsDataSource (can't find the
> source code of it and there doesn't seem to be much documentation that I
> can find for it either).
>
> Anyway, long story short: I keep getting the
> "net.sf.classifier4J.bayesian.WordsDataSourceException: Problem updating
> WordProbability" while still training some texts for my first category
> and it seems that the underlying problem here is another exception:
> java.net.SocketException: "java.net.BindException: Address already in
> use: connect". The MySQL documention tells me that this happens when an
> application is trying to open too many connections within a short time 
> span.
>
> Now what I am basically doing code-wise is this (the code has been
> simplified so that it only includes neccessary information):
>
> Iterator iter = list.iterator(); /*list is an ArrayList of filenames to
> train with for this category*/
> while(iter.hasNext()){
>     nextFile = (String)iter.next();
>     text = TextUtilities.getText(nextFile); /*returns the contents of
> the file as plain text*/
>     tokenizedText = this.tokenizer.tokenize(text);
>     for(int i = 0; i < tokenizedText.length; i++){
>         jdbcDataSource.addMatch(pool, tokenizedText[i]);
>     }
> }
>
> I hope this piece of code will still be readable once I send the 
> email. :)
>
> Some things seem to get entered into the database table before the
> exception occurs.
>
> I also tried using the classifier so I wouldn't have to add every single
> token but could train an entire message at once but I still got the same
> exception and it seemed like no data at all made it to the database.
>
> Can anyone help me with this? I just can't figure out how to solve this
> problem. Wouldn't surprise me if it was some really stupid mistake on my
> part. :)
>
> Regards,
> Nadja
>
>
>
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job 
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache 
> Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Classifier4j-devel mailing list
> Cla...@li...
> https://lists.sourceforge.net/lists/listinfo/classifier4j-devel
>
> --------------------------------
> Spam/Virus scanning by CanIt Pro
>
> For more information see
> http://www.kgbinternet.com/SpamFilter.htm
>
> To control your spam filter, log in at
> http://filter.kgbinternet.com
>

[Classifier4j-devel] Problem with MySQL and JDBCWordsDataSource

From: Nadja S. <sen...@21...> - 2006-05-02 19:39:18

Hello all,

I am trying out Classifier4J as a possible tool for categorizing news
messages. I have several thousand test files of varying length at the
moment and 12 different categories. With that amount of data I have to
use JDBCWordsDataSource (I naturally get "out of memory"-errors with
SimpleWordDataSource) or something similar. Also, I chose to use
JDBCWordsDataSource over JDBMWordsDataSource mostly because I couldn't
figure out how to properly use JDBMWordsDataSource (can't find the
source code of it and there doesn't seem to be much documentation that I
can find for it either).

Anyway, long story short: I keep getting the
"net.sf.classifier4J.bayesian.WordsDataSourceException: Problem updating
WordProbability" while still training some texts for my first category
and it seems that the underlying problem here is another exception:
java.net.SocketException: "java.net.BindException: Address already in
use: connect". The MySQL documention tells me that this happens when an
application is trying to open too many connections within a short time span.

Now what I am basically doing code-wise is this (the code has been
simplified so that it only includes neccessary information):

Iterator iter = list.iterator(); /*list is an ArrayList of filenames to
train with for this category*/
while(iter.hasNext()){
     nextFile = (String)iter.next();
     text = TextUtilities.getText(nextFile); /*returns the contents of
the file as plain text*/
     tokenizedText = this.tokenizer.tokenize(text);
     for(int i = 0; i < tokenizedText.length; i++){
         jdbcDataSource.addMatch(pool, tokenizedText[i]);
     }
}

I hope this piece of code will still be readable once I send the email. :)

Some things seem to get entered into the database table before the
exception occurs.

I also tried using the classifier so I wouldn't have to add every single
token but could train an entire message at once but I still got the same
exception and it seemed like no data at all made it to the database.

Can anyone help me with this? I just can't figure out how to solve this
problem. Wouldn't surprise me if it was some really stupid mistake on my
part. :)

Regards,
Nadja

Re: [Classifier4j-devel] Bayesian with multiple categories

From: Nick L. <ni...@ma...> - 2006-03-12 11:38:22

You could try running Classifier4J in .NET under IKVM 
(http://www.ikvm.net/). I'd imagine that it would work pretty well.

Let me know if it works!

Nick

Ric...@er... wrote:

>
> cla...@li... wrote on 03/10/2006 
> 07:48:19 AM:
>
> >> It isn't  really possible to compare scores across categories to say
> >> that one category is the "best" category.
> >> All the Bayesian classifier will do is say if something matches the
> >> current category.
>
> > I was wondering what you ended up doing on this -- I have
> > a similar situation
>
>
> I'm actually using a port of Classifier4J for .NET called
> NClassifier, which is based on Classified4J 0.51, so there
> is no working VectorClassifier implementation. I've given
> up for now and will re-evaluate when the NClassifier library
> catches up--no billable time available to port the updates
> myself.
>
> --Richard
>
>
> ----------------------------------------------
>
>
> This electronic mail message may contain information which is (a) 
> LEGALLY PRIVILEGED,  PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY 
> LAW FROM DISCLOSURE, and (b) intended only for the use of the 
> Addressee (s) names herein.  If you are not the Addressee (s), or the 
> person responsible for delivering this to the Addressee (s), you are 
> hereby notified that reading, copying, or distributing this message is 
> prohibited.  If you have received this electronic mail message in 
> error, please contact us immediately at (281) 600-1000 and take the 
> steps necessary to delete the message completely from your computer 
> system.  Thank you,   Environmental Resources Management.   Please 
> visit ERM's web site: http://www.erm.com

Re: [Classifier4j-devel] Bayesian with multiple categories

From: <Ric...@er...> - 2006-03-10 14:46:18

cla...@li... wrote on 03/10/2006 
07:48:19 AM:

>> It isn't  really possible to compare scores across categories to say
>> that one category is the "best" category. 
>> All the Bayesian classifier will do is say if something matches the 
>> current category.

> I was wondering what you ended up doing on this -- I have
> a similar situation


I'm actually using a port of Classifier4J for .NET called
NClassifier, which is based on Classified4J 0.51, so there
is no working VectorClassifier implementation. I've given
up for now and will re-evaluate when the NClassifier library
catches up--no billable time available to port the updates
myself.

--Richard


----------------------------------------------


This electronic mail message may contain information which is (a) LEGALLY 
PRIVILEGED,  PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM 
DISCLOSURE, and (b) intended only for the use of the Addressee (s) names 
herein.  If you are not the Addressee (s), or the person responsible for 
delivering this to the Addressee (s), you are hereby notified that 
reading, copying, or distributing this message is prohibited.  If you have 
received this electronic mail message in error, please contact us 
immediately at (281) 600-1000 and take the steps necessary to delete the 
message completely from your computer system.  Thank you,   Environmental 
Resources Management.   Please visit ERM's web site: http://www.erm.com

Re: [Classifier4j-devel] Bayesian with multiple categories

From: Joe S. <sca...@gm...> - 2006-03-10 13:48:25

Richard -

I was wondering what you ended up doing on this -- I have a similar
situation
joe

On 3/2/06, Nick Lothian <ni...@ma...> wrote:
>
> See inline
>
>
> Ric...@er... wrote:
>
>
> Apologies in advance if this comes through in HTML, I'm stuck
> on Lotus Notes here at work.
>
> I have a bunch of legislative text, around 400,000 individual
> paragraphs, that have each been hand-categorized into one of
> five categories.
>
> Since I have a few hundred thousand still to go, I thought the
> Bayesian classifier could give me a leg up on this process.
>
> So I wrote a little trainer that does something like the
> following:
>
> switch(existingcategory){
>   case "category1":
>         classifier.TeachMatch("category1", mytext);
>         classifier.TeachNonMatch("category2", mytext);
>         classifier.TeachNonMatch("category3", mytext);
>         classifier.TeachNonMatch("category4", mytext);
>         classifier.TeachNonMatch("category5", mytext);
>         break;
>   case "category2":
>         classifier.TeachNonMatch("category1", mytext);
>         classifier.TeachMatch("category2", mytext);
>         classifier.TeachNonMatch("category3", mytext);
>         classifier.TeachNonMatch("category4", mytext);
>         classifier.TeachNonMatch("category5", mytext);
>         break;
>   case "category3":
>         classifier.TeachNonMatch("category1", mytext);
>         classifier.TeachNonMatch("category2", mytext);
>         classifier.TeachMatch("category3", mytext);
>         classifier.TeachNonMatch("category4", mytext);
>         classifier.TeachNonMatch("category5", mytext);
>         break;
>   case "category4":
>         ...
> }
>
> The problem is, *one* of the categories is *much* more common than
> the others, so it gets more matches and fewer non-matches for almost
> *any* word.
>
> So, now when I send a new string through the trained classifier and
> compare the scores, that category almost always wins out, and in a
> big way (generally around 99% for it, 1% for the others).
>
>  It isn't  really possible to compare scores across categories to say tha=
t
> one category is the "best" category.
>
> All the Bayesian classifier will do is say if something matches the
> current category. As you've seen it does that well - you'll typically end=
 up
> with a very high score (99%) or a very low score (1%) and not much in
> between.
>
> Perhaps you could classify the big category last, and only check it is
> none of the other ones find a match.
>
>
> Am I training this classifier wrong, or is this a limitation of
> using Bayesian filters with more than two categories or with a
> corpus that is unevenly distributed among the categories?
>
> I thought maybe I should try the VectorClassifier instead, but I
> have *tens of thousands* of strings in each category that I need to
> train it on, and the docs state that you can't incrementally train
> it (which, I presume, means I would need to concatenate the entire
> training corpus into one string per category).
>
>
> That means just that the training interfaces aren't properly implemented
> (yet). I've attached an updatable HashMapTermVectorStorage that fixes thi=
s
> (I haven't tested it though) - it might give you something to start from.
>
> Nick
>
>
> package net.sf.classifier4J.vector;
>
> import java.io.Serializable;
> import java.util.HashMap;
> import java.util.Hashtable;
> import java.util.Map;
> import java.util.Set;
>
>
> public class MyHashMapTermVectorStorage implements TermVectorStorage,
> Serializable {
>    private static final long serialVersionUID =3D 1L;
>    private Map storage;
>
>
>    public MyHashMapTermVectorStorage(int amount)
>    {
>        storage =3D new HashMap(amount);
>    }
>
>
>
>    public MyHashMapTermVectorStorage()
>    {
>        storage =3D new HashMap();
>    }
>
>    /**
>     * @see net.sf.classifier4J.vector.TermVectorStorage#addTermVector(
> java.lang.String, net.sf.classifier4J.vector.TermVector)
>     */
>    public void addTermVector(String category, TermVector termVector) {
>        //storage.put(category, termVector);
>        //modified: Abelssoft, Sven Abels, 16.03.2005:
>
>        TermVector old=3D(TermVector)storage.get(category);
>        if (old=3D=3Dnull) storage.put(category, termVector);
>        else
>        {
>            old.add(termVector);
>            storage.put(category, old);
>        }
>    }
>
>    /**
>     * @see net.sf.classifier4J.vector.TermVectorStorage#getTermVector(
> java.lang.String)
>     */
>    public TermVector getTermVector(String category) {
>        return (TermVector) storage.get(category);
>    }
>
>    public int size()
>    {
>        if (storage=3D=3Dnull) return 0;
>        return storage.size();
>    }
>
> }
>
>
>

Re: [Classifier4j-devel] Bayesian with multiple categories

From: Nick L. <ni...@ma...> - 2006-03-02 11:56:50

package net.sf.classifier4J.vector;

import java.io.Serializable;
import java.util.HashMap;
import java.util.Hashtable;
import java.util.Map;
import java.util.Set;


public class MyHashMapTermVectorStorage implements TermVectorStorage, Serializable {
    private static final long serialVersionUID = 1L;
    private Map storage;
    
    
    public MyHashMapTermVectorStorage(int amount)
    {
        storage = new HashMap(amount);
    }
    

    
    public MyHashMapTermVectorStorage()
    {
        storage = new HashMap();
    }
    
    /**
     * @see net.sf.classifier4J.vector.TermVectorStorage#addTermVector(java.lang.String, net.sf.classifier4J.vector.TermVector)
     */
    public void addTermVector(String category, TermVector termVector) {
        //storage.put(category, termVector);        
        //modified: Abelssoft, Sven Abels, 16.03.2005:
              
        TermVector old=(TermVector)storage.get(category);
        if (old==null) storage.put(category, termVector);
        else
        {
            old.add(termVector);
            storage.put(category, old);
        }
    }

    /**
     * @see net.sf.classifier4J.vector.TermVectorStorage#getTermVector(java.lang.String)
     */
    public TermVector getTermVector(String category) {
        return (TermVector) storage.get(category);
    }
    
    public int size()
    {
        if (storage==null) return 0;
        return storage.size();
    }

}

[Classifier4j-devel] Bayesian with multiple categories

From: <Ric...@er...> - 2006-03-01 15:04:26

Apologies in advance if this comes through in HTML, I'm stuck
on Lotus Notes here at work.

I have a bunch of legislative text, around 400,000 individual
paragraphs, that have each been hand-categorized into one of
five categories.

Since I have a few hundred thousand still to go, I thought the
Bayesian classifier could give me a leg up on this process.

So I wrote a little trainer that does something like the
following:

switch(existingcategory){
  case "category1":
        classifier.TeachMatch("category1", mytext);
        classifier.TeachNonMatch("category2", mytext);
        classifier.TeachNonMatch("category3", mytext);
        classifier.TeachNonMatch("category4", mytext);
        classifier.TeachNonMatch("category5", mytext);
        break;
  case "category2":
        classifier.TeachNonMatch("category1", mytext);
        classifier.TeachMatch("category2", mytext);
        classifier.TeachNonMatch("category3", mytext);
        classifier.TeachNonMatch("category4", mytext);
        classifier.TeachNonMatch("category5", mytext);
        break;
  case "category3":
        classifier.TeachNonMatch("category1", mytext);
        classifier.TeachNonMatch("category2", mytext);
        classifier.TeachMatch("category3", mytext);
        classifier.TeachNonMatch("category4", mytext);
        classifier.TeachNonMatch("category5", mytext);
        break;
  case "category4":
        ...
}

The problem is, *one* of the categories is *much* more common than
the others, so it gets more matches and fewer non-matches for almost
*any* word.

So, now when I send a new string through the trained classifier and
compare the scores, that category almost always wins out, and in a
big way (generally around 99% for it, 1% for the others).

Am I training this classifier wrong, or is this a limitation of 
using Bayesian filters with more than two categories or with a 
corpus that is unevenly distributed among the categories?

I thought maybe I should try the VectorClassifier instead, but I
have *tens of thousands* of strings in each category that I need to
train it on, and the docs state that you can't incrementally train
it (which, I presume, means I would need to concatenate the entire
training corpus into one string per category).

Any help would be greatly appreciated...

--
Richard S. Tallent
ERM (Beaumont, TX)
409-833-7755


----------------------------------------------


This electronic mail message may contain information which is (a) LEGALLY 
PRIVILEGED,  PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM 
DISCLOSURE, and (b) intended only for the use of the Addressee (s) names 
herein.  If you are not the Addressee (s), or the person responsible for 
delivering this to the Addressee (s), you are hereby notified that 
reading, copying, or distributing this message is prohibited.  If you have 
received this electronic mail message in error, please contact us 
immediately at (281) 600-1000 and take the steps necessary to delete the 
message completely from your computer system.  Thank you,   Environmental 
Resources Management.   Please visit ERM's web site: http://www.erm.com

Re: [Classifier4j-devel] level two

From: karl w. <we...@ho...> - 2006-02-07 18:33:49

7 feb 2006 kl. 12.28 skrev Nick Lothian:

> karl wettin wrote:
>
>> I really like the simplicity of the C4J API, but think it's too  
>> bad  there is no support for other than nominal values. Have you   
>> considered to add support for numeric values in the interfaces?  
>> I'd  love to plug in the Weka J48 (and others) to the same API.  
>> The C4J  classifiers could simply treat the numeric values as  
>> nominal. It  would also be nice with more than one dimension of  
>> classes per  classifier, i.e. rows and columns.
>>
>> How about that? Or is this outside the indended scope of C4J?  
>> Perhaps  I should make my own facade that handles both Weka and  
>> C4J in a  simple way?
>>
>
> I'm quite interested in this, but I have to admit I don't  
> understand the distinction between numeric & nominal values. Can  
> you explain this some (to save me some googling-time...)?

Very short: nominal values are strings. Numeric values are integer/ 
floating point values.

They are both classes, but numerical values are easier to bend in  
either direction than a nominal value.

-- 
karl

Re: [Classifier4j-devel] level two

From: Nick L. <ni...@ma...> - 2006-02-07 11:28:09

karl wettin wrote:

> I really like the simplicity of the C4J API, but think it's too bad  
> there is no support for other than nominal values. Have you  
> considered to add support for numeric values in the interfaces? I'd  
> love to plug in the Weka J48 (and others) to the same API. The C4J  
> classifiers could simply treat the numeric values as nominal. It  
> would also be nice with more than one dimension of classes per  
> classifier, i.e. rows and columns.
>
> How about that? Or is this outside the indended scope of C4J? Perhaps  
> I should make my own facade that handles both Weka and C4J in a  
> simple way?
>

I'm quite interested in this, but I have to admit I don't understand the 
distinction between numeric & nominal values. Can you explain this some 
(to save me some googling-time...)?

Nick

[Classifier4j-devel] level two

From: karl w. <we...@ho...> - 2006-02-07 03:39:38

I really like the simplicity of the C4J API, but think it's too bad  
there is no support for other than nominal values. Have you  
considered to add support for numeric values in the interfaces? I'd  
love to plug in the Weka J48 (and others) to the same API. The C4J  
classifiers could simply treat the numeric values as nominal. It  
would also be nice with more than one dimension of classes per  
classifier, i.e. rows and columns.

How about that? Or is this outside the indended scope of C4J? Perhaps  
I should make my own facade that handles both Weka and C4J in a  
simple way?

-- 
karl

Re: [Classifier4j-devel] detect inappropriate content in web post

From: karl w. <we...@ho...> - 2006-02-06 14:47:55

6 feb 2006 kl. 15.31 skrev Jeff Thorne:
> I would like to analyze each users post for various words and  
> expressions before publishing their post to the DB. I was wondering  
> if someone could shed some light on the best way to tackle this  
> problem with Classifier4j or another api if doing so makes more sense?
>
> How would the performance be with classifier4J and which  
> classifier4j datasource and classifier do you recommend we use.

I doubt you want to use C4J for this. I would probably use build n- 
grams of the words and the text to weight them up to make sure no one  
is trying to hide the prophanities in other words or by miss spelling  
them. The Lucene spell check library does this for you. And really  
fast. An easier way out would be to simply match text to the words with:

for (String prophanity : prophanities) {
     if (input.indexOf(prophanity) > 1) {
         reportProphanity(input);
     }
}

-- 
karl

Re: [Classifier4j-devel] is there a list archive?

From: Joe S. <sca...@gm...> - 2005-12-10 00:23:38

Thank you

On 12/9/05, Mike Heath <mh...@av...> wrote:
>
> Indeed there is:
> http://sourceforge.net/mailarchive/forum.php?forum_id=3D34026
>
> On Fri, 2005-12-09 at 07:58 -0500, Joe Scanlon wrote:
> >
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=3D7637&alloc_id=3D16865&op=3Dclick
> _______________________________________________
> Classifier4j-devel mailing list
> Cla...@li...
> https://lists.sourceforge.net/lists/listinfo/classifier4j-devel
>

Re: [Classifier4j-devel] is there a list archive?

From: Mike H. <mh...@av...> - 2005-12-09 21:21:52

Indeed there is:
http://sourceforge.net/mailarchive/forum.php?forum_id=34026

On Fri, 2005-12-09 at 07:58 -0500, Joe Scanlon wrote:
>

[Classifier4j-devel] is there a list archive?

From: Joe S. <sca...@gm...> - 2005-12-09 12:58:37

[Classifier4j-devel] subscribe

From: Scanlon, J. <Joe...@Li...> - 2005-07-11 13:09:35

subscribe

Joe Scanlon
Principal Software Engineer
ACS Application Common Utilities
Liberty Mutual Group
Joe...@Li...
ph: (603) 245-1934
fax: (603) 245-0715
cell: (603) 489-8231

Re: [Classifier4j-devel] Classifier4J 0.6 released

From: karl w. <ka...@sn...> - 2005-02-26 12:29:42

fre 2005-02-04 klockan 19:59 +1030 skrev Nick Lothian:
> I've just released Classifier4J 0.6. This new release includes a
> rather nice (I think) new classifier (the VectorClassifer) based on 

Rather nice indeed.

I just got around to try it out. What do you think about extending the
vector with heirarcies? The parent/child delta could be used as negative
data of the parent, or something like that.

Any instintive thoughts? I could set off some hours for that.


-- 

  karl

[Classifier4j-devel] Classifier4J 0.6 released

From: Nick L. <ni...@ma...> - 2005-02-04 09:26:16

I've just released Classifier4J 0.6. This new release includes a rather 
nice (I think) new classifier (the VectorClassifer) based on the vector 
space search algorithm This particular classifier is fast, doesn't 
require training for non-matches and is very suitable for sorting data 
into various categories.

The build system now is totally based on Maven, and I've moved to a new 
CVS module (newbuild) to implement this.

Let me know if you find any bugs.

Nick

Re: [Classifier4j-devel] Comparing Documents

From: Mike H. <mh...@av...> - 2004-12-29 19:23:30

C4J relies on Naive Bayes (http://en.wikipedia.org/wiki/Naive_Bayes)
which, in order to classify something, you need to teach it what each
class is AND what each class is not.  For comparison purposes as you've
described in your message, I'm not sure the C4J is a good solution.

-Mike

On Sun, 2004-12-26 at 15:37, Colin Bell wrote:
> Hi all
> 
> I would like to start with saying what an exciting piece of software 
> C4J is thanks to all those involved.
> 
> I have written a bit of code to use C4J to compare documents (in this 
> case stored in a JDBC database) to each other and find out how similar 
> they are. I pick the document from which I am to compare, and then add 
> each word of it to a SimpleWordsDataSource using a loop 
> (wds.addMatch(wordList[i])). I then use BayesianClassifier(wds) to get 
> the result of each document.
> 
> Problem is that my results are obviously very poor (always 0.99, 
> sometimes 0.5) because I don't have any non-matches. Does anyone have 
> an idea on how I could do this? What could I possible use as 
> non-matches, or am I missing a trick?
> 
> Many thanks
> 
> Regards
> 
> Colin
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Classifier4j-devel mailing list
> Cla...@li...
> https://lists.sourceforge.net/lists/listinfo/classifier4j-devel

[Classifier4j-devel] Comparing Documents

From: Colin B. <co...@ga...> - 2004-12-26 22:37:17

Hi all

I would like to start with saying what an exciting piece of software 
C4J is thanks to all those involved.

I have written a bit of code to use C4J to compare documents (in this 
case stored in a JDBC database) to each other and find out how similar 
they are. I pick the document from which I am to compare, and then add 
each word of it to a SimpleWordsDataSource using a loop 
(wds.addMatch(wordList[i])). I then use BayesianClassifier(wds) to get 
the result of each document.

Problem is that my results are obviously very poor (always 0.99, 
sometimes 0.5) because I don't have any non-matches. Does anyone have 
an idea on how I could do this? What could I possible use as 
non-matches, or am I missing a trick?

Many thanks

Regards

Colin

[Classifier4j-devel] RE: Classifier4j-devel digest, Vol 1 #78 - 2 msgs

From: Wayne S. <wds...@oa...> - 2004-12-13 17:05:54

Thanks

It makes more sense now.

-----Original Message-----
From: cla...@li...
[mailto:cla...@li...] On Behalf Of
cla...@li...
Sent: Sunday, December 12, 2004 11:12 PM
To: cla...@li...
Subject: Classifier4j-devel digest, Vol 1 #78 - 2 msgs

Send Classifier4j-devel mailing list submissions to
	cla...@li...

To subscribe or unsubscribe via the World Wide Web, visit
	https://lists.sourceforge.net/lists/listinfo/classifier4j-devel
or, via email, send a message with subject or body 'help' to
	cla...@li...

You can reach the person managing the list at
	cla...@li...

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Classifier4j-devel digest..."


Today's Topics:

   1. RE: What does this method do -normaliseSigni
       ficance( ) (Nick Lothian)

--__--__--

Message: 1
From: Nick Lothian <nic...@es...>
To: "'cla...@li...'"
	 <cla...@li...>
Subject: RE: [Classifier4j-devel] What does this method do -normaliseSigni
	ficance( )
Date: Mon, 13 Dec 2004 08:51:29 +1030
Reply-To: cla...@li...

>=20
> On Fri, 2004-12-10 at 17:45, Wayne Snyder wrote:
> > I understand just about everything that=A2s going on in this =
package,
> > except for the following method:
> >
> > Class BayesianClassifier
>=20
> > protected static double normaliseSignificance(double sig)
>=20
> > Could you please explain the role it plays.
>=20
> I am not a Classifier4J developer but I've used Classifier4J=20
> quite a bit
> and have done a lot of research on Naive Bayesian Classifiers.
>=20
> Stated simply, probabilities of 0 mess up a Naive Bayesian Classifier
> and probabilities of 1 don't change anything.  It basically boils =
down
> to the fact that anything multiplied by 0 is 0 and multiplying by 1
> doesn't change anything.=20
> BayesianClassifer.normaliseSignificance(double) simply removes the =
1's
> and the 0's and replaces them with 0.99 and 0.01, respectively.
>=20
> For a good explanation of the magic that is Naive Bayesian
> Classification, check out:
> http://en.wikipedia.org/wiki/Naive_Bayesian_classifier
>=20
> -Mike
>=20

That's exactly what that method does.

Nick



--__--__--

_______________________________________________
Classifier4j-devel mailing list
Cla...@li...
https://lists.sourceforge.net/lists/listinfo/classifier4j-devel


End of Classifier4j-devel Digest

RE: [Classifier4j-devel] What does this method do -normaliseSigni ficance( )

From: Nick L. <nic...@es...> - 2004-12-12 22:25:53

>=20
> On Fri, 2004-12-10 at 17:45, Wayne Snyder wrote:
> > I understand just about everything that=A2s going on in this =
package,
> > except for the following method:
> >
> > Class BayesianClassifier
>=20
> > protected static double normaliseSignificance(double sig)
>=20
> > Could you please explain the role it plays.
>=20
> I am not a Classifier4J developer but I've used Classifier4J=20
> quite a bit
> and have done a lot of research on Naive Bayesian Classifiers.
>=20
> Stated simply, probabilities of 0 mess up a Naive Bayesian Classifier
> and probabilities of 1 don't change anything.  It basically boils =
down
> to the fact that anything multiplied by 0 is 0 and multiplying by 1
> doesn't change anything.=20
> BayesianClassifer.normaliseSignificance(double) simply removes the =
1's
> and the 0's and replaces them with 0.99 and 0.01, respectively.
>=20
> For a good explanation of the magic that is Naive Bayesian
> Classification, check out:
> http://en.wikipedia.org/wiki/Naive_Bayesian_classifier
>=20
> -Mike
>=20

That's exactly what that method does.

Nick

Re: [Classifier4j-devel] What does this method do - normaliseSignificance( )

From: Mike H. <mh...@av...> - 2004-12-12 03:50:34

On Fri, 2004-12-10 at 17:45, Wayne Snyder wrote:
> I understand just about everything that=A2s going on in this package,
> except for the following method:
>
> Class BayesianClassifier

> protected static double normaliseSignificance(double sig)

> Could you please explain the role it plays.

I am not a Classifier4J developer but I've used Classifier4J quite a bit
and have done a lot of research on Naive Bayesian Classifiers.

Stated simply, probabilities of 0 mess up a Naive Bayesian Classifier
and probabilities of 1 don't change anything.  It basically boils down
to the fact that anything multiplied by 0 is 0 and multiplying by 1
doesn't change anything.=20
BayesianClassifer.normaliseSignificance(double) simply removes the 1's
and the 0's and replaces them with 0.99 and 0.01, respectively.

For a good explanation of the magic that is Naive Bayesian
Classification, check out:
http://en.wikipedia.org/wiki/Naive_Bayesian_classifier

-Mike

[Classifier4j-devel] What does this method do - normaliseSignificance( )

From: Wayne S. <wds...@oa...> - 2004-12-11 00:48:42

I understand just about everything that's going on in this package, except
for the following method:

 

Class BayesianClassifier

 

protected static double normaliseSignificance(double sig)

 

 

Could you please explain the role it plays.

 

Thanks

Wayne

RE: [Classifier4j-devel] Exception in thread "main" java.lang.NoC lassDefFoundError: org/apache/commons/logging/LogFactory

From: Nick L. <nic...@es...> - 2004-11-28 22:53:55

BTW, you will need to train non-matches as well as matches in order to get
sensible results.
 
Nick

-----Original Message-----
From: Nick Lothian [mailto:nic...@es...]
Sent: Monday, 29 November 2004 9:16 AM
To: cla...@li...
Subject: RE: [Classifier4j-devel] Exception in thread "main"
java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
Importance: Low


You need apache commons logging. See
http://classifier4j.sourceforge.net/dependencies.html
<http://classifier4j.sourceforge.net/dependencies.html> 
 
Nick
 

-----Original Message-----
From: Wayne [mailto:des...@ho...]
Sent: Monday, 29 November 2004 9:17 AM
To: cla...@li...
Subject: [Classifier4j-devel] Exception in thread "main"
java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
Importance: Low



My Bayesian test program compiles fine but I get this error when I try to
run it:

 

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/commons/logging/LogFactory

      at
net.sf.classifier4J.bayesian.WordProbability.calculateProbability(WordProbab
ility.java:167)

      at
net.sf.classifier4J.bayesian.WordProbability.setMatchingCount(WordProbabilit
y.java:138)

      at
net.sf.classifier4J.bayesian.WordProbability.<init>(WordProbabilityjava:115)

      at
net.sf.classifier4J.bayesian.SimpleWordsDataSource.addMatch(SimpleWordsDataS
ource.java:94)

      at testing.Test1.main(Test1.java:15)

 

I am using Eclipse 3.1M2 and have added the Classifier4J-0.51.jar as an
external JAR library. This version of Eclipse uses JDK 5.0.

 

Does anyone know what settings I need in Eclipse to run?

 

Here is the test code in my project:

 

 

package testing;

 

import net.sf.classifier4J.ClassifierException;

import net.sf.classifier4J.IClassifier;

import net.sf.classifier4J.bayesian.BayesianClassifier;

import net.sf.classifier4J.bayesian.IWordsDataSource;

import net.sf.classifier4J.bayesian.SimpleWordsDataSource;

import net.sf.classifier4J.bayesian.WordsDataSourceException;

 

public class Test1 {

 

            public static void main(String[] args) {

                        IWordsDataSource wds = new SimpleWordsDataSource();

                        try {

                                    wds.addMatch("Blah");

                        } catch (WordsDataSourceException e) {

                                    e.printStackTrace();

                        }

                        IClassifier classifier = new
BayesianClassifier(wds);

                        try {

                                    dReturn = classifier.classify("Blah
Happy Holidays");

                        } catch (ClassifierException e1) {

                                    e1.printStackTrace();

                        }

                        System.out.println(dReturn);

            }

            private static double dReturn;

}

 

 

Thanks

-Wayne

RE: [Classifier4j-devel] Exception in thread "main" java.lang.NoC lassDefFoundError: org/apache/commons/logging/LogFactory

From: Nick L. <nic...@es...> - 2004-11-28 22:50:08

You need apache commons logging. See
http://classifier4j.sourceforge.net/dependencies.html
<http://classifier4j.sourceforge.net/dependencies.html> 
 
Nick
 

-----Original Message-----
From: Wayne [mailto:des...@ho...]
Sent: Monday, 29 November 2004 9:17 AM
To: cla...@li...
Subject: [Classifier4j-devel] Exception in thread "main"
java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
Importance: Low



My Bayesian test program compiles fine but I get this error when I try to
run it:

 

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/commons/logging/LogFactory

      at
net.sf.classifier4J.bayesian.WordProbability.calculateProbability(WordProbab
ility.java:167)

      at
net.sf.classifier4J.bayesian.WordProbability.setMatchingCount(WordProbabilit
y.java:138)

      at
net.sf.classifier4J.bayesian.WordProbability.<init>(WordProbabilityjava:115)

      at
net.sf.classifier4J.bayesian.SimpleWordsDataSource.addMatch(SimpleWordsDataS
ource.java:94)

      at testing.Test1.main(Test1.java:15)

 

I am using Eclipse 3.1M2 and have added the Classifier4J-0.51.jar as an
external JAR library. This version of Eclipse uses JDK 5.0.

 

Does anyone know what settings I need in Eclipse to run?

 

Here is the test code in my project:

 

 

package testing;

 

import net.sf.classifier4J.ClassifierException;

import net.sf.classifier4J.IClassifier;

import net.sf.classifier4J.bayesian.BayesianClassifier;

import net.sf.classifier4J.bayesian.IWordsDataSource;

import net.sf.classifier4J.bayesian.SimpleWordsDataSource;

import net.sf.classifier4J.bayesian.WordsDataSourceException;

 

public class Test1 {

 

            public static void main(String[] args) {

                        IWordsDataSource wds = new SimpleWordsDataSource();

                        try {

                                    wds.addMatch("Blah");

                        } catch (WordsDataSourceException e) {

                                    e.printStackTrace();

                        }

                        IClassifier classifier = new
BayesianClassifier(wds);

                        try {

                                    dReturn = classifier.classify("Blah
Happy Holidays");

                        } catch (ClassifierException e1) {

                                    e1.printStackTrace();

                        }

                        System.out.println(dReturn);

            }

            private static double dReturn;

}

 

 

Thanks

-Wayne

1052 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 2 3 4 5 6 .. 12 > >> (Page 4 of 12)