Re: [Jboost-users] About Asymetric cost

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hongbin,

Two excellent questions.

For asymmetric cost, there are two methods that are currently available and 
reasonably well documented:

1) You can over/under-sample your examples such that the weight placed on 
the small number of positive examples increases w.r.t. the total weight of 
all examples.  While this is a hack, it may be sufficient and is the 
simplest way to do asymmetric cost.  For this method, I recommend using 
LogLossBoost as it will leave the positive examples with more weight than 
AdaBoost (which will increase the weight of the negative examples at an 
exponential rate, whereas LogLossBoost will do this at a linear rate -- 
give or take).  However, try both algorithms and compare results (error 
rates, margin curves, etc).  There is a resample.py script that may be a 
good way to resample (it keeps things in order), though you'll likely have 
to edit the source since I wrote it quickly and likely didn't do a good 
job.

2) You can give each negative example an initial weight in the spec file 
using the "weight" feature (see the online documentation for creating the 
spec file).  This weight can be anything in the range of [0, 1].  I know 
that this worked at one point in time, but I know it's had troubled 
lately.  It is also roughly equivalent in effectiveness to option 1, but 
it requires less memory.

3) Yes, I know I said there were two methods, but here's a 3rd one 
anyways. I am currently working on asymmetric cost to be used with 
BrownBoost & NormalBoost.  I'll likely do release 1.4.1 in the next couple 
weeks and it will have asymmetric cost for BrownBoost (documented, etc) and 
an initial attempt at bug free asymmetric cost with NormalBoost. 
BrownBoost is fairly simple to parameterize and NormalBoost is only 
slightly more complicated.  I will post documentation on the website and 
send you an email in the next week or so when asymmetric cost BrownBoost is 
finished.  I'm guessing NormalBoost will take another week or two more.

For cascade classifiers, a friend of mine was going to code this up, but 
he just hasn't gotten around to it.  The main changes that would need to 
be made are in ./jboost/src/jboost/atree/ and ./jboost/src/jboost/booster. 
In particular, I think the PredictorNode (which I recently changed to be 
evaluated in iteration order, not DFS) will need check to see if the next 
prediction pushes us over the limit and we can stop evaluating weak 
hypotheses.  Once an example is considered to be obsolete, it can be 
marked as such (though this will increase the memory size of an example, 
something we're trying to keep to a minimum).  The PredictorNode and 
Booster can then check to see if the example is obsolete or not.  If 
you're interested in implementing this, let me know and I can go into 
greater detail.  Otherwise, I'll mention it again to the guy who is going 
to need it for some other.

Let me know if you have any other questions and I'll shoot you another 
email when I get asymmetric BrownBoost and NormalBoost in the next couple 
weeks.

Aaron

On Fri, 26 Oct 2007, hongbin wang wrote:

> Hi Aaron,
>
> Thanks for your excellent software. I have a question
> about how to set asymetric cost. Basically I have a
> imbalanced dataset include small number of positive
> samples and large negative samples. Another question
> is that, how to extend your code into cascade
> classifier?
>
> Many thanks
>
> Hongbin