Hongbin,
Two excellent questions.
For asymmetric cost, there are two methods that are currently available and
reasonably well documented:
1) You can over/under-sample your examples such that the weight placed on
the small number of positive examples increases w.r.t. the total weight of
all examples. While this is a hack, it may be sufficient and is the
simplest way to do asymmetric cost. For this method, I recommend using
LogLossBoost as it will leave the positive examples with more weight than
AdaBoost (which will increase the weight of the negative examples at an
exponential rate, whereas LogLossBoost will do this at a linear rate --
give or take). However, try both algorithms and compare results (error
rates, margin curves, etc). There is a resample.py script that may be a
good way to resample (it keeps things in order), though you'll likely have
to edit the source since I wrote it quickly and likely didn't do a good
job.
2) You can give each negative example an initial weight in the spec file
using the "weight" feature (see the online documentation for creating the
spec file). This weight can be anything in the range of [0, 1]. I know
that this worked at one point in time, but I know it's had troubled
lately. It is also roughly equivalent in effectiveness to option 1, but
it requires less memory.
3) Yes, I know I said there were two methods, but here's a 3rd one
anyways. I am currently working on asymmetric cost to be used with
BrownBoost & NormalBoost. I'll likely do release 1.4.1 in the next couple
weeks and it will have asymmetric cost for BrownBoost (documented, etc) and
an initial attempt at bug free asymmetric cost with NormalBoost.
BrownBoost is fairly simple to parameterize and NormalBoost is only
slightly more complicated. I will post documentation on the website and
send you an email in the next week or so when asymmetric cost BrownBoost is
finished. I'm guessing NormalBoost will take another week or two more.
For cascade classifiers, a friend of mine was going to code this up, but
he just hasn't gotten around to it. The main changes that would need to
be made are in ./jboost/src/jboost/atree/ and ./jboost/src/jboost/booster.
In particular, I think the PredictorNode (which I recently changed to be
evaluated in iteration order, not DFS) will need check to see if the next
prediction pushes us over the limit and we can stop evaluating weak
hypotheses. Once an example is considered to be obsolete, it can be
marked as such (though this will increase the memory size of an example,
something we're trying to keep to a minimum). The PredictorNode and
Booster can then check to see if the example is obsolete or not. If
you're interested in implementing this, let me know and I can go into
greater detail. Otherwise, I'll mention it again to the guy who is going
to need it for some other.
Let me know if you have any other questions and I'll shoot you another
email when I get asymmetric BrownBoost and NormalBoost in the next couple
weeks.
Aaron
On Fri, 26 Oct 2007, hongbin wang wrote:
> Hi Aaron,
>
> Thanks for your excellent software. I have a question
> about how to set asymetric cost. Basically I have a
> imbalanced dataset include small number of positive
> samples and large negative samples. Another question
> is that, how to extend your code into cascade
> classifier?
>
> Many thanks
>
> Hongbin
|