From: Aaron A. <aa...@cs...> - 2007-10-26 20:45:33
|
Hongbin, Two excellent questions. For asymmetric cost, there are two methods that are currently available and reasonably well documented: 1) You can over/under-sample your examples such that the weight placed on the small number of positive examples increases w.r.t. the total weight of all examples. While this is a hack, it may be sufficient and is the simplest way to do asymmetric cost. For this method, I recommend using LogLossBoost as it will leave the positive examples with more weight than AdaBoost (which will increase the weight of the negative examples at an exponential rate, whereas LogLossBoost will do this at a linear rate -- give or take). However, try both algorithms and compare results (error rates, margin curves, etc). There is a resample.py script that may be a good way to resample (it keeps things in order), though you'll likely have to edit the source since I wrote it quickly and likely didn't do a good job. 2) You can give each negative example an initial weight in the spec file using the "weight" feature (see the online documentation for creating the spec file). This weight can be anything in the range of [0, 1]. I know that this worked at one point in time, but I know it's had troubled lately. It is also roughly equivalent in effectiveness to option 1, but it requires less memory. 3) Yes, I know I said there were two methods, but here's a 3rd one anyways. I am currently working on asymmetric cost to be used with BrownBoost & NormalBoost. I'll likely do release 1.4.1 in the next couple weeks and it will have asymmetric cost for BrownBoost (documented, etc) and an initial attempt at bug free asymmetric cost with NormalBoost. BrownBoost is fairly simple to parameterize and NormalBoost is only slightly more complicated. I will post documentation on the website and send you an email in the next week or so when asymmetric cost BrownBoost is finished. I'm guessing NormalBoost will take another week or two more. For cascade classifiers, a friend of mine was going to code this up, but he just hasn't gotten around to it. The main changes that would need to be made are in ./jboost/src/jboost/atree/ and ./jboost/src/jboost/booster. In particular, I think the PredictorNode (which I recently changed to be evaluated in iteration order, not DFS) will need check to see if the next prediction pushes us over the limit and we can stop evaluating weak hypotheses. Once an example is considered to be obsolete, it can be marked as such (though this will increase the memory size of an example, something we're trying to keep to a minimum). The PredictorNode and Booster can then check to see if the example is obsolete or not. If you're interested in implementing this, let me know and I can go into greater detail. Otherwise, I'll mention it again to the guy who is going to need it for some other. Let me know if you have any other questions and I'll shoot you another email when I get asymmetric BrownBoost and NormalBoost in the next couple weeks. Aaron On Fri, 26 Oct 2007, hongbin wang wrote: > Hi Aaron, > > Thanks for your excellent software. I have a question > about how to set asymetric cost. Basically I have a > imbalanced dataset include small number of positive > samples and large negative samples. Another question > is that, how to extend your code into cascade > classifier? > > Many thanks > > Hongbin |