From: Aaron A. <aa...@cs...> - 2008-08-13 04:50:43
|
On Tue, 12 Aug 2008, Gungor Polatkan wrote: > The major issue for me was how the current algorithm use the weights. Is > the implementation buggy or the idea prior to implement? if the > implementation is buggy, the thing I am curious about is the idea behind > weighting the data in the current implementation. The idea itself is not buggy. The current implementation is buggy. > I also looked at the code and I am confused about the the variable > names. Does it mention about the Distribution updated at each > iterationof boosting as also weights? The member variable m_sampleWeights of AdaBoost is the weight as read in from the data file. The member variable m_weights an array of weights, one for each example, for a given iteration. m_weights is sometimes refered to as "D" in AdaBoost papers, or the "distribution across examples". The variable name reminds me that m_sampleWeights is meant to be used with boosting-by-sampling, which I'm not sure if it was every fully implemented with a standard interface. It should still work for your purposes... but it doesn't since there's likely a bug somewhere. Aaron > Aaron Arvey yazm?s,: >> Hi Gungor, >> >> Glad to hear you're working with JBoost! >> >> See comments inline below. >> >> On Tue, 12 Aug 2008, Gungor Polatkan wrote: >> >>> 1) First question is about the weight input. The meaning(higher weight >>> implies greater importance to classify correctly) is fundamentally >>> important for us since it is the heart of our research project. How does >>> the algorithm do that? Is there any paper related to this idea? or is it >>> just a practical empirical method just by changing the initial >>> distribution? Do you guys know anything about that? Any information >>> about this thing will help me very much. Also what is the bug currently >>> in the weighting? looking for the news... >>> |weight| an initial weighting of the data (higher weight implies >>> greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN >>> MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional >> >> The bug in weighting has still not been fixed. All I know is that the final >> output from data with weights is not as would be expected (I verified this >> myself several months ago). The weighting itself is read in correctly (from >> what I could tell by output), but the way it is applied is somehow buggy. >> It is only applied in a couple locations, so it is somewhat unnerving that >> it causes such abnormal behavior. >> >> There are many other ways to reweight your data other than using the >> provided weight option. Depending how large and extreme your class >> distributions are, you can just oversample the smaller class prior to input >> to JBoost. Keep in mind that the first weak hypothesis is always "Class is >> +1" so any lopsidedness in the data will be reweighted by the score given >> to this classifier and the subsequent reweighting on examples. >> >> NOTE: The "Class is +1" classifier will rebalance data that isn't *too* >> skewed in class distribution. The fact that it doesn't balance the classes >> when they are skewed is considered to be a small bug (I verified this as >> well, around the same time I verified the weight bug). However, if you >> oversample the data so that the classes aren't *too* skewed (I've done 10:1 >> without problem), then sliding the score for "Class is +1" should provide >> you with control over sensitivity/specificity. >> >>> 2) Second question is about the weak learner Jboost use. Since my data >>> features are Real Values (not binary or discrete but -inf to +inf Real >>> Numbers), I think I should use decision stumps with real thresholds. >>> Does the algorithm consider such a thing (for binary feature a simpler >>> stump should be used and for real valued another one)? >> >> If you run JBoost with default boosting parameters, it will use decision >> stumps for weak learners. Boolean values can be seen as a subset of real >> values (-1 is false, +1 is true) and the decision stumps would then be "<0" >> for false and ">0" for true. >> >> Also, I believe I remember there's a bug with "+inf" "-inf" values (as may >> happen from output. I'd recommend replacing all -inf and +inf values with a >> real value larger than all other values. The weak learning algorithms will >> treat the largest (smallest) values as +inf (-inf). >> >> Try the default parameters for boosting and let me know if you need any >> more guidance on this topic. >> >>> 3)For the modification, are all the source codes in the SRC folder ? >> >> Yes. There are some scripts in jboost-VERSION/scripts that are helpful in >> visualizing the output, but all the code you'll likely want to edit is in >> jboost-VERSION/src. >> >> >> >> Let me know if this answers your questions or if you have any other >> inquiries. >> >> Aaron > |