From: Gungor P. <pol...@Pr...> - 2008-08-12 19:15:12
|
Hi Everybody, I am working on a modification of boosting on genomic data. I have some questions on Jboost, 1) First question is about the weight input. The meaning(higher weight implies greater importance to classify correctly) is fundamentally important for us since it is the heart of our research project. How does the algorithm do that? Is there any paper related to this idea? or is it just a practical emprical method just by changing the initial distribution? Do you guys know anything about that? Any information about this thing will help me very much. Also what is the bug currently in the weighting? looking for the news... |weight| an initial weighting of the data (higher weight implies greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional 2) Second question is about the weak learner Jboost use. Since my data features are Real Values (not binary or discrete but -inf to +inf Real Numbers), I think I should use decision stumps with real thresholds. Does the algorithm consider such a thing (for binary feature a simpler stump should be used and for real valued another one)? 3)For the modification, are all the source codes in the SRC folder ? Best, Gungor |
From: Aaron A. <aa...@cs...> - 2008-08-12 19:43:50
|
Hi Gungor, Glad to hear you're working with JBoost! See comments inline below. On Tue, 12 Aug 2008, Gungor Polatkan wrote: > 1) First question is about the weight input. The meaning(higher weight > implies greater importance to classify correctly) is fundamentally > important for us since it is the heart of our research project. How does > the algorithm do that? Is there any paper related to this idea? or is it > just a practical empirical method just by changing the initial > distribution? Do you guys know anything about that? Any information > about this thing will help me very much. Also what is the bug currently > in the weighting? looking for the news... > |weight| an initial weighting of the data (higher weight implies > greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN > MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional The bug in weighting has still not been fixed. All I know is that the final output from data with weights is not as would be expected (I verified this myself several months ago). The weighting itself is read in correctly (from what I could tell by output), but the way it is applied is somehow buggy. It is only applied in a couple locations, so it is somewhat unnerving that it causes such abnormal behavior. There are many other ways to reweight your data other than using the provided weight option. Depending how large and extreme your class distributions are, you can just oversample the smaller class prior to input to JBoost. Keep in mind that the first weak hypothesis is always "Class is +1" so any lopsidedness in the data will be reweighted by the score given to this classifier and the subsequent reweighting on examples. NOTE: The "Class is +1" classifier will rebalance data that isn't *too* skewed in class distribution. The fact that it doesn't balance the classes when they are skewed is considered to be a small bug (I verified this as well, around the same time I verified the weight bug). However, if you oversample the data so that the classes aren't *too* skewed (I've done 10:1 without problem), then sliding the score for "Class is +1" should provide you with control over sensitivity/specificity. > 2) Second question is about the weak learner Jboost use. Since my data > features are Real Values (not binary or discrete but -inf to +inf Real > Numbers), I think I should use decision stumps with real thresholds. > Does the algorithm consider such a thing (for binary feature a simpler > stump should be used and for real valued another one)? If you run JBoost with default boosting parameters, it will use decision stumps for weak learners. Boolean values can be seen as a subset of real values (-1 is false, +1 is true) and the decision stumps would then be "<0" for false and ">0" for true. Also, I believe I remember there's a bug with "+inf" "-inf" values (as may happen from output. I'd recommend replacing all -inf and +inf values with a real value larger than all other values. The weak learning algorithms will treat the largest (smallest) values as +inf (-inf). Try the default parameters for boosting and let me know if you need any more guidance on this topic. > 3)For the modification, are all the source codes in the SRC folder ? Yes. There are some scripts in jboost-VERSION/scripts that are helpful in visualizing the output, but all the code you'll likely want to edit is in jboost-VERSION/src. Let me know if this answers your questions or if you have any other inquiries. Aaron |
From: Gungor P. <pol...@Pr...> - 2008-08-12 22:57:50
|
Hi Aaron, Thank you very much for your detailed explanations. The major issue for me was how the current algorithm use the weights. Is the implementation buggy or the idea prior to implement? if the implementation is buggy, the thing I am curious about is the idea behind weighting the data in the current implementation. I am looking forward to good news on this thing. I also looked at the code and I am confused about the the variable names. Does it mention about the Distribution updated at each iterationof boosting as also weights? Thanks again, Best, Gungor Aaron Arvey yazm?s,: > Hi Gungor, > > Glad to hear you're working with JBoost! > > See comments inline below. > > On Tue, 12 Aug 2008, Gungor Polatkan wrote: > >> 1) First question is about the weight input. The meaning(higher weight >> implies greater importance to classify correctly) is fundamentally >> important for us since it is the heart of our research project. How does >> the algorithm do that? Is there any paper related to this idea? or is it >> just a practical empirical method just by changing the initial >> distribution? Do you guys know anything about that? Any information >> about this thing will help me very much. Also what is the bug currently >> in the weighting? looking for the news... >> |weight| an initial weighting of the data (higher weight implies >> greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN >> MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional > > The bug in weighting has still not been fixed. All I know is that the > final output from data with weights is not as would be expected (I > verified this myself several months ago). The weighting itself is read > in correctly (from what I could tell by output), but the way it is > applied is somehow buggy. It is only applied in a couple locations, so > it is somewhat unnerving that it causes such abnormal behavior. > > There are many other ways to reweight your data other than using the > provided weight option. Depending how large and extreme your class > distributions are, you can just oversample the smaller class prior to > input to JBoost. Keep in mind that the first weak hypothesis is always > "Class is +1" so any lopsidedness in the data will be reweighted by > the score given to this classifier and the subsequent reweighting on > examples. > > NOTE: The "Class is +1" classifier will rebalance data that isn't > *too* skewed in class distribution. The fact that it doesn't balance > the classes when they are skewed is considered to be a small bug (I > verified this as well, around the same time I verified the weight > bug). However, if you oversample the data so that the classes aren't > *too* skewed (I've done 10:1 without problem), then sliding the score > for "Class is +1" should provide you with control over > sensitivity/specificity. > >> 2) Second question is about the weak learner Jboost use. Since my data >> features are Real Values (not binary or discrete but -inf to +inf Real >> Numbers), I think I should use decision stumps with real thresholds. >> Does the algorithm consider such a thing (for binary feature a simpler >> stump should be used and for real valued another one)? > > If you run JBoost with default boosting parameters, it will use > decision stumps for weak learners. Boolean values can be seen as a > subset of real values (-1 is false, +1 is true) and the decision > stumps would then be "<0" for false and ">0" for true. > > Also, I believe I remember there's a bug with "+inf" "-inf" values (as > may happen from output. I'd recommend replacing all -inf and +inf > values with a real value larger than all other values. The weak > learning algorithms will treat the largest (smallest) values as +inf > (-inf). > > Try the default parameters for boosting and let me know if you need > any more guidance on this topic. > >> 3)For the modification, are all the source codes in the SRC folder ? > > Yes. There are some scripts in jboost-VERSION/scripts that are helpful > in visualizing the output, but all the code you'll likely want to edit > is in jboost-VERSION/src. > > > > Let me know if this answers your questions or if you have any other > inquiries. > > Aaron |
From: Aaron A. <aa...@cs...> - 2008-08-13 04:50:43
|
On Tue, 12 Aug 2008, Gungor Polatkan wrote: > The major issue for me was how the current algorithm use the weights. Is > the implementation buggy or the idea prior to implement? if the > implementation is buggy, the thing I am curious about is the idea behind > weighting the data in the current implementation. The idea itself is not buggy. The current implementation is buggy. > I also looked at the code and I am confused about the the variable > names. Does it mention about the Distribution updated at each > iterationof boosting as also weights? The member variable m_sampleWeights of AdaBoost is the weight as read in from the data file. The member variable m_weights an array of weights, one for each example, for a given iteration. m_weights is sometimes refered to as "D" in AdaBoost papers, or the "distribution across examples". The variable name reminds me that m_sampleWeights is meant to be used with boosting-by-sampling, which I'm not sure if it was every fully implemented with a standard interface. It should still work for your purposes... but it doesn't since there's likely a bug somewhere. Aaron > Aaron Arvey yazm?s,: >> Hi Gungor, >> >> Glad to hear you're working with JBoost! >> >> See comments inline below. >> >> On Tue, 12 Aug 2008, Gungor Polatkan wrote: >> >>> 1) First question is about the weight input. The meaning(higher weight >>> implies greater importance to classify correctly) is fundamentally >>> important for us since it is the heart of our research project. How does >>> the algorithm do that? Is there any paper related to this idea? or is it >>> just a practical empirical method just by changing the initial >>> distribution? Do you guys know anything about that? Any information >>> about this thing will help me very much. Also what is the bug currently >>> in the weighting? looking for the news... >>> |weight| an initial weighting of the data (higher weight implies >>> greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN >>> MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional >> >> The bug in weighting has still not been fixed. All I know is that the final >> output from data with weights is not as would be expected (I verified this >> myself several months ago). The weighting itself is read in correctly (from >> what I could tell by output), but the way it is applied is somehow buggy. >> It is only applied in a couple locations, so it is somewhat unnerving that >> it causes such abnormal behavior. >> >> There are many other ways to reweight your data other than using the >> provided weight option. Depending how large and extreme your class >> distributions are, you can just oversample the smaller class prior to input >> to JBoost. Keep in mind that the first weak hypothesis is always "Class is >> +1" so any lopsidedness in the data will be reweighted by the score given >> to this classifier and the subsequent reweighting on examples. >> >> NOTE: The "Class is +1" classifier will rebalance data that isn't *too* >> skewed in class distribution. The fact that it doesn't balance the classes >> when they are skewed is considered to be a small bug (I verified this as >> well, around the same time I verified the weight bug). However, if you >> oversample the data so that the classes aren't *too* skewed (I've done 10:1 >> without problem), then sliding the score for "Class is +1" should provide >> you with control over sensitivity/specificity. >> >>> 2) Second question is about the weak learner Jboost use. Since my data >>> features are Real Values (not binary or discrete but -inf to +inf Real >>> Numbers), I think I should use decision stumps with real thresholds. >>> Does the algorithm consider such a thing (for binary feature a simpler >>> stump should be used and for real valued another one)? >> >> If you run JBoost with default boosting parameters, it will use decision >> stumps for weak learners. Boolean values can be seen as a subset of real >> values (-1 is false, +1 is true) and the decision stumps would then be "<0" >> for false and ">0" for true. >> >> Also, I believe I remember there's a bug with "+inf" "-inf" values (as may >> happen from output. I'd recommend replacing all -inf and +inf values with a >> real value larger than all other values. The weak learning algorithms will >> treat the largest (smallest) values as +inf (-inf). >> >> Try the default parameters for boosting and let me know if you need any >> more guidance on this topic. >> >>> 3)For the modification, are all the source codes in the SRC folder ? >> >> Yes. There are some scripts in jboost-VERSION/scripts that are helpful in >> visualizing the output, but all the code you'll likely want to edit is in >> jboost-VERSION/src. >> >> >> >> Let me know if this answers your questions or if you have any other >> inquiries. >> >> Aaron > |