Re: [Jboost-users] weights and weak learner

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Tue, 12 Aug 2008, Gungor Polatkan wrote:

> The major issue for me was how the current algorithm use the weights. Is 
> the implementation buggy or the idea prior to implement? if the 
> implementation is buggy, the thing I am curious about is the idea behind 
> weighting the data in the current implementation.

The idea itself is not buggy.  The current implementation is buggy.

> I also looked at the code and I am confused about the the variable 
> names. Does it mention about the Distribution updated at each 
> iterationof boosting as also weights?

The member variable m_sampleWeights of AdaBoost is the weight as read in 
from the data file.  The member variable m_weights an array of weights, 
one for each example, for a given iteration.  m_weights is sometimes 
refered to as "D" in AdaBoost papers, or the "distribution across 
examples".

The variable name reminds me that m_sampleWeights is meant to be used with 
boosting-by-sampling, which I'm not sure if it was every fully implemented 
with a standard interface.  It should still work for your purposes... but 
it doesn't since there's likely a bug somewhere.

Aaron

> Aaron Arvey yazm?s,:
>> Hi Gungor,
>> 
>> Glad to hear you're working with JBoost!
>> 
>> See comments inline below.
>> 
>> On Tue, 12 Aug 2008, Gungor Polatkan wrote:
>> 
>>> 1) First question is about the weight input. The meaning(higher weight
>>> implies greater importance to classify correctly) is fundamentally
>>> important for us since it is the heart of our research project. How does
>>> the algorithm do that? Is there any paper related to this idea? or is it
>>> just a practical empirical method just by changing the initial
>>> distribution? Do you guys know anything about that? Any information
>>> about this thing will help me very much. Also what is the bug currently
>>> in the weighting? looking for the news...
>>> |weight| an initial weighting of the data (higher weight implies
>>> greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN
>>> MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional
>> 
>> The bug in weighting has still not been fixed. All I know is that the final 
>> output from data with weights is not as would be expected (I verified this 
>> myself several months ago). The weighting itself is read in correctly (from 
>> what I could tell by output), but the way it is applied is somehow buggy. 
>> It is only applied in a couple locations, so it is somewhat unnerving that 
>> it causes such abnormal behavior.
>> 
>> There are many other ways to reweight your data other than using the 
>> provided weight option. Depending how large and extreme your class 
>> distributions are, you can just oversample the smaller class prior to input 
>> to JBoost. Keep in mind that the first weak hypothesis is always "Class is 
>> +1" so any lopsidedness in the data will be reweighted by the score given 
>> to this classifier and the subsequent reweighting on examples.
>> 
>> NOTE: The "Class is +1" classifier will rebalance data that isn't *too* 
>> skewed in class distribution. The fact that it doesn't balance the classes 
>> when they are skewed is considered to be a small bug (I verified this as 
>> well, around the same time I verified the weight bug). However, if you 
>> oversample the data so that the classes aren't *too* skewed (I've done 10:1 
>> without problem), then sliding the score for "Class is +1" should provide 
>> you with control over sensitivity/specificity.
>> 
>>> 2) Second question is about the weak learner Jboost use. Since my data
>>> features are Real Values (not binary or discrete but -inf to +inf Real
>>> Numbers), I think I should use decision stumps with real thresholds.
>>> Does the algorithm consider such a thing (for binary feature a simpler
>>> stump should be used and for real valued another one)?
>> 
>> If you run JBoost with default boosting parameters, it will use decision 
>> stumps for weak learners. Boolean values can be seen as a subset of real 
>> values (-1 is false, +1 is true) and the decision stumps would then be "<0" 
>> for false and ">0" for true.
>> 
>> Also, I believe I remember there's a bug with "+inf" "-inf" values (as may 
>> happen from output. I'd recommend replacing all -inf and +inf values with a 
>> real value larger than all other values. The weak learning algorithms will 
>> treat the largest (smallest) values as +inf (-inf).
>> 
>> Try the default parameters for boosting and let me know if you need any 
>> more guidance on this topic.
>> 
>>> 3)For the modification, are all the source codes in the SRC folder ?
>> 
>> Yes. There are some scripts in jboost-VERSION/scripts that are helpful in 
>> visualizing the output, but all the code you'll likely want to edit is in 
>> jboost-VERSION/src.
>> 
>> 
>> 
>> Let me know if this answers your questions or if you have any other 
>> inquiries.
>> 
>> Aaron
>