Re: [Jboost-users] Questions about JBoost implementations

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hey Glenn,

I haven't used JBoost in a while, but I have a couple of guesses that may 
answer your questions.

> I ran Predict ("java -cp .:../dist/jboost.jar Predict <
> spambase.data") against the original data.  I got two columns of
> output that looked like
> 
> 5.00073612523801        -5.00073612523801
> 11.864681207163063      -11.864681207163063
> 8.780744089260097       -8.780744089260097
> ...
> Why are there two columns with the same magnitudes?  I'm guessing that
> these are is/is not spam scores, but they seem redundant.
> 

Guess: One is margin and the other is classification score.  You can 
determine this by looking at labels*column1 or labels*column2 and see if 
the results match the other column.

> It would seem that changing a value in the first line of spambase.data 
> would change the classification score I see above, but it doesn't.  I 
> changed the first value in
> 
> 0,0.64,0.64,0,0.32,0,0,0,0,0,0,0.64,0,0,0,0.32,0,1.29,1.93,0,0.96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.778,0,0,3.756,61,278,+1;
> 
> from 0 to other values, but the first classification score
> (5.00073612523801) didn't change.  Why is that?

Guess: Boosting doesn't produce a linear classifier.  Depending on the 
number of iterations used, you may have used fewer dimensions than exist 
in the data.  In fact, even if you change every value in an example, the 
score may still be the same.  This is due to JBoost using thresholding 
weak classifiers.  If you look at the actual tree (either at the raw file 
or see documentation about visualization), you should be able to determine 
which dimension where used and at what thresholds.  If you change one of 
these dimensions so that it is on the other side of the threshold, you 
should see a change in output value.

Hope that helps!

Aaron