|
From: Aaron A. <aa...@cb...> - 2010-12-13 19:51:56
|
Hey Glenn,
I haven't used JBoost in a while, but I have a couple of guesses that may
answer your questions.
> I ran Predict ("java -cp .:../dist/jboost.jar Predict <
> spambase.data") against the original data. I got two columns of
> output that looked like
>
> 5.00073612523801 -5.00073612523801
> 11.864681207163063 -11.864681207163063
> 8.780744089260097 -8.780744089260097
> ...
> Why are there two columns with the same magnitudes? I'm guessing that
> these are is/is not spam scores, but they seem redundant.
>
Guess: One is margin and the other is classification score. You can
determine this by looking at labels*column1 or labels*column2 and see if
the results match the other column.
> It would seem that changing a value in the first line of spambase.data
> would change the classification score I see above, but it doesn't. I
> changed the first value in
>
> 0,0.64,0.64,0,0.32,0,0,0,0,0,0,0.64,0,0,0,0.32,0,1.29,1.93,0,0.96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.778,0,0,3.756,61,278,+1;
>
> from 0 to other values, but the first classification score
> (5.00073612523801) didn't change. Why is that?
Guess: Boosting doesn't produce a linear classifier. Depending on the
number of iterations used, you may have used fewer dimensions than exist
in the data. In fact, even if you change every value in an example, the
score may still be the same. This is due to JBoost using thresholding
weak classifiers. If you look at the actual tree (either at the raw file
or see documentation about visualization), you should be able to determine
which dimension where used and at what thresholds. If you change one of
these dimensions so that it is on the other side of the threshold, you
should see a change in output value.
Hope that helps!
Aaron
|