From: Aaron A. <aa...@cb...> - 2010-12-13 19:51:56
|
Hey Glenn, I haven't used JBoost in a while, but I have a couple of guesses that may answer your questions. > I ran Predict ("java -cp .:../dist/jboost.jar Predict < > spambase.data") against the original data. I got two columns of > output that looked like > > 5.00073612523801 -5.00073612523801 > 11.864681207163063 -11.864681207163063 > 8.780744089260097 -8.780744089260097 > ... > Why are there two columns with the same magnitudes? I'm guessing that > these are is/is not spam scores, but they seem redundant. > Guess: One is margin and the other is classification score. You can determine this by looking at labels*column1 or labels*column2 and see if the results match the other column. > It would seem that changing a value in the first line of spambase.data > would change the classification score I see above, but it doesn't. I > changed the first value in > > 0,0.64,0.64,0,0.32,0,0,0,0,0,0,0.64,0,0,0,0.32,0,1.29,1.93,0,0.96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.778,0,0,3.756,61,278,+1; > > from 0 to other values, but the first classification score > (5.00073612523801) didn't change. Why is that? Guess: Boosting doesn't produce a linear classifier. Depending on the number of iterations used, you may have used fewer dimensions than exist in the data. In fact, even if you change every value in an example, the score may still be the same. This is due to JBoost using thresholding weak classifiers. If you look at the actual tree (either at the raw file or see documentation about visualization), you should be able to determine which dimension where used and at what thresholds. If you change one of these dimensions so that it is on the other side of the threshold, you should see a change in output value. Hope that helps! Aaron |