From: Glenn M. <gle...@gm...> - 2010-12-13 22:16:38
|
Hi Aaron, Thanks for the prompt reply. Your comment on using several weak thresholding classifiers makes sense. Changing all the values (0 -> 1) in the first data line did indeed change the classification score. I don't think you're right about the two columns, though. Since they always have the same magnitude I looked into the code and saw that the code is in fact printing out {p, -p}, where p, it seems, is prediction. It turns out that margins information can be generated when the tree is created. The generated comment for predict(String[] as) says it returns "an array of scores corresponding to the classes: +1 and -1". Are "classes" the same as labels? Thanks, Glenn On Mon, Dec 13, 2010 at 12:51 PM, Aaron Arvey <aa...@cb...> wrote: > Hey Glenn, > > I haven't used JBoost in a while, but I have a couple of guesses that may > answer your questions. > >> I ran Predict ("java -cp .:../dist/jboost.jar Predict < >> spambase.data") against the original data. I got two columns of >> output that looked like >> >> 5.00073612523801 -5.00073612523801 >> 11.864681207163063 -11.864681207163063 >> 8.780744089260097 -8.780744089260097 >> ... >> Why are there two columns with the same magnitudes? I'm guessing that >> these are is/is not spam scores, but they seem redundant. >> > > Guess: One is margin and the other is classification score. You can > determine this by looking at labels*column1 or labels*column2 and see if > the results match the other column. > >> It would seem that changing a value in the first line of spambase.data >> would change the classification score I see above, but it doesn't. I >> changed the first value in >> >> 0,0.64,0.64,0,0.32,0,0,0,0,0,0,0.64,0,0,0,0.32,0,1.29,1.93,0,0.96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.778,0,0,3.756,61,278,+1; >> >> from 0 to other values, but the first classification score >> (5.00073612523801) didn't change. Why is that? > > Guess: Boosting doesn't produce a linear classifier. Depending on the > number of iterations used, you may have used fewer dimensions than exist > in the data. In fact, even if you change every value in an example, the > score may still be the same. This is due to JBoost using thresholding > weak classifiers. If you look at the actual tree (either at the raw file > or see documentation about visualization), you should be able to determine > which dimension where used and at what thresholds. If you change one of > these dimensions so that it is on the other side of the threshold, you > should see a change in output value. > > Hope that helps! > > Aaron > > > > > > > |