Re: [Jboost-users] Question regarding matlab output of classifier

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Fri, 15 Aug 2008, Viren Jain wrote:

> Sounds great, I will check it out. One question, what do you mean "by 
> default"? I.e., if I do not include the -a option at all? Or with no 
> iteration specified?

If you do not specify -a (or if you specify "-a 0"), BrownBoost algorithms 
will print output on their final iteration.  I'm going to make this the 
default behavior for all boosting algorithms.

Aaron

> On Aug 15, 2008, at 5:54 PM, Aaron Arvey wrote:
>
> Hey Viren,
>
> Update the CVS and the data will now be output on the last iteration by 
> default when using BrownBoost or subclasses.
>
> So you know, you are using code that is about to be completely rewritten, 
> which is fine, but you may want to save a copy of the version of the JBoost 
> code you currently have so that possible bugs in the new code do not destroy 
> any good results you achieve with this version. Alternatively, the 
> "improvements" in BrownBoost may give you even better performance!
>
> Also, as long as you're testing out some of the new JBoost features, you may 
> want to check out the new R visualization scripts in ./scripts. There's a 
> README file with basic documentation and more documentation will appear on 
> the website soon.  The python files are outdated, probably buggy, and 
> generate ugly pictures.  If you try the R visualizations and have any 
> problems, let me know and CC the jboost-users list.
>
> Glad to hear you're having a good time with JBoost!
>
> Aaron
>
>
> On Fri, 15 Aug 2008, Viren Jain wrote:
>
>> Hi Aaron,
>> One other random question- I've started experimenting with BrownBoost, 
>> which is useful due to lots of noisy examples in my training set. Since I 
>> use the -r option to specify how "long" to train, is there a way to tell 
>> Jboost to only output training/test info at the last epoch (even though I 
>> dont what the last epoch will necessarily be?)
>> Thanks! And great job with JBoost, its really a fun and useful too to 
>> experiment with boosting.
>> 	Thanks again,
>> 	Viren
>> 
>> On Aug 13, 2008, at 5:06 PM, Aaron Arvey wrote:
>> 
>> There are a few papers on asymmetric classification. Unfortunately, most of 
>> them focus on changing weights on examples (or slack variables in SVMs) and 
>> not on maximizing the margin.  Maximizing margins, is one of the biggest 
>> reasons boosting and SVMs are so successful, so ignoring this when doing 
>> asymmetric prediction would seem to be a bad idea.
>> 
>> Some examples can be found at a website I have setup to remind myself of 
>> all the papers in boosting.  Just google "boosting papers."
>> 
>> Aaron
>> 
>> 
>> 
>> On Wed, 13 Aug 2008, Viren Jain wrote:
>> 
>>> Great! The compilation and fix seem to work.
>>> The sliding bias approach is basically what I am doing now; only using 
>>> positive classifications with some minimum margin. Is there an interesting 
>>> prior literature out there on learning with asymmetric cost..? I am 
>>> interested in this issue.
>>> Thanks again for all your help, sorry about all the trouble!
>>> Viren
>>> On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote:
>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>> I downloaded the CVS version and tried to compile it, but ran into the 
>>>> following errors. Do you think this is probably just an outdated Java 
>>>> version issue?
>>>>
>>>> 	 [javac] Compiling 46 source files to /home/viren/jboost/build 
>>>> [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: 
>>>> cannot find symbol [javac] symbol : method getPositiveInitialPotential() 
>>>> [javac] location: class jboost.booster.BrownBoost [javac] 
>>>> System.out.println("\tPotential loss of positive examples m_booster: " + 
>>>> b.getPositiveInitialPotential());
>>>> 	.... etc
>>> This is the region that is currently under heavy construction.  Instead of 
>>> trying to debug this, I just commented out the calls and everything 
>>> compiled for me.  Try a 'cvs update' and 'ant jar'.
>>>> Regarding the cost function - the distribution has maybe 70% negative 
>>>> examples and 30% positive examples, so not horrendously imbalanced, but 
>>>> for our application a false positive is extremely costly (lets say, 5 
>>>> times as costly) as compared to a false negative.
>>> For the moment being, you can just use a "sliding bias" approach.  Where 
>>> you just add a value X to the root node (which is the same as adding X to 
>>> the final score, since all examples go through the root).  The root node 
>>> actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] 
>>> = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1").
>>> This isn't perfect, but check again in a couple weeks and we should have a 
>>> much more technical (and much cooler) approach available based on some new 
>>> math from drifting games.
>>> If this doesn't give very good results (no value of X gives you the 
>>> sensitivity/specificity tradeoff desired), I've picked up a few other 
>>> hacks that may help.
>>> Aaron
>>>> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote:
>>>> That was implemented in the BrownBoost and NormalBoost boosters. 
>>>> Unfortunately, these boosters are currently (as in 5 minutes ago) being 
>>>> completely rewritten.
>>>> There are still plenty of ways to artificially change the cost function, 
>>>> and these work well for some applications.
>>>> What exactly are you trying to do?  How asymmetric are your costs of your 
>>>> mispredictions?  How asymmtric is the distribution of your classes?
>>>> Aaron
>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>> Great! One last question (for now :). I was a little confused by the 
>>>>> documentation regarding asymmetric cost functions: is it currently 
>>>>> possible to change the cost function such that false positives are more 
>>>>> costly than false negatives?
>>>>> Thanks,
>>>>> Viren
>>>>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote:
>>>>> Viren,
>>>>> Yep, I just tried 1.4 and I was able to reproduce your problem.
>>>>> This will certainly cause a speed up in the release of the next version. 
>>>>> Let me know if you have any problems with the CVS release.
>>>>> Nice catch!
>>>>> Aaron
>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>> Hey Aaron,
>>>>>> 	OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an 
>>>>>> empty output in the .boosting.info files. I am using release 1.4, but I 
>>>>>> can try the repository version.
>>>>>> 	Thanks!
>>>>>> 	Viren
>>>>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote:
>>>>>> Hey Viren,
>>>>>> I just tried out
>>>>>> cd jboost/demo
>>>>>> ../jboost -numRounds 10 -a 9 -S stem
>>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak
>>>>>> ../jboost -numRounds 10 -a -2 -S stem
>>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info
>>>>>> And I see that this outputs the second to last iteration.  When I try
>>>>>> cd jboost/demo
>>>>>> ../jboost -numRounds 10 -a 10 -S stem
>>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak
>>>>>> ../jboost -numRounds 10 -a -2 -S stem
>>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info
>>>>>> I see that the final iteration is output.
>>>>>> Let me know what you see when you run the above.  If you see something 
>>>>>> different, perhaps the used to be a bug and it was corrected. The code 
>>>>>> to output files by the "-a" switch was recently updated, so perhaps 
>>>>>> this bug was corrected (I updated it and have no memory of fixing this 
>>>>>> bug, but perhaps I did...).  Are you perhaps using an old version of 
>>>>>> JBoost? Perhaps try out the cvs repository and see if that fixes your 
>>>>>> problem.
>>>>>> Aaron
>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>>> Thanks again, Aaron.
>>>>>>> I double checked things and it seems I still discrepancies in the 
>>>>>>> classifier outputs. The exact jboost command I am using is:
>>>>>>> ... jboost.controller.Controller -S test_old_background_mporder 
>>>>>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m 
>>>>>>> classify_background.m
>>>>>>> I assume there is some sort of 0 counting, since if I use -a 300 the 
>>>>>>> .info.testing and .info.training file are 0 bytes. So if this is 
>>>>>>> correct, then test_old_background_mporder.test.boosting.info should 
>>>>>>> have identical outputs to those generated from the same examples by 
>>>>>>> using classify_background.m?
>>>>>>> Again, thanks so much!
>>>>>>> Viren
>>>>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote:
>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>>>> I'm actually using text strings for the labels. i.e., in the spec 
>>>>>>>> file i have  line "labels		(merge, split)" and then for 
>>>>>>>> each example in training/test, I output the appropriate string. Do 
>>>>>>>> you recommend I use (-1,1) instead?
>>>>>>> That's fine.  I just assumed that since you said the labels were 
>>>>>>> inverted, that meant you were using -1/+1.  Using text is perfectly 
>>>>>>> okay.
>>>>>>>> Also, what is the iteration on which Jboost outputs the matlab file 
>>>>>>>> when I use the -m option? The last one?
>>>>>>> Yes, it is the last iteration.  There should probably be an option 
>>>>>>> (like -a) to output this more often.
>>>>>>> Aaron
>>>>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote:
>>>>>>>> Hi Viren,
>>>>>>>> The inverted label is a result of JBoost using it's own internal 
>>>>>>>> labeling system.  If you swap the order of how you specify the labels 
>>>>>>>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get 
>>>>>>>> the correct label.
>>>>>>>> I haven't heard about the difference in score before.  Are you 
>>>>>>>> perhaps looking at the scores for the wrong iteration?  Are you using 
>>>>>>>> "-a -1" or "-a -2" switches to obtain the appropriate score/margin 
>>>>>>>> output files? Are you perhaps getting training and testing sets mixed 
>>>>>>>> up?
>>>>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo 
>>>>>>>> directory) and it looks like everything is fine.  If you can send 
>>>>>>>> your train/test files or reproduce the bug on the spambase dataset, 
>>>>>>>> please send me the exact parameters you're using and I'll see if it's 
>>>>>>>> a bug, poor documentation, or a misunderstanding of some sort.
>>>>>>>> Thanks for the heads up on the potential bug in the matlab scores.
>>>>>>>> Aaron
>>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using 
>>>>>>>>> Jboost. I also asked it to output a matlab script I could use to 
>>>>>>>>> classify examples with in the future. However, I was wondering why 
>>>>>>>>> the matlab script outputs slightly different values than I would get 
>>>>>>>>> by classifying the training/test set directly using Jboost (for 
>>>>>>>>> example, the sign of the classifier output is always opposite to 
>>>>>>>>> what Jboost produces, and at most I have seen a 0.1469 discrepancy 
>>>>>>>>> in the actual value after accounting for the sign issue). Has anyone 
>>>>>>>>> encountered this issue, or am I perhaps doing something incorrectly?