Re: [Jboost-users] Question regarding matlab output of classifier

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hey Viren,

Yes, I do remember a 300 iteration limit.  I think that was put in since 
BrownBoost rarely went into an infinite loop, so an upper limit was placed 
on the number of iterations so that when left running in the background, 
it didn't fill up the disk.

However, if you don't stop before 300 iterations, look at the .info file 
and see if the error is decreasing.  If it is decreasing, then increase 
the iteration limit to 3000 (or other appropriate number for your 
application).  If it is not decreasing then look at the intermediate 
.boosting.info files to see if the remaining time is decreasing.  If it is 
not decreasing... then you've found one of those rare seemingly infinite 
loops.

The best place to set an iteration bound for BrownBoost is to set an 
iteration bound at line 433 in Controller.java.

Keep the comments, requests, and bugs coming!

Aaron

On Fri, 15 Aug 2008, Viren Jain wrote:

> OK.. By the way, I tried the new code from the CVS repository. BrownBoost 
> runs, but I think it always stops at the 300th learning iteration (regardless 
> of what I use as the -r parameter?) I will double check this... And let me 
> know if you don't want bug reports yet :)
>
> Thanks,
> Viren
>
> On Aug 15, 2008, at 8:49 PM, Aaron Arvey wrote:
>
> On Fri, 15 Aug 2008, Viren Jain wrote:
>
>> Sounds great, I will check it out. One question, what do you mean "by 
>> default"? I.e., if I do not include the -a option at all? Or with no 
>> iteration specified?
>
> If you do not specify -a (or if you specify "-a 0"), BrownBoost algorithms 
> will print output on their final iteration.  I'm going to make this the 
> default behavior for all boosting algorithms.
>
> Aaron
>
>
>> On Aug 15, 2008, at 5:54 PM, Aaron Arvey wrote:
>> 
>> Hey Viren,
>> 
>> Update the CVS and the data will now be output on the last iteration by 
>> default when using BrownBoost or subclasses.
>> 
>> So you know, you are using code that is about to be completely rewritten, 
>> which is fine, but you may want to save a copy of the version of the JBoost 
>> code you currently have so that possible bugs in the new code do not 
>> destroy any good results you achieve with this version. Alternatively, the 
>> "improvements" in BrownBoost may give you even better performance!
>> 
>> Also, as long as you're testing out some of the new JBoost features, you 
>> may want to check out the new R visualization scripts in ./scripts. There's 
>> a README file with basic documentation and more documentation will appear 
>> on the website soon.  The python files are outdated, probably buggy, and 
>> generate ugly pictures.  If you try the R visualizations and have any 
>> problems, let me know and CC the jboost-users list.
>> 
>> Glad to hear you're having a good time with JBoost!
>> 
>> Aaron
>> 
>> 
>> On Fri, 15 Aug 2008, Viren Jain wrote:
>> 
>>> Hi Aaron,
>>> One other random question- I've started experimenting with BrownBoost, 
>>> which is useful due to lots of noisy examples in my training set. Since I 
>>> use the -r option to specify how "long" to train, is there a way to tell 
>>> Jboost to only output training/test info at the last epoch (even though I 
>>> dont what the last epoch will necessarily be?)
>>> Thanks! And great job with JBoost, its really a fun and useful too to 
>>> experiment with boosting.
>>> 	Thanks again,
>>> 	Viren
>>> On Aug 13, 2008, at 5:06 PM, Aaron Arvey wrote:
>>> There are a few papers on asymmetric classification. Unfortunately, most 
>>> of them focus on changing weights on examples (or slack variables in SVMs) 
>>> and not on maximizing the margin.  Maximizing margins, is one of the 
>>> biggest reasons boosting and SVMs are so successful, so ignoring this when 
>>> doing asymmetric prediction would seem to be a bad idea.
>>> Some examples can be found at a website I have setup to remind myself of 
>>> all the papers in boosting.  Just google "boosting papers."
>>> Aaron
>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>> Great! The compilation and fix seem to work.
>>>> The sliding bias approach is basically what I am doing now; only using 
>>>> positive classifications with some minimum margin. Is there an 
>>>> interesting prior literature out there on learning with asymmetric 
>>>> cost..? I am interested in this issue.
>>>> Thanks again for all your help, sorry about all the trouble!
>>>> Viren
>>>> On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote:
>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>> I downloaded the CVS version and tried to compile it, but ran into the 
>>>>> following errors. Do you think this is probably just an outdated Java 
>>>>> version issue?
>>>>>
>>>>> 	 [javac] Compiling 46 source files to /home/viren/jboost/build 
>>>>> [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: 
>>>>> cannot find symbol [javac] symbol : method getPositiveInitialPotential() 
>>>>> [javac] location: class jboost.booster.BrownBoost [javac] 
>>>>> System.out.println("\tPotential loss of positive examples m_booster: " + 
>>>>> b.getPositiveInitialPotential());
>>>>> 	.... etc
>>>> This is the region that is currently under heavy construction.  Instead 
>>>> of trying to debug this, I just commented out the calls and everything 
>>>> compiled for me.  Try a 'cvs update' and 'ant jar'.
>>>>> Regarding the cost function - the distribution has maybe 70% negative 
>>>>> examples and 30% positive examples, so not horrendously imbalanced, but 
>>>>> for our application a false positive is extremely costly (lets say, 5 
>>>>> times as costly) as compared to a false negative.
>>>> For the moment being, you can just use a "sliding bias" approach.  Where 
>>>> you just add a value X to the root node (which is the same as adding X to 
>>>> the final score, since all examples go through the root).  The root node 
>>>> actually balances the weights on the dataset (satisfies 
>>>> E_{w_{i+1}}[h(x)y] = 0, see equation in Schapire & Singer 99, where 
>>>> h(x)=="Always +1").
>>>> This isn't perfect, but check again in a couple weeks and we should have 
>>>> a much more technical (and much cooler) approach available based on some 
>>>> new math from drifting games.
>>>> If this doesn't give very good results (no value of X gives you the 
>>>> sensitivity/specificity tradeoff desired), I've picked up a few other 
>>>> hacks that may help.
>>>> Aaron
>>>>> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote:
>>>>> That was implemented in the BrownBoost and NormalBoost boosters. 
>>>>> Unfortunately, these boosters are currently (as in 5 minutes ago) being 
>>>>> completely rewritten.
>>>>> There are still plenty of ways to artificially change the cost function, 
>>>>> and these work well for some applications.
>>>>> What exactly are you trying to do?  How asymmetric are your costs of 
>>>>> your mispredictions?  How asymmtric is the distribution of your classes?
>>>>> Aaron
>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>> Great! One last question (for now :). I was a little confused by the 
>>>>>> documentation regarding asymmetric cost functions: is it currently 
>>>>>> possible to change the cost function such that false positives are more 
>>>>>> costly than false negatives?
>>>>>> Thanks,
>>>>>> Viren
>>>>>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote:
>>>>>> Viren,
>>>>>> Yep, I just tried 1.4 and I was able to reproduce your problem.
>>>>>> This will certainly cause a speed up in the release of the next 
>>>>>> version. Let me know if you have any problems with the CVS release.
>>>>>> Nice catch!
>>>>>> Aaron
>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>>> Hey Aaron,
>>>>>>> 	OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an 
>>>>>>> empty output in the .boosting.info files. I am using release 1.4, but 
>>>>>>> I can try the repository version.
>>>>>>> 	Thanks!
>>>>>>> 	Viren
>>>>>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote:
>>>>>>> Hey Viren,
>>>>>>> I just tried out
>>>>>>> cd jboost/demo
>>>>>>> ../jboost -numRounds 10 -a 9 -S stem
>>>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak
>>>>>>> ../jboost -numRounds 10 -a -2 -S stem
>>>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info
>>>>>>> And I see that this outputs the second to last iteration.  When I try
>>>>>>> cd jboost/demo
>>>>>>> ../jboost -numRounds 10 -a 10 -S stem
>>>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak
>>>>>>> ../jboost -numRounds 10 -a -2 -S stem
>>>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info
>>>>>>> I see that the final iteration is output.
>>>>>>> Let me know what you see when you run the above.  If you see something 
>>>>>>> different, perhaps the used to be a bug and it was corrected. The code 
>>>>>>> to output files by the "-a" switch was recently updated, so perhaps 
>>>>>>> this bug was corrected (I updated it and have no memory of fixing this 
>>>>>>> bug, but perhaps I did...).  Are you perhaps using an old version of 
>>>>>>> JBoost? Perhaps try out the cvs repository and see if that fixes your 
>>>>>>> problem.
>>>>>>> Aaron
>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>>>> Thanks again, Aaron.
>>>>>>>> I double checked things and it seems I still discrepancies in the 
>>>>>>>> classifier outputs. The exact jboost command I am using is:
>>>>>>>> ... jboost.controller.Controller -S test_old_background_mporder 
>>>>>>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m 
>>>>>>>> classify_background.m
>>>>>>>> I assume there is some sort of 0 counting, since if I use -a 300 the 
>>>>>>>> .info.testing and .info.training file are 0 bytes. So if this is 
>>>>>>>> correct, then test_old_background_mporder.test.boosting.info should 
>>>>>>>> have identical outputs to those generated from the same examples by 
>>>>>>>> using classify_background.m?
>>>>>>>> Again, thanks so much!
>>>>>>>> Viren
>>>>>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote:
>>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>>>>> I'm actually using text strings for the labels. i.e., in the spec 
>>>>>>>>> file i have  line "labels		(merge, split)" and then for 
>>>>>>>>> each example in training/test, I output the appropriate string. Do 
>>>>>>>>> you recommend I use (-1,1) instead?
>>>>>>>> That's fine.  I just assumed that since you said the labels were 
>>>>>>>> inverted, that meant you were using -1/+1.  Using text is perfectly 
>>>>>>>> okay.
>>>>>>>>> Also, what is the iteration on which Jboost outputs the matlab file 
>>>>>>>>> when I use the -m option? The last one?
>>>>>>>> Yes, it is the last iteration.  There should probably be an option 
>>>>>>>> (like -a) to output this more often.
>>>>>>>> Aaron
>>>>>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote:
>>>>>>>>> Hi Viren,
>>>>>>>>> The inverted label is a result of JBoost using it's own internal 
>>>>>>>>> labeling system.  If you swap the order of how you specify the 
>>>>>>>>> labels (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") 
>>>>>>>>> you'll get the correct label.
>>>>>>>>> I haven't heard about the difference in score before.  Are you 
>>>>>>>>> perhaps looking at the scores for the wrong iteration?  Are you 
>>>>>>>>> using "-a -1" or "-a -2" switches to obtain the appropriate 
>>>>>>>>> score/margin output files? Are you perhaps getting training and 
>>>>>>>>> testing sets mixed up?
>>>>>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo 
>>>>>>>>> directory) and it looks like everything is fine.  If you can send 
>>>>>>>>> your train/test files or reproduce the bug on the spambase dataset, 
>>>>>>>>> please send me the exact parameters you're using and I'll see if 
>>>>>>>>> it's a bug, poor documentation, or a misunderstanding of some sort.
>>>>>>>>> Thanks for the heads up on the potential bug in the matlab scores.
>>>>>>>>> Aaron
>>>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using 
>>>>>>>>>> Jboost. I also asked it to output a matlab script I could use to 
>>>>>>>>>> classify examples with in the future. However, I was wondering why 
>>>>>>>>>> the matlab script outputs slightly different values than I would 
>>>>>>>>>> get by classifying the training/test set directly using Jboost (for 
>>>>>>>>>> example, the sign of the classifier output is always opposite to 
>>>>>>>>>> what Jboost produces, and at most I have seen a 0.1469 discrepancy 
>>>>>>>>>> in the actual value after accounting for the sign issue). Has 
>>>>>>>>>> anyone encountered this issue, or am I perhaps doing something 
>>>>>>>>>> incorrectly?