Re: [Jboost-users] Question regarding matlab output of classifier

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Great! The compilation and fix seem to work.

The sliding bias approach is basically what I am doing now; only using  
positive classifications with some minimum margin. Is there an  
interesting prior literature out there on learning with asymmetric  
cost..? I am interested in this issue.

Thanks again for all your help, sorry about all the trouble!

Viren

On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote:

On Wed, 13 Aug 2008, Viren Jain wrote:

> I downloaded the CVS version and tried to compile it, but ran into  
> the following errors. Do you think this is probably just an outdated  
> Java version issue?
>
> 	 [javac] Compiling 46 source files to /home/viren/jboost/build  
> [javac] /home/viren/jboost/src/jboost/controller/Controller.java: 
> 148: cannot find symbol [javac] symbol : method  
> getPositiveInitialPotential() [javac] location: class  
> jboost.booster.BrownBoost [javac] System.out.println("\tPotential  
> loss of positive examples m_booster: " +  
> b.getPositiveInitialPotential());
> 	.... etc

This is the region that is currently under heavy construction.   
Instead of trying to debug this, I just commented out the calls and  
everything compiled for me.  Try a 'cvs update' and 'ant jar'.

> Regarding the cost function - the distribution has maybe 70%  
> negative examples and 30% positive examples, so not horrendously  
> imbalanced, but for our application a false positive is extremely  
> costly (lets say, 5 times as costly) as compared to a false negative.

For the moment being, you can just use a "sliding bias" approach.   
Where you just add a value X to the root node (which is the same as  
adding X to the final score, since all examples go through the root).   
The root node actually balances the weights on the dataset (satisfies  
E_{w_{i+1}}[h(x)y] = 0, see equation in Schapire & Singer 99, where  
h(x)=="Always +1").

This isn't perfect, but check again in a couple weeks and we should  
have a much more technical (and much cooler) approach available based  
on some new math from drifting games.

If this doesn't give very good results (no value of X gives you the  
sensitivity/specificity tradeoff desired), I've picked up a few other  
hacks that may help.

Aaron

> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote:
>
> That was implemented in the BrownBoost and NormalBoost boosters.  
> Unfortunately, these boosters are currently (as in 5 minutes ago)  
> being completely rewritten.
>
> There are still plenty of ways to artificially change the cost  
> function, and these work well for some applications.
>
> What exactly are you trying to do?  How asymmetric are your costs of  
> your mispredictions?  How asymmtric is the distribution of your  
> classes?
>
> Aaron
>
>
> On Wed, 13 Aug 2008, Viren Jain wrote:
>
>> Great! One last question (for now :). I was a little confused by  
>> the documentation regarding asymmetric cost functions: is it  
>> currently possible to change the cost function such that false  
>> positives are more costly than false negatives?
>> Thanks,
>> Viren
>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote:
>> Viren,
>> Yep, I just tried 1.4 and I was able to reproduce your problem.
>> This will certainly cause a speed up in the release of the next  
>> version. Let me know if you have any problems with the CVS release.
>> Nice catch!
>> Aaron
>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>> Hey Aaron,
>>> 	OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get  
>>> an empty output in the .boosting.info files. I am using release  
>>> 1.4, but I can try the repository version.
>>> 	Thanks!
>>> 	Viren
>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote:
>>> Hey Viren,
>>> I just tried out
>>> cd jboost/demo
>>> ../jboost -numRounds 10 -a 9 -S stem
>>> cp stem.test.boosting.info stem.test.boosting.info.bak
>>> ../jboost -numRounds 10 -a -2 -S stem
>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info
>>> And I see that this outputs the second to last iteration.  When I  
>>> try
>>> cd jboost/demo
>>> ../jboost -numRounds 10 -a 10 -S stem
>>> cp stem.test.boosting.info stem.test.boosting.info.bak
>>> ../jboost -numRounds 10 -a -2 -S stem
>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info
>>> I see that the final iteration is output.
>>> Let me know what you see when you run the above.  If you see  
>>> something different, perhaps the used to be a bug and it was  
>>> corrected. The code to output files by the "-a" switch was  
>>> recently updated, so perhaps this bug was corrected (I updated it  
>>> and have no memory of fixing this bug, but perhaps I did...).  Are  
>>> you perhaps using an old version of JBoost? Perhaps try out the  
>>> cvs repository and see if that fixes your problem.
>>> Aaron
>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>> Thanks again, Aaron.
>>>> I double checked things and it seems I still discrepancies in the  
>>>> classifier outputs. The exact jboost command I am using is:
>>>> ... jboost.controller.Controller -S test_old_background_mporder - 
>>>> numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m  
>>>> classify_background.m
>>>> I assume there is some sort of 0 counting, since if I use -a 300  
>>>> the .info.testing and .info.training file are 0 bytes. So if this  
>>>> is correct, then test_old_background_mporder.test.boosting.info  
>>>> should have identical outputs to those generated from the same  
>>>> examples by using classify_background.m?
>>>> Again, thanks so much!
>>>> Viren
>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote:
>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>> I'm actually using text strings for the labels. i.e., in the  
>>>>> spec file i have  line "labels		(merge, split)" and then for  
>>>>> each example in training/test, I output the appropriate string.  
>>>>> Do you recommend I use (-1,1) instead?
>>>> That's fine.  I just assumed that since you said the labels were  
>>>> inverted, that meant you were using -1/+1.  Using text is  
>>>> perfectly okay.
>>>>> Also, what is the iteration on which Jboost outputs the matlab  
>>>>> file when I use the -m option? The last one?
>>>> Yes, it is the last iteration.  There should probably be an  
>>>> option (like -a) to output this more often.
>>>> Aaron
>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote:
>>>>> Hi Viren,
>>>>> The inverted label is a result of JBoost using it's own internal  
>>>>> labeling system.  If you swap the order of how you specify the  
>>>>> labels (i.e. instead of "labels (1,-1)" you do "labels (-1,1)")  
>>>>> you'll get the correct label.
>>>>> I haven't heard about the difference in score before.  Are you  
>>>>> perhaps looking at the scores for the wrong iteration?  Are you  
>>>>> using "-a -1" or "-a -2" switches to obtain the appropriate  
>>>>> score/margin output files? Are you perhaps getting training and  
>>>>> testing sets mixed up?
>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo  
>>>>> directory) and it looks like everything is fine.  If you can  
>>>>> send your train/test files or reproduce the bug on the spambase  
>>>>> dataset, please send me the exact parameters you're using and  
>>>>> I'll see if it's a bug, poor documentation, or a  
>>>>> misunderstanding of some sort.
>>>>> Thanks for the heads up on the potential bug in the matlab scores.
>>>>> Aaron
>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT  
>>>>>> using Jboost. I also asked it to output a matlab script I could  
>>>>>> use to classify examples with in the future. However, I was  
>>>>>> wondering why the matlab script outputs slightly different  
>>>>>> values than I would get by classifying the training/test set  
>>>>>> directly using Jboost (for example, the sign of the classifier  
>>>>>> output is always opposite to what Jboost produces, and at most  
>>>>>> I have seen a 0.1469 discrepancy in the actual value after  
>>>>>> accounting for the sign issue). Has anyone encountered this  
>>>>>> issue, or am I perhaps doing something incorrectly?