From: Viren J. <vi...@MI...> - 2008-08-13 20:20:59
|
Great! The compilation and fix seem to work. The sliding bias approach is basically what I am doing now; only using positive classifications with some minimum margin. Is there an interesting prior literature out there on learning with asymmetric cost..? I am interested in this issue. Thanks again for all your help, sorry about all the trouble! Viren On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote: On Wed, 13 Aug 2008, Viren Jain wrote: > I downloaded the CVS version and tried to compile it, but ran into > the following errors. Do you think this is probably just an outdated > Java version issue? > > [javac] Compiling 46 source files to /home/viren/jboost/build > [javac] /home/viren/jboost/src/jboost/controller/Controller.java: > 148: cannot find symbol [javac] symbol : method > getPositiveInitialPotential() [javac] location: class > jboost.booster.BrownBoost [javac] System.out.println("\tPotential > loss of positive examples m_booster: " + > b.getPositiveInitialPotential()); > .... etc This is the region that is currently under heavy construction. Instead of trying to debug this, I just commented out the calls and everything compiled for me. Try a 'cvs update' and 'ant jar'. > Regarding the cost function - the distribution has maybe 70% > negative examples and 30% positive examples, so not horrendously > imbalanced, but for our application a false positive is extremely > costly (lets say, 5 times as costly) as compared to a false negative. For the moment being, you can just use a "sliding bias" approach. Where you just add a value X to the root node (which is the same as adding X to the final score, since all examples go through the root). The root node actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1"). This isn't perfect, but check again in a couple weeks and we should have a much more technical (and much cooler) approach available based on some new math from drifting games. If this doesn't give very good results (no value of X gives you the sensitivity/specificity tradeoff desired), I've picked up a few other hacks that may help. Aaron > On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: > > That was implemented in the BrownBoost and NormalBoost boosters. > Unfortunately, these boosters are currently (as in 5 minutes ago) > being completely rewritten. > > There are still plenty of ways to artificially change the cost > function, and these work well for some applications. > > What exactly are you trying to do? How asymmetric are your costs of > your mispredictions? How asymmtric is the distribution of your > classes? > > Aaron > > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> Great! One last question (for now :). I was a little confused by >> the documentation regarding asymmetric cost functions: is it >> currently possible to change the cost function such that false >> positives are more costly than false negatives? >> Thanks, >> Viren >> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >> Viren, >> Yep, I just tried 1.4 and I was able to reproduce your problem. >> This will certainly cause a speed up in the release of the next >> version. Let me know if you have any problems with the CVS release. >> Nice catch! >> Aaron >> On Wed, 13 Aug 2008, Viren Jain wrote: >>> Hey Aaron, >>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get >>> an empty output in the .boosting.info files. I am using release >>> 1.4, but I can try the repository version. >>> Thanks! >>> Viren >>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>> Hey Viren, >>> I just tried out >>> cd jboost/demo >>> ../jboost -numRounds 10 -a 9 -S stem >>> cp stem.test.boosting.info stem.test.boosting.info.bak >>> ../jboost -numRounds 10 -a -2 -S stem >>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>> And I see that this outputs the second to last iteration. When I >>> try >>> cd jboost/demo >>> ../jboost -numRounds 10 -a 10 -S stem >>> cp stem.test.boosting.info stem.test.boosting.info.bak >>> ../jboost -numRounds 10 -a -2 -S stem >>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>> I see that the final iteration is output. >>> Let me know what you see when you run the above. If you see >>> something different, perhaps the used to be a bug and it was >>> corrected. The code to output files by the "-a" switch was >>> recently updated, so perhaps this bug was corrected (I updated it >>> and have no memory of fixing this bug, but perhaps I did...). Are >>> you perhaps using an old version of JBoost? Perhaps try out the >>> cvs repository and see if that fixes your problem. >>> Aaron >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> Thanks again, Aaron. >>>> I double checked things and it seems I still discrepancies in the >>>> classifier outputs. The exact jboost command I am using is: >>>> ... jboost.controller.Controller -S test_old_background_mporder - >>>> numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>> classify_background.m >>>> I assume there is some sort of 0 counting, since if I use -a 300 >>>> the .info.testing and .info.training file are 0 bytes. So if this >>>> is correct, then test_old_background_mporder.test.boosting.info >>>> should have identical outputs to those generated from the same >>>> examples by using classify_background.m? >>>> Again, thanks so much! >>>> Viren >>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> I'm actually using text strings for the labels. i.e., in the >>>>> spec file i have line "labels (merge, split)" and then for >>>>> each example in training/test, I output the appropriate string. >>>>> Do you recommend I use (-1,1) instead? >>>> That's fine. I just assumed that since you said the labels were >>>> inverted, that meant you were using -1/+1. Using text is >>>> perfectly okay. >>>>> Also, what is the iteration on which Jboost outputs the matlab >>>>> file when I use the -m option? The last one? >>>> Yes, it is the last iteration. There should probably be an >>>> option (like -a) to output this more often. >>>> Aaron >>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>> Hi Viren, >>>>> The inverted label is a result of JBoost using it's own internal >>>>> labeling system. If you swap the order of how you specify the >>>>> labels (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") >>>>> you'll get the correct label. >>>>> I haven't heard about the difference in score before. Are you >>>>> perhaps looking at the scores for the wrong iteration? Are you >>>>> using "-a -1" or "-a -2" switches to obtain the appropriate >>>>> score/margin output files? Are you perhaps getting training and >>>>> testing sets mixed up? >>>>> I just tested ADD_ROOT on the spambase dataset (in the demo >>>>> directory) and it looks like everything is fine. If you can >>>>> send your train/test files or reproduce the bug on the spambase >>>>> dataset, please send me the exact parameters you're using and >>>>> I'll see if it's a bug, poor documentation, or a >>>>> misunderstanding of some sort. >>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>> Aaron >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT >>>>>> using Jboost. I also asked it to output a matlab script I could >>>>>> use to classify examples with in the future. However, I was >>>>>> wondering why the matlab script outputs slightly different >>>>>> values than I would get by classifying the training/test set >>>>>> directly using Jboost (for example, the sign of the classifier >>>>>> output is always opposite to what Jboost produces, and at most >>>>>> I have seen a 0.1469 discrepancy in the actual value after >>>>>> accounting for the sign issue). Has anyone encountered this >>>>>> issue, or am I perhaps doing something incorrectly? |