From: Aaron A. <aa...@cs...> - 2008-08-13 20:01:30
|
On Wed, 13 Aug 2008, Viren Jain wrote: > I downloaded the CVS version and tried to compile it, but ran into > the following errors. Do you think this is probably just an outdated Java > version issue? > > [javac] Compiling 46 source files to /home/viren/jboost/build > [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: > cannot find symbol [javac] symbol : method getPositiveInitialPotential() > [javac] location: class jboost.booster.BrownBoost [javac] > System.out.println("\tPotential loss of positive examples m_booster: " + > b.getPositiveInitialPotential()); > .... etc This is the region that is currently under heavy construction. Instead of trying to debug this, I just commented out the calls and everything compiled for me. Try a 'cvs update' and 'ant jar'. > Regarding the cost function - the distribution has maybe 70% > negative examples and 30% positive examples, so not horrendously > imbalanced, but for our application a false positive is extremely costly > (lets say, 5 times as costly) as compared to a false negative. For the moment being, you can just use a "sliding bias" approach. Where you just add a value X to the root node (which is the same as adding X to the final score, since all examples go through the root). The root node actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1"). This isn't perfect, but check again in a couple weeks and we should have a much more technical (and much cooler) approach available based on some new math from drifting games. If this doesn't give very good results (no value of X gives you the sensitivity/specificity tradeoff desired), I've picked up a few other hacks that may help. Aaron > On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: > > That was implemented in the BrownBoost and NormalBoost boosters. > Unfortunately, these boosters are currently (as in 5 minutes ago) being > completely rewritten. > > There are still plenty of ways to artificially change the cost function, > and these work well for some applications. > > What exactly are you trying to do? How asymmetric are your costs of your > mispredictions? How asymmtric is the distribution of your classes? > > Aaron > > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> Great! One last question (for now :). I was a little confused by the >> documentation regarding asymmetric cost functions: is it currently >> possible to change the cost function such that false positives are more >> costly than false negatives? >> >> Thanks, >> Viren >> >> >> >> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >> >> Viren, >> >> Yep, I just tried 1.4 and I was able to reproduce your problem. >> >> This will certainly cause a speed up in the release of the next version. >> Let me know if you have any problems with the CVS release. >> >> Nice catch! >> >> Aaron >> >> >> On Wed, 13 Aug 2008, Viren Jain wrote: >> >>> Hey Aaron, >>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an >>> empty output in the .boosting.info files. I am using release 1.4, but I >>> can try the repository version. >>> Thanks! >>> Viren >>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>> Hey Viren, >>> I just tried out >>> cd jboost/demo >>> ../jboost -numRounds 10 -a 9 -S stem >>> cp stem.test.boosting.info stem.test.boosting.info.bak >>> ../jboost -numRounds 10 -a -2 -S stem >>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>> And I see that this outputs the second to last iteration. When I try >>> cd jboost/demo >>> ../jboost -numRounds 10 -a 10 -S stem >>> cp stem.test.boosting.info stem.test.boosting.info.bak >>> ../jboost -numRounds 10 -a -2 -S stem >>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>> I see that the final iteration is output. >>> Let me know what you see when you run the above. If you see something >>> different, perhaps the used to be a bug and it was corrected. The code >>> to output files by the "-a" switch was recently updated, so perhaps this >>> bug was corrected (I updated it and have no memory of fixing this bug, >>> but perhaps I did...). Are you perhaps using an old version of JBoost? >>> Perhaps try out the cvs repository and see if that fixes your problem. >>> Aaron >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> Thanks again, Aaron. >>>> I double checked things and it seems I still discrepancies in the >>>> classifier outputs. The exact jboost command I am using is: >>>> ... jboost.controller.Controller -S test_old_background_mporder >>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>> classify_background.m >>>> I assume there is some sort of 0 counting, since if I use -a 300 the >>>> .info.testing and .info.training file are 0 bytes. So if this is >>>> correct, then test_old_background_mporder.test.boosting.info should >>>> have identical outputs to those generated from the same examples by >>>> using classify_background.m? >>>> Again, thanks so much! >>>> Viren >>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> I'm actually using text strings for the labels. i.e., in the spec file >>>>> i have line "labels (merge, split)" and then for each >>>>> example in training/test, I output the appropriate string. Do you >>>>> recommend I use (-1,1) instead? >>>> That's fine. I just assumed that since you said the labels were >>>> inverted, that meant you were using -1/+1. Using text is perfectly >>>> okay. >>>>> Also, what is the iteration on which Jboost outputs the matlab file >>>>> when I use the -m option? The last one? >>>> Yes, it is the last iteration. There should probably be an option >>>> (like -a) to output this more often. >>>> Aaron >>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>> Hi Viren, >>>>> The inverted label is a result of JBoost using it's own internal >>>>> labeling system. If you swap the order of how you specify the labels >>>>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get >>>>> the correct label. >>>>> I haven't heard about the difference in score before. Are you perhaps >>>>> looking at the scores for the wrong iteration? Are you using "-a -1" >>>>> or "-a -2" switches to obtain the appropriate score/margin output >>>>> files? Are you perhaps getting training and testing sets mixed up? >>>>> I just tested ADD_ROOT on the spambase dataset (in the demo directory) >>>>> and it looks like everything is fine. If you can send your train/test >>>>> files or reproduce the bug on the spambase dataset, please send me the >>>>> exact parameters you're using and I'll see if it's a bug, poor >>>>> documentation, or a misunderstanding of some sort. >>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>> Aaron >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>>>>> Jboost. I also asked it to output a matlab script I could use to >>>>>> classify examples with in the future. However, I was wondering why >>>>>> the matlab script outputs slightly different values than I would get >>>>>> by classifying the training/test set directly using Jboost (for >>>>>> example, the sign of the classifier output is always opposite to what >>>>>> Jboost produces, and at most I have seen a 0.1469 discrepancy in the >>>>>> actual value after accounting for the sign issue). Has anyone >>>>>> encountered this issue, or am I perhaps doing something incorrectly? |