From: Aaron A. <aa...@cs...> - 2008-08-16 00:49:19
|
On Fri, 15 Aug 2008, Viren Jain wrote: > Sounds great, I will check it out. One question, what do you mean "by > default"? I.e., if I do not include the -a option at all? Or with no > iteration specified? If you do not specify -a (or if you specify "-a 0"), BrownBoost algorithms will print output on their final iteration. I'm going to make this the default behavior for all boosting algorithms. Aaron > On Aug 15, 2008, at 5:54 PM, Aaron Arvey wrote: > > Hey Viren, > > Update the CVS and the data will now be output on the last iteration by > default when using BrownBoost or subclasses. > > So you know, you are using code that is about to be completely rewritten, > which is fine, but you may want to save a copy of the version of the JBoost > code you currently have so that possible bugs in the new code do not destroy > any good results you achieve with this version. Alternatively, the > "improvements" in BrownBoost may give you even better performance! > > Also, as long as you're testing out some of the new JBoost features, you may > want to check out the new R visualization scripts in ./scripts. There's a > README file with basic documentation and more documentation will appear on > the website soon. The python files are outdated, probably buggy, and > generate ugly pictures. If you try the R visualizations and have any > problems, let me know and CC the jboost-users list. > > Glad to hear you're having a good time with JBoost! > > Aaron > > > On Fri, 15 Aug 2008, Viren Jain wrote: > >> Hi Aaron, >> One other random question- I've started experimenting with BrownBoost, >> which is useful due to lots of noisy examples in my training set. Since I >> use the -r option to specify how "long" to train, is there a way to tell >> Jboost to only output training/test info at the last epoch (even though I >> dont what the last epoch will necessarily be?) >> Thanks! And great job with JBoost, its really a fun and useful too to >> experiment with boosting. >> Thanks again, >> Viren >> >> On Aug 13, 2008, at 5:06 PM, Aaron Arvey wrote: >> >> There are a few papers on asymmetric classification. Unfortunately, most of >> them focus on changing weights on examples (or slack variables in SVMs) and >> not on maximizing the margin. Maximizing margins, is one of the biggest >> reasons boosting and SVMs are so successful, so ignoring this when doing >> asymmetric prediction would seem to be a bad idea. >> >> Some examples can be found at a website I have setup to remind myself of >> all the papers in boosting. Just google "boosting papers." >> >> Aaron >> >> >> >> On Wed, 13 Aug 2008, Viren Jain wrote: >> >>> Great! The compilation and fix seem to work. >>> The sliding bias approach is basically what I am doing now; only using >>> positive classifications with some minimum margin. Is there an interesting >>> prior literature out there on learning with asymmetric cost..? I am >>> interested in this issue. >>> Thanks again for all your help, sorry about all the trouble! >>> Viren >>> On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote: >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> I downloaded the CVS version and tried to compile it, but ran into the >>>> following errors. Do you think this is probably just an outdated Java >>>> version issue? >>>> >>>> [javac] Compiling 46 source files to /home/viren/jboost/build >>>> [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: >>>> cannot find symbol [javac] symbol : method getPositiveInitialPotential() >>>> [javac] location: class jboost.booster.BrownBoost [javac] >>>> System.out.println("\tPotential loss of positive examples m_booster: " + >>>> b.getPositiveInitialPotential()); >>>> .... etc >>> This is the region that is currently under heavy construction. Instead of >>> trying to debug this, I just commented out the calls and everything >>> compiled for me. Try a 'cvs update' and 'ant jar'. >>>> Regarding the cost function - the distribution has maybe 70% negative >>>> examples and 30% positive examples, so not horrendously imbalanced, but >>>> for our application a false positive is extremely costly (lets say, 5 >>>> times as costly) as compared to a false negative. >>> For the moment being, you can just use a "sliding bias" approach. Where >>> you just add a value X to the root node (which is the same as adding X to >>> the final score, since all examples go through the root). The root node >>> actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] >>> = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1"). >>> This isn't perfect, but check again in a couple weeks and we should have a >>> much more technical (and much cooler) approach available based on some new >>> math from drifting games. >>> If this doesn't give very good results (no value of X gives you the >>> sensitivity/specificity tradeoff desired), I've picked up a few other >>> hacks that may help. >>> Aaron >>>> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: >>>> That was implemented in the BrownBoost and NormalBoost boosters. >>>> Unfortunately, these boosters are currently (as in 5 minutes ago) being >>>> completely rewritten. >>>> There are still plenty of ways to artificially change the cost function, >>>> and these work well for some applications. >>>> What exactly are you trying to do? How asymmetric are your costs of your >>>> mispredictions? How asymmtric is the distribution of your classes? >>>> Aaron >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> Great! One last question (for now :). I was a little confused by the >>>>> documentation regarding asymmetric cost functions: is it currently >>>>> possible to change the cost function such that false positives are more >>>>> costly than false negatives? >>>>> Thanks, >>>>> Viren >>>>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >>>>> Viren, >>>>> Yep, I just tried 1.4 and I was able to reproduce your problem. >>>>> This will certainly cause a speed up in the release of the next version. >>>>> Let me know if you have any problems with the CVS release. >>>>> Nice catch! >>>>> Aaron >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> Hey Aaron, >>>>>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an >>>>>> empty output in the .boosting.info files. I am using release 1.4, but I >>>>>> can try the repository version. >>>>>> Thanks! >>>>>> Viren >>>>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>>>>> Hey Viren, >>>>>> I just tried out >>>>>> cd jboost/demo >>>>>> ../jboost -numRounds 10 -a 9 -S stem >>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>>>> ../jboost -numRounds 10 -a -2 -S stem >>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>>>> And I see that this outputs the second to last iteration. When I try >>>>>> cd jboost/demo >>>>>> ../jboost -numRounds 10 -a 10 -S stem >>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>>>> ../jboost -numRounds 10 -a -2 -S stem >>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>>>> I see that the final iteration is output. >>>>>> Let me know what you see when you run the above. If you see something >>>>>> different, perhaps the used to be a bug and it was corrected. The code >>>>>> to output files by the "-a" switch was recently updated, so perhaps >>>>>> this bug was corrected (I updated it and have no memory of fixing this >>>>>> bug, but perhaps I did...). Are you perhaps using an old version of >>>>>> JBoost? Perhaps try out the cvs repository and see if that fixes your >>>>>> problem. >>>>>> Aaron >>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>> Thanks again, Aaron. >>>>>>> I double checked things and it seems I still discrepancies in the >>>>>>> classifier outputs. The exact jboost command I am using is: >>>>>>> ... jboost.controller.Controller -S test_old_background_mporder >>>>>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>>>>> classify_background.m >>>>>>> I assume there is some sort of 0 counting, since if I use -a 300 the >>>>>>> .info.testing and .info.training file are 0 bytes. So if this is >>>>>>> correct, then test_old_background_mporder.test.boosting.info should >>>>>>> have identical outputs to those generated from the same examples by >>>>>>> using classify_background.m? >>>>>>> Again, thanks so much! >>>>>>> Viren >>>>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>>> I'm actually using text strings for the labels. i.e., in the spec >>>>>>>> file i have line "labels (merge, split)" and then for >>>>>>>> each example in training/test, I output the appropriate string. Do >>>>>>>> you recommend I use (-1,1) instead? >>>>>>> That's fine. I just assumed that since you said the labels were >>>>>>> inverted, that meant you were using -1/+1. Using text is perfectly >>>>>>> okay. >>>>>>>> Also, what is the iteration on which Jboost outputs the matlab file >>>>>>>> when I use the -m option? The last one? >>>>>>> Yes, it is the last iteration. There should probably be an option >>>>>>> (like -a) to output this more often. >>>>>>> Aaron >>>>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>>>>> Hi Viren, >>>>>>>> The inverted label is a result of JBoost using it's own internal >>>>>>>> labeling system. If you swap the order of how you specify the labels >>>>>>>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get >>>>>>>> the correct label. >>>>>>>> I haven't heard about the difference in score before. Are you >>>>>>>> perhaps looking at the scores for the wrong iteration? Are you using >>>>>>>> "-a -1" or "-a -2" switches to obtain the appropriate score/margin >>>>>>>> output files? Are you perhaps getting training and testing sets mixed >>>>>>>> up? >>>>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo >>>>>>>> directory) and it looks like everything is fine. If you can send >>>>>>>> your train/test files or reproduce the bug on the spambase dataset, >>>>>>>> please send me the exact parameters you're using and I'll see if it's >>>>>>>> a bug, poor documentation, or a misunderstanding of some sort. >>>>>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>>>>> Aaron >>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>>>>>>>> Jboost. I also asked it to output a matlab script I could use to >>>>>>>>> classify examples with in the future. However, I was wondering why >>>>>>>>> the matlab script outputs slightly different values than I would get >>>>>>>>> by classifying the training/test set directly using Jboost (for >>>>>>>>> example, the sign of the classifier output is always opposite to >>>>>>>>> what Jboost produces, and at most I have seen a 0.1469 discrepancy >>>>>>>>> in the actual value after accounting for the sign issue). Has anyone >>>>>>>>> encountered this issue, or am I perhaps doing something incorrectly? |