|
From: Aaron A. <aa...@cs...> - 2008-08-13 20:01:30
|
On Wed, 13 Aug 2008, Viren Jain wrote:
> I downloaded the CVS version and tried to compile it, but ran into
> the following errors. Do you think this is probably just an outdated Java
> version issue?
>
> [javac] Compiling 46 source files to /home/viren/jboost/build
> [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148:
> cannot find symbol [javac] symbol : method getPositiveInitialPotential()
> [javac] location: class jboost.booster.BrownBoost [javac]
> System.out.println("\tPotential loss of positive examples m_booster: " +
> b.getPositiveInitialPotential());
> .... etc
This is the region that is currently under heavy construction. Instead of
trying to debug this, I just commented out the calls and everything
compiled for me. Try a 'cvs update' and 'ant jar'.
> Regarding the cost function - the distribution has maybe 70%
> negative examples and 30% positive examples, so not horrendously
> imbalanced, but for our application a false positive is extremely costly
> (lets say, 5 times as costly) as compared to a false negative.
For the moment being, you can just use a "sliding bias" approach. Where
you just add a value X to the root node (which is the same as adding X to
the final score, since all examples go through the root). The root node
actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y]
= 0, see equation in Schapire & Singer 99, where h(x)=="Always +1").
This isn't perfect, but check again in a couple weeks and we should have a
much more technical (and much cooler) approach available based on some new
math from drifting games.
If this doesn't give very good results (no value of X gives you the
sensitivity/specificity tradeoff desired), I've picked up a few other
hacks that may help.
Aaron
> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote:
>
> That was implemented in the BrownBoost and NormalBoost boosters.
> Unfortunately, these boosters are currently (as in 5 minutes ago) being
> completely rewritten.
>
> There are still plenty of ways to artificially change the cost function,
> and these work well for some applications.
>
> What exactly are you trying to do? How asymmetric are your costs of your
> mispredictions? How asymmtric is the distribution of your classes?
>
> Aaron
>
>
> On Wed, 13 Aug 2008, Viren Jain wrote:
>
>> Great! One last question (for now :). I was a little confused by the
>> documentation regarding asymmetric cost functions: is it currently
>> possible to change the cost function such that false positives are more
>> costly than false negatives?
>>
>> Thanks,
>> Viren
>>
>>
>>
>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote:
>>
>> Viren,
>>
>> Yep, I just tried 1.4 and I was able to reproduce your problem.
>>
>> This will certainly cause a speed up in the release of the next version.
>> Let me know if you have any problems with the CVS release.
>>
>> Nice catch!
>>
>> Aaron
>>
>>
>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>
>>> Hey Aaron,
>>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an
>>> empty output in the .boosting.info files. I am using release 1.4, but I
>>> can try the repository version.
>>> Thanks!
>>> Viren
>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote:
>>> Hey Viren,
>>> I just tried out
>>> cd jboost/demo
>>> ../jboost -numRounds 10 -a 9 -S stem
>>> cp stem.test.boosting.info stem.test.boosting.info.bak
>>> ../jboost -numRounds 10 -a -2 -S stem
>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info
>>> And I see that this outputs the second to last iteration. When I try
>>> cd jboost/demo
>>> ../jboost -numRounds 10 -a 10 -S stem
>>> cp stem.test.boosting.info stem.test.boosting.info.bak
>>> ../jboost -numRounds 10 -a -2 -S stem
>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info
>>> I see that the final iteration is output.
>>> Let me know what you see when you run the above. If you see something
>>> different, perhaps the used to be a bug and it was corrected. The code
>>> to output files by the "-a" switch was recently updated, so perhaps this
>>> bug was corrected (I updated it and have no memory of fixing this bug,
>>> but perhaps I did...). Are you perhaps using an old version of JBoost?
>>> Perhaps try out the cvs repository and see if that fixes your problem.
>>> Aaron
>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>> Thanks again, Aaron.
>>>> I double checked things and it seems I still discrepancies in the
>>>> classifier outputs. The exact jboost command I am using is:
>>>> ... jboost.controller.Controller -S test_old_background_mporder
>>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m
>>>> classify_background.m
>>>> I assume there is some sort of 0 counting, since if I use -a 300 the
>>>> .info.testing and .info.training file are 0 bytes. So if this is
>>>> correct, then test_old_background_mporder.test.boosting.info should
>>>> have identical outputs to those generated from the same examples by
>>>> using classify_background.m?
>>>> Again, thanks so much!
>>>> Viren
>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote:
>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>> I'm actually using text strings for the labels. i.e., in the spec file
>>>>> i have line "labels (merge, split)" and then for each
>>>>> example in training/test, I output the appropriate string. Do you
>>>>> recommend I use (-1,1) instead?
>>>> That's fine. I just assumed that since you said the labels were
>>>> inverted, that meant you were using -1/+1. Using text is perfectly
>>>> okay.
>>>>> Also, what is the iteration on which Jboost outputs the matlab file
>>>>> when I use the -m option? The last one?
>>>> Yes, it is the last iteration. There should probably be an option
>>>> (like -a) to output this more often.
>>>> Aaron
>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote:
>>>>> Hi Viren,
>>>>> The inverted label is a result of JBoost using it's own internal
>>>>> labeling system. If you swap the order of how you specify the labels
>>>>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get
>>>>> the correct label.
>>>>> I haven't heard about the difference in score before. Are you perhaps
>>>>> looking at the scores for the wrong iteration? Are you using "-a -1"
>>>>> or "-a -2" switches to obtain the appropriate score/margin output
>>>>> files? Are you perhaps getting training and testing sets mixed up?
>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo directory)
>>>>> and it looks like everything is fine. If you can send your train/test
>>>>> files or reproduce the bug on the spambase dataset, please send me the
>>>>> exact parameters you're using and I'll see if it's a bug, poor
>>>>> documentation, or a misunderstanding of some sort.
>>>>> Thanks for the heads up on the potential bug in the matlab scores.
>>>>> Aaron
>>>>> On Wed, 13 Aug 2008, Viren Jain wrote:
>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using
>>>>>> Jboost. I also asked it to output a matlab script I could use to
>>>>>> classify examples with in the future. However, I was wondering why
>>>>>> the matlab script outputs slightly different values than I would get
>>>>>> by classifying the training/test set directly using Jboost (for
>>>>>> example, the sign of the classifier output is always opposite to what
>>>>>> Jboost produces, and at most I have seen a 0.1469 discrepancy in the
>>>>>> actual value after accounting for the sign issue). Has anyone
>>>>>> encountered this issue, or am I perhaps doing something incorrectly?
|