You can subscribe to this list here.
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(7) |
Nov
|
Dec
(2) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2008 |
Jan
(8) |
Feb
(4) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(7) |
Aug
(15) |
Sep
(5) |
Oct
|
Nov
(3) |
Dec
|
2009 |
Jan
(5) |
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
(5) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
2011 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Vimal V. <vim...@ya...> - 2008-09-30 09:12:44
|
Respected Sir, i am doing my research work on Boosting method and specially on Adaboost. i need the BrownBoost Algorithm that you have used in jBoost Software. some statatics related to BrownBoost. so please provide me so i can do more work on that. i am waiting for your positive reply. thanking you, vimal vaghela |
From: Jason K. <jas...@ro...> - 2008-09-26 22:08:28
|
Hello, In looking at the examples, I am finding that the lack of documentation on the examples themselves is really hurting my attempts to understand what the classifier is outputting. It would make sense, for example, if the noisy line example was completed from end to end with details on the building and execution of the example as well as some description of the input and interpretation of the results for some examples. The information that is provided is insufficient to understand. I am able to build it, but when I ran it, the lack of some instructions on the command line and very terse comments in the code meant that I was unsure what I should be typing on the command line. I had to figure it out which is quite silly for someone trying to use examples in order to learn the application. For the output, the example described in Wikipedia yields a single number confidence output and this is intuitive whereas the output of two numbers makes little sense to me. Is the confidence the summation of the outputs in the vector, is it multidimensional or something else? This information should be on the site so others can attempt to make use of the data. Jason |
From: Aaron A. <aa...@cs...> - 2008-08-16 03:25:11
|
Hey Viren, Yes, I do remember a 300 iteration limit. I think that was put in since BrownBoost rarely went into an infinite loop, so an upper limit was placed on the number of iterations so that when left running in the background, it didn't fill up the disk. However, if you don't stop before 300 iterations, look at the .info file and see if the error is decreasing. If it is decreasing, then increase the iteration limit to 3000 (or other appropriate number for your application). If it is not decreasing then look at the intermediate .boosting.info files to see if the remaining time is decreasing. If it is not decreasing... then you've found one of those rare seemingly infinite loops. The best place to set an iteration bound for BrownBoost is to set an iteration bound at line 433 in Controller.java. Keep the comments, requests, and bugs coming! Aaron On Fri, 15 Aug 2008, Viren Jain wrote: > OK.. By the way, I tried the new code from the CVS repository. BrownBoost > runs, but I think it always stops at the 300th learning iteration (regardless > of what I use as the -r parameter?) I will double check this... And let me > know if you don't want bug reports yet :) > > Thanks, > Viren > > On Aug 15, 2008, at 8:49 PM, Aaron Arvey wrote: > > On Fri, 15 Aug 2008, Viren Jain wrote: > >> Sounds great, I will check it out. One question, what do you mean "by >> default"? I.e., if I do not include the -a option at all? Or with no >> iteration specified? > > If you do not specify -a (or if you specify "-a 0"), BrownBoost algorithms > will print output on their final iteration. I'm going to make this the > default behavior for all boosting algorithms. > > Aaron > > >> On Aug 15, 2008, at 5:54 PM, Aaron Arvey wrote: >> >> Hey Viren, >> >> Update the CVS and the data will now be output on the last iteration by >> default when using BrownBoost or subclasses. >> >> So you know, you are using code that is about to be completely rewritten, >> which is fine, but you may want to save a copy of the version of the JBoost >> code you currently have so that possible bugs in the new code do not >> destroy any good results you achieve with this version. Alternatively, the >> "improvements" in BrownBoost may give you even better performance! >> >> Also, as long as you're testing out some of the new JBoost features, you >> may want to check out the new R visualization scripts in ./scripts. There's >> a README file with basic documentation and more documentation will appear >> on the website soon. The python files are outdated, probably buggy, and >> generate ugly pictures. If you try the R visualizations and have any >> problems, let me know and CC the jboost-users list. >> >> Glad to hear you're having a good time with JBoost! >> >> Aaron >> >> >> On Fri, 15 Aug 2008, Viren Jain wrote: >> >>> Hi Aaron, >>> One other random question- I've started experimenting with BrownBoost, >>> which is useful due to lots of noisy examples in my training set. Since I >>> use the -r option to specify how "long" to train, is there a way to tell >>> Jboost to only output training/test info at the last epoch (even though I >>> dont what the last epoch will necessarily be?) >>> Thanks! And great job with JBoost, its really a fun and useful too to >>> experiment with boosting. >>> Thanks again, >>> Viren >>> On Aug 13, 2008, at 5:06 PM, Aaron Arvey wrote: >>> There are a few papers on asymmetric classification. Unfortunately, most >>> of them focus on changing weights on examples (or slack variables in SVMs) >>> and not on maximizing the margin. Maximizing margins, is one of the >>> biggest reasons boosting and SVMs are so successful, so ignoring this when >>> doing asymmetric prediction would seem to be a bad idea. >>> Some examples can be found at a website I have setup to remind myself of >>> all the papers in boosting. Just google "boosting papers." >>> Aaron >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> Great! The compilation and fix seem to work. >>>> The sliding bias approach is basically what I am doing now; only using >>>> positive classifications with some minimum margin. Is there an >>>> interesting prior literature out there on learning with asymmetric >>>> cost..? I am interested in this issue. >>>> Thanks again for all your help, sorry about all the trouble! >>>> Viren >>>> On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote: >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> I downloaded the CVS version and tried to compile it, but ran into the >>>>> following errors. Do you think this is probably just an outdated Java >>>>> version issue? >>>>> >>>>> [javac] Compiling 46 source files to /home/viren/jboost/build >>>>> [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: >>>>> cannot find symbol [javac] symbol : method getPositiveInitialPotential() >>>>> [javac] location: class jboost.booster.BrownBoost [javac] >>>>> System.out.println("\tPotential loss of positive examples m_booster: " + >>>>> b.getPositiveInitialPotential()); >>>>> .... etc >>>> This is the region that is currently under heavy construction. Instead >>>> of trying to debug this, I just commented out the calls and everything >>>> compiled for me. Try a 'cvs update' and 'ant jar'. >>>>> Regarding the cost function - the distribution has maybe 70% negative >>>>> examples and 30% positive examples, so not horrendously imbalanced, but >>>>> for our application a false positive is extremely costly (lets say, 5 >>>>> times as costly) as compared to a false negative. >>>> For the moment being, you can just use a "sliding bias" approach. Where >>>> you just add a value X to the root node (which is the same as adding X to >>>> the final score, since all examples go through the root). The root node >>>> actually balances the weights on the dataset (satisfies >>>> E_{w_{i+1}}[h(x)y] = 0, see equation in Schapire & Singer 99, where >>>> h(x)=="Always +1"). >>>> This isn't perfect, but check again in a couple weeks and we should have >>>> a much more technical (and much cooler) approach available based on some >>>> new math from drifting games. >>>> If this doesn't give very good results (no value of X gives you the >>>> sensitivity/specificity tradeoff desired), I've picked up a few other >>>> hacks that may help. >>>> Aaron >>>>> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: >>>>> That was implemented in the BrownBoost and NormalBoost boosters. >>>>> Unfortunately, these boosters are currently (as in 5 minutes ago) being >>>>> completely rewritten. >>>>> There are still plenty of ways to artificially change the cost function, >>>>> and these work well for some applications. >>>>> What exactly are you trying to do? How asymmetric are your costs of >>>>> your mispredictions? How asymmtric is the distribution of your classes? >>>>> Aaron >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> Great! One last question (for now :). I was a little confused by the >>>>>> documentation regarding asymmetric cost functions: is it currently >>>>>> possible to change the cost function such that false positives are more >>>>>> costly than false negatives? >>>>>> Thanks, >>>>>> Viren >>>>>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >>>>>> Viren, >>>>>> Yep, I just tried 1.4 and I was able to reproduce your problem. >>>>>> This will certainly cause a speed up in the release of the next >>>>>> version. Let me know if you have any problems with the CVS release. >>>>>> Nice catch! >>>>>> Aaron >>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>> Hey Aaron, >>>>>>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an >>>>>>> empty output in the .boosting.info files. I am using release 1.4, but >>>>>>> I can try the repository version. >>>>>>> Thanks! >>>>>>> Viren >>>>>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>>>>>> Hey Viren, >>>>>>> I just tried out >>>>>>> cd jboost/demo >>>>>>> ../jboost -numRounds 10 -a 9 -S stem >>>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>>>>> ../jboost -numRounds 10 -a -2 -S stem >>>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>>>>> And I see that this outputs the second to last iteration. When I try >>>>>>> cd jboost/demo >>>>>>> ../jboost -numRounds 10 -a 10 -S stem >>>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>>>>> ../jboost -numRounds 10 -a -2 -S stem >>>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>>>>> I see that the final iteration is output. >>>>>>> Let me know what you see when you run the above. If you see something >>>>>>> different, perhaps the used to be a bug and it was corrected. The code >>>>>>> to output files by the "-a" switch was recently updated, so perhaps >>>>>>> this bug was corrected (I updated it and have no memory of fixing this >>>>>>> bug, but perhaps I did...). Are you perhaps using an old version of >>>>>>> JBoost? Perhaps try out the cvs repository and see if that fixes your >>>>>>> problem. >>>>>>> Aaron >>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>>> Thanks again, Aaron. >>>>>>>> I double checked things and it seems I still discrepancies in the >>>>>>>> classifier outputs. The exact jboost command I am using is: >>>>>>>> ... jboost.controller.Controller -S test_old_background_mporder >>>>>>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>>>>>> classify_background.m >>>>>>>> I assume there is some sort of 0 counting, since if I use -a 300 the >>>>>>>> .info.testing and .info.training file are 0 bytes. So if this is >>>>>>>> correct, then test_old_background_mporder.test.boosting.info should >>>>>>>> have identical outputs to those generated from the same examples by >>>>>>>> using classify_background.m? >>>>>>>> Again, thanks so much! >>>>>>>> Viren >>>>>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>>>> I'm actually using text strings for the labels. i.e., in the spec >>>>>>>>> file i have line "labels (merge, split)" and then for >>>>>>>>> each example in training/test, I output the appropriate string. Do >>>>>>>>> you recommend I use (-1,1) instead? >>>>>>>> That's fine. I just assumed that since you said the labels were >>>>>>>> inverted, that meant you were using -1/+1. Using text is perfectly >>>>>>>> okay. >>>>>>>>> Also, what is the iteration on which Jboost outputs the matlab file >>>>>>>>> when I use the -m option? The last one? >>>>>>>> Yes, it is the last iteration. There should probably be an option >>>>>>>> (like -a) to output this more often. >>>>>>>> Aaron >>>>>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>>>>>> Hi Viren, >>>>>>>>> The inverted label is a result of JBoost using it's own internal >>>>>>>>> labeling system. If you swap the order of how you specify the >>>>>>>>> labels (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") >>>>>>>>> you'll get the correct label. >>>>>>>>> I haven't heard about the difference in score before. Are you >>>>>>>>> perhaps looking at the scores for the wrong iteration? Are you >>>>>>>>> using "-a -1" or "-a -2" switches to obtain the appropriate >>>>>>>>> score/margin output files? Are you perhaps getting training and >>>>>>>>> testing sets mixed up? >>>>>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo >>>>>>>>> directory) and it looks like everything is fine. If you can send >>>>>>>>> your train/test files or reproduce the bug on the spambase dataset, >>>>>>>>> please send me the exact parameters you're using and I'll see if >>>>>>>>> it's a bug, poor documentation, or a misunderstanding of some sort. >>>>>>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>>>>>> Aaron >>>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>>>>>>>>> Jboost. I also asked it to output a matlab script I could use to >>>>>>>>>> classify examples with in the future. However, I was wondering why >>>>>>>>>> the matlab script outputs slightly different values than I would >>>>>>>>>> get by classifying the training/test set directly using Jboost (for >>>>>>>>>> example, the sign of the classifier output is always opposite to >>>>>>>>>> what Jboost produces, and at most I have seen a 0.1469 discrepancy >>>>>>>>>> in the actual value after accounting for the sign issue). Has >>>>>>>>>> anyone encountered this issue, or am I perhaps doing something >>>>>>>>>> incorrectly? |
From: Aaron A. <aa...@cs...> - 2008-08-16 00:49:19
|
On Fri, 15 Aug 2008, Viren Jain wrote: > Sounds great, I will check it out. One question, what do you mean "by > default"? I.e., if I do not include the -a option at all? Or with no > iteration specified? If you do not specify -a (or if you specify "-a 0"), BrownBoost algorithms will print output on their final iteration. I'm going to make this the default behavior for all boosting algorithms. Aaron > On Aug 15, 2008, at 5:54 PM, Aaron Arvey wrote: > > Hey Viren, > > Update the CVS and the data will now be output on the last iteration by > default when using BrownBoost or subclasses. > > So you know, you are using code that is about to be completely rewritten, > which is fine, but you may want to save a copy of the version of the JBoost > code you currently have so that possible bugs in the new code do not destroy > any good results you achieve with this version. Alternatively, the > "improvements" in BrownBoost may give you even better performance! > > Also, as long as you're testing out some of the new JBoost features, you may > want to check out the new R visualization scripts in ./scripts. There's a > README file with basic documentation and more documentation will appear on > the website soon. The python files are outdated, probably buggy, and > generate ugly pictures. If you try the R visualizations and have any > problems, let me know and CC the jboost-users list. > > Glad to hear you're having a good time with JBoost! > > Aaron > > > On Fri, 15 Aug 2008, Viren Jain wrote: > >> Hi Aaron, >> One other random question- I've started experimenting with BrownBoost, >> which is useful due to lots of noisy examples in my training set. Since I >> use the -r option to specify how "long" to train, is there a way to tell >> Jboost to only output training/test info at the last epoch (even though I >> dont what the last epoch will necessarily be?) >> Thanks! And great job with JBoost, its really a fun and useful too to >> experiment with boosting. >> Thanks again, >> Viren >> >> On Aug 13, 2008, at 5:06 PM, Aaron Arvey wrote: >> >> There are a few papers on asymmetric classification. Unfortunately, most of >> them focus on changing weights on examples (or slack variables in SVMs) and >> not on maximizing the margin. Maximizing margins, is one of the biggest >> reasons boosting and SVMs are so successful, so ignoring this when doing >> asymmetric prediction would seem to be a bad idea. >> >> Some examples can be found at a website I have setup to remind myself of >> all the papers in boosting. Just google "boosting papers." >> >> Aaron >> >> >> >> On Wed, 13 Aug 2008, Viren Jain wrote: >> >>> Great! The compilation and fix seem to work. >>> The sliding bias approach is basically what I am doing now; only using >>> positive classifications with some minimum margin. Is there an interesting >>> prior literature out there on learning with asymmetric cost..? I am >>> interested in this issue. >>> Thanks again for all your help, sorry about all the trouble! >>> Viren >>> On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote: >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> I downloaded the CVS version and tried to compile it, but ran into the >>>> following errors. Do you think this is probably just an outdated Java >>>> version issue? >>>> >>>> [javac] Compiling 46 source files to /home/viren/jboost/build >>>> [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: >>>> cannot find symbol [javac] symbol : method getPositiveInitialPotential() >>>> [javac] location: class jboost.booster.BrownBoost [javac] >>>> System.out.println("\tPotential loss of positive examples m_booster: " + >>>> b.getPositiveInitialPotential()); >>>> .... etc >>> This is the region that is currently under heavy construction. Instead of >>> trying to debug this, I just commented out the calls and everything >>> compiled for me. Try a 'cvs update' and 'ant jar'. >>>> Regarding the cost function - the distribution has maybe 70% negative >>>> examples and 30% positive examples, so not horrendously imbalanced, but >>>> for our application a false positive is extremely costly (lets say, 5 >>>> times as costly) as compared to a false negative. >>> For the moment being, you can just use a "sliding bias" approach. Where >>> you just add a value X to the root node (which is the same as adding X to >>> the final score, since all examples go through the root). The root node >>> actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] >>> = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1"). >>> This isn't perfect, but check again in a couple weeks and we should have a >>> much more technical (and much cooler) approach available based on some new >>> math from drifting games. >>> If this doesn't give very good results (no value of X gives you the >>> sensitivity/specificity tradeoff desired), I've picked up a few other >>> hacks that may help. >>> Aaron >>>> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: >>>> That was implemented in the BrownBoost and NormalBoost boosters. >>>> Unfortunately, these boosters are currently (as in 5 minutes ago) being >>>> completely rewritten. >>>> There are still plenty of ways to artificially change the cost function, >>>> and these work well for some applications. >>>> What exactly are you trying to do? How asymmetric are your costs of your >>>> mispredictions? How asymmtric is the distribution of your classes? >>>> Aaron >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> Great! One last question (for now :). I was a little confused by the >>>>> documentation regarding asymmetric cost functions: is it currently >>>>> possible to change the cost function such that false positives are more >>>>> costly than false negatives? >>>>> Thanks, >>>>> Viren >>>>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >>>>> Viren, >>>>> Yep, I just tried 1.4 and I was able to reproduce your problem. >>>>> This will certainly cause a speed up in the release of the next version. >>>>> Let me know if you have any problems with the CVS release. >>>>> Nice catch! >>>>> Aaron >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> Hey Aaron, >>>>>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an >>>>>> empty output in the .boosting.info files. I am using release 1.4, but I >>>>>> can try the repository version. >>>>>> Thanks! >>>>>> Viren >>>>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>>>>> Hey Viren, >>>>>> I just tried out >>>>>> cd jboost/demo >>>>>> ../jboost -numRounds 10 -a 9 -S stem >>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>>>> ../jboost -numRounds 10 -a -2 -S stem >>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>>>> And I see that this outputs the second to last iteration. When I try >>>>>> cd jboost/demo >>>>>> ../jboost -numRounds 10 -a 10 -S stem >>>>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>>>> ../jboost -numRounds 10 -a -2 -S stem >>>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>>>> I see that the final iteration is output. >>>>>> Let me know what you see when you run the above. If you see something >>>>>> different, perhaps the used to be a bug and it was corrected. The code >>>>>> to output files by the "-a" switch was recently updated, so perhaps >>>>>> this bug was corrected (I updated it and have no memory of fixing this >>>>>> bug, but perhaps I did...). Are you perhaps using an old version of >>>>>> JBoost? Perhaps try out the cvs repository and see if that fixes your >>>>>> problem. >>>>>> Aaron >>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>> Thanks again, Aaron. >>>>>>> I double checked things and it seems I still discrepancies in the >>>>>>> classifier outputs. The exact jboost command I am using is: >>>>>>> ... jboost.controller.Controller -S test_old_background_mporder >>>>>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>>>>> classify_background.m >>>>>>> I assume there is some sort of 0 counting, since if I use -a 300 the >>>>>>> .info.testing and .info.training file are 0 bytes. So if this is >>>>>>> correct, then test_old_background_mporder.test.boosting.info should >>>>>>> have identical outputs to those generated from the same examples by >>>>>>> using classify_background.m? >>>>>>> Again, thanks so much! >>>>>>> Viren >>>>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>>> I'm actually using text strings for the labels. i.e., in the spec >>>>>>>> file i have line "labels (merge, split)" and then for >>>>>>>> each example in training/test, I output the appropriate string. Do >>>>>>>> you recommend I use (-1,1) instead? >>>>>>> That's fine. I just assumed that since you said the labels were >>>>>>> inverted, that meant you were using -1/+1. Using text is perfectly >>>>>>> okay. >>>>>>>> Also, what is the iteration on which Jboost outputs the matlab file >>>>>>>> when I use the -m option? The last one? >>>>>>> Yes, it is the last iteration. There should probably be an option >>>>>>> (like -a) to output this more often. >>>>>>> Aaron >>>>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>>>>> Hi Viren, >>>>>>>> The inverted label is a result of JBoost using it's own internal >>>>>>>> labeling system. If you swap the order of how you specify the labels >>>>>>>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get >>>>>>>> the correct label. >>>>>>>> I haven't heard about the difference in score before. Are you >>>>>>>> perhaps looking at the scores for the wrong iteration? Are you using >>>>>>>> "-a -1" or "-a -2" switches to obtain the appropriate score/margin >>>>>>>> output files? Are you perhaps getting training and testing sets mixed >>>>>>>> up? >>>>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo >>>>>>>> directory) and it looks like everything is fine. If you can send >>>>>>>> your train/test files or reproduce the bug on the spambase dataset, >>>>>>>> please send me the exact parameters you're using and I'll see if it's >>>>>>>> a bug, poor documentation, or a misunderstanding of some sort. >>>>>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>>>>> Aaron >>>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>>>>>>>> Jboost. I also asked it to output a matlab script I could use to >>>>>>>>> classify examples with in the future. However, I was wondering why >>>>>>>>> the matlab script outputs slightly different values than I would get >>>>>>>>> by classifying the training/test set directly using Jboost (for >>>>>>>>> example, the sign of the classifier output is always opposite to >>>>>>>>> what Jboost produces, and at most I have seen a 0.1469 discrepancy >>>>>>>>> in the actual value after accounting for the sign issue). Has anyone >>>>>>>>> encountered this issue, or am I perhaps doing something incorrectly? |
From: Aaron A. <aa...@cs...> - 2008-08-15 21:54:03
|
Hey Viren, Update the CVS and the data will now be output on the last iteration by default when using BrownBoost or subclasses. So you know, you are using code that is about to be completely rewritten, which is fine, but you may want to save a copy of the version of the JBoost code you currently have so that possible bugs in the new code do not destroy any good results you achieve with this version. Alternatively, the "improvements" in BrownBoost may give you even better performance! Also, as long as you're testing out some of the new JBoost features, you may want to check out the new R visualization scripts in ./scripts. There's a README file with basic documentation and more documentation will appear on the website soon. The python files are outdated, probably buggy, and generate ugly pictures. If you try the R visualizations and have any problems, let me know and CC the jboost-users list. Glad to hear you're having a good time with JBoost! Aaron On Fri, 15 Aug 2008, Viren Jain wrote: > Hi Aaron, > One other random question- I've started experimenting with > BrownBoost, which is useful due to lots of noisy examples in my training set. > Since I use the -r option to specify how "long" to train, is there a way to > tell Jboost to only output training/test info at the last epoch (even though > I dont what the last epoch will necessarily be?) > Thanks! And great job with JBoost, its really a fun and useful too to > experiment with boosting. > Thanks again, > Viren > > On Aug 13, 2008, at 5:06 PM, Aaron Arvey wrote: > > There are a few papers on asymmetric classification. Unfortunately, most of > them focus on changing weights on examples (or slack variables in SVMs) and > not on maximizing the margin. Maximizing margins, is one of the biggest > reasons boosting and SVMs are so successful, so ignoring this when doing > asymmetric prediction would seem to be a bad idea. > > Some examples can be found at a website I have setup to remind myself of all > the papers in boosting. Just google "boosting papers." > > Aaron > > > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> Great! The compilation and fix seem to work. >> >> The sliding bias approach is basically what I am doing now; only using >> positive classifications with some minimum margin. Is there an interesting >> prior literature out there on learning with asymmetric cost..? I am >> interested in this issue. >> >> Thanks again for all your help, sorry about all the trouble! >> >> Viren >> >> On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote: >> >> >> On Wed, 13 Aug 2008, Viren Jain wrote: >> >>> I downloaded the CVS version and tried to compile it, but ran into the >>> following errors. Do you think this is probably just an outdated Java >>> version issue? >>> >>> [javac] Compiling 46 source files to /home/viren/jboost/build >>> [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: >>> cannot find symbol [javac] symbol : method getPositiveInitialPotential() >>> [javac] location: class jboost.booster.BrownBoost [javac] >>> System.out.println("\tPotential loss of positive examples m_booster: " + >>> b.getPositiveInitialPotential()); >>> .... etc >> >> >> This is the region that is currently under heavy construction. Instead of >> trying to debug this, I just commented out the calls and everything >> compiled for me. Try a 'cvs update' and 'ant jar'. >> >> >>> Regarding the cost function - the distribution has maybe 70% negative >>> examples and 30% positive examples, so not horrendously imbalanced, but >>> for our application a false positive is extremely costly (lets say, 5 >>> times as costly) as compared to a false negative. >> >> For the moment being, you can just use a "sliding bias" approach. Where >> you just add a value X to the root node (which is the same as adding X to >> the final score, since all examples go through the root). The root node >> actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] >> = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1"). >> >> This isn't perfect, but check again in a couple weeks and we should have a >> much more technical (and much cooler) approach available based on some new >> math from drifting games. >> >> If this doesn't give very good results (no value of X gives you the >> sensitivity/specificity tradeoff desired), I've picked up a few other hacks >> that may help. >> >> Aaron >> >> >> >>> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: >>> That was implemented in the BrownBoost and NormalBoost boosters. >>> Unfortunately, these boosters are currently (as in 5 minutes ago) being >>> completely rewritten. >>> There are still plenty of ways to artificially change the cost function, >>> and these work well for some applications. >>> What exactly are you trying to do? How asymmetric are your costs of your >>> mispredictions? How asymmtric is the distribution of your classes? >>> Aaron >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> Great! One last question (for now :). I was a little confused by the >>>> documentation regarding asymmetric cost functions: is it currently >>>> possible to change the cost function such that false positives are more >>>> costly than false negatives? >>>> Thanks, >>>> Viren >>>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >>>> Viren, >>>> Yep, I just tried 1.4 and I was able to reproduce your problem. >>>> This will certainly cause a speed up in the release of the next version. >>>> Let me know if you have any problems with the CVS release. >>>> Nice catch! >>>> Aaron >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> Hey Aaron, >>>>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an >>>>> empty output in the .boosting.info files. I am using release 1.4, but I >>>>> can try the repository version. >>>>> Thanks! >>>>> Viren >>>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>>>> Hey Viren, >>>>> I just tried out >>>>> cd jboost/demo >>>>> ../jboost -numRounds 10 -a 9 -S stem >>>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>>> ../jboost -numRounds 10 -a -2 -S stem >>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>>> And I see that this outputs the second to last iteration. When I try >>>>> cd jboost/demo >>>>> ../jboost -numRounds 10 -a 10 -S stem >>>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>>> ../jboost -numRounds 10 -a -2 -S stem >>>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>>> I see that the final iteration is output. >>>>> Let me know what you see when you run the above. If you see something >>>>> different, perhaps the used to be a bug and it was corrected. The code >>>>> to output files by the "-a" switch was recently updated, so perhaps this >>>>> bug was corrected (I updated it and have no memory of fixing this bug, >>>>> but perhaps I did...). Are you perhaps using an old version of JBoost? >>>>> Perhaps try out the cvs repository and see if that fixes your problem. >>>>> Aaron >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> Thanks again, Aaron. >>>>>> I double checked things and it seems I still discrepancies in the >>>>>> classifier outputs. The exact jboost command I am using is: >>>>>> ... jboost.controller.Controller -S test_old_background_mporder >>>>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>>>> classify_background.m >>>>>> I assume there is some sort of 0 counting, since if I use -a 300 the >>>>>> .info.testing and .info.training file are 0 bytes. So if this is >>>>>> correct, then test_old_background_mporder.test.boosting.info should >>>>>> have identical outputs to those generated from the same examples by >>>>>> using classify_background.m? >>>>>> Again, thanks so much! >>>>>> Viren >>>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>> I'm actually using text strings for the labels. i.e., in the spec file >>>>>>> i have line "labels (merge, split)" and then for each >>>>>>> example in training/test, I output the appropriate string. Do you >>>>>>> recommend I use (-1,1) instead? >>>>>> That's fine. I just assumed that since you said the labels were >>>>>> inverted, that meant you were using -1/+1. Using text is perfectly >>>>>> okay. >>>>>>> Also, what is the iteration on which Jboost outputs the matlab file >>>>>>> when I use the -m option? The last one? >>>>>> Yes, it is the last iteration. There should probably be an option >>>>>> (like -a) to output this more often. >>>>>> Aaron >>>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>>>> Hi Viren, >>>>>>> The inverted label is a result of JBoost using it's own internal >>>>>>> labeling system. If you swap the order of how you specify the labels >>>>>>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get >>>>>>> the correct label. >>>>>>> I haven't heard about the difference in score before. Are you perhaps >>>>>>> looking at the scores for the wrong iteration? Are you using "-a -1" >>>>>>> or "-a -2" switches to obtain the appropriate score/margin output >>>>>>> files? Are you perhaps getting training and testing sets mixed up? >>>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo directory) >>>>>>> and it looks like everything is fine. If you can send your train/test >>>>>>> files or reproduce the bug on the spambase dataset, please send me the >>>>>>> exact parameters you're using and I'll see if it's a bug, poor >>>>>>> documentation, or a misunderstanding of some sort. >>>>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>>>> Aaron >>>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>>>>>>> Jboost. I also asked it to output a matlab script I could use to >>>>>>>> classify examples with in the future. However, I was wondering why >>>>>>>> the matlab script outputs slightly different values than I would get >>>>>>>> by classifying the training/test set directly using Jboost (for >>>>>>>> example, the sign of the classifier output is always opposite to what >>>>>>>> Jboost produces, and at most I have seen a 0.1469 discrepancy in the >>>>>>>> actual value after accounting for the sign issue). Has anyone >>>>>>>> encountered this issue, or am I perhaps doing something incorrectly? |
From: Aaron A. <aa...@cs...> - 2008-08-13 21:06:42
|
There are a few papers on asymmetric classification. Unfortunately, most of them focus on changing weights on examples (or slack variables in SVMs) and not on maximizing the margin. Maximizing margins, is one of the biggest reasons boosting and SVMs are so successful, so ignoring this when doing asymmetric prediction would seem to be a bad idea. Some examples can be found at a website I have setup to remind myself of all the papers in boosting. Just google "boosting papers." Aaron On Wed, 13 Aug 2008, Viren Jain wrote: > Great! The compilation and fix seem to work. > > The sliding bias approach is basically what I am doing now; only using > positive classifications with some minimum margin. Is there an interesting > prior literature out there on learning with asymmetric cost..? I am > interested in this issue. > > Thanks again for all your help, sorry about all the trouble! > > Viren > > On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote: > > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> I downloaded the CVS version and tried to compile it, but ran into the >> following errors. Do you think this is probably just an outdated Java >> version issue? >> >> [javac] Compiling 46 source files to /home/viren/jboost/build >> [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: >> cannot find symbol [javac] symbol : method getPositiveInitialPotential() >> [javac] location: class jboost.booster.BrownBoost [javac] >> System.out.println("\tPotential loss of positive examples m_booster: " + >> b.getPositiveInitialPotential()); >> .... etc > > > This is the region that is currently under heavy construction. Instead of > trying to debug this, I just commented out the calls and everything > compiled for me. Try a 'cvs update' and 'ant jar'. > > >> Regarding the cost function - the distribution has maybe 70% negative >> examples and 30% positive examples, so not horrendously imbalanced, but >> for our application a false positive is extremely costly (lets say, 5 >> times as costly) as compared to a false negative. > > For the moment being, you can just use a "sliding bias" approach. Where > you just add a value X to the root node (which is the same as adding X to > the final score, since all examples go through the root). The root node > actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] > = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1"). > > This isn't perfect, but check again in a couple weeks and we should have a > much more technical (and much cooler) approach available based on some new > math from drifting games. > > If this doesn't give very good results (no value of X gives you the > sensitivity/specificity tradeoff desired), I've picked up a few other hacks > that may help. > > Aaron > > > >> On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: >> >> That was implemented in the BrownBoost and NormalBoost boosters. >> Unfortunately, these boosters are currently (as in 5 minutes ago) being >> completely rewritten. >> >> There are still plenty of ways to artificially change the cost function, >> and these work well for some applications. >> >> What exactly are you trying to do? How asymmetric are your costs of your >> mispredictions? How asymmtric is the distribution of your classes? >> >> Aaron >> >> >> On Wed, 13 Aug 2008, Viren Jain wrote: >> >>> Great! One last question (for now :). I was a little confused by the >>> documentation regarding asymmetric cost functions: is it currently >>> possible to change the cost function such that false positives are more >>> costly than false negatives? >>> Thanks, >>> Viren >>> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >>> Viren, >>> Yep, I just tried 1.4 and I was able to reproduce your problem. >>> This will certainly cause a speed up in the release of the next version. >>> Let me know if you have any problems with the CVS release. >>> Nice catch! >>> Aaron >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> Hey Aaron, >>>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an >>>> empty output in the .boosting.info files. I am using release 1.4, but I >>>> can try the repository version. >>>> Thanks! >>>> Viren >>>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>>> Hey Viren, >>>> I just tried out >>>> cd jboost/demo >>>> ../jboost -numRounds 10 -a 9 -S stem >>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>> ../jboost -numRounds 10 -a -2 -S stem >>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>> And I see that this outputs the second to last iteration. When I try >>>> cd jboost/demo >>>> ../jboost -numRounds 10 -a 10 -S stem >>>> cp stem.test.boosting.info stem.test.boosting.info.bak >>>> ../jboost -numRounds 10 -a -2 -S stem >>>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>>> I see that the final iteration is output. >>>> Let me know what you see when you run the above. If you see something >>>> different, perhaps the used to be a bug and it was corrected. The code >>>> to output files by the "-a" switch was recently updated, so perhaps >>>> this bug was corrected (I updated it and have no memory of fixing this >>>> bug, but perhaps I did...). Are you perhaps using an old version of >>>> JBoost? Perhaps try out the cvs repository and see if that fixes your >>>> problem. >>>> Aaron >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> Thanks again, Aaron. >>>>> I double checked things and it seems I still discrepancies in the >>>>> classifier outputs. The exact jboost command I am using is: >>>>> ... jboost.controller.Controller -S test_old_background_mporder >>>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>>> classify_background.m >>>>> I assume there is some sort of 0 counting, since if I use -a 300 the >>>>> .info.testing and .info.training file are 0 bytes. So if this is >>>>> correct, then test_old_background_mporder.test.boosting.info should >>>>> have identical outputs to those generated from the same examples by >>>>> using classify_background.m? >>>>> Again, thanks so much! >>>>> Viren >>>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> I'm actually using text strings for the labels. i.e., in the spec >>>>>> file i have line "labels (merge, split)" and then for each >>>>>> example in training/test, I output the appropriate string. Do you >>>>>> recommend I use (-1,1) instead? >>>>> That's fine. I just assumed that since you said the labels were >>>>> inverted, that meant you were using -1/+1. Using text is perfectly >>>>> okay. >>>>>> Also, what is the iteration on which Jboost outputs the matlab file >>>>>> when I use the -m option? The last one? >>>>> Yes, it is the last iteration. There should probably be an option >>>>> (like -a) to output this more often. >>>>> Aaron >>>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>>> Hi Viren, >>>>>> The inverted label is a result of JBoost using it's own internal >>>>>> labeling system. If you swap the order of how you specify the labels >>>>>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get >>>>>> the correct label. >>>>>> I haven't heard about the difference in score before. Are you >>>>>> perhaps looking at the scores for the wrong iteration? Are you using >>>>>> "-a -1" or "-a -2" switches to obtain the appropriate score/margin >>>>>> output files? Are you perhaps getting training and testing sets mixed >>>>>> up? >>>>>> I just tested ADD_ROOT on the spambase dataset (in the demo >>>>>> directory) and it looks like everything is fine. If you can send >>>>>> your train/test files or reproduce the bug on the spambase dataset, >>>>>> please send me the exact parameters you're using and I'll see if it's >>>>>> a bug, poor documentation, or a misunderstanding of some sort. >>>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>>> Aaron >>>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>>>>>> Jboost. I also asked it to output a matlab script I could use to >>>>>>> classify examples with in the future. However, I was wondering why >>>>>>> the matlab script outputs slightly different values than I would get >>>>>>> by classifying the training/test set directly using Jboost (for >>>>>>> example, the sign of the classifier output is always opposite to >>>>>>> what Jboost produces, and at most I have seen a 0.1469 discrepancy >>>>>>> in the actual value after accounting for the sign issue). Has anyone >>>>>>> encountered this issue, or am I perhaps doing something incorrectly? |
From: Viren J. <vi...@MI...> - 2008-08-13 20:20:59
|
Great! The compilation and fix seem to work. The sliding bias approach is basically what I am doing now; only using positive classifications with some minimum margin. Is there an interesting prior literature out there on learning with asymmetric cost..? I am interested in this issue. Thanks again for all your help, sorry about all the trouble! Viren On Aug 13, 2008, at 4:01 PM, Aaron Arvey wrote: On Wed, 13 Aug 2008, Viren Jain wrote: > I downloaded the CVS version and tried to compile it, but ran into > the following errors. Do you think this is probably just an outdated > Java version issue? > > [javac] Compiling 46 source files to /home/viren/jboost/build > [javac] /home/viren/jboost/src/jboost/controller/Controller.java: > 148: cannot find symbol [javac] symbol : method > getPositiveInitialPotential() [javac] location: class > jboost.booster.BrownBoost [javac] System.out.println("\tPotential > loss of positive examples m_booster: " + > b.getPositiveInitialPotential()); > .... etc This is the region that is currently under heavy construction. Instead of trying to debug this, I just commented out the calls and everything compiled for me. Try a 'cvs update' and 'ant jar'. > Regarding the cost function - the distribution has maybe 70% > negative examples and 30% positive examples, so not horrendously > imbalanced, but for our application a false positive is extremely > costly (lets say, 5 times as costly) as compared to a false negative. For the moment being, you can just use a "sliding bias" approach. Where you just add a value X to the root node (which is the same as adding X to the final score, since all examples go through the root). The root node actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1"). This isn't perfect, but check again in a couple weeks and we should have a much more technical (and much cooler) approach available based on some new math from drifting games. If this doesn't give very good results (no value of X gives you the sensitivity/specificity tradeoff desired), I've picked up a few other hacks that may help. Aaron > On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: > > That was implemented in the BrownBoost and NormalBoost boosters. > Unfortunately, these boosters are currently (as in 5 minutes ago) > being completely rewritten. > > There are still plenty of ways to artificially change the cost > function, and these work well for some applications. > > What exactly are you trying to do? How asymmetric are your costs of > your mispredictions? How asymmtric is the distribution of your > classes? > > Aaron > > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> Great! One last question (for now :). I was a little confused by >> the documentation regarding asymmetric cost functions: is it >> currently possible to change the cost function such that false >> positives are more costly than false negatives? >> Thanks, >> Viren >> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >> Viren, >> Yep, I just tried 1.4 and I was able to reproduce your problem. >> This will certainly cause a speed up in the release of the next >> version. Let me know if you have any problems with the CVS release. >> Nice catch! >> Aaron >> On Wed, 13 Aug 2008, Viren Jain wrote: >>> Hey Aaron, >>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get >>> an empty output in the .boosting.info files. I am using release >>> 1.4, but I can try the repository version. >>> Thanks! >>> Viren >>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>> Hey Viren, >>> I just tried out >>> cd jboost/demo >>> ../jboost -numRounds 10 -a 9 -S stem >>> cp stem.test.boosting.info stem.test.boosting.info.bak >>> ../jboost -numRounds 10 -a -2 -S stem >>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>> And I see that this outputs the second to last iteration. When I >>> try >>> cd jboost/demo >>> ../jboost -numRounds 10 -a 10 -S stem >>> cp stem.test.boosting.info stem.test.boosting.info.bak >>> ../jboost -numRounds 10 -a -2 -S stem >>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>> I see that the final iteration is output. >>> Let me know what you see when you run the above. If you see >>> something different, perhaps the used to be a bug and it was >>> corrected. The code to output files by the "-a" switch was >>> recently updated, so perhaps this bug was corrected (I updated it >>> and have no memory of fixing this bug, but perhaps I did...). Are >>> you perhaps using an old version of JBoost? Perhaps try out the >>> cvs repository and see if that fixes your problem. >>> Aaron >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> Thanks again, Aaron. >>>> I double checked things and it seems I still discrepancies in the >>>> classifier outputs. The exact jboost command I am using is: >>>> ... jboost.controller.Controller -S test_old_background_mporder - >>>> numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>> classify_background.m >>>> I assume there is some sort of 0 counting, since if I use -a 300 >>>> the .info.testing and .info.training file are 0 bytes. So if this >>>> is correct, then test_old_background_mporder.test.boosting.info >>>> should have identical outputs to those generated from the same >>>> examples by using classify_background.m? >>>> Again, thanks so much! >>>> Viren >>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> I'm actually using text strings for the labels. i.e., in the >>>>> spec file i have line "labels (merge, split)" and then for >>>>> each example in training/test, I output the appropriate string. >>>>> Do you recommend I use (-1,1) instead? >>>> That's fine. I just assumed that since you said the labels were >>>> inverted, that meant you were using -1/+1. Using text is >>>> perfectly okay. >>>>> Also, what is the iteration on which Jboost outputs the matlab >>>>> file when I use the -m option? The last one? >>>> Yes, it is the last iteration. There should probably be an >>>> option (like -a) to output this more often. >>>> Aaron >>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>> Hi Viren, >>>>> The inverted label is a result of JBoost using it's own internal >>>>> labeling system. If you swap the order of how you specify the >>>>> labels (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") >>>>> you'll get the correct label. >>>>> I haven't heard about the difference in score before. Are you >>>>> perhaps looking at the scores for the wrong iteration? Are you >>>>> using "-a -1" or "-a -2" switches to obtain the appropriate >>>>> score/margin output files? Are you perhaps getting training and >>>>> testing sets mixed up? >>>>> I just tested ADD_ROOT on the spambase dataset (in the demo >>>>> directory) and it looks like everything is fine. If you can >>>>> send your train/test files or reproduce the bug on the spambase >>>>> dataset, please send me the exact parameters you're using and >>>>> I'll see if it's a bug, poor documentation, or a >>>>> misunderstanding of some sort. >>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>> Aaron >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT >>>>>> using Jboost. I also asked it to output a matlab script I could >>>>>> use to classify examples with in the future. However, I was >>>>>> wondering why the matlab script outputs slightly different >>>>>> values than I would get by classifying the training/test set >>>>>> directly using Jboost (for example, the sign of the classifier >>>>>> output is always opposite to what Jboost produces, and at most >>>>>> I have seen a 0.1469 discrepancy in the actual value after >>>>>> accounting for the sign issue). Has anyone encountered this >>>>>> issue, or am I perhaps doing something incorrectly? |
From: Aaron A. <aa...@cs...> - 2008-08-13 20:01:30
|
On Wed, 13 Aug 2008, Viren Jain wrote: > I downloaded the CVS version and tried to compile it, but ran into > the following errors. Do you think this is probably just an outdated Java > version issue? > > [javac] Compiling 46 source files to /home/viren/jboost/build > [javac] /home/viren/jboost/src/jboost/controller/Controller.java:148: > cannot find symbol [javac] symbol : method getPositiveInitialPotential() > [javac] location: class jboost.booster.BrownBoost [javac] > System.out.println("\tPotential loss of positive examples m_booster: " + > b.getPositiveInitialPotential()); > .... etc This is the region that is currently under heavy construction. Instead of trying to debug this, I just commented out the calls and everything compiled for me. Try a 'cvs update' and 'ant jar'. > Regarding the cost function - the distribution has maybe 70% > negative examples and 30% positive examples, so not horrendously > imbalanced, but for our application a false positive is extremely costly > (lets say, 5 times as costly) as compared to a false negative. For the moment being, you can just use a "sliding bias" approach. Where you just add a value X to the root node (which is the same as adding X to the final score, since all examples go through the root). The root node actually balances the weights on the dataset (satisfies E_{w_{i+1}}[h(x)y] = 0, see equation in Schapire & Singer 99, where h(x)=="Always +1"). This isn't perfect, but check again in a couple weeks and we should have a much more technical (and much cooler) approach available based on some new math from drifting games. If this doesn't give very good results (no value of X gives you the sensitivity/specificity tradeoff desired), I've picked up a few other hacks that may help. Aaron > On Aug 13, 2008, at 3:29 PM, Aaron Arvey wrote: > > That was implemented in the BrownBoost and NormalBoost boosters. > Unfortunately, these boosters are currently (as in 5 minutes ago) being > completely rewritten. > > There are still plenty of ways to artificially change the cost function, > and these work well for some applications. > > What exactly are you trying to do? How asymmetric are your costs of your > mispredictions? How asymmtric is the distribution of your classes? > > Aaron > > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> Great! One last question (for now :). I was a little confused by the >> documentation regarding asymmetric cost functions: is it currently >> possible to change the cost function such that false positives are more >> costly than false negatives? >> >> Thanks, >> Viren >> >> >> >> On Aug 13, 2008, at 3:19 PM, Aaron Arvey wrote: >> >> Viren, >> >> Yep, I just tried 1.4 and I was able to reproduce your problem. >> >> This will certainly cause a speed up in the release of the next version. >> Let me know if you have any problems with the CVS release. >> >> Nice catch! >> >> Aaron >> >> >> On Wed, 13 Aug 2008, Viren Jain wrote: >> >>> Hey Aaron, >>> OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an >>> empty output in the .boosting.info files. I am using release 1.4, but I >>> can try the repository version. >>> Thanks! >>> Viren >>> On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: >>> Hey Viren, >>> I just tried out >>> cd jboost/demo >>> ../jboost -numRounds 10 -a 9 -S stem >>> cp stem.test.boosting.info stem.test.boosting.info.bak >>> ../jboost -numRounds 10 -a -2 -S stem >>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>> And I see that this outputs the second to last iteration. When I try >>> cd jboost/demo >>> ../jboost -numRounds 10 -a 10 -S stem >>> cp stem.test.boosting.info stem.test.boosting.info.bak >>> ../jboost -numRounds 10 -a -2 -S stem >>> sdiff stem.test.boosting.info.bak stem.test.boosting.info >>> I see that the final iteration is output. >>> Let me know what you see when you run the above. If you see something >>> different, perhaps the used to be a bug and it was corrected. The code >>> to output files by the "-a" switch was recently updated, so perhaps this >>> bug was corrected (I updated it and have no memory of fixing this bug, >>> but perhaps I did...). Are you perhaps using an old version of JBoost? >>> Perhaps try out the cvs repository and see if that fixes your problem. >>> Aaron >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> Thanks again, Aaron. >>>> I double checked things and it seems I still discrepancies in the >>>> classifier outputs. The exact jboost command I am using is: >>>> ... jboost.controller.Controller -S test_old_background_mporder >>>> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >>>> classify_background.m >>>> I assume there is some sort of 0 counting, since if I use -a 300 the >>>> .info.testing and .info.training file are 0 bytes. So if this is >>>> correct, then test_old_background_mporder.test.boosting.info should >>>> have identical outputs to those generated from the same examples by >>>> using classify_background.m? >>>> Again, thanks so much! >>>> Viren >>>> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>> I'm actually using text strings for the labels. i.e., in the spec file >>>>> i have line "labels (merge, split)" and then for each >>>>> example in training/test, I output the appropriate string. Do you >>>>> recommend I use (-1,1) instead? >>>> That's fine. I just assumed that since you said the labels were >>>> inverted, that meant you were using -1/+1. Using text is perfectly >>>> okay. >>>>> Also, what is the iteration on which Jboost outputs the matlab file >>>>> when I use the -m option? The last one? >>>> Yes, it is the last iteration. There should probably be an option >>>> (like -a) to output this more often. >>>> Aaron >>>>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>>>> Hi Viren, >>>>> The inverted label is a result of JBoost using it's own internal >>>>> labeling system. If you swap the order of how you specify the labels >>>>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get >>>>> the correct label. >>>>> I haven't heard about the difference in score before. Are you perhaps >>>>> looking at the scores for the wrong iteration? Are you using "-a -1" >>>>> or "-a -2" switches to obtain the appropriate score/margin output >>>>> files? Are you perhaps getting training and testing sets mixed up? >>>>> I just tested ADD_ROOT on the spambase dataset (in the demo directory) >>>>> and it looks like everything is fine. If you can send your train/test >>>>> files or reproduce the bug on the spambase dataset, please send me the >>>>> exact parameters you're using and I'll see if it's a bug, poor >>>>> documentation, or a misunderstanding of some sort. >>>>> Thanks for the heads up on the potential bug in the matlab scores. >>>>> Aaron >>>>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>>>>> Jboost. I also asked it to output a matlab script I could use to >>>>>> classify examples with in the future. However, I was wondering why >>>>>> the matlab script outputs slightly different values than I would get >>>>>> by classifying the training/test set directly using Jboost (for >>>>>> example, the sign of the classifier output is always opposite to what >>>>>> Jboost produces, and at most I have seen a 0.1469 discrepancy in the >>>>>> actual value after accounting for the sign issue). Has anyone >>>>>> encountered this issue, or am I perhaps doing something incorrectly? |
From: Aaron A. <aa...@cs...> - 2008-08-13 19:19:06
|
Viren, Yep, I just tried 1.4 and I was able to reproduce your problem. This will certainly cause a speed up in the release of the next version. Let me know if you have any problems with the CVS release. Nice catch! Aaron On Wed, 13 Aug 2008, Viren Jain wrote: > Hey Aaron, > OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an > empty output in the .boosting.info files. I am using release 1.4, but I can > try the repository version. > Thanks! > Viren > > On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: > > Hey Viren, > > I just tried out > > cd jboost/demo > ../jboost -numRounds 10 -a 9 -S stem > cp stem.test.boosting.info stem.test.boosting.info.bak > ../jboost -numRounds 10 -a -2 -S stem > sdiff stem.test.boosting.info.bak stem.test.boosting.info > > And I see that this outputs the second to last iteration. When I try > > cd jboost/demo > ../jboost -numRounds 10 -a 10 -S stem > cp stem.test.boosting.info stem.test.boosting.info.bak > ../jboost -numRounds 10 -a -2 -S stem > sdiff stem.test.boosting.info.bak stem.test.boosting.info > > I see that the final iteration is output. > > Let me know what you see when you run the above. If you see something > different, perhaps the used to be a bug and it was corrected. The code to > output files by the "-a" switch was recently updated, so perhaps this bug > was corrected (I updated it and have no memory of fixing this bug, but > perhaps I did...). Are you perhaps using an old version of JBoost? Perhaps > try out the cvs repository and see if that fixes your problem. > > Aaron > > > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> Thanks again, Aaron. >> >> I double checked things and it seems I still discrepancies in the >> classifier outputs. The exact jboost command I am using is: >> >> ... jboost.controller.Controller -S test_old_background_mporder >> -numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m >> classify_background.m >> >> I assume there is some sort of 0 counting, since if I use -a 300 the >> .info.testing and .info.training file are 0 bytes. So if this is correct, >> then test_old_background_mporder.test.boosting.info should have identical >> outputs to those generated from the same examples by using >> classify_background.m? >> >> Again, thanks so much! >> Viren >> >> >> >> On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: >> >> On Wed, 13 Aug 2008, Viren Jain wrote: >> >>> I'm actually using text strings for the labels. i.e., in the spec file i >>> have line "labels (merge, split)" and then for each example >>> in training/test, I output the appropriate string. Do you recommend I >>> use (-1,1) instead? >> >> That's fine. I just assumed that since you said the labels were >> inverted, that meant you were using -1/+1. Using text is perfectly okay. >> >>> Also, what is the iteration on which Jboost outputs the matlab file when >>> I use the -m option? The last one? >> >> Yes, it is the last iteration. There should probably be an option (like >> -a) to output this more often. >> >> Aaron >> >> >>> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >>> Hi Viren, >>> The inverted label is a result of JBoost using it's own internal >>> labeling system. If you swap the order of how you specify the labels >>> (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get the >>> correct label. >>> I haven't heard about the difference in score before. Are you perhaps >>> looking at the scores for the wrong iteration? Are you using "-a -1" or >>> "-a -2" switches to obtain the appropriate score/margin output files? >>> Are you perhaps getting training and testing sets mixed up? >>> I just tested ADD_ROOT on the spambase dataset (in the demo directory) >>> and it looks like everything is fine. If you can send your train/test >>> files or reproduce the bug on the spambase dataset, please send me the >>> exact parameters you're using and I'll see if it's a bug, poor >>> documentation, or a misunderstanding of some sort. >>> Thanks for the heads up on the potential bug in the matlab scores. >>> Aaron >>> On Wed, 13 Aug 2008, Viren Jain wrote: >>>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>>> Jboost. I also asked it to output a matlab script I could use to >>>> classify examples with in the future. However, I was wondering why the >>>> matlab script outputs slightly different values than I would get by >>>> classifying the training/test set directly using Jboost (for example, >>>> the sign of the classifier output is always opposite to what Jboost >>>> produces, and at most I have seen a 0.1469 discrepancy in the actual >>>> value after accounting for the sign issue). Has anyone encountered this >>>> issue, or am I perhaps doing something incorrectly? |
From: Viren J. <vi...@MI...> - 2008-08-13 19:11:30
|
Hey Aaron, OK, so when I run "../jboost -numRounds 10 -a 10 -S stem" I get an empty output in the .boosting.info files. I am using release 1.4, but I can try the repository version. Thanks! Viren On Aug 13, 2008, at 2:58 PM, Aaron Arvey wrote: Hey Viren, I just tried out cd jboost/demo ../jboost -numRounds 10 -a 9 -S stem cp stem.test.boosting.info stem.test.boosting.info.bak ../jboost -numRounds 10 -a -2 -S stem sdiff stem.test.boosting.info.bak stem.test.boosting.info And I see that this outputs the second to last iteration. When I try cd jboost/demo ../jboost -numRounds 10 -a 10 -S stem cp stem.test.boosting.info stem.test.boosting.info.bak ../jboost -numRounds 10 -a -2 -S stem sdiff stem.test.boosting.info.bak stem.test.boosting.info I see that the final iteration is output. Let me know what you see when you run the above. If you see something different, perhaps the used to be a bug and it was corrected. The code to output files by the "-a" switch was recently updated, so perhaps this bug was corrected (I updated it and have no memory of fixing this bug, but perhaps I did...). Are you perhaps using an old version of JBoost? Perhaps try out the cvs repository and see if that fixes your problem. Aaron On Wed, 13 Aug 2008, Viren Jain wrote: > Thanks again, Aaron. > > I double checked things and it seems I still discrepancies in the > classifier outputs. The exact jboost command I am using is: > > ... jboost.controller.Controller -S test_old_background_mporder - > numRounds 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m > classify_background.m > > I assume there is some sort of 0 counting, since if I use -a 300 > the .info.testing and .info.training file are 0 bytes. So if this is > correct, then test_old_background_mporder.test.boosting.info should > have identical outputs to those generated from the same examples by > using classify_background.m? > > Again, thanks so much! > Viren > > > > On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> I'm actually using text strings for the labels. i.e., in the spec >> file i have line "labels (merge, split)" and then for each >> example in training/test, I output the appropriate string. Do you >> recommend I use (-1,1) instead? > > That's fine. I just assumed that since you said the labels were > inverted, that meant you were using -1/+1. Using text is perfectly > okay. > >> Also, what is the iteration on which Jboost outputs the matlab file >> when I use the -m option? The last one? > > Yes, it is the last iteration. There should probably be an option > (like -a) to output this more often. > > Aaron > > >> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >> Hi Viren, >> The inverted label is a result of JBoost using it's own internal >> labeling system. If you swap the order of how you specify the >> labels (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") >> you'll get the correct label. >> I haven't heard about the difference in score before. Are you >> perhaps looking at the scores for the wrong iteration? Are you >> using "-a -1" or "-a -2" switches to obtain the appropriate score/ >> margin output files? Are you perhaps getting training and testing >> sets mixed up? >> I just tested ADD_ROOT on the spambase dataset (in the demo >> directory) and it looks like everything is fine. If you can send >> your train/test files or reproduce the bug on the spambase dataset, >> please send me the exact parameters you're using and I'll see if >> it's a bug, poor documentation, or a misunderstanding of some sort. >> Thanks for the heads up on the potential bug in the matlab scores. >> Aaron >> On Wed, 13 Aug 2008, Viren Jain wrote: >>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>> Jboost. I also asked it to output a matlab script I could use to >>> classify examples with in the future. However, I was wondering why >>> the matlab script outputs slightly different values than I would >>> get by classifying the training/test set directly using Jboost >>> (for example, the sign of the classifier output is always opposite >>> to what Jboost produces, and at most I have seen a 0.1469 >>> discrepancy in the actual value after accounting for the sign >>> issue). Has anyone encountered this issue, or am I perhaps doing >>> something incorrectly? |
From: Aaron A. <aa...@cs...> - 2008-08-13 18:58:55
|
Hey Viren, I just tried out cd jboost/demo ../jboost -numRounds 10 -a 9 -S stem cp stem.test.boosting.info stem.test.boosting.info.bak ../jboost -numRounds 10 -a -2 -S stem sdiff stem.test.boosting.info.bak stem.test.boosting.info And I see that this outputs the second to last iteration. When I try cd jboost/demo ../jboost -numRounds 10 -a 10 -S stem cp stem.test.boosting.info stem.test.boosting.info.bak ../jboost -numRounds 10 -a -2 -S stem sdiff stem.test.boosting.info.bak stem.test.boosting.info I see that the final iteration is output. Let me know what you see when you run the above. If you see something different, perhaps the used to be a bug and it was corrected. The code to output files by the "-a" switch was recently updated, so perhaps this bug was corrected (I updated it and have no memory of fixing this bug, but perhaps I did...). Are you perhaps using an old version of JBoost? Perhaps try out the cvs repository and see if that fixes your problem. Aaron On Wed, 13 Aug 2008, Viren Jain wrote: > Thanks again, Aaron. > > I double checked things and it seems I still discrepancies in the > classifier outputs. The exact jboost command I am using is: > > ... jboost.controller.Controller -S test_old_background_mporder -numRounds > 300 -b LogLossBoost -ATreeType ADD_ROOT -a 299 -m classify_background.m > > I assume there is some sort of 0 counting, since if I use -a 300 the > .info.testing and .info.training file are 0 bytes. So if this is correct, > then test_old_background_mporder.test.boosting.info should have identical > outputs to those generated from the same examples by using > classify_background.m? > > Again, thanks so much! > Viren > > > > On Aug 13, 2008, at 1:11 PM, Aaron Arvey wrote: > > On Wed, 13 Aug 2008, Viren Jain wrote: > >> I'm actually using text strings for the labels. i.e., in the spec file i >> have line "labels (merge, split)" and then for each example >> in training/test, I output the appropriate string. Do you recommend I use >> (-1,1) instead? > > That's fine. I just assumed that since you said the labels were inverted, > that meant you were using -1/+1. Using text is perfectly okay. > >> Also, what is the iteration on which Jboost outputs the matlab file when >> I use the -m option? The last one? > > Yes, it is the last iteration. There should probably be an option (like > -a) to output this more often. > > Aaron > > >> On Aug 13, 2008, at 12:43 PM, Aaron Arvey wrote: >> >> Hi Viren, >> >> The inverted label is a result of JBoost using it's own internal labeling >> system. If you swap the order of how you specify the labels (i.e. >> instead of "labels (1,-1)" you do "labels (-1,1)") you'll get the correct >> label. >> >> I haven't heard about the difference in score before. Are you perhaps >> looking at the scores for the wrong iteration? Are you using "-a -1" or >> "-a -2" switches to obtain the appropriate score/margin output files? >> Are you perhaps getting training and testing sets mixed up? >> >> I just tested ADD_ROOT on the spambase dataset (in the demo directory) >> and it looks like everything is fine. If you can send your train/test >> files or reproduce the bug on the spambase dataset, please send me the >> exact parameters you're using and I'll see if it's a bug, poor >> documentation, or a misunderstanding of some sort. >> >> Thanks for the heads up on the potential bug in the matlab scores. >> >> Aaron >> >> >> >> On Wed, 13 Aug 2008, Viren Jain wrote: >> >>> I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using >>> Jboost. I also asked it to output a matlab script I could use to >>> classify examples with in the future. However, I was wondering why the >>> matlab script outputs slightly different values than I would get by >>> classifying the training/test set directly using Jboost (for example, >>> the sign of the classifier output is always opposite to what Jboost >>> produces, and at most I have seen a 0.1469 discrepancy in the actual >>> value after accounting for the sign issue). Has anyone encountered this >>> issue, or am I perhaps doing something incorrectly? |
From: Aaron A. <aa...@cs...> - 2008-08-13 16:43:07
|
Hi Viren, The inverted label is a result of JBoost using it's own internal labeling system. If you swap the order of how you specify the labels (i.e. instead of "labels (1,-1)" you do "labels (-1,1)") you'll get the correct label. I haven't heard about the difference in score before. Are you perhaps looking at the scores for the wrong iteration? Are you using "-a -1" or "-a -2" switches to obtain the appropriate score/margin output files? Are you perhaps getting training and testing sets mixed up? I just tested ADD_ROOT on the spambase dataset (in the demo directory) and it looks like everything is fine. If you can send your train/test files or reproduce the bug on the spambase dataset, please send me the exact parameters you're using and I'll see if it's a bug, poor documentation, or a misunderstanding of some sort. Thanks for the heads up on the potential bug in the matlab scores. Aaron On Wed, 13 Aug 2008, Viren Jain wrote: > I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using > Jboost. I also asked it to output a matlab script I could use to > classify examples with in the future. However, I was wondering why the > matlab script outputs slightly different values than I would get by > classifying the training/test set directly using Jboost (for example, > the sign of the classifier output is always opposite to what Jboost > produces, and at most I have seen a 0.1469 discrepancy in the actual > value after accounting for the sign issue). Has anyone encountered this > issue, or am I perhaps doing something incorrectly? |
From: Viren J. <vi...@MI...> - 2008-08-13 15:45:09
|
Hello, I trained a LogLossBoost classifier with -ATreeType ADD_ROOT using Jboost. I also asked it to output a matlab script I could use to classify examples with in the future. However, I was wondering why the matlab script outputs slightly different values than I would get by classifying the training/test set directly using Jboost (for example, the sign of the classifier output is always opposite to what Jboost produces, and at most I have seen a 0.1469 discrepancy in the actual value after accounting for the sign issue). Has anyone encountered this issue, or am I perhaps doing something incorrectly? Thanks so much, Viren Jain |
From: Aaron A. <aa...@cs...> - 2008-08-13 04:50:43
|
On Tue, 12 Aug 2008, Gungor Polatkan wrote: > The major issue for me was how the current algorithm use the weights. Is > the implementation buggy or the idea prior to implement? if the > implementation is buggy, the thing I am curious about is the idea behind > weighting the data in the current implementation. The idea itself is not buggy. The current implementation is buggy. > I also looked at the code and I am confused about the the variable > names. Does it mention about the Distribution updated at each > iterationof boosting as also weights? The member variable m_sampleWeights of AdaBoost is the weight as read in from the data file. The member variable m_weights an array of weights, one for each example, for a given iteration. m_weights is sometimes refered to as "D" in AdaBoost papers, or the "distribution across examples". The variable name reminds me that m_sampleWeights is meant to be used with boosting-by-sampling, which I'm not sure if it was every fully implemented with a standard interface. It should still work for your purposes... but it doesn't since there's likely a bug somewhere. Aaron > Aaron Arvey yazm?s,: >> Hi Gungor, >> >> Glad to hear you're working with JBoost! >> >> See comments inline below. >> >> On Tue, 12 Aug 2008, Gungor Polatkan wrote: >> >>> 1) First question is about the weight input. The meaning(higher weight >>> implies greater importance to classify correctly) is fundamentally >>> important for us since it is the heart of our research project. How does >>> the algorithm do that? Is there any paper related to this idea? or is it >>> just a practical empirical method just by changing the initial >>> distribution? Do you guys know anything about that? Any information >>> about this thing will help me very much. Also what is the bug currently >>> in the weighting? looking for the news... >>> |weight| an initial weighting of the data (higher weight implies >>> greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN >>> MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional >> >> The bug in weighting has still not been fixed. All I know is that the final >> output from data with weights is not as would be expected (I verified this >> myself several months ago). The weighting itself is read in correctly (from >> what I could tell by output), but the way it is applied is somehow buggy. >> It is only applied in a couple locations, so it is somewhat unnerving that >> it causes such abnormal behavior. >> >> There are many other ways to reweight your data other than using the >> provided weight option. Depending how large and extreme your class >> distributions are, you can just oversample the smaller class prior to input >> to JBoost. Keep in mind that the first weak hypothesis is always "Class is >> +1" so any lopsidedness in the data will be reweighted by the score given >> to this classifier and the subsequent reweighting on examples. >> >> NOTE: The "Class is +1" classifier will rebalance data that isn't *too* >> skewed in class distribution. The fact that it doesn't balance the classes >> when they are skewed is considered to be a small bug (I verified this as >> well, around the same time I verified the weight bug). However, if you >> oversample the data so that the classes aren't *too* skewed (I've done 10:1 >> without problem), then sliding the score for "Class is +1" should provide >> you with control over sensitivity/specificity. >> >>> 2) Second question is about the weak learner Jboost use. Since my data >>> features are Real Values (not binary or discrete but -inf to +inf Real >>> Numbers), I think I should use decision stumps with real thresholds. >>> Does the algorithm consider such a thing (for binary feature a simpler >>> stump should be used and for real valued another one)? >> >> If you run JBoost with default boosting parameters, it will use decision >> stumps for weak learners. Boolean values can be seen as a subset of real >> values (-1 is false, +1 is true) and the decision stumps would then be "<0" >> for false and ">0" for true. >> >> Also, I believe I remember there's a bug with "+inf" "-inf" values (as may >> happen from output. I'd recommend replacing all -inf and +inf values with a >> real value larger than all other values. The weak learning algorithms will >> treat the largest (smallest) values as +inf (-inf). >> >> Try the default parameters for boosting and let me know if you need any >> more guidance on this topic. >> >>> 3)For the modification, are all the source codes in the SRC folder ? >> >> Yes. There are some scripts in jboost-VERSION/scripts that are helpful in >> visualizing the output, but all the code you'll likely want to edit is in >> jboost-VERSION/src. >> >> >> >> Let me know if this answers your questions or if you have any other >> inquiries. >> >> Aaron > |
From: Gungor P. <pol...@Pr...> - 2008-08-12 22:57:50
|
Hi Aaron, Thank you very much for your detailed explanations. The major issue for me was how the current algorithm use the weights. Is the implementation buggy or the idea prior to implement? if the implementation is buggy, the thing I am curious about is the idea behind weighting the data in the current implementation. I am looking forward to good news on this thing. I also looked at the code and I am confused about the the variable names. Does it mention about the Distribution updated at each iterationof boosting as also weights? Thanks again, Best, Gungor Aaron Arvey yazm?s,: > Hi Gungor, > > Glad to hear you're working with JBoost! > > See comments inline below. > > On Tue, 12 Aug 2008, Gungor Polatkan wrote: > >> 1) First question is about the weight input. The meaning(higher weight >> implies greater importance to classify correctly) is fundamentally >> important for us since it is the heart of our research project. How does >> the algorithm do that? Is there any paper related to this idea? or is it >> just a practical empirical method just by changing the initial >> distribution? Do you guys know anything about that? Any information >> about this thing will help me very much. Also what is the bug currently >> in the weighting? looking for the news... >> |weight| an initial weighting of the data (higher weight implies >> greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN >> MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional > > The bug in weighting has still not been fixed. All I know is that the > final output from data with weights is not as would be expected (I > verified this myself several months ago). The weighting itself is read > in correctly (from what I could tell by output), but the way it is > applied is somehow buggy. It is only applied in a couple locations, so > it is somewhat unnerving that it causes such abnormal behavior. > > There are many other ways to reweight your data other than using the > provided weight option. Depending how large and extreme your class > distributions are, you can just oversample the smaller class prior to > input to JBoost. Keep in mind that the first weak hypothesis is always > "Class is +1" so any lopsidedness in the data will be reweighted by > the score given to this classifier and the subsequent reweighting on > examples. > > NOTE: The "Class is +1" classifier will rebalance data that isn't > *too* skewed in class distribution. The fact that it doesn't balance > the classes when they are skewed is considered to be a small bug (I > verified this as well, around the same time I verified the weight > bug). However, if you oversample the data so that the classes aren't > *too* skewed (I've done 10:1 without problem), then sliding the score > for "Class is +1" should provide you with control over > sensitivity/specificity. > >> 2) Second question is about the weak learner Jboost use. Since my data >> features are Real Values (not binary or discrete but -inf to +inf Real >> Numbers), I think I should use decision stumps with real thresholds. >> Does the algorithm consider such a thing (for binary feature a simpler >> stump should be used and for real valued another one)? > > If you run JBoost with default boosting parameters, it will use > decision stumps for weak learners. Boolean values can be seen as a > subset of real values (-1 is false, +1 is true) and the decision > stumps would then be "<0" for false and ">0" for true. > > Also, I believe I remember there's a bug with "+inf" "-inf" values (as > may happen from output. I'd recommend replacing all -inf and +inf > values with a real value larger than all other values. The weak > learning algorithms will treat the largest (smallest) values as +inf > (-inf). > > Try the default parameters for boosting and let me know if you need > any more guidance on this topic. > >> 3)For the modification, are all the source codes in the SRC folder ? > > Yes. There are some scripts in jboost-VERSION/scripts that are helpful > in visualizing the output, but all the code you'll likely want to edit > is in jboost-VERSION/src. > > > > Let me know if this answers your questions or if you have any other > inquiries. > > Aaron |
From: Aaron A. <aa...@cs...> - 2008-08-12 19:43:50
|
Hi Gungor, Glad to hear you're working with JBoost! See comments inline below. On Tue, 12 Aug 2008, Gungor Polatkan wrote: > 1) First question is about the weight input. The meaning(higher weight > implies greater importance to classify correctly) is fundamentally > important for us since it is the heart of our research project. How does > the algorithm do that? Is there any paper related to this idea? or is it > just a practical empirical method just by changing the initial > distribution? Do you guys know anything about that? Any information > about this thing will help me very much. Also what is the bug currently > in the weighting? looking for the news... > |weight| an initial weighting of the data (higher weight implies > greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN > MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional The bug in weighting has still not been fixed. All I know is that the final output from data with weights is not as would be expected (I verified this myself several months ago). The weighting itself is read in correctly (from what I could tell by output), but the way it is applied is somehow buggy. It is only applied in a couple locations, so it is somewhat unnerving that it causes such abnormal behavior. There are many other ways to reweight your data other than using the provided weight option. Depending how large and extreme your class distributions are, you can just oversample the smaller class prior to input to JBoost. Keep in mind that the first weak hypothesis is always "Class is +1" so any lopsidedness in the data will be reweighted by the score given to this classifier and the subsequent reweighting on examples. NOTE: The "Class is +1" classifier will rebalance data that isn't *too* skewed in class distribution. The fact that it doesn't balance the classes when they are skewed is considered to be a small bug (I verified this as well, around the same time I verified the weight bug). However, if you oversample the data so that the classes aren't *too* skewed (I've done 10:1 without problem), then sliding the score for "Class is +1" should provide you with control over sensitivity/specificity. > 2) Second question is about the weak learner Jboost use. Since my data > features are Real Values (not binary or discrete but -inf to +inf Real > Numbers), I think I should use decision stumps with real thresholds. > Does the algorithm consider such a thing (for binary feature a simpler > stump should be used and for real valued another one)? If you run JBoost with default boosting parameters, it will use decision stumps for weak learners. Boolean values can be seen as a subset of real values (-1 is false, +1 is true) and the decision stumps would then be "<0" for false and ">0" for true. Also, I believe I remember there's a bug with "+inf" "-inf" values (as may happen from output. I'd recommend replacing all -inf and +inf values with a real value larger than all other values. The weak learning algorithms will treat the largest (smallest) values as +inf (-inf). Try the default parameters for boosting and let me know if you need any more guidance on this topic. > 3)For the modification, are all the source codes in the SRC folder ? Yes. There are some scripts in jboost-VERSION/scripts that are helpful in visualizing the output, but all the code you'll likely want to edit is in jboost-VERSION/src. Let me know if this answers your questions or if you have any other inquiries. Aaron |
From: Gungor P. <pol...@Pr...> - 2008-08-12 19:15:12
|
Hi Everybody, I am working on a modification of boosting on genomic data. I have some questions on Jboost, 1) First question is about the weight input. The meaning(higher weight implies greater importance to classify correctly) is fundamentally important for us since it is the heart of our research project. How does the algorithm do that? Is there any paper related to this idea? or is it just a practical emprical method just by changing the initial distribution? Do you guys know anything about that? Any information about this thing will help me very much. Also what is the bug currently in the weighting? looking for the news... |weight| an initial weighting of the data (higher weight implies greater importance to classify correctly) THERE IS A BUG IN WEIGHTING IN MOST VERSIONS. MORE NEWS SOON. default = 1.0 Optional 2) Second question is about the weak learner Jboost use. Since my data features are Real Values (not binary or discrete but -inf to +inf Real Numbers), I think I should use decision stumps with real thresholds. Does the algorithm consider such a thing (for binary feature a simpler stump should be used and for real valued another one)? 3)For the modification, are all the source codes in the SRC folder ? Best, Gungor |
From: Viren J. <vi...@MI...> - 2008-07-07 17:13:43
|
Ah ok, I misinterpreted the documentation. Thanks very much for the prompt reply! Best, Viren On Jul 7, 2008, at 1:07 PM, Aaron Arvey wrote: A simple misunderstanding. I round the score/margin to 15. The margin is 15, the score is -15. The score has the same sign as the label (-1), thus the margin is positive (correctly labeled). More documentation on the format can be found at http://jboost.sourceforge.net/doc.html#boost_format . Let me know if this clears the problem. Aaron On Mon, 7 Jul 2008, Viren Jain wrote: > Hello, I have trained a classifier that reaches 0 error on the > training set (according to the .info file) and have asked jboost to > produce .train.boosting.info output on the last iteration. However, > when I look at this logfile, my understanding of the format seems to > suggest that this logfile indicates many examples were incorrect. For > example, does > > 280: 15.07121: -15.072121: 0.000: 0.000: -1: > > indicate an incorrect answer (should have been -1, actual was > 15.07121, so margin was -15.072121?) > > Are there any known issues in this output file? > > Thanks. > > > ------------------------------------------------------------------------- > Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! > Studies have shown that voting for your favorite open source project, > along with a healthy diet, reduces your potential for chronic lameness > and boredom. Vote Now at http://www.sourceforge.net/community/cca08 > _______________________________________________ > jboost-users mailing list > jbo...@li... > https://lists.sourceforge.net/lists/listinfo/jboost-users > |
From: Aaron A. <aa...@cs...> - 2008-07-07 17:07:16
|
A simple misunderstanding. I round the score/margin to 15. The margin is 15, the score is -15. The score has the same sign as the label (-1), thus the margin is positive (correctly labeled). More documentation on the format can be found at http://jboost.sourceforge.net/doc.html#boost_format. Let me know if this clears the problem. Aaron On Mon, 7 Jul 2008, Viren Jain wrote: > Hello, I have trained a classifier that reaches 0 error on the > training set (according to the .info file) and have asked jboost to > produce .train.boosting.info output on the last iteration. However, > when I look at this logfile, my understanding of the format seems to > suggest that this logfile indicates many examples were incorrect. For > example, does > > 280: 15.07121: -15.072121: 0.000: 0.000: -1: > > indicate an incorrect answer (should have been -1, actual was > 15.07121, so margin was -15.072121?) > > Are there any known issues in this output file? > > Thanks. > > > ------------------------------------------------------------------------- > Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! > Studies have shown that voting for your favorite open source project, > along with a healthy diet, reduces your potential for chronic lameness > and boredom. Vote Now at http://www.sourceforge.net/community/cca08 > _______________________________________________ > jboost-users mailing list > jbo...@li... > https://lists.sourceforge.net/lists/listinfo/jboost-users > |
From: Viren J. <vi...@MI...> - 2008-07-07 16:50:07
|
Hello, I have trained a classifier that reaches 0 error on the training set (according to the .info file) and have asked jboost to produce .train.boosting.info output on the last iteration. However, when I look at this logfile, my understanding of the format seems to suggest that this logfile indicates many examples were incorrect. For example, does 280: 15.07121: -15.072121: 0.000: 0.000: -1: indicate an incorrect answer (should have been -1, actual was 15.07121, so margin was -15.072121?) Are there any known issues in this output file? Thanks. |
From: Aaron A. <aa...@cs...> - 2008-07-02 17:02:26
|
I'm not sure if JBoost can squeeze everything into 8GB RAM. Try it out on 20% of the dataset and see if it fits in 2GB. Since Java has an overhead of about 200-400MB and JBoost has an overhead of less than 50MB, than 20% should at 2GB should be able to squeeze 100% into 8GB. However, depending on your use of perl, you may have had much more overhead than 450MB. Give a shot and let me know if it doesn't fit. You'll also want to edit the file 'jboost', which is just a shell script, so that jboost is given more memory by java. exec java -Xmx3000M jboost.controller.Controller $@ exec java -Xmx7500M jboost.controller.Controller $@ Aaron On Wed, 2 Jul 2008, Mauer, Dan wrote: > Well, I've already built an efficient AdaBoost implementation in Perl, > and as part of that effort I created a full "n-gram presence" table, > where each n-gram has an integer ID, and each document is represented > as a list of the integers which appear in that original document. So I > can easily treat the whole thing as unigrams. My main concern is that > the system handles the data storage very efficiently, and ideally > holding the entirety of the data in RAM (I'm using a 64-bit machine > with 8GB ram, so there's room for it)... would you expect Jboost would > handle this well? > > Thanks! > -d > > -----Original Message----- > From: Aaron Arvey [mailto:aa...@cs...] > Sent: Wednesday, July 02, 2008 12:01 PM > To: Mauer, Dan > Cc: jbo...@li... > Subject: Re: [Jboost-users] very-high-dimensional feature space? > > Hi Dan, > > You should be able to use the "text" feature of JBoost and pass in the > text itself. JBoost then has parsers for n-gram models. While there > exists code for n > 1 grams, it is currently disabled. However, n==1 > grams > are enabled in the CVS repository. 'ngramtype' (fixed, full, or > sparse) > and 'ngramsize' (1,2,3...) are mostly implemented, but not fully > finished > or integrated into JBoost (for unknown reasons). > > I'll look into some details on quick ways to tie the code together. In > > the mean time, you can look at the following: > jboost/src/jboost/examples/TextDescription.java > jboost/src/jboost/examples/ngram/*.java > jboost/src/jboost/examples/ngram/SparseNgram.java > > You can also look at the current hack for handling text features as > 1-grams: > http://jboost.cvs.sourceforge.net/jboost/jboost/src/jboost/examples/Tex > tDescription.java?r1=1.1&r2=1.2 > > You may be able to extend this hack to bigrams with little further > pain/suffering. While this is certainly not an ideal solution, it may > be > much faster. If you get stuck during code edits, feel free to post > specific questions about the code base to jboost-devel. > > Also, if you have an original document with n words, you can create a > document with n + (n-1) words that could use n==1 grams. Just put all > 2 > words together as one word in addition to their one word parts. I > didn't > say that very elegantly, but I'm sure you get the idea. This should in > > turn be stored fairly sparsely (let me know if it isn't). > > I'll get back to you later this week with a further update on text > processing options. If you don't hear back by Monday, just send a > reminder email. > > Aaron > > On Wed, 2 Jul 2008, Mauer, Dan wrote: > >> So I'm trying to figure out whether JBoost is workable for a research >> project, and I was wondering if anyone could help me out. >> >> What I'm doing is (more-or-less) a text categorization task, where my >> feature space consists of unigrams and bigrams (1-2 word terms) which >> either do or do not appear in each document in the collection. There >> are around 5,000,000 different n-grams. As you'd expect, though, the >> truth matrix of document-to-n-gram is extremely sparse - most > documents >> in the collection are on the order of 200 words apiece. >> >> I know boosting can work on this set; I've written my own AdaBoost >> implementation which, while it requires a boatload of RAM (5GB or > so), >> runs quite nicely. The results, though, aren't great, I'm sure due > to >> noise. So I wanted to try BrownBoost. But unless I'm reading the >> specs wrong, the input format for JBoost requires that each example > in >> the data set be given an explicit value for every feature. Clearly, >> this is infeasible for this set. So, I was wondering if there is a > way >> to specify the data set by explicitly giving only the TRUE-valued >> features for each example, with all other features being set to > FALSE. >> Or something similar. Or perhaps it's a feature that I could work on >> adding to JBoost, if it wouldn't require any major retooling (I only >> have funding for a week or so of work on this). >> >> Thanks, >> >> -Dan >> >> >> >> ______________________________ >> >> Daniel Mauer >> >> Software Systems Engineer, Sr. >> >> E547 Software Engineering >> >> The MITRE Corporation >> >> ______________________________ >> >> > |
From: Mauer, D. <dm...@mi...> - 2008-07-02 16:50:57
|
Well, I've already built an efficient AdaBoost implementation in Perl, and as part of that effort I created a full "n-gram presence" table, where each n-gram has an integer ID, and each document is represented as a list of the integers which appear in that original document. So I can easily treat the whole thing as unigrams. My main concern is that the system handles the data storage very efficiently, and ideally holding the entirety of the data in RAM (I'm using a 64-bit machine with 8GB ram, so there's room for it)... would you expect Jboost would handle this well? Thanks! -d -----Original Message----- From: Aaron Arvey [mailto:aa...@cs...] Sent: Wednesday, July 02, 2008 12:01 PM To: Mauer, Dan Cc: jbo...@li... Subject: Re: [Jboost-users] very-high-dimensional feature space? Hi Dan, You should be able to use the "text" feature of JBoost and pass in the text itself. JBoost then has parsers for n-gram models. While there exists code for n > 1 grams, it is currently disabled. However, n==1 grams are enabled in the CVS repository. 'ngramtype' (fixed, full, or sparse) and 'ngramsize' (1,2,3...) are mostly implemented, but not fully finished or integrated into JBoost (for unknown reasons). I'll look into some details on quick ways to tie the code together. In the mean time, you can look at the following: jboost/src/jboost/examples/TextDescription.java jboost/src/jboost/examples/ngram/*.java jboost/src/jboost/examples/ngram/SparseNgram.java You can also look at the current hack for handling text features as 1-grams: http://jboost.cvs.sourceforge.net/jboost/jboost/src/jboost/examples/Tex tDescription.java?r1=1.1&r2=1.2 You may be able to extend this hack to bigrams with little further pain/suffering. While this is certainly not an ideal solution, it may be much faster. If you get stuck during code edits, feel free to post specific questions about the code base to jboost-devel. Also, if you have an original document with n words, you can create a document with n + (n-1) words that could use n==1 grams. Just put all 2 words together as one word in addition to their one word parts. I didn't say that very elegantly, but I'm sure you get the idea. This should in turn be stored fairly sparsely (let me know if it isn't). I'll get back to you later this week with a further update on text processing options. If you don't hear back by Monday, just send a reminder email. Aaron On Wed, 2 Jul 2008, Mauer, Dan wrote: > So I'm trying to figure out whether JBoost is workable for a research > project, and I was wondering if anyone could help me out. > > What I'm doing is (more-or-less) a text categorization task, where my > feature space consists of unigrams and bigrams (1-2 word terms) which > either do or do not appear in each document in the collection. There > are around 5,000,000 different n-grams. As you'd expect, though, the > truth matrix of document-to-n-gram is extremely sparse - most documents > in the collection are on the order of 200 words apiece. > > I know boosting can work on this set; I've written my own AdaBoost > implementation which, while it requires a boatload of RAM (5GB or so), > runs quite nicely. The results, though, aren't great, I'm sure due to > noise. So I wanted to try BrownBoost. But unless I'm reading the > specs wrong, the input format for JBoost requires that each example in > the data set be given an explicit value for every feature. Clearly, > this is infeasible for this set. So, I was wondering if there is a way > to specify the data set by explicitly giving only the TRUE-valued > features for each example, with all other features being set to FALSE. > Or something similar. Or perhaps it's a feature that I could work on > adding to JBoost, if it wouldn't require any major retooling (I only > have funding for a week or so of work on this). > > Thanks, > > -Dan > > > > ______________________________ > > Daniel Mauer > > Software Systems Engineer, Sr. > > E547 Software Engineering > > The MITRE Corporation > > ______________________________ > > |
From: Aaron A. <aa...@cs...> - 2008-07-02 16:00:55
|
Hi Dan, You should be able to use the "text" feature of JBoost and pass in the text itself. JBoost then has parsers for n-gram models. While there exists code for n > 1 grams, it is currently disabled. However, n==1 grams are enabled in the CVS repository. 'ngramtype' (fixed, full, or sparse) and 'ngramsize' (1,2,3...) are mostly implemented, but not fully finished or integrated into JBoost (for unknown reasons). I'll look into some details on quick ways to tie the code together. In the mean time, you can look at the following: jboost/src/jboost/examples/TextDescription.java jboost/src/jboost/examples/ngram/*.java jboost/src/jboost/examples/ngram/SparseNgram.java You can also look at the current hack for handling text features as 1-grams: http://jboost.cvs.sourceforge.net/jboost/jboost/src/jboost/examples/TextDescription.java?r1=1.1&r2=1.2 You may be able to extend this hack to bigrams with little further pain/suffering. While this is certainly not an ideal solution, it may be much faster. If you get stuck during code edits, feel free to post specific questions about the code base to jboost-devel. Also, if you have an original document with n words, you can create a document with n + (n-1) words that could use n==1 grams. Just put all 2 words together as one word in addition to their one word parts. I didn't say that very elegantly, but I'm sure you get the idea. This should in turn be stored fairly sparsely (let me know if it isn't). I'll get back to you later this week with a further update on text processing options. If you don't hear back by Monday, just send a reminder email. Aaron On Wed, 2 Jul 2008, Mauer, Dan wrote: > So I'm trying to figure out whether JBoost is workable for a research > project, and I was wondering if anyone could help me out. > > What I'm doing is (more-or-less) a text categorization task, where my > feature space consists of unigrams and bigrams (1-2 word terms) which > either do or do not appear in each document in the collection. There > are around 5,000,000 different n-grams. As you'd expect, though, the > truth matrix of document-to-n-gram is extremely sparse - most documents > in the collection are on the order of 200 words apiece. > > I know boosting can work on this set; I've written my own AdaBoost > implementation which, while it requires a boatload of RAM (5GB or so), > runs quite nicely. The results, though, aren't great, I'm sure due to > noise. So I wanted to try BrownBoost. But unless I'm reading the > specs wrong, the input format for JBoost requires that each example in > the data set be given an explicit value for every feature. Clearly, > this is infeasible for this set. So, I was wondering if there is a way > to specify the data set by explicitly giving only the TRUE-valued > features for each example, with all other features being set to FALSE. > Or something similar. Or perhaps it's a feature that I could work on > adding to JBoost, if it wouldn't require any major retooling (I only > have funding for a week or so of work on this). > > Thanks, > > -Dan > > > > ______________________________ > > Daniel Mauer > > Software Systems Engineer, Sr. > > E547 Software Engineering > > The MITRE Corporation > > ______________________________ > > |
From: Mauer, D. <dm...@mi...> - 2008-07-02 13:07:54
|
So I'm trying to figure out whether JBoost is workable for a research project, and I was wondering if anyone could help me out. What I'm doing is (more-or-less) a text categorization task, where my feature space consists of unigrams and bigrams (1-2 word terms) which either do or do not appear in each document in the collection. There are around 5,000,000 different n-grams. As you'd expect, though, the truth matrix of document-to-n-gram is extremely sparse - most documents in the collection are on the order of 200 words apiece. I know boosting can work on this set; I've written my own AdaBoost implementation which, while it requires a boatload of RAM (5GB or so), runs quite nicely. The results, though, aren't great, I'm sure due to noise. So I wanted to try BrownBoost. But unless I'm reading the specs wrong, the input format for JBoost requires that each example in the data set be given an explicit value for every feature. Clearly, this is infeasible for this set. So, I was wondering if there is a way to specify the data set by explicitly giving only the TRUE-valued features for each example, with all other features being set to FALSE. Or something similar. Or perhaps it's a feature that I could work on adding to JBoost, if it wouldn't require any major retooling (I only have funding for a week or so of work on this). Thanks, -Dan ______________________________ Daniel Mauer Software Systems Engineer, Sr. E547 Software Engineering The MITRE Corporation ______________________________ |
From: Aaron A. <aa...@cs...> - 2008-02-24 19:36:26
|
Hi Filippo, > Hi, I'm studying JBoost 1.4 in this weekend on Windows XP. I succeded in > run the demos and my training set. The tree is successfully build. Good start. > Now I'd like to use the tree with java since it has the predictor just > built in and I don't want to write a generic source in c to manage all > types of input. But I cannot understand how to obtain the predictor.java > from stem.java (for example). Alright, so I looked through the errors and really don't understand what would cause them. Try the following: $> ../jboost -j Predict.java -javaOutputClass Predict -javaOutputMethod predict -S stem -numRounds 5 $> javac Predict.java $> java Predict < stem.train If this doesn't work, then send me the exact three (give or take) commands you used to create, compile, and run the predict file. Once I can recreate the compile errors, I have a much better chance of fixing any existing bugs. Aaron |