From: Aaron A. <aa...@cs...> - 2007-10-25 03:50:33
|
On Wed, 24 Oct 2007, David Rolland wrote: > I'm starting a new thread because I never received your reply through my > mail, so I can't reply to the thread. Huh... I'll see if that's a problem on my end... > I just downloaded version 1.4, I'll take a look at it really soon. I'm > still in the process of evaluating JBoost to see if it fits my needs. So > any change to the ouput files format doesn't affect me right now. The > GUI will be a very interesting tool, even though I already made my own > batch files to call JBoost. The GUI didn't make the 1.4 cut :-(. It'll be in 1.4.1 or 1.4.2. I can send you a version with GUI if you'd like. > Your explanation of score/margin is clear, but there is still something > that I don't understand. The graphic for the margin has "Margin" and > "Cumulative Distribution" as axis labels. In this graphic, the margin > values are between -1 and +1. This means that the values have been > remapped from the formula you gave me (margin = score * label). Am I > right ? This remap was actually the purpose of my initial question, but > I wasn't clear on that point. Yes, the margin is remapped into the interval [-1,+1]. An unnormalized margin doesn't really mean much (and all theorems have been proven for the normalized margin), so we normalize it. > I began using Adaboost for image analysis (mainly OCR) where I was > working, but seeing it's interesting performances, I want to try it in > other domains. I'll give you feedback it it works well. From my previous > experience with OCR, Adaboost isn't magical, the input data must be > chosen carefully and right now I'm working on the input before really > using JBoost. All statistics and machine learning algorithms are only as good as the input. One nice thing about AdaBoost (in contrast to other learning algorithms) is that feature selection is less important. Thus, you can throw as many features as you can imagine at the algorithm and let it figure out which ones are relevant. At some point in the future we're planning on putting cascaded predictors similar to Viola & Jones. This may be relevant to your work. Aaron > Hi David, > > See responses. Also know that for release 1.4 (coming out sometime this > week most likely) we've completely changed the format of the output files > and post-processing. We are also in the process of porting all > Python/Perl code to Java. We are also developing a GUI to make the whole > process more intuitive, which will probably be released in the next month. > That being said... > > On Mon, 15 Oct 2007, David Rolland wrote: > >> First, I tested the visualization tools to make sure I had the same >> results as shown on >> http://www.cs.ucsd.edu/~aarvey/jboost/doc.html#visualization. I had to >> modify the atree2dot2ps.pl script because it does not use the "--dir" >> option everywhere in the script. Is it the expected behavior ? > > You're right. The $dirname variable is only used for the $infofilename, > not the $filename. I've committed the change to CVS. See > http://sourceforge.net/cvs/?group_id=195659 for details on how to get CVS > access. > >> Second, I was not able to reproduce the margin output. My problem is >> that I run JBoost on Windows XP and the margin script uses 'cat' and >> other Unix commands. > > Yeah... the scripts were written assuming a few other things too... > Really, they were written as bandages, not final solutions. We're > currently working on porting all of this to Java so that we have more > interoperability. > >> I am a C/C++ programmer, I don't know Perl nor Python. > > Perl is pretty out of style these, though it is amazing what some of that > gross syntax can do. Python has survived a couple of fads (ruby, etc) and > still seems to me the best widespread scripting language. It may be worth > a look, even if not for this project. > >> Even though these languages are similar to C I still wonder how the >> files spambase.train.margin and spambase.train.scores can be processed >> to output the margin graphic. > > We use gnuplot as an intermediary... > >> And actually, what's the difference between these two files? > > One has the "score" of the example, the other has the "margin". The score > is defined as the value predicted for a given example. The margin is the > > if (label of example is correct) > return |score| > else > return - |score| > > where |x| is the absolute value of x. > >> There only seems to be some positive/negative changes of values. I don't >> expect you to rewrite margin.py without taking advantage of Unix >> commands, but if you explain how I can get from the data files to the >> output graphic, I'll code myself a Windows equivalent. > > I'd say the best bet for the moment being is to grab cygwin, where > everything has been tested and *seemed* to work peachy. The second > best bet would probably be to wait till Wednesday/Thursday for the next > release (when all of your changes would probably be rendered somewhat moot > anyways). Third best bet is edit the code. > > The only place where UNIX commands are used in 1.3.1 margin.py are lines > 244--255. That is where a label file is created. All labels in JBoost > are converted into binary values +1, -1. If you have more than two > labels, just wait till later this week. if you have two labels, then just > figure out which is mapped to "+1" and which to "-1" (should be fairly > straight forward). Create the labels file, which is just a series of 1, > -1 read in. The formula for margin is (line 168) > > margin = score * label > > This is identical to what I state above for when +1,-1 are used. So if > you write a script to create a label file, you can specify the file on the > command line (via --labels=...), and all your problems should be solved. > > Let me know if you have any other questions/comments. > > Also, out of curiosity, for what classification task are you using JBoost? > > Aaron > > > > > Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail at http://mrd.mail.yahoo.com/try_beta?.intl=ca > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > jboost-users mailing list > jbo...@li... > https://lists.sourceforge.net/lists/listinfo/jboost-users > |