Thread: [Jboost-devel] How to make use of neural network as weak learners in JBoost?

Status: Beta

Brought to you by: aarvey, boyko, mkabra, scheaman, and 2 others

jboost-devel

[Jboost-devel] How to make use of neural network as weak learners in JBoost?

From: Haiyong X. <hai...@gm...> - 2007-08-15 14:40:53

Hi there,

I found current implementation of JBoost mainly focus on Alternating
Decision Tree (ADT), which means the weak learners are similar to decision
stumps. Is there any way to construct a boosting system based on another
type of weak learner, say neural networks? Thanks.

-- Haiyong

Re: [Jboost-devel] How to make use of neural network as weak learners in JBoost?

From: Aaron A. <aa...@cs...> - 2007-08-17 22:16:22

Apologies for the late response.  See inline comments.

On Wed, 15 Aug 2007, Haiyong Xu wrote:

> I found current implementation of JBoost mainly focus on Alternating 
> Decision Tree (ADT), which means the weak learners are similar to 
> decision stumps.

ADTrees are capable of using any weak learner.  The nice thing about using 
decision stumps with ADTrees is that since each path is a conjunction, 
ADTrees are able to form quite a rich representation with a limited base 
class of hypotheses.  See the original paper for more details: 
http://www.cse.ucsd.edu/~yfreund/papers/atrees.pdf

> Is there any way to construct a boosting system based on another type of 
> weak learner, say neural networks? Thanks.

In JBoost, a weak learner (typically something similar to a decision 
stump) can be replaced by just about anything, including neural networks. 
If you wanted to implement a neural network weak learner, you could mimic 
the existing weak learners at ./src/jboost/leaner/.  The easiest weak 
learner to understand is InequalitySplitter.java (which is created by 
InequalitySplitterBuilder.java).  A simple neural net implementation 
should be fairly straight forward in this framework.

Note that there are a variety of reasons why boosting neural nets may not 
be the best idea.  The margin analysis given by Schapire et al (1998, 
Annals of stats) gives generalization bounds in terms of the VC dimension 
of the base classifier, while these bounds are certainly quite lose, they 
do provide some intuition why boosting may not overfit.

Also note that JBoost can boost decision trees with the command

./jboost ... -ATreeType ADD_ROOT_OR_SINGLES

You also have the other options:

      -ATreeType type   The type of ATree to create.  There are several options:
                           ADD_ALL               Create a full ADTree (default)
                           ADD_ROOT              Add splits only at the root producing a glat tree.
                                                 This is equivalent to boosting decision stumps
                           ADD_SINGLES           Create a decision tree
                           ADD_ROOT_OR_SINGLES   Create a linear combination of decision trees.
                                                 This is equivalent to simultaneously growing
                                                 boosted decision trees.

An existing implementation of Boosting with neural nets (and many other 
weak classifiers) exists in WEKA.  Last time I checked, neural nets were 
under the name of Multilayer Perceptron.

Aaron

Re: [Jboost-devel] How to make use of neural network as weak learners in JBoost?

From: Aaron A. <aa...@cs...> - 2007-08-20 22:17:45

On Sat, 18 Aug 2007, Haiyong Xu wrote:

> Thanks for your reply.
>
> Currently, I am involving in a project using pattern recognition 
> techniques and we build an ensemble system based on AdaBoost method. 
> JBoost is the only library I found which could be practically employed 
> in our project. Therefore, I'd like to dig into the detailed 
> implementation to see how to adapt it to our requirement.

Great!

> Through reading part of source codes, mainly about AdaBoost.java, 
> InequalitySpilterBuilder.java, and InstrumentedAlternatingTree.java, I 
> found that when choosing a candidate spliter, AdaBoost.java always 
> choose the one to minimize a loss criteria, which is defined as 
> "2*Sqrt(w_0 * w_1) - w_0 - w_1" in function BinaryBag::getLoss( ). 
> Actually, this minimization is not exactly the same as the one defined 
> in Freund's original paper "The alternating decision tree learning 
> algorithm" (1999). (See step 1 and 2 in Figure 3).

That is absolutely correct.  This is because the weak classifiers c1 and 
c2 are evaluated in a separate module and the booster is only informed of 
the (weighted) errors w.r.t. the classes.  Using the notation of Freund & 
Mason 1999, the booster is only told of

\sum W-               and                \sum W+

Then there are two prediction weights determined in two separate phases 
(one for "YES/TRUE" and one for "NO/FALSE").

This is reminiscent of where this loss function originally came from 
(Schapire & Singer "Improved algorithms using confidence rated 
predictions" 1999), which actually might be a better reference for 
understanding much of the reasoning behind the ADTree paper (Freund & 
Mason 1999).  While the loss function is not identical, the motivation is 
to use a loss which will best separate the classes, not necessarily give 
the lowest weighted error w.r.t. some potential function (which was the 
reasoning in the original AdaBoost paper).

You can also implement other loss functions.  Two that are on the todo 
list (http://www.cs.ucsd.edu/~aarvey/jboost/todo.html) are weighted-error 
and information gain.  There are many other loss functions that can be 
used, granted the Z-loss of Schapire & Singer 99 has some very nice 
properties.

> Also, I am still trying to figure out the connection between ADT and 
> AdaBoost algorithm. As we know, the fininal prediction in AdaBoost is f 
> = a1*h1 + a2*h2 + a3*h3 +...+ aT*hT, with a1 is coefficient and H1 is 
> weak learner. In ADT, the fininal prediction has similar form as the 
> summary of prediction values. Here, I suppose the prediction value in 
> ADT is equivalent to a1*h1 in AdaBoost.

Not exactly.  I think the best way to understand the evaluation process of 
ADTrees is by example.  There is an example provided on wikipedia at 
http://en.wikipedia.org/wiki/Alternating_decision_tree.  There is another 
example in http://www.cse.ucsd.edu/~yfreund/papers/final-geneclass.pdf.

The major difference in ADTrees is that h_{t} may depend on the evaluation 
of h_{1}...h_{t-1}.  Namely, each path is a conjunction of weak 
hypotheses.  For instance, not all examples are evaluated by the t^{th} 
hypothesis; only those examples that make it to the decision node of the 
t^{th} hypothesis are actually evaluated.  This allows very weak classes 
of hypotheses (e.g. decision stumps) to create fairly complex hyptheses 
(e.g. decision rectangles) without doing the large number of iterations to 
create the same classifier with standard boosting.  While this may lead to 
overfitting, it does not seem to do so empirically.

> The marginal based generalization bound analysis proposed by Schapire is a
> very nice thing. And in recent years, many reserachers have put efforts on
> searching algorithms to maximize margin. See papers:
>
> "Soft Margins for AdaBoost", Ratsch, 17th Annual Conference on 
> Computational Learning Theory, 2004.
>
> "Boosting Based on a Smooth Margin", Rudin, 17th Annual Conference on 
> Computational Learning Theory, 2004.
>
> "ANALYSIS OF BOOSTING ALGORITHMS USING THE SMOOTH MARGIN FUNCTION", 
> Rudin, Submitted to the Annals of Statistics. 2007
>
> And I would like to make use of those maximizing magin algorithms to build
> an AdaBoost ensemble system. At this time, I am thinking about implement it
> on the basis of JBoost architecture.

I have actually not read these three particular papers (they're certainly 
on my stack).  However, I know that a pitfall of some "margin maximizing" 
algorithms is that they maximize the *minimal* margin, not the margin 
distribution (which is perhaps somewhat ill defined for maximizing). 
Also keep in mind that while margin maximization is very important, it may 
not be the only explanation for why AdaBoost generalizes so well (see 
Reyzin & Schapire 06).  Also, maximizing the minimum margin may allow 
noisy data to severely impair the learning process.

I believe that the "Soft margins..." (Ratsch) paper discusses noise 
resistant boosting.  While JBoost doesn't have the exact algorithm of 
Ratsch, BrownBoost (use CVS version, there may be small bugs in version 
1.3.1) can be used to be noise resistant.  This works by setting the exact 
amount of noise you want to allow.  This can be tweaked quickly by using 
the -potential flag.  You can then visualize the creation of the margin 
distribution using the surfing.py script (again, use the CVS version). 
This may not be exactly what you need, but it may give some intuition as 
to what is happening in the "boosting the margin" interpretation of 
boosting.

If you wish to implement a "margin maximizing" booster, I'd recommend 
starting with one of the simpler methods, e.g. LPBoost.  Also, since some 
of these methods require keeping all previous hypotheses available, I'd 
recommend keeping the data structures (matrices, etc) in the booster 
class.  See the (inefficient, though correct) BrownBoost implementation 
for an example of how to keep past hypotheses stored in the booster class.

> Thank you for your maintenance of open sourced project JBoost, and hope we
> can cooperate on JBoost project.

I'm glad to hear that you're interested in contributing!  I'm not sure if 
you've contributed to open source projects in the past; either way, I'd 
recommend reading http://www.cs.ucsd.edu/~aarvey/jboost/contribute.html (I 
just added some more content to it) and some of the links.

Cheers,

Aaron

> On 8/17/07, Aaron Arvey <aa...@cs...> wrote:
>>
>> Apologies for the late response.  See inline comments.
>>
>> On Wed, 15 Aug 2007, Haiyong Xu wrote:
>>
>>> I found current implementation of JBoost mainly focus on Alternating
>>> Decision Tree (ADT), which means the weak learners are similar to
>>> decision stumps.
>>
>> ADTrees are capable of using any weak learner.  The nice thing about using
>> decision stumps with ADTrees is that since each path is a conjunction,
>> ADTrees are able to form quite a rich representation with a limited base
>> class of hypotheses.  See the original paper for more details:
>> http://www.cse.ucsd.edu/~yfreund/papers/atrees.pdf
>>
>>> Is there any way to construct a boosting system based on another type of
>>> weak learner, say neural networks? Thanks.
>>
>> In JBoost, a weak learner (typically something similar to a decision
>> stump) can be replaced by just about anything, including neural networks.
>> If you wanted to implement a neural network weak learner, you could mimic
>> the existing weak learners at ./src/jboost/leaner/.  The easiest weak
>> learner to understand is InequalitySplitter.java (which is created by
>> InequalitySplitterBuilder.java).  A simple neural net implementation
>> should be fairly straight forward in this framework.
>>
>> Note that there are a variety of reasons why boosting neural nets may not
>> be the best idea.  The margin analysis given by Schapire et al (1998,
>> Annals of stats) gives generalization bounds in terms of the VC dimension
>> of the base classifier, while these bounds are certainly quite lose, they
>> do provide some intuition why boosting may not overfit.
>>
>> Also note that JBoost can boost decision trees with the command
>>
>> ./jboost ... -ATreeType ADD_ROOT_OR_SINGLES
>>
>> You also have the other options:
>>
>>      -ATreeType type   The type of ATree to create.  There are several
>> options:
>>                           ADD_ALL               Create a full ADTree
>> (default)
>>                           ADD_ROOT              Add splits only at the
>> root producing a glat tree.
>>                                                 This is equivalent to
>> boosting decision stumps
>>                           ADD_SINGLES           Create a decision tree
>>                           ADD_ROOT_OR_SINGLES   Create a linear
>> combination of decision trees.
>>                                                 This is equivalent to
>> simultaneously growing
>>                                                 boosted decision trees.
>>
>> An existing implementation of Boosting with neural nets (and many other
>> weak classifiers) exists in WEKA.  Last time I checked, neural nets were
>> under the name of Multilayer Perceptron.
>>
>> Aaron
>>
>>
>