Random Forests

Help
2012-02-12
2012-09-14
  • Vojtěch Rylko

    Vojtěch Rylko - 2012-02-12

    I would like to ask

    a) how are missing values handled

    b) if continuous attributes are fully supported (not only "discretized")

    c) there is any important difference in algorithm compared to Breinman papper?

    Thanks,

    Vojtěch R.

     
  • Nobody/Anonymous

    a)

    At training-time, missing values are replaced at the last possible moment with
    the most-common value (or the mean, in the case of continuous values) among
    the instances assigned to that sub-branch. (This "default" value is stored in
    the interior node.) At evaluation-time, missing values are assumed to be the
    "default" value stored at the interior node.

    This approach is fast, but I have achieved better results by pre-processing
    data with an imputation algorithm to predict missing values. The NonlinearPCA
    algorithm (also included in Waffles) generally does a very good job for this
    purpose, and yields the best results I have found.

    b)

    For continuous attributes, I draw two patterns at random from the remaining
    data, and compute the average of their values as a candidate value to split
    on. After 'k' draws, it divides on the value that minimizes entropy. (This
    approach is very efficient, but it differs slightly from other
    implementations.)

    c)

    The code was originally written to conform to the description at http://en.wi
    kipedia.org/wiki/Random_forest.

    If you find any substantial differences with the Breiman paper, please report
    it, and I will fix it as a bug (and update Wikipedia too).

    If you are interested in maximizing predictive accuracy, I might recommend
    trying a BMC ensemble instead of a bagging ensemble with your random trees. In
    my experimentation, this consistently outperforms random forest. Also, if you
    include meanmargins trees in your ensemble (also included in Waffles), this
    can improve the diversity within your ensemble and lead to better results with
    fewer trees.

     
  • Vojtěch Rylko

    Vojtěch Rylko - 2012-02-15

    Thanks for an answer.

    There is any way to obtain out-of-bag generalization error estimate?

    An estimate of the error rate can be obtained,

    based on the training data, by the following:

    1. At each bootstrap iteration, predict the data
      not in the bootstrap sample (what Breiman

    calls “out-of-bag”, or OOB, data) using the tree

    grown with the bootstrap sample.

    1. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag
      around 36% of the times, so aggregate these

    predictions.) Calcuate the error rate, and call

    it the OOB estimate of error rate

     
  • Mike Gashler

    Mike Gashler - 2012-02-15

    I use the following similar approach:

    1- Use cross-validation to estimate the accuracy of the model (meaning the
    whole ensemble).

    2- Train the model (meaning the whole ensemble) with all available data.

    The advantages of this approach include:

    1- It works with arbitrary models, not just bagged ensembles.

    2- The cross-validation step may be performed with an arbitrary number of
    repetitions to improve the accuracy of the prediction.

    If you can identify any reasons why Breiman's more specific approach is
    superior, I would be happy to implement it.

     
  • Vojtěch Rylko

    Vojtěch Rylko - 2012-02-21

    Can I ask for the new functionality - the waffles_plot for random forests (as
    like for decision trees)? Or may I contribute this?

     
  • Mike Gashler

    Mike Gashler - 2012-02-21

    That would be a welcome contribution. Would it print every tree in the forest?
    I have added you as a developer so you may push changes into the Git
    repository if you like. If you need any help getting started, my e-mail
    address can be found at
    http://waffles.sourceforge.net/.

     
  • Nobody/Anonymous

    Just out of curiosity -- what is a BMC ensemble?

     
  • Anonymous

    Anonymous - 2012-06-22

    Mike,

    I'm still a little confused on this. Is BMC implemented in Waffles? I'm not
    crystal clear on which algorithm in http://waffles.sourceforge.net/command/le
    arn.html
    implements this.

    Thanks!

     
  • Mike Gashler

    Mike Gashler - 2012-06-22

    Uh oh, it looks like that documentation is out of date. (I am not sure how
    that happened.) I have updated it.

    Also, if you do

    waffles_learn usage

    it should show you up-to-date documentation, including "bmc"

     

Anonymous
Anonymous

Cancel  Add attachments