Random Forests

  • Vojtěch Rylko
    Vojtěch Rylko

    I would like to ask

    a) how are missing values handled

    b) if continuous attributes are fully supported (not only "discretized")

    c) there is any important difference in algorithm compared to Breinman papper?


    Vojtěch R.

  • a)

    At training-time, missing values are replaced at the last possible moment with
    the most-common value (or the mean, in the case of continuous values) among
    the instances assigned to that sub-branch. (This "default" value is stored in
    the interior node.) At evaluation-time, missing values are assumed to be the
    "default" value stored at the interior node.

    This approach is fast, but I have achieved better results by pre-processing
    data with an imputation algorithm to predict missing values. The NonlinearPCA
    algorithm (also included in Waffles) generally does a very good job for this
    purpose, and yields the best results I have found.


    For continuous attributes, I draw two patterns at random from the remaining
    data, and compute the average of their values as a candidate value to split
    on. After 'k' draws, it divides on the value that minimizes entropy. (This
    approach is very efficient, but it differs slightly from other


    The code was originally written to conform to the description at http://en.wi

    If you find any substantial differences with the Breiman paper, please report
    it, and I will fix it as a bug (and update Wikipedia too).

    If you are interested in maximizing predictive accuracy, I might recommend
    trying a BMC ensemble instead of a bagging ensemble with your random trees. In
    my experimentation, this consistently outperforms random forest. Also, if you
    include meanmargins trees in your ensemble (also included in Waffles), this
    can improve the diversity within your ensemble and lead to better results with
    fewer trees.

  • Vojtěch Rylko
    Vojtěch Rylko

    Thanks for an answer.

    There is any way to obtain out-of-bag generalization error estimate?

    An estimate of the error rate can be obtained,

    based on the training data, by the following:

    1. At each bootstrap iteration, predict the data
      not in the bootstrap sample (what Breiman

    calls “out-of-bag”, or OOB, data) using the tree

    grown with the bootstrap sample.

    1. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag
      around 36% of the times, so aggregate these

    predictions.) Calcuate the error rate, and call

    it the OOB estimate of error rate

  • Mike Gashler
    Mike Gashler

    I use the following similar approach:

    1- Use cross-validation to estimate the accuracy of the model (meaning the
    whole ensemble).

    2- Train the model (meaning the whole ensemble) with all available data.

    The advantages of this approach include:

    1- It works with arbitrary models, not just bagged ensembles.

    2- The cross-validation step may be performed with an arbitrary number of
    repetitions to improve the accuracy of the prediction.

    If you can identify any reasons why Breiman's more specific approach is
    superior, I would be happy to implement it.

  • Vojtěch Rylko
    Vojtěch Rylko

    Can I ask for the new functionality - the waffles_plot for random forests (as
    like for decision trees)? Or may I contribute this?

  • Mike Gashler
    Mike Gashler

    That would be a welcome contribution. Would it print every tree in the forest?
    I have added you as a developer so you may push changes into the Git
    repository if you like. If you need any help getting started, my e-mail
    address can be found at

  • Just out of curiosity -- what is a BMC ensemble?


  • Anonymous


    I'm still a little confused on this. Is BMC implemented in Waffles? I'm not
    crystal clear on which algorithm in http://waffles.sourceforge.net/command/le
    implements this.


  • Mike Gashler
    Mike Gashler

    Uh oh, it looks like that documentation is out of date. (I am not sure how
    that happened.) I have updated it.

    Also, if you do

    waffles_learn usage

    it should show you up-to-date documentation, including "bmc"



Cancel   Add attachments