Waffles / Discussion / Help: Random Forests

Vojtěch Rylko - 2012-02-12

I would like to ask

a) how are missing values handled

b) if continuous attributes are fully supported (not only "discretized")

c) there is any important difference in algorithm compared to Breinman papper?

Thanks,

Vojtěch R.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-02-12

a)

At training-time, missing values are replaced at the last possible moment with
the most-common value (or the mean, in the case of continuous values) among
the instances assigned to that sub-branch. (This "default" value is stored in
the interior node.) At evaluation-time, missing values are assumed to be the
"default" value stored at the interior node.

This approach is fast, but I have achieved better results by pre-processing
data with an imputation algorithm to predict missing values. The NonlinearPCA
algorithm (also included in Waffles) generally does a very good job for this
purpose, and yields the best results I have found.

b)

For continuous attributes, I draw two patterns at random from the remaining
data, and compute the average of their values as a candidate value to split
on. After 'k' draws, it divides on the value that minimizes entropy. (This
approach is very efficient, but it differs slightly from other
implementations.)

c)

The code was originally written to conform to the description at http://en.wi
kipedia.org/wiki/Random_forest.
If you find any substantial differences with the Breiman paper, please report
it, and I will fix it as a bug (and update Wikipedia too).

If you are interested in maximizing predictive accuracy, I might recommend
trying a BMC ensemble instead of a bagging ensemble with your random trees. In
my experimentation, this consistently outperforms random forest. Also, if you
include meanmargins trees in your ensemble (also included in Waffles), this
can improve the diversity within your ensemble and lead to better results with
fewer trees.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Vojtěch Rylko - 2012-02-15

Thanks for an answer.

There is any way to obtain out-of-bag generalization error estimate?

An estimate of the error rate can be obtained,

based on the training data, by the following:

At each bootstrap iteration, predict the data
not in the bootstrap sample (what Breiman

calls “out-of-bag”, or OOB, data) using the tree

grown with the bootstrap sample.

Aggregate the OOB predictions. (On the average, each data point would be out-of-bag
around 36% of the times, so aggregate these

predictions.) Calcuate the error rate, and call

it the OOB estimate of error rate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Mike Gashler - 2012-02-15

I use the following similar approach:

1- Use cross-validation to estimate the accuracy of the model (meaning the
whole ensemble).

2- Train the model (meaning the whole ensemble) with all available data.

The advantages of this approach include:

1- It works with arbitrary models, not just bagged ensembles.

2- The cross-validation step may be performed with an arbitrary number of
repetitions to improve the accuracy of the prediction.

If you can identify any reasons why Breiman's more specific approach is
superior, I would be happy to implement it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-02-21

I also use yours approach with cross-validation.

Breiman uses this for variable importance, proximities, ... http://www.stat.b
erkeley.edu/~breiman/RandomForests/cc_home.htm#varimp

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Vojtěch Rylko - 2012-02-21

Can I ask for the new functionality - the waffles_plot for random forests (as
like for decision trees)? Or may I contribute this?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Mike Gashler - 2012-02-21

That would be a welcome contribution. Would it print every tree in the forest?
I have added you as a developer so you may push changes into the Git
repository if you like. If you need any help getting started, my e-mail
address can be found at
http://waffles.sourceforge.net/.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-02-25

Just out of curiosity -- what is a BMC ensemble?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-02-27

BMC expands to Bayesian Model Combination. It is an ensemble technique that
uses statistics to combine models more effectively than bagging. Here's the
Wikipedia entry about it: http://en.wikipedia.org/wiki/Ensemble_learning#Baye
sian_model_combination

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous - 2012-06-22

Mike,

I'm still a little confused on this. Is BMC implemented in Waffles? I'm not
crystal clear on which algorithm in http://waffles.sourceforge.net/command/le
arn.html implements this.

Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Mike Gashler - 2012-06-22

Uh oh, it looks like that documentation is out of date. (I am not sure how
that happened.) I have updated it.

Also, if you do

waffles_learn usage

it should show you up-to-date documentation, including "bmc"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Random Forests

Forums

Help

Random Forests document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Random Forests