At training-time, missing values are replaced at the last possible moment with
the most-common value (or the mean, in the case of continuous values) among
the instances assigned to that sub-branch. (This "default" value is stored in
the interior node.) At evaluation-time, missing values are assumed to be the
"default" value stored at the interior node.
This approach is fast, but I have achieved better results by pre-processing
data with an imputation algorithm to predict missing values. The NonlinearPCA
algorithm (also included in Waffles) generally does a very good job for this
purpose, and yields the best results I have found.
b)
For continuous attributes, I draw two patterns at random from the remaining
data, and compute the average of their values as a candidate value to split
on. After 'k' draws, it divides on the value that minimizes entropy. (This
approach is very efficient, but it differs slightly from other
implementations.)
c)
The code was originally written to conform to the description at http://en.wi
kipedia.org/wiki/Random_forest.
If you find any substantial differences with the Breiman paper, please report
it, and I will fix it as a bug (and update Wikipedia too).
If you are interested in maximizing predictive accuracy, I might recommend
trying a BMC ensemble instead of a bagging ensemble with your random trees. In
my experimentation, this consistently outperforms random forest. Also, if you
include meanmargins trees in your ensemble (also included in Waffles), this
can improve the diversity within your ensemble and lead to better results with
fewer trees.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That would be a welcome contribution. Would it print every tree in the forest?
I have added you as a developer so you may push changes into the Git
repository if you like. If you need any help getting started, my e-mail
address can be found at http://waffles.sourceforge.net/.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would like to ask
a) how are missing values handled
b) if continuous attributes are fully supported (not only "discretized")
c) there is any important difference in algorithm compared to Breinman papper?
Thanks,
Vojtěch R.
a)
At training-time, missing values are replaced at the last possible moment with
the most-common value (or the mean, in the case of continuous values) among
the instances assigned to that sub-branch. (This "default" value is stored in
the interior node.) At evaluation-time, missing values are assumed to be the
"default" value stored at the interior node.
This approach is fast, but I have achieved better results by pre-processing
data with an imputation algorithm to predict missing values. The NonlinearPCA
algorithm (also included in Waffles) generally does a very good job for this
purpose, and yields the best results I have found.
b)
For continuous attributes, I draw two patterns at random from the remaining
data, and compute the average of their values as a candidate value to split
on. After 'k' draws, it divides on the value that minimizes entropy. (This
approach is very efficient, but it differs slightly from other
implementations.)
c)
The code was originally written to conform to the description at http://en.wi
kipedia.org/wiki/Random_forest.
If you find any substantial differences with the Breiman paper, please report
it, and I will fix it as a bug (and update Wikipedia too).
If you are interested in maximizing predictive accuracy, I might recommend
trying a BMC ensemble instead of a bagging ensemble with your random trees. In
my experimentation, this consistently outperforms random forest. Also, if you
include meanmargins trees in your ensemble (also included in Waffles), this
can improve the diversity within your ensemble and lead to better results with
fewer trees.
Thanks for an answer.
There is any way to obtain out-of-bag generalization error estimate?
based on the training data, by the following:
not in the bootstrap sample (what Breiman
calls “out-of-bag”, or OOB, data) using the tree
grown with the bootstrap sample.
around 36% of the times, so aggregate these
predictions.) Calcuate the error rate, and call
it the OOB estimate of error rate
I use the following similar approach:
1- Use cross-validation to estimate the accuracy of the model (meaning the
whole ensemble).
2- Train the model (meaning the whole ensemble) with all available data.
The advantages of this approach include:
1- It works with arbitrary models, not just bagged ensembles.
2- The cross-validation step may be performed with an arbitrary number of
repetitions to improve the accuracy of the prediction.
If you can identify any reasons why Breiman's more specific approach is
superior, I would be happy to implement it.
I also use yours approach with cross-validation.
Breiman uses this for variable importance, proximities, ... http://www.stat.b
erkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
Can I ask for the new functionality - the waffles_plot for random forests (as
like for decision trees)? Or may I contribute this?
That would be a welcome contribution. Would it print every tree in the forest?
I have added you as a developer so you may push changes into the Git
repository if you like. If you need any help getting started, my e-mail
address can be found at
http://waffles.sourceforge.net/.
Just out of curiosity -- what is a BMC ensemble?
BMC expands to Bayesian Model Combination. It is an ensemble technique that
uses statistics to combine models more effectively than bagging. Here's the
Wikipedia entry about it: http://en.wikipedia.org/wiki/Ensemble_learning#Baye
sian_model_combination
Mike,
I'm still a little confused on this. Is BMC implemented in Waffles? I'm not
crystal clear on which algorithm in http://waffles.sourceforge.net/command/le
arn.html implements this.
Thanks!
Uh oh, it looks like that documentation is out of date. (I am not sure how
that happened.) I have updated it.
Also, if you do
waffles_learn usage
it should show you up-to-date documentation, including "bmc"