I haven't dug deeply into the source code yet, so I thought I would ask here
first if the randomforest learner has any way to determine variable
importance, as in the original Breiman paper?
Thanks, Bill
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My implementation of RF randomly chooses k candidate attributes, and divides
on the one that minimizes entropy. If k=1, then divisions are completely
random. As k approaches the number of attributes, behavior approaches that of
C4.5 decision trees. You can also create bagging ensembles of other kinds of
trees, or other learning algorithms. Waffles also contains several other
ensemble techniques. For example, when I use the BayesianModelCombination
ensemble technique with random trees, it outperforms random forest more often
than not.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am looking of for a replacement for RandomJungle, which no longer supplies
source code. I don't need classification; I need a ranking of the most
important attributes, which the original Brieman paper/implementation and
RandomJungle provided in output files called <foo>.importance. We are dealing
with over 500,000 attributes and are building a new algorithm called
Evaporative Cooling that combines RelieF with Random Forest for rankings of
attributes to go into other bioinformatics analyses, not
classification/prediction. Thanks, Bill</foo>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a tool that will rank attributes by importance. For example, the
following command will display a ranked list of attributes by importance:
waffles_dimred attributeselector data.arff
It works with regression problems as well as classification problems.
Unfortunately, it does not use random forest. It uses logistic regression, and
removes the least-important attribute one-at-a-time. I am planning to improve
this tool so that the user can specify a preferred model, but I haven't gotten
around to it yet.
I know WEKA contains several methods for doing attribute ranking, but I'm not
familiar enough to know how well they support doing it for regression
problems.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Logistic regression is the classic way to find attributes of importance in
genetic epidemiology, but it fails when attributes interact in complex ways,
particularly epistatic interactions. We have developed several modifications
to ReliefF, which can handle interactions well, but we want to combine a good
main effects detector to "couple" with interaction effects. Weka is a great
tool, but it lacks parallel processing in it's ReliefF implementation. And
Java just won't cut it compared to optimized C++. I am leaving my current
post, so I will no longer be working on this problem. It will be left to who
comes after me. I want to suggest they extend your random forests
implementation to include Breiman's importance algorithm.
Moore, J.H., White, B.C. Tuning ReliefF for genome-wide genetic analysis.
Lecture Notes in Computer Science 4447, 166-175 (2007).
... challenge that genetic epidemiologists and computational biologists face
is the
statistical modeling problem . That is, what is the most appropriate way to
model the
relationship between combinations of genetic variations or gene expression
variables and
clinical endpoints?One traditional approach to modeling the relationship
between genetic
variations or gene expression variables and discrete clinical outcomes is
logistic regression
(Hosmer and Lemeshow 2000).Logistic regression is a parametric statistical
approach for
relating one or more independent or explanatory variables to a dependent or
outcome
variable (e.g. disease status) that follows a binomial distribution.However,
as reviewed by
Moore and Williams (2002), the number of possible interaction terms grows
exponentially
as each additional main effect is included in the logistic regression
model.Thus, logistic
regression, like most parametric statistical methods, is limited in its
ability to deal with
interaction data involving many simultaneous factors
And...
Similarly, in order to detect the epistatic interaction of two loci, a full
logistic regression model with at most 9 parameters can be fitted and tested,
and the p-values should be multiplied by L(L-1)/2 according to the Bonferroni
correction . Because the number of SNPs is typically huge (e.g., several
hundred thousand) in genome-wide case-control studies, an exhaustive search
for all possible combinations of SNPs is computationally impractical. To
overcome this limitation, the stepwise logistic regression approach first
selects a small fraction (ε, e.g., 10%) of loci according to the significance
of their single-locus associations and then tests the interactions between the
selected loci . The determination of the fraction ε is, however, not guided.
An approach that is able to automatically determine such a small set of good
candidate markers is therefore preferred.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The random forest provides another randomization mechanism to estimate the
importance of individual features. When a decision tree is constructed, the
correct classifications for the OOB samples can be counted. Now, for a feature
v, randomly permute its values in the OOB samples and again count the correct
classifications. The average of the difference in these two counts over all
trees in a forest is then defined as the raw importance of the feature v.
Makes sense. It seems this approach could be implemented generally, to work
with any model. Is there any reason it is tied to random forest?
If you or the new people who will be working on this problem would like access
to our code repository, just send me an e-mail and I'll set it up. (I try to
keep the barriers-to-development at an absolute minimum.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The only reason is that there's a great deal of literature already citing this
method of ranking biological variables and random forests are very fast. We
are dealing with very large data, taking days of processing on tens of cores.
When it comes down to it, we are trusting it works better than other models
based on previous research. Do you have any suggestions for models that would
be amenable to this process and perform as efficiently at finding
interactions? There are many papers out there that like random forests for
variable selection combined with classifiers such as SVMs. In fact, our
evaporative cooling paper uses random forests plus a naive bayes classifier.
It is the efficient selection of variables from hundreds of thousands that is
tricky. We want BOTH main effects AND interaction effects. ReliefF and RF seem
to do very well on small tests but remains to be seen on real data on the
other of 5,000 individuals/rows and 600,000 variables/attributes/columns.
We are trying to use RandomJungle, but they are not releasing source code any
more. Works great for binary class association, but we're having trouble
trying to reverse engineer the regression part of it, since there's no
documentation for the library.
Thanks for your help!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I haven't dug deeply into the source code yet, so I thought I would ask here
first if the randomforest learner has any way to determine variable
importance, as in the original Breiman paper?
Thanks, Bill
My implementation of RF randomly chooses k candidate attributes, and divides
on the one that minimizes entropy. If k=1, then divisions are completely
random. As k approaches the number of attributes, behavior approaches that of
C4.5 decision trees. You can also create bagging ensembles of other kinds of
trees, or other learning algorithms. Waffles also contains several other
ensemble techniques. For example, when I use the BayesianModelCombination
ensemble technique with random trees, it outperforms random forest more often
than not.
I am looking of for a replacement for RandomJungle, which no longer supplies
source code. I don't need classification; I need a ranking of the most
important attributes, which the original Brieman paper/implementation and
RandomJungle provided in output files called <foo>.importance. We are dealing
with over 500,000 attributes and are building a new algorithm called
Evaporative Cooling that combines RelieF with Random Forest for rankings of
attributes to go into other bioinformatics analyses, not
classification/prediction. Thanks, Bill</foo>
LEO BREIMAN
Statistics Department, University of California, Berkeley, CA 94720
Editor: Robert E. Schapire
Abstract. Random forests are a combination of tree predictors such that each
tree depends on the values of a
random vector sampled independently and with the same distribution for all
trees in the forest. The generalization
error for forests converges a.s. to a limit as the number of trees in the
forest becomes large. The generalization
error of a forest of tree classifiers depends on the strength of the
individual trees in the forest and the correlation
between them. Using a random selection of features to split each node yields
error rates that compare
favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings
of the Thirteenth International
conference, ∗ ∗ ∗, 148–156), but are more robust with respect to noise.
Internal estimates monitor error,
strength, and correlation and these are used to show the response to
increasing the number of features used in
the splitting. Internal estimates are also used to measure variable
importance. These ideas are also applicable to
regression.
Aforest of trees is impenetrable as far as simple interpretations of its
mechanism go. In some
applications, **analysis of medical experiments for example, it is critical to
understand the
interaction of variables that is providing the predictive accuracy.** A start
on this problem is
made by using internal out-of-bag estimates, and verification by reruns using
only selected
variables.
I have a tool that will rank attributes by importance. For example, the
following command will display a ranked list of attributes by importance:
waffles_dimred attributeselector data.arff
It works with regression problems as well as classification problems.
Unfortunately, it does not use random forest. It uses logistic regression, and
removes the least-important attribute one-at-a-time. I am planning to improve
this tool so that the user can specify a preferred model, but I haven't gotten
around to it yet.
I know WEKA contains several methods for doing attribute ranking, but I'm not
familiar enough to know how well they support doing it for regression
problems.
Mike,
Logistic regression is the classic way to find attributes of importance in
genetic epidemiology, but it fails when attributes interact in complex ways,
particularly epistatic interactions. We have developed several modifications
to ReliefF, which can handle interactions well, but we want to combine a good
main effects detector to "couple" with interaction effects. Weka is a great
tool, but it lacks parallel processing in it's ReliefF implementation. And
Java just won't cut it compared to optimized C++. I am leaving my current
post, so I will no longer be working on this problem. It will be left to who
comes after me. I want to suggest they extend your random forests
implementation to include Breiman's importance algorithm.
Moore, J.H., White, B.C. Tuning ReliefF for genome-wide genetic analysis.
Lecture Notes in Computer Science 4447, 166-175 (2007).
McKinney, B.A., Reif, D.M., White, B.C., Crowe, J.E., Moore, J.H. Evaporative
cooling feature selection for genotypic data involving interactions.
Bioinformatics 23, 2113-2120 (2007).
Thanks,
Bill
The...
... challenge that genetic epidemiologists and computational biologists face
is the
statistical modeling problem . That is, what is the most appropriate way to
model the
relationship between combinations of genetic variations or gene expression
variables and
clinical endpoints?One traditional approach to modeling the relationship
between genetic
variations or gene expression variables and discrete clinical outcomes is
logistic regression
(Hosmer and Lemeshow 2000).Logistic regression is a parametric statistical
approach for
relating one or more independent or explanatory variables to a dependent or
outcome
variable (e.g. disease status) that follows a binomial distribution.However,
as reviewed by
Moore and Williams (2002), the number of possible interaction terms grows
exponentially
as each additional main effect is included in the logistic regression
model.Thus, logistic
regression, like most parametric statistical methods, is limited in its
ability to deal with
interaction data involving many simultaneous factors
And...
Similarly, in order to detect the epistatic interaction of two loci, a full
logistic regression model with at most 9 parameters can be fitted and tested,
and the p-values should be multiplied by L(L-1)/2 according to the Bonferroni
correction . Because the number of SNPs is typically huge (e.g., several
hundred thousand) in genome-wide case-control studies, an exhaustive search
for all possible combinations of SNPs is computationally impractical. To
overcome this limitation, the stepwise logistic regression approach first
selects a small fraction (ε, e.g., 10%) of loci according to the significance
of their single-locus associations and then tests the interactions between the
selected loci . The determination of the fraction ε is, however, not guided.
An approach that is able to automatically determine such a small set of good
candidate markers is therefore preferred.
The random forest provides another randomization mechanism to estimate the
importance of individual features. When a decision tree is constructed, the
correct classifications for the OOB samples can be counted. Now, for a feature
v, randomly permute its values in the OOB samples and again count the correct
classifications. The average of the difference in these two counts over all
trees in a forest is then defined as the raw importance of the feature v.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648748/?tool=pubmed
Makes sense. It seems this approach could be implemented generally, to work
with any model. Is there any reason it is tied to random forest?
If you or the new people who will be working on this problem would like access
to our code repository, just send me an e-mail and I'll set it up. (I try to
keep the barriers-to-development at an absolute minimum.)
The only reason is that there's a great deal of literature already citing this
method of ranking biological variables and random forests are very fast. We
are dealing with very large data, taking days of processing on tens of cores.
When it comes down to it, we are trusting it works better than other models
based on previous research. Do you have any suggestions for models that would
be amenable to this process and perform as efficiently at finding
interactions? There are many papers out there that like random forests for
variable selection combined with classifiers such as SVMs. In fact, our
evaporative cooling paper uses random forests plus a naive bayes classifier.
It is the efficient selection of variables from hundreds of thousands that is
tricky. We want BOTH main effects AND interaction effects. ReliefF and RF seem
to do very well on small tests but remains to be seen on real data on the
other of 5,000 individuals/rows and 600,000 variables/attributes/columns.
We are trying to use RandomJungle, but they are not releasing source code any
more. Works great for binary class association, but we're having trouble
trying to reverse engineer the regression part of it, since there's no
documentation for the library.
Thanks for your help!