Waffles / Discussion / Help: Random Forest Variable Importance Measure?

Anonymous - 2011-12-25

I haven't dug deeply into the source code yet, so I thought I would ask here
first if the randomforest learner has any way to determine variable
importance, as in the original Breiman paper?

Thanks, Bill

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Mike Gashler - 2011-12-26

My implementation of RF randomly chooses k candidate attributes, and divides
on the one that minimizes entropy. If k=1, then divisions are completely
random. As k approaches the number of attributes, behavior approaches that of
C4.5 decision trees. You can also create bagging ensembles of other kinds of
trees, or other learning algorithms. Waffles also contains several other
ensemble techniques. For example, when I use the BayesianModelCombination
ensemble technique with random trees, it outperforms random forest more often
than not.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2011-12-27

I am looking of for a replacement for RandomJungle, which no longer supplies
source code. I don't need classification; I need a ranking of the most
important attributes, which the original Brieman paper/implementation and
RandomJungle provided in output files called <foo>.importance. We are dealing
with over 500,000 attributes and are building a new algorithm called
Evaporative Cooling that combines RelieF with Random Forest for rankings of
attributes to go into other bioinformatics analyses, not
classification/prediction. Thanks, Bill</foo>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2011-12-27

LEO BREIMAN

Statistics Department, University of California, Berkeley, CA 94720

Editor: Robert E. Schapire

Abstract. Random forests are a combination of tree predictors such that each
tree depends on the values of a

random vector sampled independently and with the same distribution for all
trees in the forest. The generalization

error for forests converges a.s. to a limit as the number of trees in the
forest becomes large. The generalization

error of a forest of tree classifiers depends on the strength of the
individual trees in the forest and the correlation

between them. Using a random selection of features to split each node yields
error rates that compare

favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings
of the Thirteenth International

conference, ∗ ∗ ∗, 148–156), but are more robust with respect to noise.
Internal estimates monitor error,

strength, and correlation and these are used to show the response to
increasing the number of features used in

the splitting. Internal estimates are also used to measure variable
importance. These ideas are also applicable to

regression.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2011-12-27

Exploring the random forest mechanism
Aforest of trees is impenetrable as far as simple interpretations of its
mechanism go. In some

applications, **analysis of medical experiments for example, it is critical to
understand the

interaction of variables that is providing the predictive accuracy.** A start
on this problem is

made by using internal out-of-bag estimates, and verification by reruns using
only selected

variables.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Mike Gashler - 2011-12-27

I have a tool that will rank attributes by importance. For example, the
following command will display a ranked list of attributes by importance:

waffles_dimred attributeselector data.arff

It works with regression problems as well as classification problems.
Unfortunately, it does not use random forest. It uses logistic regression, and
removes the least-important attribute one-at-a-time. I am planning to improve
this tool so that the user can specify a preferred model, but I haven't gotten
around to it yet.

I know WEKA contains several methods for doing attribute ranking, but I'm not
familiar enough to know how well they support doing it for regression
problems.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2011-12-29

Mike,

Logistic regression is the classic way to find attributes of importance in
genetic epidemiology, but it fails when attributes interact in complex ways,
particularly epistatic interactions. We have developed several modifications
to ReliefF, which can handle interactions well, but we want to combine a good
main effects detector to "couple" with interaction effects. Weka is a great
tool, but it lacks parallel processing in it's ReliefF implementation. And
Java just won't cut it compared to optimized C++. I am leaving my current
post, so I will no longer be working on this problem. It will be left to who
comes after me. I want to suggest they extend your random forests
implementation to include Breiman's importance algorithm.

Moore, J.H., White, B.C. Tuning ReliefF for genome-wide genetic analysis.
Lecture Notes in Computer Science 4447, 166-175 (2007).

McKinney, B.A., Reif, D.M., White, B.C., Crowe, J.E., Moore, J.H. Evaporative
cooling feature selection for genotypic data involving interactions.
Bioinformatics 23, 2113-2120 (2007).

Thanks,

Bill

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2011-12-29

The...

... challenge that genetic epidemiologists and computational biologists face
is the

statistical modeling problem . That is, what is the most appropriate way to
model the

relationship between combinations of genetic variations or gene expression
variables and

clinical endpoints?One traditional approach to modeling the relationship
between genetic

variations or gene expression variables and discrete clinical outcomes is
logistic regression

(Hosmer and Lemeshow 2000).Logistic regression is a parametric statistical
approach for

relating one or more independent or explanatory variables to a dependent or
outcome

variable (e.g. disease status) that follows a binomial distribution.However,
as reviewed by

Moore and Williams (2002), the number of possible interaction terms grows
exponentially

as each additional main effect is included in the logistic regression
model.Thus, logistic

regression, like most parametric statistical methods, is limited in its
ability to deal with

interaction data involving many simultaneous factors

And...

Similarly, in order to detect the epistatic interaction of two loci, a full
logistic regression model with at most 9 parameters can be fitted and tested,
and the p-values should be multiplied by L(L-1)/2 according to the Bonferroni
correction . Because the number of SNPs is typically huge (e.g., several
hundred thousand) in genome-wide case-control studies, an exhaustive search
for all possible combinations of SNPs is computationally impractical. To
overcome this limitation, the stepwise logistic regression approach first
selects a small fraction (ε, e.g., 10%) of loci according to the significance
of their single-locus associations and then tests the interactions between the
selected loci . The determination of the fraction ε is, however, not guided.
An approach that is able to automatically determine such a small set of good
candidate markers is therefore preferred.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Bill White - 2011-12-29

The random forest provides another randomization mechanism to estimate the
importance of individual features. When a decision tree is constructed, the
correct classifications for the OOB samples can be counted. Now, for a feature
v, randomly permute its values in the OOB samples and again count the correct
classifications. The average of the difference in these two counts over all
trees in a forest is then defined as the raw importance of the feature v.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648748/?tool=pubmed

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Mike Gashler - 2011-12-29

Makes sense. It seems this approach could be implemented generally, to work
with any model. Is there any reason it is tied to random forest?

If you or the new people who will be working on this problem would like access
to our code repository, just send me an e-mail and I'll set it up. (I try to
keep the barriers-to-development at an absolute minimum.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Bill White - 2011-12-29

The only reason is that there's a great deal of literature already citing this
method of ranking biological variables and random forests are very fast. We
are dealing with very large data, taking days of processing on tens of cores.
When it comes down to it, we are trusting it works better than other models
based on previous research. Do you have any suggestions for models that would
be amenable to this process and perform as efficiently at finding
interactions? There are many papers out there that like random forests for
variable selection combined with classifiers such as SVMs. In fact, our
evaporative cooling paper uses random forests plus a naive bayes classifier.
It is the efficient selection of variables from hundreds of thousands that is
tricky. We want BOTH main effects AND interaction effects. ReliefF and RF seem
to do very well on small tests but remains to be seen on real data on the
other of 5,000 individuals/rows and 600,000 variables/attributes/columns.

We are trying to use RandomJungle, but they are not releasing source code any
more. Works great for binary class association, but we're having trouble
trying to reverse engineer the regression part of it, since there's no
documentation for the library.

Thanks for your help!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Random Forest Variable Importance Measure?

Forums

Help

Random Forest Variable Importance Measure? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Random Forest Variable Importance Measure?