Menu

Random Forest Variable Importance Measure?

Help
Anonymous
2011-12-25
2012-09-14
  • Anonymous

    Anonymous - 2011-12-25

    I haven't dug deeply into the source code yet, so I thought I would ask here
    first if the randomforest learner has any way to determine variable
    importance, as in the original Breiman paper?

    Thanks, Bill

     
  • Mike Gashler

    Mike Gashler - 2011-12-26

    My implementation of RF randomly chooses k candidate attributes, and divides
    on the one that minimizes entropy. If k=1, then divisions are completely
    random. As k approaches the number of attributes, behavior approaches that of
    C4.5 decision trees. You can also create bagging ensembles of other kinds of
    trees, or other learning algorithms. Waffles also contains several other
    ensemble techniques. For example, when I use the BayesianModelCombination
    ensemble technique with random trees, it outperforms random forest more often
    than not.

     
  • Nobody/Anonymous

    I am looking of for a replacement for RandomJungle, which no longer supplies
    source code. I don't need classification; I need a ranking of the most
    important attributes, which the original Brieman paper/implementation and
    RandomJungle provided in output files called <foo>.importance. We are dealing
    with over 500,000 attributes and are building a new algorithm called
    Evaporative Cooling that combines RelieF with Random Forest for rankings of
    attributes to go into other bioinformatics analyses, not
    classification/prediction. Thanks, Bill</foo>

     
  • Nobody/Anonymous

    LEO BREIMAN

    Statistics Department, University of California, Berkeley, CA 94720

    Editor: Robert E. Schapire

    Abstract. Random forests are a combination of tree predictors such that each
    tree depends on the values of a

    random vector sampled independently and with the same distribution for all
    trees in the forest. The generalization

    error for forests converges a.s. to a limit as the number of trees in the
    forest becomes large. The generalization

    error of a forest of tree classifiers depends on the strength of the
    individual trees in the forest and the correlation

    between them. Using a random selection of features to split each node yields
    error rates that compare

    favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings
    of the Thirteenth International

    conference, ∗ ∗ ∗, 148–156), but are more robust with respect to noise.
    Internal estimates monitor error,

    strength, and correlation and these are used to show the response to
    increasing the number of features used in

    the splitting. Internal estimates are also used to measure variable
    importance.
    These ideas are also applicable to

    regression.

     
  • Nobody/Anonymous

    1. Exploring the random forest mechanism
      Aforest of trees is impenetrable as far as simple interpretations of its
      mechanism go. In some

    applications, **analysis of medical experiments for example, it is critical to
    understand the

    interaction of variables that is providing the predictive accuracy.** A start
    on this problem is

    made by using internal out-of-bag estimates, and verification by reruns using
    only selected

    variables.

     
  • Mike Gashler

    Mike Gashler - 2011-12-27

    I have a tool that will rank attributes by importance. For example, the
    following command will display a ranked list of attributes by importance:

    waffles_dimred attributeselector data.arff

    It works with regression problems as well as classification problems.
    Unfortunately, it does not use random forest. It uses logistic regression, and
    removes the least-important attribute one-at-a-time. I am planning to improve
    this tool so that the user can specify a preferred model, but I haven't gotten
    around to it yet.

    I know WEKA contains several methods for doing attribute ranking, but I'm not
    familiar enough to know how well they support doing it for regression
    problems.

     
  • Nobody/Anonymous

    Mike,

    Logistic regression is the classic way to find attributes of importance in
    genetic epidemiology, but it fails when attributes interact in complex ways,
    particularly epistatic interactions. We have developed several modifications
    to ReliefF, which can handle interactions well, but we want to combine a good
    main effects detector to "couple" with interaction effects. Weka is a great
    tool, but it lacks parallel processing in it's ReliefF implementation. And
    Java just won't cut it compared to optimized C++. I am leaving my current
    post, so I will no longer be working on this problem. It will be left to who
    comes after me. I want to suggest they extend your random forests
    implementation to include Breiman's importance algorithm.

    Moore, J.H., White, B.C. Tuning ReliefF for genome-wide genetic analysis.
    Lecture Notes in Computer Science 4447, 166-175 (2007).

    McKinney, B.A., Reif, D.M., White, B.C., Crowe, J.E., Moore, J.H. Evaporative
    cooling feature selection for genotypic data involving interactions.
    Bioinformatics 23, 2113-2120 (2007).

    Thanks,

    Bill

     
  • Nobody/Anonymous

    The...

    ... challenge that genetic epidemiologists and computational biologists face
    is the

    statistical modeling problem . That is, what is the most appropriate way to
    model the

    relationship between combinations of genetic variations or gene expression
    variables and

    clinical endpoints?One traditional approach to modeling the relationship
    between genetic

    variations or gene expression variables and discrete clinical outcomes is
    logistic regression

    (Hosmer and Lemeshow 2000).Logistic regression is a parametric statistical
    approach for

    relating one or more independent or explanatory variables to a dependent or
    outcome

    variable (e.g. disease status) that follows a binomial distribution.However,
    as reviewed by

    Moore and Williams (2002), the number of possible interaction terms grows
    exponentially

    as each additional main effect is included in the logistic regression
    model.Thus, logistic

    regression, like most parametric statistical methods, is limited in its
    ability to deal with

    interaction data involving many simultaneous factors

    And...

    Similarly, in order to detect the epistatic interaction of two loci, a full
    logistic regression model with at most 9 parameters can be fitted and tested,
    and the p-values should be multiplied by L(L-1)/2 according to the Bonferroni
    correction . Because the number of SNPs is typically huge (e.g., several
    hundred thousand) in genome-wide case-control studies, an exhaustive search
    for all possible combinations of SNPs is computationally impractical. To
    overcome this limitation, the stepwise logistic regression approach first
    selects a small fraction (ε, e.g., 10%) of loci according to the significance
    of their single-locus associations and then tests the interactions between the
    selected loci . The determination of the fraction ε is, however, not guided.
    An approach that is able to automatically determine such a small set of good
    candidate markers is therefore preferred.

     
  • Bill White

    Bill White - 2011-12-29

    The random forest provides another randomization mechanism to estimate the
    importance of individual features. When a decision tree is constructed, the
    correct classifications for the OOB samples can be counted. Now, for a feature
    v, randomly permute its values in the OOB samples and again count the correct
    classifications. The average of the difference in these two counts over all
    trees in a forest is then defined as the raw importance of the feature v.

    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2648748/?tool=pubmed

     
  • Mike Gashler

    Mike Gashler - 2011-12-29

    Makes sense. It seems this approach could be implemented generally, to work
    with any model. Is there any reason it is tied to random forest?

    If you or the new people who will be working on this problem would like access
    to our code repository, just send me an e-mail and I'll set it up. (I try to
    keep the barriers-to-development at an absolute minimum.)

     
  • Bill White

    Bill White - 2011-12-29

    The only reason is that there's a great deal of literature already citing this
    method of ranking biological variables and random forests are very fast. We
    are dealing with very large data, taking days of processing on tens of cores.
    When it comes down to it, we are trusting it works better than other models
    based on previous research. Do you have any suggestions for models that would
    be amenable to this process and perform as efficiently at finding
    interactions? There are many papers out there that like random forests for
    variable selection combined with classifiers such as SVMs. In fact, our
    evaporative cooling paper uses random forests plus a naive bayes classifier.
    It is the efficient selection of variables from hundreds of thousands that is
    tricky. We want BOTH main effects AND interaction effects. ReliefF and RF seem
    to do very well on small tests but remains to be seen on real data on the
    other of 5,000 individuals/rows and 600,000 variables/attributes/columns.

    We are trying to use RandomJungle, but they are not releasing source code any
    more. Works great for binary class association, but we're having trouble
    trying to reverse engineer the regression part of it, since there's no
    documentation for the library.

    Thanks for your help!

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.