Random Forests / by Siobhán Cronin

How it works (in a nutshell)

This ensemble method constructs decision trees during the training phase (searching over a random subset of available decisions) and outputs the mode or mean prediction of these individual trees. In this way, Random Forests build up a prediction by leveraging output from a suite of "weak" learners. The final estimate is gleaned from multiple models trained on random samples of the data, thereby reducing the risk of overfitting. 

Applications

  • Bioinformatics, including "classifying different types of samples using gene expression of microarrays data, identifying disease associated genes from genome wide association studies, recognizing the important elements in protein sequences, or identifying protein-protein interactions."[4]
  • Breast cancer prediction [5]

Perks

  • "Forests of trees splitting with oblique hyperplanes, if randomly restricted to be sensitive to only selected feature dimensions, can gain accuracy as they grow without suffering from overtraining" [1]
  • Reduces overfitting by training each tree independently, using a random sample of the data. [2]

Drawbacks [3]

  • While fast to train, Random Forests can be slow to generate predictions.
  • Will not perform well if we must extrapolate beyond the range of what is given by the dependent or independent variables.
  • The produced results can be difficult to interpret.
  • Performs less well than other algorithms (i.e. regressions) when the relationship between independent and dependent variables is linear. 

 Resources

[1] Wikipedia
[2] Random Forests and Boosting
[3] When is a random forest a poor choice relative to other algorithms?
[4] Random Forests for bioinformatics
[5] Prediction of Breast Cancer using Random Forest, Support Vector Machines and Naïve Bayes