a child gets lost in a themeforestt的中文

Document Detail
Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
MedLine Citation:
Abstract/OtherAbstract:
In the Life Sciences 'omics' data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.
Wouter G T Jumamurat R B Lex O Lennart B Jos B Michiel W Sacha A F T van Hijum
Related Documents
Publication Detail:
JOURNAL ARTICLE
Journal Detail:
Briefings in bioinformatics
ISO Abbreviation:
Publication Date:
Date Detail:
Created Date:
Completed Date:
Revised Date:
Medline Journal Info:
Nlm Unique ID:
Medline TA:
Brief Bioinform
Other Details:
Languages:
Pagination:
Citation Subset:
Export Citation:
MeSH Terms
Descriptor/Qualifier:
From MEDLINE(R)/PubMed(R), a database of the U.S. National Library of Medicine
Journal Information
Journal ID (nlm-ta): Brief Bioinform
Journal ID (iso-abbrev): Brief. Bioinformatics
Journal ID (publisher-id): bib
Journal ID (hwp): bib
Publisher: Oxford University Press
Article Information
(C) The Author 2012. Published by Oxford University Press.
creative-commons:
Received Day: 30 Month: 3 Year: 2012
Accepted Day: 26 Month: 5 Year: 2012
Print publication date: Month: 5 Year: 2013
Electronic publication date: Day: 10 Month: 7 Year: 2012
pmc-release publication date: Day: 10 Month: 7 Year: 2012
Volume: 14 Issue: 3
First Page: 315 Last Page: 326
PubMed Id:
ID: 3659301
DOI: 10.1093/bib/bbs034
Publisher Id: bbs034
Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
Wouter G. Touw*
Jumamurat R. Bayjanov*
Lex Overmars*
Lennart Backus*
Jos Boekhorst*
Michiel Wels*
Sacha A. F. T. van Hijum*
Correspondence: Corresponding author. Sacha A. F. T. van Hijum. E-mail: svhijum@cmbi.ru.nl
BACKGROUND
Development of high-throughput techniques and accompanying technology to manage and mine large-scale data has led to a revolution of Systems Biology in the last decade []. ‘Omics’ technologies such as genomics, transcriptomics, proteomics, metabolomics, epigenomics and metagenomics allow rapid and parallel collection of massive amounts of different types of data for the same model system. Software tools to manage [], visualize [] and integratively analyse omics-scale data are crucial to deal with its inherent complexity and ultimately uncover new biology. For example, knowledge on both gene expression and protein abundance may better explain a phenotype than gene expression or protein abundance separately. Particularly machine learning algorithms play a central role in the process of knowledge extraction [, ]. They are applied for supervised pattern recognition in data sets: typically they are used to train a classification model that allows separating samples of different classes (e.g. healthy or ill) based on variables (e.g. SNPs in a Genome-Wide Association Study or GWAS), and to estimate which variables were important for this task (see below).
The Random Forest (RF) algorithm [] has become very popular for pattern recognition in omics-scale data, mainly because RF provides two aspects that are very important for data mining: high prediction accuracy and information on variable importance for classification. The prediction performance of RF compares well to other classification algorithms [] such as support vector machines (SVMs, [, ]), artificial neural networks [], Bayesian classifiers [, ], logistic regression [], k-nearest-neighbours [], discriminant analysis such as Fisher’s linear discriminant analysis [] and regularized discriminant analysis [], partial least squares (PLS, []) and decision trees such as classification and regression trees (CARTs, []). The theoretical and practical aspects of many of those algorithms and their application in biology have been discussed elsewhere (for example [, , ]). SVM and RF are arguably the most widely used classification techniques in the Life Sciences. Comparisons between the prediction accuracy of SVM and RF have been made several times [e.g. 24–29]. Although the performance of carefully tuned SVMs is generally slightly better than RF [], RF offers unique advantages over SVM (see below). Further comparisons between SVM and RF will not be discussed here.
Life Science data sets typically have many more variables than samples. This problem is known as the ‘curse of dimensionality’ or the small n large p problem []. For instance, genomics, transcriptomics, proteomics and GWAS data sets suffer from this problem with in general thousands of measurements of genes, transcripts, proteins or SNPs determined for only dozens of samples []. RF effectively handles these data sets by training many decision trees using subsets of the data. Furthermore, RF has the potential to unravel variable interactions, which are ubiquitous in data sets generated in the Life Sciences. Interactions can for example be expected between SNPs in GWAS [], between microbiota in metagenomics [], between physicochemical properties of peptides in proteomic biomarker discovery studies [] and between cellular levels of gene-products in gene-expression studies []. Additionally, the combinations of variables that together define molecules, e.g. mass spectrometry m/z ratios or Nuclear Magnetic Resonance chemical shifts, can distinguish phenotypes in metabolomics and metabonomics []. A final example includes combinations of several protein characteristics influencing the success rate in structural genomics []. In summary, its versatility makes RF a very suitable technique to investigate high-throughput data in this omics era.
Recent reviews aimed towards a more specialized audience have discussed the use of RF in (i) a broad scientific context [], (ii) genomics research [] and (iii) genetic association studies []. Here, we focus on the application of RF for supervised classification in the Life Sciences. In addition to reviewing the different uses of RF, we provide ideas to make this algorithm even more suitable for uncovering complex interactions from omics data. First, we introduce the general characteristics of RF for the reader who is not familiar with RF, followed by its use to tackle problems in data analysis. We also discuss rarely used properties of RF that allow determining interaction between variables. RF even has the potential to characterize these interactions for sample subclasses (e.g. groups of patients for which a SNP combination is predictive, while for a different group of patients the same SNP combination is not). Here, we discuss several research strategies that may allow exploiting RF to its full potential.
HOW DOES RF WORK?
Predictive RF models (from now on referred to as RFM) are non-parametric, hard to over-train, relatively robust to outliers and noise and fast to train. The RF algorithm can be used without tuning of algorithm parameters, although a better classification model can often easily be obtained by optimization of very few parameters (see below) []. RF trains an ensemble of individual decision trees based on samples, their class designation and variables. Every tree in the forest is built using a random subset of samples and variables (), hence the name RF. The RF description by Breiman serves as a general reference for this section [, ].
Suppose a forest of decision trees (e.g. CARTs) is constructed based on a given data set. For each tree, a different training set is created by randomly sampling samples (e.g. patient samples) from the data set with replacement resulting in a training set, or ‘bootstrap’ set, containing about two-third of the samples in the original data set. The remaining samples in the original data set are the ‘out-of-bag’ (OOB) samples. The tree is grown using the bootstrap data set by recursive partitioning (). For every tree ‘node’, variables are randomly selected from the set of all variables and evaluated for their ability to split the data (). The variable resulting in the largest decrease in impurity is chosen to separate the samples at each ‘parent node’, starting at the top node, into two subsets, ending up in two distinct ‘child nodes’. In RF, the impurity measure is the Gini impurity. A decrease in Gini impurity is related to an increase in the amount of order in the sample classes introduced by a split in the decision tree. After the bootstrap data has been split at the top node, the splitting process is repeated. The partitioning is finished when the final nodes, ‘terminal nodes’ or ‘leafs’, are either (i) ‘pure’, i.e. they contain only samples belonging to the same class or (ii) contain a specified number of samples. A classification tree is usually grown until the terminal nodes are pure, even if that results in terminal nodes containing a single sample. The tree is thus grown t it is not ‘pruned’. After a forest has been fully grown, the training process is completed. The RFM can subsequently be used to predict the class of a new sample. Every classification tree in the forest casts an unweighted vote for the sample after which the majority vote determines the class of the sample.
Although a single tree from the RFM is a weak classifier because it is trained on a subset of the data, the combination of all trees in a forest is a strong classifier []. Random selection of candidate variables for splitting ensures a low correlation between trees and prevents over-training of an RFM. Therefore, trees in an RFM need not be pruned, in contrast to classical decision trees that do not use random selection of variables []. The expected error rate of classification of new samples by a classifier, is usually estimated by cross-validation procedures, such as leave-one-out or K-fold cross-validation []. In K-fold cross-validation, the original data are randomly partitioned into K subsets (folds). Each of the K folds is once used as a test set while the other K - 1 folds are used as training data to construct a classifier. The average of the K error rates is the expected error rate of the classification of new samples when the classifier is built with all samples. In leave-one-out cross-validation a single sample is left out from the training set. General cross-validation procedures are unnecessary to predict the classification performance of a given RFM. A cross-validation is already built-in, as each tree in the forest has its own training (bootstrap) and test (OOB) data.
IMPORTANT VARIABLES FOR CLASS PREDICTION
In addition to an internal cross-validation RF also calculates estimates of variable importance for classification []. Importance estimates can be very useful to interpret the relevance of variables for the data set under study. The importance scores can for example be used to identify biomarkers [] or as a filter to remove non-informative variables []. Two frequently used types of the RF variable importance measures exist. The mean decrease in classification is based on permutation. For each tree, the classification accuracy of the OOB samples is determined both with and without random permutation of the values of the variable. The prediction accuracy after permutation is subtracted from the prediction accuracy before permutation and averaged over all trees in the forest to give the permutation importance value. The second importance measure is the Gini importance of a variable and is calculated as the sum of the Gini impurity decrease of every node in the forest for which that variable was used for splitting. The use of different variable importance measures is discussed below in more detail.
The importance of variables for classification of a single sample is provided by RF as the local importance. It thus shows a direct link between variables and samples. As discussed in more detail below, the differences in local importance between samples can for example be used to detect variables that are important for a subset of samples of the same class (e.g. the important variables for a subtype of cancer in a data set with cancer patients and healthy subjects as classes). The local importance score is derived from all trees for which the sample was not used to train the tree (and is therefore OOB). The percentage of correct votes for the correct class in the permuted OOB data is subtracted from the percentage of votes for the correct class in the original OOB data to assign a local importance score for the variable of which the values were permuted. The score reflects the impact on correct classification of a given sample: negative, 0 (the variable is neutral) and positive. Local importances are rarely used and noisier than global importances, but a robust estimation of local importance values can be obtained by running the same classification several times [] and for instance averaging the local importance scores.
PROXIMITY SCORES ALLOW DETERMINING SIMILARITY BETWEEN SAMPLES
RF not only generates variable-related information such as variable importance measures, but also calculates the proximity between samples. The proximity between similar samples is high. For proximity calculations, all samples in the original data set are classified by the forest. The proximity between two samples is calculated as the number of times the two samples end up in the same terminal node of a tree, divided by the number of trees in the forest. Provided sufficient variables are included in the RFM, outliers or mislabelled samples can be defined as samples whose proximity to all other samples from the same class is small. Identification of outliers or mislabelled samples serves as important feedback for the biologist who, if necessary, can correct for experimental mistakes. Similarly, subclasses can in principle be identified by finding samples that have similar proximities to all other samples of the same class. Subclasses in a data set with healthy and diseased subjects can for example be severe and mild subtypes of the disease. Proximity scores also allow the identification of prototypes, representative samples of a group of samples. The variable values of prototypes may explain how those variables relate to the classification of the group. Proximity scores may also be used to construct multidimensional scaling (MDS) plots. MDS plots aim to visualize the dissimilarity (calculated as 1 – proximity) between samples typically in a two-dimensional plot, so that the distances between data points are proportional to the dissimilarities. A good class separation may be obtained by plotting the first two scaling coordinates against each other, provided they capture sufficient information.
RF IMPLEMENTATIONS
The RF algorithm is available in many different open source software packages. Conveniently, the ‘randomForest’ package [] is available as an R implementation [] of the original RF code by Breiman and Cutler []. It is probably the most referred RF implementation because it is easy to use and the user benefits from other R data processing functionality. Recently, a framework for tree growing called Random Jungle (RJ) was developed []. It is currently the fastest implementation of RF, allows parallel computation of trees and is therefore very suited for the analysis of genome-wide data. The Willows package was also designed for tree-based analysis of genome-wide data by maximizing the use of computer memory []. The WEKA workbench [] is a data mining environment that includes several machine learning algorithms including RF. The workbench allows for easy pre-processing of data and comparison between RF and other algorithms.
RF IN THE LIFE SCIENCES
lists a non-exhaustive, yet in our opinion representative, number of studies that applied RF in different areas of the Life Sciences. A summary of the use of RF features in these areas is also provided in . The publications include many highly cited papers and papers that we included because they describe noteworthy use of RF properties. A detailed overview of the use of RF in these publications as well as meta data on them can be found in Supplementary Table S1.
Three-quarters of the studies exploited the variable importance output of the RF algorithm (). For example, information on variable importance has been used to identify risk-associated SNPs in a genome-wide association study [], to determine important genes and pathways for the classification of micro-array gene-expression data [] and to identify factors that can be used to predict protein–protein interactions []. Very few studies report on the use of an iterative variable selection procedure [] to select the most relevant variables and optimize the prediction accuracy of the RFM, although the classification accuracy improved when such a protocol was applied [, , , ] (Supplementary Table S1). In several data mining pipelines, important variables were selected from an RFM, which were subsequently used in other analysis techniques [, ].
Improving prediction accuracy has also been researched. In addition to a better separation of the samples of different classes, the variables of an accurate RFM are likely to be more relevant than those of a less accurate RFM. The number of variables to select for the best split at each node, mtry, was already marked as a tuning parameter by Breiman []. Varying the number of trees in the forest may also improve the OOB-error. One-fourth of the papers tuned and optimized the value of mtry and the number of trees. A single study not only regulated the size of the forest but also the size of the trees by varying the minimal node size []. The improvement of the prediction accuracy however was negligible. In contrast, Segal reported a better prediction accuracy may be achieved by regulation of the tree size via limiting the number of splits or the size of nodes for which splitting is allowed []. Boulesteix et al. [] also recommended tuning tree depth and minimal node size in the context of genetic association studies. Alternative voting schemes, such as weighted voting, may improve classification accuracy [] too, but have not been applied in the papers listed in .
Zhang and Wang pointed out that the interpretation of an RFM may be less practical than the interpretation of a single decision tree classifier due to the many trees in a forest. In a single tree, it is clear in which level of the tree and with what cut-off a variable is used to make a split. In a forest, a variable may or may not be present in a given tree, and if it is present, it may be so at different levels in the tree and have different cut-offs. They proposed to shrink a full forest to a smaller forest having a manageable number of trees and a level of prediction accuracy similar to the original RFM []. The smallest forest is one of the attempts to modify RF or use RF in combination with other methods in order to increase the prediction accuracy or model interpretability of RFMs (). Several other modifications were reviewed by Verikas et al. []. RF has not only been used in combination with other techniques, but several studies also combined multiple RFMs in a pipeline for better classification results (, [, , ]). RF has also been used in conjunction with dimension reduction techniques [, ]. For example, RF has been applied after PLS (PLS-RF, []). Sampson and colleagues argued the loadings (relative contribution of variables to the variability in the data) produced by PLS allow for meaningful interpretation of the association between variables and disease. De Lobel et al. [] have used RF as a pre-screening method to remove noisy SNPs before multifactor-dimensionality reduction in genetic association studies. Additionally, RF has been incorporated in a transductive confidence machine [], a framework that allows the prediction of classifiers to be complemented with a confidence value that can be set by the user prior to classification [].
NEGLECTED RF PROPERTIES
RF has several properties that allow extracting relevant trends from data with complex variable relations, such as omics data sets. Nevertheless, these properties have according to our knowledge not yet been exploited to their full extent and only a few studies have explored their potential. Below we discuss the most important ones.
Proximity values are a measure of similarity between samples. A few studies used proximity values to detect outliers [, , ] resulting in an RFM before and after removal of outliers. The OOB prediction accuracy may improve after removing the outliers []. However, not in all cases a comparison was reported between the OOB errors of the second and the first model [].
In addition to outlier detection, studies listed in
used proximity scores in MDS plots [, , ] and for class discovery from RF clustering results []. Analogous to their role in clustering, proximity scores also in supervised classification have the potential to allow discovering subclasses of data samples and even to identify corresponding prototypic variable values. However, we did not come across literature examples of utilization of the RF proximity measure for identification of subclasses or variable prototypes.
LOCAL IMPORTANCE
The global variable importance generated by RF captures classification impact of variables on all samples. The local variable importance is an estimate of the importance of a variable for the classification of a single sample. Local importance may therefore reveal specific variable importance patterns within groups of samples that may not be evident from global importance values. In other words, variables that are important for a subset of samples from the same class could show a clear local importance signal, while this signal would be lost in the global measure. Nevertheless, only one study in the Life Sciences reported the use of local importances in data analysis (). In this study, the local importance measure was exploited to predict microRNAs (miRNAs) that are significantly associated to the modification of expression of specific mRNAs []. Local importance instead of global importance was used in a regression RF analysis because the authors assumed that only a subset of miRNAs would significantly contribute to the regression fit. Recently, we developed PhenoLink, a method that links phenotypes to omics data sets []. Local importances were applied for variable selection using two criteria: (i) a removal criterion: having a negative or neutral local importance for the majority of class samples removing variables that do not positively contribute to the classification and (ii) a selection criterion: having a positive local importance for at least a few samples (typically 3) or for a percentage of samples (at least 10%) of a class. Classification of a metabolomics data set consisting of 9303 headspace (gas-phase) GC-MS metabolomics-based measurements (variables) for 45 different bacterial samples resulted in a classification (OOB) error of 71% (results not shown). After removal of 8587 ‘garbage’ variables the classification error was reduced to 18%. This dramatic reduction of classification error is due to the ‘garbage’ variables that make it more difficult for RF to recognize the informative variables. The positive selection criterion resulted in the same classification error but with an additional 210 variables removed and a total of 506 variables relevant for separating the bacterial samples based on headspace metabolites. PhenoLink was used effectively to remove redundant or even confusing variables and to detect variables that were important for a subset of samples in a number of studies ranging from gene-trait matching, metabolomics-transcriptomics matching and identification of biomarkers based on a variety of data sources []. Altogether, utilization of local importances is promising for many omics data sets and has the potential to uncover variables important for subsets of samples.
CONDITIONAL RELATIONSHIPS AND VARIABLE INTERACTIONS
For data sets generated in the Life Sciences, e.g. for metabolomics and proteomics measurements, gene expression data and GWAS studies, variables (e.g. SNPs in genetic association studies) are typically important for a subset of samples of the same class (e.g. patients) and conditional relations between variables might be important for a subset of samples. For example, certain SNPs or SNP combinations may be important for the first subgroup of patients and not important for the second subgroup.
Variable interactions have been reported to increase the global variable importance value []. The importance value itself however only provides the combined importance of the variable and all its interactions with other variables, but does not specify the actual variable interactions. Interactions between two variables can be inferred from a classification tree if a variable systematically makes a split on the other variable more likely or less likely than expected compared to variables without interactions. A recent paper reviewed the ability to identify SNP interactions by variations of logic regression, RF and Bayesian logistic regression []. For RF, an interaction importance measure was defined. However, the actual SNP interactions were not identified by the interaction importance, but rather by a relatively high variable importance measure. As Chen and colleagues discussed, the problem with their interaction importance measure was that two interacting SNPs need to be jointly selected in a tree branch relatively often. Furthermore, in the branches further down the tree the interaction of SNP A and B may have to be prominent in the presence of other variables in order to show a signal in the interaction importance [].
Interactions between variables will often go hand in hand with conditional dependencies between the variables, i.e. variable B contributes to classification given that variable A is present above B in the tree. Conditional relations between variables are implicitly taken into account by the conditional inference forest algorithm (cforest, implemented in the party package [] in R). cforest is a variant of RF that has been designed for unbiased variable selection (discussed below) []. Like RF, cforest generates a variable importance measure. Variable importance measures are currently subject of debate and rankings produced using permutation importance may be preferred over Gini importance rankings when variables: (i) are correlated [, ], (ii) vary in their scale of measurement (e.g. continuous and categorical variables) [, ] and (iii) vary in their number of categories [, ]. These variable characteristics are common in Life Science data sets, e.g. for patient parameters (for instance a categorical variable such as the dichotomous variable ‘has dog’: yes, another discrete variable such as ‘number of children’: 0, 1, 2, 3, 4; and a continuous variable ‘IgG blood level’: 0–20 g/l) and gene expression (continuous) versus SNP data (categorical). In combination with subsampling instead of bootstrap sampling, the splitting criterion of cforest has been reported to be less biased than the RF criterion []. The algorithm to determine the conditional importance measure generated by cforest explicitly takes into account the conditional relationships. However, like in RF conditional relationships are still implicit in the importance value output of cforest.
Analysis of individual RFM tree structures might be a good strategy to investigate interactions between variables. If variable A precedes variable B significantly more often than expected for variables without interactions, B is likely conditionally dependent on A. Recently, in a GWAS study the genetic variants underlying age-related macular degeneration (AMD) were investigated []. The authors analysed tree structures and proposed an importance measure based on associations between a variable (SNP) and the response variable (trait), conditional on other variables (other SNPs). For a given SNP, the forest was searched for nodes where that SNP was used as a splitting variable. A conditional Chi-square statistic was calculated for each of those nodes using SNPs that preceded the SNP in the same tree. The maximal conditional Chi-square (MCC) importance was defined as the highest Chi-square value of all nodes where the SNP was used as a splitting variable. The MCC value thus quantifies the relationship between a phenotype and a SNP given its preceding SNPs in the RFM.
The interactions between alleles of patients or healthy people in these SNPs were shown in a tree-like graph. The effects of the conditional relationships between variables for all samples of a given class are directly visible in these graphs. Partial dependence plots [] may reveal the same information as they show how the classification of a data set is altered as a function of a subset of variables (usually one or two) after accounting for the average effects of all other variables in the model. CARTscans [] allow visualization of conditional dependencies on categorical variables. However, multidimensional partial dependence plots or CARTscans have to be manually inspected to derive concrete interactions between variables.
The MCC importance can probably also be applied to other high-throughput data with numerous noisy and only a few important variables, as long as the node size is sufficient []. To date, however, no publicly available MCC implementation exists. Importantly, none of the above-described studies allow deriving a minimum set of variables and their interactions required to classify a given data set. Such minimum set is essential in reducing the complexity of a biomarker and increasing its interpretability. In addition, it could very well be that variable interactions are relevant only for a subset of samples of the same class. Generating this potentially crucial information for a given data set would require supplementing for instance the MCC algorithm of Wang and co-workers with, e.g. a clustering of samples based on, e.g. local variable importance or RF proximity scores and subsequently selecting the variables and/or variable interactions that explain the classification of a given subset of samples of the same class. A publicly available and validated MCC implementation might therefore be promising for the discovery of variable interactions in proteomics, metabolomics, genomics and transcriptomics data using RF, especially if the implementation would also include the determination of variable interactions for subsets of samples and visualization tools that support interpretation of such complex relationships.
For inspiration, we provide a concept visualization of interacting variables, relevant for subsets of samples, different from the visualizations discussed earlier. The visualization might be a typical result from extensive omics data mining from the trees in an RFM (). Linking the samples of the same subclass using evidence-based graphs, much like those from STRING [], could furthermore allow the viewer to see and understand the (other) biological connection(s) between samples that are found to be linked by (interacting) variables identified in this data-driven approach.
CONCLUSION
The RF algorithm has been widely used in the Life Sciences. It is suited for both regression and classification tasks, for example the prediction of disease state of patients (samples) using expression characteristics of genes (variables). However, RF has predominantly been used in a straight-forward way as a classifier without preceding variable selection and parameter tuning, or as a variable filter prior to using other prediction algorithms. RF is an elegant and powerful algorithm allowing the extraction of additional relevant knowledge from omics data, such as conditional relations between variables and interactions between variables for subsets of samples. Exploiting local importances, proximity values and analysis of individual trees could prove to be a compass to unlocking this information from complex omics data.
Key points
RF is widely used in the Life Sciences because RF classification models are versatile, have a high prediction accuracy and provide additional information such as variable importances.
RF is often used as a black box, without parameter optimization, variable selection or exploitation of proximity values and local importances.
RF is a unique and valuable tool to analyse variable interactions and conditional relationships for data sets in which (combinations of) variables are important for subsets of samples, typically for omics data generated in the Life Sciences.
SUPPLEMENTARY DATA
Supplementary data are available online at http://bib.oxfordjournals.org/.
Supplementary Data
Wouter Touw is a master student of Molecular Life Sciences at the Radboud University of Nijmegen, the Netherlands. He specializes in bioinformatics and structural biology.
Jumamurat Bayjanov is a postdoctoral researcher at the Radboud University Medical Centre, the Netherlands. He is involved in analyzing next-generation sequence data and developing machine-learning tools.
Lex Overmars is a PhD student at the Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre. His research focuses on the analysis of prokaryotic regulatory elements.
Lennart Backus is developing phylogenomics techniques for sequence-based prediction of microbial interactions at the Radboud University Medical Centre in a PhD project funded by TI Food and Nutrition.
Jos Boekhorst is a bioinformatician at NIZO food research. He uses computational tools to unravel links between microbes, food, health and disease.
Michiel Wels is group leader bioinformatics at NIZO food research and is involved in applying bioinformatics approaches to different food-related research questions.
Sacha van Hijum is a senior scientist bioinformatics at NIZO food research and group leader of the bacterial genomics group, Centre for Molecular and Biomolecular Informatics at the Radboud University Medical Centre. Bioinformatics research at the bacterial genomics group focuses on establishing the relation between microbial consortia and health.
References
Ideker T,Galitski T,Hood L. A new approach to decoding life: systems biologyAnnu Rev Genomics Hum GenetYear: 701654
Kitano H. Systems biology: a brief overviewScienceYear:
Chuang H-Y,Hofree M,Ideker T. A decade of systems biologyAnnu Rev Cell Dev BiolYear: 0604711
Ghosh S,Matsuoka Y,Asai Y,et al. Software for systems biology: from tools to integrated platformsNat Rev GenetYear: 2048662
Gehlenborg N,O’Donoghue SI,Baliga NS,et al. Visualization of omics data for systems biologyNat MethodsYear: 195258
Larranaga P. Machine learning in bioinformaticsBrief BioinformYear: 761367
Verikas A,Gelzinis A,Bacauskiene M. Mining data with random forests: a survey and results of new testsPattern RecognitYear:
Breiman L. Random ForestsMach LearnYear:
Boser BE,Guyon IM,Vapnik VN. A training algorithm for optimal margin classifiersIn: Proceedings of the fifth annual workshop on Computational learning theory - COLT ’92,
Cortes C,Vapnik V. Support-vector networksMach LearnYear:
McCulloch WS,Pitts W. A logical calculus of the ideas immanent in nervous activityBull Math BiophysYear:
Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brainPsychol RevYear:
Rumelhart DE,Hinton GE,Williams RJ. Learning representations by back-propagating errorsNatureYear:
Friedman N,Geiger D,Goldszmidt M. Bayesian network classifiersMach LearnYear:
Minsky M. Steps toward artificial intelligenceProc IREYear:
Kleinbaum DG,Kupper LL,Chambless LE. Logistic regression analysis of epidemiologic data: theory and practiceCommun Stat TheoryYear:
Fixt E,Hodges JL. Discriminatory analysis-nonparametric discrimination: consistency propertiesInt Stat RevYear:
Fischer RA. The use of multiple measurements in taxonomic problemsAnn Hum GenetYear:
Friedman JH. Regularized discriminant analysisJ Am Stat AssocYear:
Wold H. Soft modeling by latent variables: the nonlinear iterative partial least squares approachPerspectives in Probability and Statistics, Papers in Honour of M. S. BartlettYear: 1975
Breiman L,Friedman JH,Olshen RA,et al. Classification and regression treesThe Wadsworth Statistics Probability SeriesYear:
Hastie T,Tibshirani R,Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and PredictionYear: 20092nd ednNew YorkSpringer-Verlag
Tarca AL,Carey VJ,Chen X-wen,et al. Machine learning and its applications to biologyPLoS Comput BiolYear: 04446
Statnikov A,Wang L,Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classificationBMC BioinformaticsYear: 7401
Díaz-Uriarte R,Alvarez de Andrés S. Gene selection and classification of microarray data using random forestBMC BioinformaticsYear: 26
Jiang P,Wu H,Wang W,et al. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined featuresNucleic Acids ResYear:
Pang H,Lin A,Holford M,et al. Pathway analysis using random forests classification and regressionBioinformaticsYear:
Bao L,Cui Y. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary informationBioinformaticsYear:
Qi Y,Bar-Joseph Z,Klein-seetharaman J. Evaluation of different biological data and computational classification methods for use in protein interactionBioinformaticsYear: 0
Bellman RE. RandCorporationDynamic ProgrammingYear: 1957PrincetonPrinceton University Press342
Somorjai RL,Dolenko B,Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautionsBioinformaticsYear:
Bureau A,Dupuis J,Falls K,et al. Identifying SNPs predictive of phenotype using random forestsGenet EpidemiolYear: 5593090
Sampson DL,Parker TJ,Upton Z,et al. A comparison of methods for classifying clinical samples based on proteomics data: a case study for statistical and machine learning approachesPLoS OneYear: 1969867
Moore JH,Asselbergs FW,Williams SM. Bioinformatics challenges for genome-wide association studiesBioinformaticsYear: 0053841
Arumugam M,Raes J,Pelletier E,et al. Enterotypes of the human gut microbiomeNatureYear:
Fusaro VA,Mani DR,Mesirov JP,et al. Prediction of high-responding peptides for targeted protein assays by mass spectrometryNat BiotechnolYear: 169245
Nicholson JK,Connelly J,Lindon JC,et al. Metabonomics: a platform for studying drug toxicity and gene functionNat Rev Drug DiscovYear: 120097
Goh C-S,Lan N,Douglas SM,et al. Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysisJ Mol BiolYear:
Chen X,Ishwaran H. Random forests for genomic data analysisGenomicsYear: 546560
Goldstein BA,Polley EC,Briggs FBS. Random forests for genetic association studiesStat Appl Genet Mol BiolYear:
Breiman L,Cutler A. Random Forests. http://www.stat.berkeley.edu/~breiman/RandomForests/.
Stone M. Cross-validatory choice and assessment of statistical predictionsJ Roy Stat Soc B MetYear:
Bayjanov JR,Molenaar D,Tzeneva V,Siezen RJ,van Hijum SAFT. PhenoLink – a web-tool for linking phenotype to ~omics data for bacteria: application to gene-trait matching for Lactobacillus plantarum strainsBMC GenomicsYear: 59291
Liaw A,Wiener M. Classification and regression by randomForestR NewsYear:
R Development Core Team. R.A Language and Environment for Statistical ComputingYear: 2012
Schwarz DF,K?nig IR,Ziegler A. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional dataBioinformaticsYear: 0505004
Zhang H,Wang M,Chen X. Willows: a memory efficient tree and forest construction packageBMC BioinformaticsYear: 16535
Frank E,Hall M,Trigg L,et al. Data mining in bioinformatics using WekaBioinformaticsYear:
Alvarez S,Diaz-Uriarte R,Osorio A,et al. A predictor based on the somatic genomic changes of the BRCA1/BRCA2 breast cancer tumors identifies the non-BRCA1/BRCA2 tumors with BRCA1 promoter hypermethylationClin Cancer ResYear:
Briggs FBS,Bartlett SE,Goldstein BA,et al. Evidence for CRHR1 in multiple sclerosis using supervised machine learning and meta-analysis in 12,566 individualsHum Mol GenetYear:
Caporaso JG,Lauber CL,Costello EK,et al. Moving pictures of the human microbiomeGenome BiolYear: 1624126
Chen CCM,Schwender H,Keith J,et al. Methods for identifying snp interactions: a review on variations of logic regression, random forest and bayesian logistic regressionIEEE/ACM Trans Comput Biol BioinfYear:
Christensen BC,Houseman EA,Godleski JJ,et al. Epigenetic profiles distinguish pleural mesothelioma from normal pleura and predict lung asbestos burden and clinical outcomeCancer ResYear: 9118007
De Lobel L,Geurts P,Baele G,et al. A screening methodology based on Random Forests to improve the detection of gene-gene interactionsEur J Hum Genet,Year:
Dutilh BE,Jurgelenaite R,Szklarczyk R,et al. FACIL: fast and accurate genetic code inference and logoBioinformaticsYear:
Lunetta KL,Hayward LB,Segal J,et al. Screening large-scale association study data: exploiting interactions using random forestsBMC GenetYear: 316
Ma D,Xiao J,Li Y,et al. Feature importance analysis in guide strand identification of microRNAsComput Biol ChemYear: 704258
Meijerink M,van Hemert S,Taverne N,et al. Identification of genetic loci in Lactobacillus plantarum that modulate the immune response of dendritic cells using comparative genome hybridizationPloS OneYear: 0498715
R?delsperger C,Guo G,Kolanczyk M,et al. Integrative analysis of genomic, functional and protein interaction data predicts long-range enhancer-target gene interactionsNucleic Acids ResYear:
Roshan U,Chikkagoudar S,Wei Z,et al. Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forestNucleic Acids ResYear: 1317188
Tsou JA,Galler JS,Siegmund KD,et al. Identification of a panel of sensitive and specific DNA methylation markers for lung adenocarcinomaMol CancerYear: 182
van Hemert S,Meijerink M,Molenaar D,et al. Identification of Lactobacillus plantarum genes modulating the cytokine response of human peripheral blood mononuclear cellsBMC MicrobiolYear: 80958
Vingerhoets J,Tambuyzer L,Azijn H,et al. Resistance profile of etravirine: combined analysis of baseline genotypic and phenotypic data from the randomized, controlled Phase III clinical studiesAIDSYear: 0051805
Enot DP,Beckmann M,Draper J. On the interpretation of high throughput MS based metabolomics fingerprints with random forestMetabolomicsYear: .
Gupta S,Aires-de-Sousa J. Comparing the chemical spaces of metabolites and available chemicals: models of metabolite-likenessMol DiversYear: 447158
Pino Del Carpio D,Basnet RK,De Vos RCH,et al. Comparative methods for association studies: a case study on metabolite variation in a Brassica rapa core collectionPloS OneYear: 1602927
Finehout EJ,Franck Z,Choe LH,et al. Cerebrospinal fluid proteomic biomarkers for Alzheimer’s diseaseAnn NeurolYear: 167789
Hettick JM,Kashon ML,Slaven JE,et al. Discrimination of intact mycobacteria at the strain level: a combined MALDI-TOF MS and biostatistical analysisProteomicsYear: 7109381
Munro NP,Cairns DA,Clarke P,et al. Urinary biomarker profiling in transitional cell carcinomaInt J CancerYear:
Gunther EC,Stone DJ,Gerwien RW,et al. Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitroProc Natl Acad Sci USAYear:
Guo L,Ma Y,Ward R,et al. Constructing molecular classifiers for the accurate prognosis of lung adenocarcinomaClin Cancer ResYear:
Nannapaneni P,Hertwig F,Depke M,et al. Defining the structure of the general stress regulon of Bacillus subtilis using targeted microarray analysis and Random Forest classificationMicrobiologyYear:
Riddick G,Song H,Ahn S,et al. Predicting in vitro drug sensitivity using Random ForestsBioinformaticsYear: 134890
Tsuji S,Midorikawa Y,Takahashi T,et al. Potential responders to FOLFOX therapy for colorectal cancer by Random Forests analysisBr J CancerYear: 201117
Wang X,Simon R. Microarray-based cancer prediction using single genesBMC BioinformaticsYear: 82331
Wuchty S,Arjona D,Li A,et al. Prediction of associations between microRNAs and gene expression in glioma biologyPloS OneYear: 1358821
Bordner AJ. Predicting protein-protein binding sites in membrane proteinsBMC BioinformaticsYear: 78442
Chen X-wen,Jeong JC. Sequence-based prediction of protein interaction sites with an integrative methodBioinformaticsYear: 9153136
Dybowski JN,Heider D,Hoffmann D. Prediction of co-receptor usage of HIV-1 from genotypePLoS Comput BiolYear:
Han P,Zhang X,Norton RS,et al. Large-scale prediction of long disordered regions in proteins using random forestsBMC BioinformaticsYear: 505
Heider D,Verheyen J,Hoffmann D. Predicting Bevirimat resistance of HIV-1 from genotypeBMC BioinformaticsYear: 9140
Hillenmeyer ME,Ericson E,Davis RW,et al. Systematic analysis of genome-wide fitness data in yeast reveals novel gene function and drug actionGenome BiolYear: 0226027
Li Y,Fang Y,Fang J. Predicting residue-residue contacts using random forest modelsBioinformaticsYear: 201117
Li Y,Wen Z,Xiao J,et al. Predicting disease-associated substitution of a single amino acid by analyzing residue interactionsBMC BioinformaticsYear: 3604
Lin N,Wu B,Jansen R,et al. Information assessment on predicting protein-protein interactionsBMC BioinformaticsYear: 1499
Marino SR,Lin S,Maiers M,et al. Identification by random forest method of HLA class I amino acid substitutions associated with lower survival at day 100 in unrelated donor hematopoietic cell transplantationBone Marrow TransplantYear: 1441965
Medema MH,Zhou M,van Hijum SAFT,et al. A predicted physicochemically distinct sub-proteome associated with the intracellular organelle of the anammox bacterium Kuenenia stuttgartiensisBMC GenomicsYear: 59862
Nayal M,Honig B. On the nature of cavities on protein surfaces: application to the identification of drug-binding sitesProteinsYear:
Nimrod G,Szilágyi A,Leslie C,et al. Identification of DNA-binding proteins using structural, electrostatic and evolutionary featuresJ Mol BiolYear:
Radivojac P,Vacic V,Haynes C,et al. Identification, analysis, and prediction of protein ubiquitination sitesProteinsYear: 9722269
Shi T,Seligson D,Belldegrun AS,et al. Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinomaMod PatholYear: 5529185
Slabbinck B,De Baets B,Dawyndt P,et al. Towards large-scale FAME-based bacterial species identification using machine learning techniquesSyst Appl MicrobiolYear: 9237256
Springer C,Adalsteinsson H,Young MM,et al. PostDOCK: a structural, empirical approach to scoring protein ligand complexesJ Med ChemYear:
Tognazzo S,Emanuela B,Rita FA,et al. Probabilistic classifiers and automated cancer registration: an exploratory applicationJ Biomed InformYear: 20077
Wang H,Lin C,Yang F,et al. Hedged predictions for traditional Chinese chronic gastritis diagnosis with confidence machineComput Biol MedYear: 9386299
Wiseman SM,Melck A,Masoudi H,et al. Molecular phenotyping of thyroid tumors identifies a marker panel for differentiated thyroid cancer diagnosisAnn Surg OncolYear:
Zhang G,Li H,Fang B. Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid compositionProcess BiochemYear:
Kim Y,Wojciechowski R,Sung H,et al. Evaluation of random forests performance for genome-wide association studies in the presence of interaction effectsBMC ProcYear: 8058
Segal M. Machine learning benchmarks and random forest regressionTechnical Report, Center for Bioinformatics & Molecular Biostatistics, University of California, San Francisco, 2004114
Boulesteix A-L,Bender A,Lorenzo Bermejo J,et al. Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendationsBrief BioinformYear:
Robnik-Sikonja M. Boulicaut JF,et al.Improving Random ForestsMachine Learning: ECML 2004 ProceedingsYear: 2004Vol. 3201BerlinSpringer35970
Zhang H,Wang M. Search for the smallest random forestStat InterfaceYear: 5560
Gammerman A,Vovk V,Vapnik V. Learning by transductionIn: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence,
Strobl C,Boulesteix A-L,Zeileis A,et al. Bias in random forest variable importance measures: illustrations, sources and a solutionBMC BioinformaticsYear: 353
Strobl C,Boulesteix A-L,Kneib T,et al. Conditional variable importance for random forestsBMC BioinformaticsYear: 0558
Hothorn T,Bühlmann P,Dudoit S,et al. Survival ensemblesBiostatisticsYear: 344280
Hothorn T,Hornik K,Zeileis A. Unbiased recursive partitioning: a conditional inference frameworkJ Comput Graph StatYear:
Nicodemus KK,Malley JD. Predictor correlation impacts machine learning algorithms: implications for genomic studiesBioinformaticsYear:
Nicodemus KK,Malley JD,Strobl C,et al. The behaviour of random forest permutation-based variable importance measures under predictor correlationBMC BioinformaticsYear: 87966
Nicodemus KK. Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measuresBrief BioinformYear: 1498552
Wang M,Chen X,Zhang H. Maximal conditional chi-square importance in random forestsBioinformaticsYear: 130032
Friedman JH. Greedy function approximation: a gradient boosting machineAnn StatYear: 2
Nason M,Emerson S,LeBlanc M. CARTscans: a tool for visualizing complex modelsJ Comput Graph StatYear:
Szklarczyk D,Franceschini A,Kuhn M,et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scoredNucleic Acids ResYear:
[Figure ID: bbs034-F1]
Training of an individual tree of an RFM. The tree is built based on a data matrix (shown within the ellipses). This matrix consists of samples (S1–S10; e.g. individuals) belonging to two classes (encircled crosses or
e.g. healthy and ill) and measurements for each sample for different variables (V1-V5; e.g. SNPs). Dice: random selection. Dashed lines: randomly selected samples and variables. For each tree, a bootstrap set is created by sampling samples from the data set at random and with replacement until it contains as many samples as there are in the data set. The random selection will contain about 63% of the samples in the original data set. In this example, the bootstrap set contains seven unique samples (samples S3–S9; non-selected samples S1, S2 and S10 are faded). For every node (indicated as ellipses) a few variables are randomly selected ( the other two non-selected varia by default RF selects the square root of the total number of variables) and evaluated for their ability to split the data. The variable resulting in the largest decrease in impurity is chosen to define the splitting rule. In case of the top node, this is V4 and for the second node on the left hand side this is V2 (indicated with the black arrows). This process is repeated until the nodes are pure ( indicated with round-edged boxes): they contain samples of the same class (encircled cross or plus signs).
[Figure ID: bbs034-F2]
Concept visualization of how relations between variables and samples could be represented following the dissection of the trees in a random forest. In this hypothetical case, a supervised classification was performed on samples from two classes (encircled crosses or
e.g. healthy individuals or patients). Dissection of the random forest trees might result in the further (unsupervised) distinction of subsets of samples. Top panel: variables (V1-Vn; e.g. SNPs in a GWAS study), their values (1 or 0) and interactions. Bottom panel: subsets (separated by the dashed lines) of samples from the pure classes that are predicted by a given interaction between variables. An interpretation example: provided that SNP4 (V4) is present, SNP2 (V2) allows the distinction between two subsets (consisting of healthy individuals 6, 7 8, 9 and patients 2, 5 and s). If SNP4 is absent, then the patient samples 1, 3, 4 and t can be classified. In case SNP1 (V1) is absent and SNP5 (V5) is present, a subset of healthy individuals consisting of samples a, b, c and d can be classified. Note that in this example, there can apparently no subset be distinguished if SNP1 (V1) is present or SNP5 (V5) is absent.
[TableWrap ID: bbs034-T1]
Random Forest use in Life Sciences publications ordered by data type or origin
Publications
RF features
Other possible uses of RF
mtry varied
Number of trees varied
Tree size varied
Variable importance
Local importance
Conditional importance
Variable interactions
Alternative voting scheme
RF algorithm modified
RF in pipeline
Metabolomics
Proteomics
Transcriptomics
bbs034-TF1The number of studies in different application areas that use RF properties or that report adaptations to RF is indicated. aNumber of variables to select for the best split at each node. bBy changing the number of splits or node size. cInferred from tree structure.
Article Categories:Papers
Keywords: Random Forest, variable importance, local importance, conditional relationships, variable interaction, proximity.
Previous Document:&
Next Document:&

我要回帖

更多关于 themeforest 的文章

 

随机推荐