Home
Name Modified Size InfoDownloads / Week
Tutorial 2021-09-29
VPR_scripts.zip 2023-05-24 13.4 kB
README-VPR.md 2023-04-17 11.0 kB
Totals: 3 Items   24.4 kB 0



header

VPR - Variables Prioritisation using RandomForest

VPR1.0 is a simple R script collection that will allow you to perform variables prioritisation/selection basing on a RandomForest machine learning approach. The method was developed originally for selectiong genetic variables related to the microbiome structure, but the approach can be traslated to a moltitude of different datasets (environmental variables, metabolomics, microbiome, virome..). The two scripts, when unzipped, perform the same prioritisation pipeline:
1 - VPR_classification.R performs a prioritisation based on classification (the prediction object is based on the probability of the individual of being classified in one of the classes to predict -minimum 2-);
2 - VPR_regression.R performs a prioritisation based on regression (the prediction object is based on the predicted value, like for a conventional regression).

This prioritisation approach is based, ad described, on RandomForest (RF). The RF algortihm is built-up in order to reliably predict outcomes on datasets containing intrinsically the information to do so. Here RF is used for a different aim, since the user doesn't know if the data used actually contains the information necessary to perform a successful RF approach. For this reason, poor performances of the model indicates that the data you provided as input probably has globally nothing or little to do with the predicted outcome (or maybe, that this approach is not the right one). Finally, expect some variability on different prioritisation runs if the overall performance tends to be low.

REQUIREMENTS and INPUT

Requirements
VPR1.0 is an Rscript and requires R installed on your working machine. The packages needed to run the pipeline are limited to:

- ranger
- randomForestExplainer
- pROC
- reshape2

The running time of the whole script is aleatory and strictly dependant on the number of the variables included in your input file (e.g running a prioritisation on 250 SNPs for 480 individuals takes approx. 10 minutes, performing 200 + 200 iterations).

Input parameters
Once the VPR script intended to be used is downloaded in your working directory, it should be invocked using the following command:

Rscript VPR_classification.R --args ntrees=.. training=.. nrep=.. nsel=.. minroc=..

All the arguments after the flag --args need to be passed to the script in order to make it work
ntrees : the number of trees to build (suggested range 150-300)
training : percentage of observation to be split into the training-validation dataset (It may vary basing on your sample size, 70-90(%) range is suggested)
nrep : the number of iterative repeats to perform in order to produce the list of relevant variables for the subsequent variables selection process (suggested range up to 1000)
nsel : the number of model to build in order to evaluate the relevance of the variables selected in the previous iterative repeats
minroc : minimum ROC value to be accepted for the initial variable prioritisation - Use this option when submitting a classification (suggested 0.60 - THIS VALUE WILL NOT REFLECT THE ROC VALUE OF YOUR FINAL MODEL)
mincor : minimum Correlation (Pearson's pairwise correlation) value to be accepted for the initial variable prioritisation - Use this option when submitting a regression (suggested 0.1 - THIS VALUE WILL NOT REFLECT THE CORRELATION VALUE OF YOUR FINAL MODEL)

This is an example of a test run of the classification script:
Rscript VPR_classification.R --args ntrees=150 training=80 nrep=200 minroc=0.6 nsel=200

This is an example of a test run of the regression script:
Rscript VPR_regression.R --args ntrees=150 training=80 nrep=200 mincor=0.1 nsel=200

Input file
The script requires a single input file in order to work. This input file is not specified in the input args, and it is automatically inputed when the script is launched. The input file MUST be named input_data.tsv and be formatted as .tsv file. The column containing the values to predict (prediction or regression) MUST be named 'response'. 'response' content MUST be a character factor for running VPR_classification.R script (as reported below). Numeric factors for classification purpose will lead the script to encounter an error. Predictors have to be specified by the columns and subjects by the rows. An example of the file structure is reported below:

/ response #2 #3 #4 #5 #6 #7 #8 #9 #10 #11
ID1 Group1 283 290 6 289 85 7 287 22 2 26
--- --- --- --- --- --- --- --- --- --- --- ---
IDn Group3 28 90 286 289 28 287 287 272 6 29

OUTPUT
The script output consist of 4 files:
- final_rf_model.rda is the final model produced by the pipeline
- Training, Validation and External_testes.rda are the splitted individuals used for the production of the model (useful to run subsequent correlations or analyses, if relevant)
- VPR_output.html is the main output of the process, and it wraps inside useful insights about your model performance. Most of the graphs reported in the final output are generated using randomForestExplainer. Check their web page in order to have a broader explanation of what's reported in the final output.

USING VPR ON REAL DATA - A brief tutorial
Here is reported a simple application of VPR on real world data, in detail the Mercedes used cars dataset of UK. The .tsv table used in this tutorial is contained in the tutorial folder.
The data table contains 3118 entries (cars) and 8 predictors, including our response variable: engine type. What are we asking at VPR, is to select the variables that have a stronger relation with the car's engine type (if any), by performing a classification exercise. The engine types are indeed categories (petrol, diesel, hybrid..), and are predicted using the VPR_classification.R script.
Once the dataset is downloaded in the same directory with the VPR script, we can start invoking the script through command line:
Rscript VPR_classification.R --args ntrees=150 training=80 nrep=1000 minroc=0.6 nsel=200
As you can see, the amount of iterations both for variables selection and prioritisation are "limited" to 1000 , since the high number of observations (cars) will significantly decrease the speed of the script (due to the permutational-required time for the computation of the single variables importance in every iteration)
The script then proceeds to split the dataset accordingly to the ratio of Training-Validation/Test we specified as input.

image

When the iterative modelling process is finished, on the command line it will be printed the actual performance of our model produced after variable prioritisation, in this case in the form of mROC AUC (since we ran a RF classification).

The mROC AUC of your model is 0.907678929607086

As we can see, the performance of the model obtained is pretty high, indicating that the process was successful in the variable selection process.
In more detail, from the file forest_explained.html, we can see that out of the initial 8 predictors, 5 have been finally selected as relevant for the outcome. Among these, we see that miles per gallon (mpg) and the EnginseSize have stronger impact on the prediction.

image

SIDENOTES
- The regression script usually take a bit more time if compared to the prediction one. It is perfectly normal, there is anything wrong in the data submitted
- The numbers progressively appearing on the screen represents the iteration step for the model. The most time-consuming step is the first one, in which the model is repeated nrep times. In detail, the process that slows down the iteration is the computation of the significance of the single variables in the prediction (measure_importance function in the randomForestExplainer package)
- Sometimes the script can give an error and terminate the execution. This is usually due to the random shuffling of the data that may cause a problem in the pipeline. Another possible cause of this inconvenience could be a too low amount of predictors. In any case, if an unexpected interruption is encountered, I strongly suggest to run again the command with the same parameters: it usually doesn't appear for two times in a row. If the error persists, I suggest to check again your data and maybe consider that VPR is not suitable to prioritize your variables.

Acknowledgements
This pipeline was originally conceived and structured by Matteo Soverini. Crucial inputs for its polishing and finalisation were obtained from Morten Arendt Rasmussen, Jonathan Thorsen and ,in general, from all the COPSAC working group.

LICENSE ON THE PRESENT RELEASE

Copyright © 2021, Matteo Soverini
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Source: README-VPR.md, updated 2023-04-17