SNooPer Wiki

Brought to you by: jfspinella

User Manual

Labels: User manual (1)

Authors:

USER MANUAL

SNooPer version 0.02

Synopsis

SNooPer.pl -help [brief help message] -man [full documentation]
Training:
SNooPer.pl -i [input_directory] -o [output_directory] -a1 [type_of_analysis1] -a2 [train] -w [pathtoweka] [options]
Classify/Evaluate:
SNooPer.pl -i [input_directory] -o [output_directory] -a1 [type_of_analysis1] -a2 [classify/evaluate] -m [model] -w [path_to_weka] [options]

Description

SNooPer is a highly versatile data mining approach that uses Leo Breiman's Random Forest classification models to accurately call somatic variants in low-pass sequencing data.
SNooPer requires a training phase during which a training dataset (a subset of validated positions) is
used to construct a model that can be then applied to call variants on an extended test dataset.

For the training phase ("train"), the user must provide 2 types of files:

pileup files (.pu) with similar characteristics as the test dataset on which the trained model will be
applied.
>Somatic analysis format: tset_T_sample_id.pu and tset_N_sample_id.pu
>Germline analysis format: tset_sample_id.pu
vcf files (.vcf) validation files that are ideally orthogonal validations of the positions contained in the pileup files.
>Somatic analysis format: vset_T_sample_id.vcf
>Germline analysis format: vset_sample_id.vcf

>Each position in the pileup files must be tested a priori so that the class (true variant or sequencing
error) is known by comparison with the vcf files.
If a variant is present in the corresponding validation file, it will be considered as an actual variant. If
the variant is absent from the validation file, the variant will be considered as an error.
>To be considered as the corresponding validation file of a .pu file, the .vcf file has to present the
same sample_id.

For the classification phase ("classify") or to evaluate a model ("evaluate"), the user simply provides
the paths to the model that is to be applied and to the pileup files from the test dataset:
>Somatic analysis format: tset_T_sample_id.pu and tset_N_sample_id.pu
>Germline analysis format: tset_sample_id.pu

>>Note that input files must contain the prefix tset (for training or test dataset, depending on the context) and the .pu extension or vset (for validation dataset) and the .vcf extension.

Author

Jean-Francois Spinella, jfspinella@gmail.com.
CHU Sainte-Justine Research Center, Université de Montréal, Montreal, Qc, Canada.

Date

March 2016

Requirements

>Weka has to be installed. The current version of SNooPer has been tested with the version weka-3-6-10.
>R; the current version of SNooPer was tested with version R/3.2.1.
>Bedtools if BlackList (-r) or germDB_track (-g) options are applied. The current version of SNooPer was tested with version bedtools-2.17.0.

>For the development and testing of SNooPer:
The BlackList track corresponded to the RepeatMasker track downloaded from UCSC. "Assembly" has to be set according to the reference used to map your sequences, "Group" was set to Variation and Repeats, and "Track" was set to RepeatMasker. The track was downloaded in a .bed format.

>The germline database used as germDB_track corresponded to the 1000 Genomes database downloaded from http://www.1000genomes.org/. The track was formated in a .bed format.

Options

-help <brief help message>

-man <full documentation>

-a1 <type_of_analysis1> Can take the following values: "somatic" or "germline". "somatic" means that the somatic evaluation will be done based on N samples provided (and additional germline data if provided, see germDB_track -g option).

-a2 <type_of_analysis2> Can take the following values: "train", "classify" or "evaluate".
-> if "train" is selected, a model will be trained based on the comparison of the training dataset (tset) and the validation dataset (vset). A subset of the data provided (subset chosen with the -v and -nv options or automatically selected) for which the class is known (0/1 = non-validated/validated = not shared by tset and vset / shared by tset and vset) will be used for training. Therefore, a partially overlapping dataset between tset and vset must be provided. Final classification of the complete data will be done base on the trained model. Furthermore, evaluation of the model will be performed using a subset excluded beforehand.
-> if "classify" is selected, the provided test dataset (tset) is classified using a model created previously. This model has to be in an .arff format (see Weka documentation for more info).
-> if "evaluate" is selected, the provided dataset (tset) is classified using a model created previously. The purpose of this option is to evaluate a previously created model based on the classification of an independant dataset (never used to train the model). To evaluate the model, the class of each variant in the dataset must be known. Therefore, the data from both tset and vset must be provided. These data should be located in a new directory containing these files only.

-i <input_directory> Complete path to your input directory.

-o <output_directory> Complete path to your output directory (input and output can be located in the same directory).

-m <path_to_model> Complete path to the directory of a previously trained model. This option should be set only if the type of analysis 2 is "classify" or "evaluate".

-w <path_to_weka> Complete path to the weka.jar executable.

-a3 <type_of_analysis3> [optional] Can take the following values: "SNP" or "Indel". The default value is "SNP".

-a4 <attributes_selection> [optional] Can take the following value: "off", "MI" or "BestFirst". The default value is "off". If "MI" is selected (Weka InfoGainAttributeEval + Ranker): evaluation the worth of an attribute by measuring the information gain with respect to the class + ranking of attributes by their individual evaluations. Attributes will be discarded if presenting less than 0.001 bits of mutual information. If "BestFirst" is selected (Weka CfsSubsetEval + BestFirst): evaluate the value of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them + evaluate the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility.

-b <path_to_bedtool> [optional] Complete path to bedtools binary file.

-bqv <bqv> Base quality value (phred) of a variation to be considered as "High Quality". Default value is 20.

-c <contamination> [optional] Fraction of normal cells in the tumor sample. Can take a value between 0 and 1. Default value is 0.

-cf <covered_filter_N> [optional]Can take the following values: "on" or "off". If the filter is "on", only positions with a minimum coverage of "coveragefilter_N" in the N will be considered in the T for somatic analysis. Default value is on.

-cm <cost_matrix> [optional] used to adjust the weight of mistakes on a class (see http://weka.wikispaces.com/). The cost matrix has to be define in a single line format using comma to separate values ex: 0.0,5.0,1.0,0.0 here the weight on false positive is 5 and on false negatives is 1.

-cn <coveragefilter_N> [optional] Defines the minimum of coverage for a position to be considered in the N files during a Somatic analysis or the Germline analysis. If a position in the T file doesn't reach the coverage limit in the N file, the position can't be call Somatic and won't be considered. Default value is 8.

-ct <coveragefilter_T> [optional]Defines the minimum coverage required for a position to be considered in the T file during a Somatic analysis. Default value is 8.

-fi <freqinf> [optional] Defines the inferior limit of allele frequency for a variant position to be considered in the T file during a Somatic analysis. Default value is 0.

-fs <freqsup> [optional] Defines the superior limit of allele frequency for a variant position to be considered in the T file during a Somatic analysis. Default value is 1.

-g <path_to_germDB_track> [optional] Complete path to any germline variant database track. If such a file is provided and if the type_of_analysis1 is "somatic", the variations located at these positions will be considered as germline during the somatic variant calling process.

-id <job_id> [optional] The output file name will be: SNooPer_output_job_id_date.

-ind <indel_filter> [optional] Can take the following values: "on" or "off" when type_of_analysis3 is "SNP". If the filter is "on", pileup lines containing indels won't be considered during the SNP calling process. Default value is on.

-k <cross_validation> [optional] Integer to define the k-fold cross-validation used to train the model. This option must be set only if the type of analysis 2 is "train" or "classify". Default value is 10.

-mem <memory> [optional] The user can extend the memory available for the virtual machine by setting appropriate options. Ex: -Xmx2g to set it to 2GB. The user can also redirect temporary JVM files using the format: -Djava.io.tmpdir=/path/to/tmpdir

-mqv <mqv> [optional] Minimum mapping quality value (phred) of a read in order for it to be retained as "High Quality" in the variant calling process. Default value is 20.

-nN <nbvar_N> [optional] Defines the number of supporting variant reads required for a position to be considered in the N files during a Germline or Somatic analysis.

-nT <nbvar_T> [optional] Defines the number of supporting variant reads required for a position to be considered in the T files during a Somatic analysis.

-nv <nb_of_non_validated_var_to_train> [optional] Number of non-validated variants (disconcordant between tset and vset) used to train your model. If no value is provided, a default value will be calculated from the input file. It prevails over validated_variant_fraction and validated_nonvalidated_ratio.

-p1 <tech> [optional] Technology/chemistry used to produce the data to be classified. Can take the following values: "Solid", "Solexa", "Illumina-1.3", "Illumina-1.5" or ">Illumina-1.8". Default value is the Illumina-1.8 or higher ">Illumina-1.8".

-q <qual_filter> [optional] Can take the following values: "on", "on+", "off" or "off". If the filter is "on" or "on+", only variants matching the selected bqv and mqv values will be considered. If "on+" or "off+" are selected, all attributes will be considered including those that depend on quality. Default value is on.

-r <path_to_blacklist> [optional] Complete path to the BlackList track. This black list usually corresponds to problematic regions in the genome. If such a file is provided, the variations located in these regions won't be considered during the variant calling process.

-s <somatic_pvalue> [optional] Somatic P-value filter based on a one-tailed Fisher's exact test comparing the somatic and germline allele count. Only variants presenting a P-value <= to this value will be conserved. The default value is 0.1. The value must be set between 0 and 1.

-t <tree> [optional] Number of trees to build the model. Default value is 300.

-v <nb_of_validated_var_to_train> [optional] Number of validated variants (concordant between tset and vset) used to train your model. If no value is provided, a default value will be calculated from the input file. It prevails over validated_variant_fraction and validated_nonvalidated_ratio.

-vf <validated_variant_fraction> [optional] Fraction of the validated variants to be used for training. The default value is 1. Note that if the number of validated positions is large, the analysis can be time- consuming.

-vr <validated_nonvalidated_ratio> [optional] Ratio (nb of non-validated variants / nb of validated variants) in the training dataset. The default value is 0.1. Note that, if the training dataset is extremely imbalanced, cost sensitive learning can be used to improve the algorithm’s performance.