Recent changes to User Manual

User Manual modified by jfspinella

jfspinella — Wed, 13 Apr 2016 10:24:35 -0000

--- v11
+++ v12
@@ -1,6 +1,6 @@
 **USER MANUAL**
 ============
-*SNooPer version 0.01*
+*SNooPer version 0.02*

 -----
 ##### Synopsis #####

User Manual modified by jfspinella

jfspinella — Thu, 17 Mar 2016 20:50:02 -0000

--- v10
+++ v11
@@ -76,11 +76,9 @@

 `-i <input_directory>` Complete path to your input directory.    

-`-o <output_directory>` Complete path to your output directory (input and output can be located in the
-same directory).
+`-o <output_directory>` Complete path to your output directory (input and output can be located in the same directory).

-`-m <path_to_model>` Complete path to the directory of a previously trained model. This option should
-be set only if the type of analysis 2 is "classify" or "evaluate".
+`-m <path_to_model>` Complete path to the directory of a previously trained model. This option should be set only if the type of analysis 2 is "classify" or "evaluate".

 `-w <path_to_weka>` Complete path to the weka.jar executable.

@@ -90,8 +88,7 @@

 `-b <path_to_bedtool> [optional]` Complete path to bedtools binary file. 

-`-bqv <bqv>` Base quality value (phred) of a variation to be considered as "High Quality". Default value
-is 20.
+`-bqv <bqv>` Base quality value (phred) of a variation to be considered as "High Quality". Default value is 20.

 `-c <contamination> [optional]` Fraction of normal cells in the tumor sample. Can take a value between 0 and 1. Default value is 0.

User Manual modified by jfspinella

jfspinella — Thu, 17 Mar 2016 20:47:48 -0000

--- v9
+++ v10
@@ -42,8 +42,8 @@

 -----
 ##### Author #####
-Jean-Francois Spinella - Sainte-Justine UHC Research Center, University of Montreal.
-Contact: jfspinella@gmail.com
+Jean-Francois Spinella, jfspinella@gmail.com.
+CHU Sainte-Justine Research Center, Université de Montréal, Montreal, Qc, Canada.

 -----
 ##### Date #####
@@ -56,88 +56,89 @@
 **>**[Bedtools](https://code.google.com/p/bedtools/downloads/list) if BlackList (-r) or germDB_track (-g) options are applied. The current version of SNooPer was tested with version bedtools-2.17.0.

 **>**For the development and testing of SNooPer:
-The   BlackList   track   corresponded   to   the   RepeatMasker   track   downloaded   from  [UCSC](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). "Assembly" has to be set according to the
-reference used to map your sequences, "Group" was set to Variation and Repeats, and "Track" was
-set to RepeatMasker. The track was downloaded in a .bed format.
+The   BlackList   track   corresponded   to   the   RepeatMasker   track   downloaded   from  [UCSC](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). "Assembly" has to be set according to the reference used to map your sequences, "Group" was set to Variation and Repeats, and "Track" was set to RepeatMasker. The track was downloaded in a .bed format.

 **>**The   germline   database   used   as   germDB_track   corresponded   to   the   1000   Genomes   database downloaded from [http://www.1000genomes.org/](http://www.1000genomes.org/). The track was formated in a .bed format.

 -----
 ##### Options #####
+
 `-help <brief help="" message="">`

 `-man <full documentation="">`

--`a1 <type_of_analysis1>` Can take the following values: "somatic" or "germline". "somatic" means that the somatic evaluation will be done based on provided N samples and the eventual provided additional normal data.
+`-a1 <type_of_analysis1>` Can take the following values: "somatic" or "germline". "somatic" means that the somatic evaluation will be  done based on N samples provided (and additional  germline data if provided, see germDB_track -g option).

--`a2 <type_of_analysis2>` Can take the following values: "train", "classify" or "evaluate".
-  **->** if "train" is selected, a model will be trained based on the comparison of the data generated from 2 different sequencing platforms. A subset of the data provided (subset chosen with the -v and -nv options or automatically selected) for whom the class is known (0/1 = non-validated/validated = not shared by platform1 and 2 / shared by platform1 and 2) will be used for the training. Therefore, a partially overlapping dataset between platform1 and platform2 has to be provided. A final classification of the complete data will be done base on the trained model. Furthermore, an evaluation of the model will be done using a subset of the data   never   seen   by   the   model.                                                                    
-  **->** if "classify" is selected, the provided dataset is classified using a model created previously. This model has to be in an .arff format (see Weka documentation for more info).
-  **->** if "evaluate" is selected, the provided dataset is classified using a model created previously. The purpose of this option is an evaluation of an already created model based on the classification of an independant dataset (never used to train the model). To evaluate the model, the class of each variant of the dataset has to been known. Therefore, the data from both platform1 and platform2 have to be provided. These data should be located in a fresh directory containing these files only.
+`-a2 <type_of_analysis2>` Can take the following values: "train", "classify" or "evaluate". 
+**->** if "train" is selected, a model will be trained based on the comparison of the training dataset (tset) and the validation dataset (vset). A subset of the data provided (subset chosen with the -v and -nv options or automatically selected) for which the class is known (0/1 = non-validated/validated = not shared   by  tset   and   vset   /   shared   by   tset   and   vset)   will   be   used   for   training.  Therefore,   a   partially overlapping dataset between tset and vset must be provided. Final classification of the complete data will be done base on the trained model. Furthermore, evaluation of the model will be performed using a subset excluded beforehand. 
+**->** if "classify" is selected, the provided test dataset (tset) is classified using a model created previously. This model has to be in an .arff format (see Weka documentation for more info).
+**->** if   "evaluate"   is   selected,   the   provided   dataset   (tset)   is   classified   using   a   model   created previously.   The   purpose   of   this   option   is   to   evaluate   a   previously   created   model   based   on   the classification of an independant dataset (never used to train the model). To evaluate the model, the class of each variant in the dataset must be known. Therefore, the data from both tset and vset must be provided. These data should be located in a new directory containing these files only.

-`-i   <input_directory>`   Complete   path   to   your   input   directory.            
+`-i <input_directory>` Complete path to your input directory.    

-`-o <output_directory>` Complete path to your output directory (input and output can be in the same directory).
+`-o <output_directory>` Complete path to your output directory (input and output can be located in the
+same directory).

-`-m <model>` Complete path to the directory of an already trained model. This option has to be set only if the type of analysis 2 is "classify" or "evaluate".
+`-m <path_to_model>` Complete path to the directory of a previously trained model. This option should
+be set only if the type of analysis 2 is "classify" or "evaluate".

-`-w <path_to_weka>` Complete path to Weka software, optimally to weka.jar script.
+`-w <path_to_weka>` Complete path to the weka.jar executable.

 `-a3 <type_of_analysis3> [optional]` Can take the following values: "SNP" or "Indel". The default value is "SNP".

-`-a4 <attributes_selection> [optional]` Can take the following value: "off", "MI" or "BestFirst". The   default   value   is   "off".   If   "MI"   is   selected   (Weka   InfoGainAttributeEval   +   Ranker): evaluation the worth of an attribute by measuring the information gain with respect to the class + ranking of attributes by their individual evaluations. Attributes will be discarded if presenting   less  than   0.001   bits   of  mutual   information.   If   "BestFirst"   is   selected   (Weka CfsSubsetEval + BestFirst): evaluation of the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them + evaluation of the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility.
+`-a4 <attributes_selection> [optional]` Can take the following value: "off", "MI" or "BestFirst". The default value is "off". If "MI" is selected (Weka InfoGainAttributeEval + Ranker): evaluation the worth of an attribute by measuring the information gain with respect to the class + ranking of attributes by their individual   evaluations.   Attributes   will   be   discarded   if   presenting   less   than   0.001   bits   of   mutual information. If "BestFirst" is selected (Weka CfsSubsetEval + BestFirst): evaluate the value of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them + evaluate the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility. 

-`-b <path_to_bedtool> [optional]` Complete path to the Bedtools software, optimally to bedtools binary.
+`-b <path_to_bedtool> [optional]` Complete path to bedtools binary file. 

-`-bqv <bqv>` Base quality value (phred) of a variation to be considered as "High Quality". Default value is 20.
+`-bqv <bqv>` Base quality value (phred) of a variation to be considered as "High Quality". Default value
+is 20.

-`-c <contamination> [optional]` Fraction of normal cells in the tumoral sample. Can take a value between 0 and 1. Default value is 0.
+`-c <contamination> [optional]` Fraction of normal cells in the tumor sample. Can take a value between 0 and 1. Default value is 0.

-`-cf <covered_filter_N> [optional]` Can take the following values: "on" or "off". If the filter is "on", only positions presenting at least a coverage of "coveragefilter_N" in the N will be considered in the T for somatic analysis. Default value is on.
+`-cf <covered_filter_N> [optional] `Can take the following values: "on" or "off". If the filter is "on", only positions   with   a   minimum   coverage   of   "coveragefilter_N"   in   the   N   will   be   considered   in   the   T   for somatic analysis. Default value is on.

-`-cm   <cost_matrix>   [optional]`   Used   to   adjuste   the   weight   of   mistakes   on   a   class   (see
-[http://weka.wikispaces.com/CostMatrix](http://weka.wikispaces.com/CostMatrix). The cost matrix has to be define in a single line format ex:` [0.0 5.0; 1.0 0.0]`, here the weight on false positive is 5 and on false negative is 1.
+`-cm   <cost_matrix>   [optional]`   used   to   adjust   the   weight   of   mistakes   on   a   class   (see [http://weka.wikispaces.com/](http://weka.wikispaces.com/)). The cost matrix has to be define in a single line format using comma to separate values ex: 0.0,5.0,1.0,0.0 here the weight on false positive is 5 and on false negatives is 1.

 `-cn <coveragefilter_N> [optional]` Defines the minimum of coverage for a position to be considered in the N files during a Somatic analysis or the Germline analysis. If a position in the T file doesn't reach the coverage limit in the N file, the position can't be call Somatic and won't be considered. Default value is 8.

-`-ct <coveragefilter_T> [optional]` Defines the minimum of coverage for a position to be considered in the T files during a Somatic analysis. Default value is 8.
+`-ct   <coveragefilter_T>   [optional]`Defines   the   minimum   coverage   required   for   a   position   to   be considered in the T file during a Somatic analysis. Default value is 8.

-`-fi <freqinf> [optional]` Defines the inferior limit of allele frequency for a variant position to be considered in the T files during a Somatic analysis. Default value is 0.
+`-fi   <freqinf>   [optional]`   Defines   the   inferior   limit   of   allele   frequency   for   a   variant   position   to   be considered in the T file during a Somatic analysis. Default value is 0.

-`-fs <freqsup> [optional]` Defines the superior limit of allele frequency for a variant position to be considered in the T files during a Somatic analysis. Default value is 1.
+`-fs   <freqsup>   [optional]`   Defines   the   superior   limit   of   allele   frequency   for   a   variant   position   to   be considered in the T file during a Somatic analysis. Default value is 1.

-`-g <path_to_germDB_track> [optional]` Complete path to any germline variants database track. This black list usually corresponds to problematic region in the genome. If such a file is provided and if the type_of_analysis1 is "somatic", the variations located at these positions will be considered as germline during the somatic variant calling process. Be careful to provide the track corresponding to the same reference you used to map your sequences.
+`-g <path_to_germDB_track> [optional]` Complete path to any germline variant database track. If such a file is provided and if the type_of_analysis1 is "somatic", the variations located at these positions will be considered as germline during the somatic variant calling process.

 `-id <job_id> [optional]` The output file name will be: SNooPer_output_job_id_date.

-`-ind   <indel_filter>   [optional]`   Can   take   the   following   values:   "on"   or   "off"   when type_of_analysis3 is "SNP". If the filter is "on", pileup lines containing indels won't be considered during the SNP calling process. Default value is on.
+`-ind   <indel_filter>   [optional]`   Can   take   the   following   values:   "on"   or   "off"   when   type_of_analysis3   is "SNP". If the filter is "on", pileup lines containing indels won't be considered during the SNP calling process. Default value is on. 

-`-k <cross_validation> [optional]` Integer to define the k-fold cross-validation used to train the model. This option has to be set only if the type of analysis 2 is "train" or "classify". Default value is 10.
+`-k <cross_validation> [optional]` Integer to define the k-fold cross-validation used to train the model. This option must be set only if the type of analysis 2 is "train" or "classify". Default value is 10.

-`-mem <memory> [optional]` You can extend the memory available for the virtual machine by setting appropriate options. Ex: -Xmx2g to set it to 2GB. Do not use the -Xms parameter. Using this option, you can also set where the JVM will write temporary files by using the format: -Djava.io.tmpdir=/path/to/tmpdir
+`-mem   <memory>   [optional]`   The   user   can   extend   the   memory   available   for   the   virtual   machine   by setting appropriate options. Ex: -Xmx2g to set it to 2GB. The user can also redirect temporary JVM files using the format: -Djava.io.tmpdir=/path/to/tmpdir

-`-mqv <mqv> [optional]` Mapping quality value (phred) of a read presenting a variation to be considered as "High Quality". Default value is 20.
+`-mqv <mqv> [optional]` Minimum mapping quality value (phred) of a read in order for it to be retained as "High Quality" in the variant calling process. Default value is 20.

-`-nN <nbvar_N> [optional]` Defines the number of variant for a position to be considered in the N files during a Germline analysis or Somatic analysis.
+`-nN <nbvar_N> [optional]` Defines the number of supporting variant reads required for a position to be considered in the N files during a Germline or Somatic analysis.

-`-nT <nbvar_T> [optional]` Defines the number of variant for a position to be considered in the T files during a Somatic analysis.
+`-nT <nbvar_T> [optional]` Defines the number of supporting variant reads required for a position to be considered in the T files during a Somatic analysis.

-`-nv   <nb_of_non_validated_var_to_train>   [optional]`   Number   of   non-validated   variants (disconcordant between platform 1 and 2) used to train your model. If no value is provided, a default value will be calculated from the input file. It prevails over validated_variant_fraction and validated_nonvalidated_ratio.
+`-nv  <nb_of_non_validated_var_to_train>  [optional]`   Number of  non-validated  variants (disconcordant between tset  and vset) used  to  train   your   model.   If   no   value   is   provided,  a default value will be calculated from the input file. It prevails over validated_variant_fraction and validated_nonvalidated_ratio.

-`-p1 <platform1> [optional]` Platform used to produce the data to be classified. Can take the following values: "Solid", "Solexa", "Illumina-1.3", "Illumina-1.5" or ">Illumina-1.8". Default value is the Illumina-1.8 or higher ">Illumina-1.8".
+`-p1 <tech> [optional]` Technology/chemistry used to produce the data to be classified. Can take the following values: "Solid", "Solexa", "Illumina-1.3", "Illumina-1.5" or ">Illumina-1.8". Default value is the Illumina-1.8 or higher ">Illumina-1.8".

-`-q <qual_filter> [optional]` Can take the following values: "on", "on+", "off" or "off". If the filter is "on" or "on+", only variants matching the selected bqv and mqv values will be considered. If "on+" or "off+" are selected, all attributes will be considered including those depending of the quality. Default value is on.
+`-q <qual_filter> [optional]` Can take the following values: "on", "on+", "off" or "off". If the filter is "on" or "on+", only variants matching the selected bqv and mqv values will be considered. If "on+" or "off+" are selected, all attributes will be considered including those that depend on quality. Default value is on.

-`-r <path_to_blacklist> [optional]` Complete path to the BlackList track. This black list usually corresponds to problematic region in the genome. If such a file is provided, the variations located in these regions won't be considered during the variant calling process. Be careful to provide the track corresponding to the same reference you used to map your sequences.
+`-r   <path_to_blacklist>   [optional]`   Complete   path   to   the   BlackList   track.   This   black   list   usually corresponds to problematic regions in the genome. If such a file is provided, the variations located in these regions won't be considered during the variant calling process.

-`-s <somatic_pvalue> [optional]` Somatic P-value filter based on a one-tailed Fisher's exact test comparing the somatic and germline allele count. Only variants presenting a P-value <= to this value will be conserved. The default value is 0.1. The value must be between 0 and 1.
+`-s   <somatic_pvalue>   [optional]`   Somatic   P-value   filter   based   on   a   one-tailed   Fisher's   exact   test comparing the somatic and germline allele count. Only variants presenting a P-value <= to this value will be conserved. The default value is 0.1. The value must be set between 0 and 1.

 `-t <tree> [optional]` Number of trees to build the model. Default value is 300.

-`-v   <nb_of_validated_var_to_train>   [optional]`   Number   of   validated   variants   (concordant between platform 1 and 2) used to train your model. If no value is provided, a default value will   be   calculated   from   the   input   file.   It   prevails   over   validated_variant_fraction   and validated_nonvalidated_ratio.
+`-v <nb_of_validated_var_to_train> [optional]` Number of validated variants (concordant between tset and vset) used to train your model. If no value is provided, a default value will be calculated from the input file. It prevails over validated_variant_fraction and validated_nonvalidated_ratio.

-`-vf <validated_variant_fraction> [optional]` Fraction of the validated variants to be used for training. The default value is 1. Note that if the number of validated positions is large, the analysis can be time-consuming. 
+`-vf <validated_variant_fraction> [optional]` Fraction of the validated variants to be used for training. The default value is 1. Note that if the number of validated positions is large, the analysis can be time- consuming.  

-`-vr  <validated_nonvalidated_ratio>  [optional]`  Ratio  (nb  of non-validated variants /  nb  of validated variants) in the training dataset. The default value is 0.1. Note that, if the training dataset   is   extremely   imbalanced,   a   cost   sensitive   learning   can   be   useful   to   improve performances
+`-vr   <validated_nonvalidated_ratio>   [optional]`   Ratio   (nb   of   non-validated   variants   /   nb   of   validated variants) in the training dataset. The default value is 0.1. Note that, if the training dataset is extremely imbalanced, cost sensitive learning can be used to improve the algorithm’s performance.

 -----

User Manual modified by jfspinella

jfspinella — Thu, 17 Mar 2016 20:31:00 -0000

--- v8
+++ v9
@@ -16,6 +16,7 @@
 SNooPer   is   a   highly   versatile   data   mining   approach   that   uses   Leo   Breiman's   Random   Forest classification models to accurately call somatic variants in low-pass sequencing data.
 SNooPer requires a training phase during which a training dataset (a subset of validated positions) is
 used to construct a model that can be then applied to call variants on an extended test dataset.
+
 For the training phase ("train"), the user must provide 2 types of files:
 1. **pileup files** (.pu) with similar characteristics as the test dataset on which the trained model will be
 applied. 
@@ -31,17 +32,18 @@
 the variant is absent from the validation file, the variant will be considered as an error.
 **>**To be considered as the corresponding validation file of a .pu file, the .vcf file has to present the
 same sample_id.
-`*`For the classification phase ("classify") or to evaluate a model ("evaluate"), the user simply provides
+
+For the classification phase ("classify") or to evaluate a model ("evaluate"), the user simply provides
 the paths to the model that is to be applied and to the pileup files from the test dataset:
-    **-**Somatic analysis format: tset_T_sample_id.pu and tset_N_sample_id.pu
-    **-**Germline analysis format: tset_sample_id.pu
+**>**Somatic analysis format: tset_T_sample_id.pu and tset_N_sample_id.pu
+**>**Germline analysis format: tset_sample_id.pu

-**>**Note   that   input   files   must   contain   the   prefix   tset   (for   training   or   test   dataset,   depending   on   the context) and the .pu extension or vset (for validation dataset) and the .vcf extension. 
+**>>**Note   that   input   files   must   contain   the   prefix   tset   (for   training   or   test   dataset,   depending   on   the context) and the .pu extension or vset (for validation dataset) and the .vcf extension. 

 -----
 ##### Author #####
 Jean-Francois Spinella - Sainte-Justine UHC Research Center, University of Montreal.
-jfspinella@gmail.com
+Contact: jfspinella@gmail.com

 -----
 ##### Date #####
@@ -49,17 +51,16 @@

 -----
 ##### Requirements #####
+**>**[Weka](http://sourceforge.net/projects/weka/) has to be installed. The current version of SNooPer has been tested with the version weka-3-6-10.
+**>**[R](https://www.r-project.org/); the current version of SNooPer was tested with version R/3.2.1.
+**>**[Bedtools](https://code.google.com/p/bedtools/downloads/list) if BlackList (-r) or germDB_track (-g) options are applied. The current version of SNooPer was tested with version bedtools-2.17.0.

+**>**For the development and testing of SNooPer:
+The   BlackList   track   corresponded   to   the   RepeatMasker   track   downloaded   from  [UCSC](http://genome.ucsc.edu/cgi-bin/hgTables?command=start). "Assembly" has to be set according to the
+reference used to map your sequences, "Group" was set to Variation and Repeats, and "Track" was
+set to RepeatMasker. The track was downloaded in a .bed format.

-**>**[Weka](http://sourceforge.net/projects/weka/) has to be installed. The current version of SNooPer has been tested with the version
-weka-3-6-10.
-**>**[Bedtools](https://code.google.com/p/bedtools/downloads/list) has to be installed if BlackList or germDB_track options are used. The current
-version   of   SNooPer   has   been   tested   with   the   version   bedtools-2.17.0.
-**>**During weka development, the BlackList track corresponded to the RepeatMasker track downloaded   from   [UCSC](http://genome.ucsc.edu/cgi-bin/hgTables?command=start).
-"Assembly" has to be set according to the reference you used to map your sequences,
-"Group" was set to Variation and Repeats, and "Track" was set to RepeatMasker. The track was downloaded in a .bed format.
-**>**During weka development, the germline database used as germDB_track corresponded to
-the [1000 Genomes](http://www.1000genomes.org/) database. The track was formated in a .bed format.
+**>**The   germline   database   used   as   germDB_track   corresponded   to   the   1000   Genomes   database downloaded from [http://www.1000genomes.org/](http://www.1000genomes.org/). The track was formated in a .bed format.

 -----
 ##### Options #####

User Manual modified by jfspinella

jfspinella — Thu, 17 Mar 2016 20:23:31 -0000

--- v7
+++ v8
@@ -29,12 +29,12 @@
 error) is known by comparison with the vcf files.
 If a variant is present in the corresponding validation file, it will be considered as an actual variant. If
 the variant is absent from the validation file, the variant will be considered as an error.
-`*****`To be considered as the corresponding validation file of a .pu file, the .vcf file has to present the
+**>**To be considered as the corresponding validation file of a .pu file, the .vcf file has to present the
 same sample_id.
-**>**For the classification phase ("classify") or to evaluate a model ("evaluate"), the user simply provides
+`*`For the classification phase ("classify") or to evaluate a model ("evaluate"), the user simply provides
 the paths to the model that is to be applied and to the pileup files from the test dataset:
-    **>**Somatic analysis format: tset_T_sample_id.pu and tset_N_sample_id.pu
-    **>**Germline analysis format: tset_sample_id.pu
+    **-**Somatic analysis format: tset_T_sample_id.pu and tset_N_sample_id.pu
+    **-**Germline analysis format: tset_sample_id.pu

 **>**Note   that   input   files   must   contain   the   prefix   tset   (for   training   or   test   dataset,   depending   on   the context) and the .pu extension or vset (for validation dataset) and the .vcf extension.

User Manual modified by jfspinella

jfspinella — Thu, 17 Mar 2016 20:22:35 -0000

--- v6
+++ v7
@@ -13,8 +13,7 @@
 -----

 ##### Description #####
-SNooPer   is   a   highly   versatile   data   mining   approach   that   uses   Leo   Breiman's   Random   Forest
-classification models to accurately call somatic variants in low-pass sequencing data.
+SNooPer   is   a   highly   versatile   data   mining   approach   that   uses   Leo   Breiman's   Random   Forest classification models to accurately call somatic variants in low-pass sequencing data.
 SNooPer requires a training phase during which a training dataset (a subset of validated positions) is
 used to construct a model that can be then applied to call variants on an extended test dataset.
 For the training phase ("train"), the user must provide 2 types of files:
@@ -30,7 +29,7 @@
 error) is known by comparison with the vcf files.
 If a variant is present in the corresponding validation file, it will be considered as an actual variant. If
 the variant is absent from the validation file, the variant will be considered as an error.
-**>**To be considered as the corresponding validation file of a .pu file, the .vcf file has to present the
+`*****`To be considered as the corresponding validation file of a .pu file, the .vcf file has to present the
 same sample_id.
 **>**For the classification phase ("classify") or to evaluate a model ("evaluate"), the user simply provides
 the paths to the model that is to be applied and to the pileup files from the test dataset:

User Manual modified by jfspinella

jfspinella — Thu, 17 Mar 2016 20:20:58 -0000

--- v5
+++ v6
@@ -3,7 +3,7 @@
 *SNooPer version 0.01*

 -----
-##### General usage #####
+##### Synopsis #####
 `SNooPer.pl -help [brief help message] -man [full documentation]`
 **Training:**
 `SNooPer.pl -i [input_directory] -o [output_directory] -a1 [type_of_analysis1] -a2 [train] -w [pathtoweka] [options]`
@@ -13,39 +13,45 @@
 -----

 ##### Description #####
-SNooPer is a highly versatile data mining approach that uses Leo Breiman's Random Forest
-classification   models   to   accurately   call   somatic   variants   in   low-depth   sequencing   data.
-SNooPer needs firstly to be trained using a training set to construct a model that can be then
-used to call variants on an extended test set.
-The   training   set   has   to   be   constituted   of   2   types   of   files:                         
-1. **pileup files** (.pu) presenting equivalent characteristics of the test set on which the trained
-model  will  be  applied.
-**>**Somatic analysis format: Platform1_T_sample_id.pu and Platform1_N_sample_id.pu        
-**>**Germline analysis format: Platform1_sample_id.pu
-2. **vcf files** (.vcf) validation  files that are  ideally orthogonal validations of the positions
-contained   in   the   pileup   files.                                                                    
-**>**Somatic  analysis  format:  Platform2_T_sample_id.vcf                                                     
-**>**Germline analysis format: Platform2_sample_id.vcf
+SNooPer   is   a   highly   versatile   data   mining   approach   that   uses   Leo   Breiman's   Random   Forest
+classification models to accurately call somatic variants in low-pass sequencing data.
+SNooPer requires a training phase during which a training dataset (a subset of validated positions) is
+used to construct a model that can be then applied to call variants on an extended test dataset.
+For the training phase ("train"), the user must provide 2 types of files:
+1. **pileup files** (.pu) with similar characteristics as the test dataset on which the trained model will be
+applied. 
+**>**Somatic analysis format: tset_T_sample_id.pu and tset_N_sample_id.pu
+**>**Germline analysis format: tset_sample_id.pu
+2. **vcf files** (.vcf) validation files that are ideally orthogonal validations of the positions contained in the pileup files.
+**>**Somatic analysis format: vset_T_sample_id.vcf
+**>**Germline analysis format: vset_sample_id.vcf

-**>**Each position contained in the pileup files have to be tested so the class (actual variant or
-error)   will   be   known   by   comparison   with   the   vcf   files.                                        
-If a variant is present in the corresponding validation file, it will be considered as an actual
-variant. If the variant is absent from the validation file, the variant will be considered as an
-error.
-**>**To be considered as the corresponding validation file of a .pu file, the .vcf file has to present
-the same sample_id. 
-
+**>**Each position in the pileup files must be tested a priori so that the class (true variant or sequencing
+error) is known by comparison with the vcf files.
+If a variant is present in the corresponding validation file, it will be considered as an actual variant. If
+the variant is absent from the validation file, the variant will be considered as an error.
+**>**To be considered as the corresponding validation file of a .pu file, the .vcf file has to present the
+same sample_id.
+**>**For the classification phase ("classify") or to evaluate a model ("evaluate"), the user simply provides
+the paths to the model that is to be applied and to the pileup files from the test dataset:
+    **>**Somatic analysis format: tset_T_sample_id.pu and tset_N_sample_id.pu
+    **>**Germline analysis format: tset_sample_id.pu
+    
+**>**Note   that   input   files   must   contain   the   prefix   tset   (for   training   or   test   dataset,   depending   on   the context) and the .pu extension or vset (for validation dataset) and the .vcf extension. 

 -----
 ##### Author #####
 Jean-Francois Spinella - Sainte-Justine UHC Research Center, University of Montreal.
+jfspinella@gmail.com

 -----
 ##### Date #####
-January 2016
+March 2016

 -----
 ##### Requirements #####
+
+
 **>**[Weka](http://sourceforge.net/projects/weka/) has to be installed. The current version of SNooPer has been tested with the version
 weka-3-6-10.
 **>**[Bedtools](https://code.google.com/p/bedtools/downloads/list) has to be installed if BlackList or germDB_track options are used. The current

User Manual modified by jfspinella

jfspinella — Mon, 07 Mar 2016 19:13:49 -0000

--- v4
+++ v5
@@ -4,16 +4,11 @@

 -----
 ##### General usage #####
-**>**SNooPer.pl -help [brief help message] -man [full documentation]
+`SNooPer.pl -help [brief help message] -man [full documentation]`
 **Training:**
-~~~~
-SNooPer.pl -i [input_directory] -o [output_directory] -a1 [type_of_analysis1] -a2 [train] -w [pathtoweka] [options]
-~~~~
+`SNooPer.pl -i [input_directory] -o [output_directory] -a1 [type_of_analysis1] -a2 [train] -w [pathtoweka] [options]`
 **Classify/Evaluate:**
-~~~~
-SNooPer.pl   -i   [input_directory]   -o   [output_directory]   -a1      [type_of_analysis1]   -a2 [classify/evaluate] -m [model] -w [path_to_weka] [options]
-~~~~
-
+`SNooPer.pl   -i   [input_directory]   -o   [output_directory]   -a1      [type_of_analysis1]   -a2 [classify/evaluate] -m [model] -w [path_to_weka] [options]`

 -----

@@ -140,4 +135,3 @@
 `-vr  <validated_nonvalidated_ratio>  [optional]`  Ratio  (nb  of non-validated variants /  nb  of validated variants) in the training dataset. The default value is 0.1. Note that, if the training dataset   is   extremely   imbalanced,   a   cost   sensitive   learning   can   be   useful   to   improve performances

 -----
-

User Manual modified by jfspinella

jfspinella — Mon, 07 Mar 2016 19:11:34 -0000

--- v3
+++ v4
@@ -6,11 +6,13 @@
 ##### General usage #####
 **>**SNooPer.pl -help [brief help message] -man [full documentation]
 **Training:**
-**>**SNooPer.pl -i [input_directory] -o [output_directory] -a1 [type_of_analysis1] -a2 [train] -w
-[path_to_weka] [options]
+~~~~
+SNooPer.pl -i [input_directory] -o [output_directory] -a1 [type_of_analysis1] -a2 [train] -w [pathtoweka] [options]
+~~~~
 **Classify/Evaluate:**
-**>**SNooPer.pl   -i   [input_directory]   -o   [output_directory]   -a1   [type_of_analysis1]   -a2
-[classify/evaluate] -m [model] -w [path_to_weka] [options]
+~~~~
+SNooPer.pl   -i   [input_directory]   -o   [output_directory]   -a1      [type_of_analysis1]   -a2 [classify/evaluate] -m [model] -w [path_to_weka] [options]
+~~~~

 -----
@@ -61,81 +63,81 @@

 -----
 ##### Options #####
--help <brief help="" message="">
+`-help <brief help="" message="">`

--man <full documentation="">
+`-man <full documentation="">`

--a1 <type_of_analysis1> Can take the following values: "somatic" or "germline". "somatic" means that the somatic evaluation will be done based on provided N samples and the eventual provided additional normal data.
+-`a1 <type_of_analysis1>` Can take the following values: "somatic" or "germline". "somatic" means that the somatic evaluation will be done based on provided N samples and the eventual provided additional normal data.

--a2 <type_of_analysis2> Can take the following values: "train", "classify" or "evaluate".
+-`a2 <type_of_analysis2>` Can take the following values: "train", "classify" or "evaluate".
   **->** if "train" is selected, a model will be trained based on the comparison of the data generated from 2 different sequencing platforms. A subset of the data provided (subset chosen with the -v and -nv options or automatically selected) for whom the class is known (0/1 = non-validated/validated = not shared by platform1 and 2 / shared by platform1 and 2) will be used for the training. Therefore, a partially overlapping dataset between platform1 and platform2 has to be provided. A final classification of the complete data will be done base on the trained model. Furthermore, an evaluation of the model will be done using a subset of the data   never   seen   by   the   model.                                                                    
   **->** if "classify" is selected, the provided dataset is classified using a model created previously. This model has to be in an .arff format (see Weka documentation for more info).
   **->** if "evaluate" is selected, the provided dataset is classified using a model created previously. The purpose of this option is an evaluation of an already created model based on the classification of an independant dataset (never used to train the model). To evaluate the model, the class of each variant of the dataset has to been known. Therefore, the data from both platform1 and platform2 have to be provided. These data should be located in a fresh directory containing these files only.

--i   <input_directory>   Complete   path   to   your   input   directory.            
+`-i   <input_directory>`   Complete   path   to   your   input   directory.            

--o <output_directory> Complete path to your output directory (input and output can be in the same directory).
+`-o <output_directory>` Complete path to your output directory (input and output can be in the same directory).

--m <model> Complete path to the directory of an already trained model. This option has to be set only if the type of analysis 2 is "classify" or "evaluate".
+`-m <model>` Complete path to the directory of an already trained model. This option has to be set only if the type of analysis 2 is "classify" or "evaluate".

--w <path_to_weka> Complete path to Weka software, optimally to weka.jar script.
+`-w <path_to_weka>` Complete path to Weka software, optimally to weka.jar script.

--a3 <type_of_analysis3> [optional] Can take the following values: "SNP" or "Indel". The default value is "SNP".
+`-a3 <type_of_analysis3> [optional]` Can take the following values: "SNP" or "Indel". The default value is "SNP".

--a4 <attributes_selection> [optional] Can take the following value: "off", "MI" or "BestFirst". The   default   value   is   "off".   If   "MI"   is   selected   (Weka   InfoGainAttributeEval   +   Ranker): evaluation the worth of an attribute by measuring the information gain with respect to the class + ranking of attributes by their individual evaluations. Attributes will be discarded if presenting   less  than   0.001   bits   of  mutual   information.   If   "BestFirst"   is   selected   (Weka CfsSubsetEval + BestFirst): evaluation of the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them + evaluation of the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility.
+`-a4 <attributes_selection> [optional]` Can take the following value: "off", "MI" or "BestFirst". The   default   value   is   "off".   If   "MI"   is   selected   (Weka   InfoGainAttributeEval   +   Ranker): evaluation the worth of an attribute by measuring the information gain with respect to the class + ranking of attributes by their individual evaluations. Attributes will be discarded if presenting   less  than   0.001   bits   of  mutual   information.   If   "BestFirst"   is   selected   (Weka CfsSubsetEval + BestFirst): evaluation of the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them + evaluation of the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility.

--b <path_to_bedtool> [optional] Complete path to the Bedtools software, optimally to bedtools binary.
+`-b <path_to_bedtool> [optional]` Complete path to the Bedtools software, optimally to bedtools binary.

--bqv <bqv> Base quality value (phred) of a variation to be considered as "High Quality". Default value is 20.
+`-bqv <bqv>` Base quality value (phred) of a variation to be considered as "High Quality". Default value is 20.

--c <contamination> [optional] Fraction of normal cells in the tumoral sample. Can take a value between 0 and 1. Default value is 0.
+`-c <contamination> [optional]` Fraction of normal cells in the tumoral sample. Can take a value between 0 and 1. Default value is 0.

--cf <covered_filter_N> [optional] Can take the following values: "on" or "off". If the filter is "on", only positions presenting at least a coverage of "coveragefilter_N" in the N will be considered in the T for somatic analysis. Default value is on.
+`-cf <covered_filter_N> [optional]` Can take the following values: "on" or "off". If the filter is "on", only positions presenting at least a coverage of "coveragefilter_N" in the N will be considered in the T for somatic analysis. Default value is on.

--cm   <cost_matrix>   [optional]   used   to   adjuste   the   weight   of   mistakes   on   a   class   (see
-[http://weka.wikispaces.com/CostMatrix](http://weka.wikispaces.com/CostMatrix). The cost matrix has to be define in a single line format ex: [0.0 5.0; 1.0 0.0], here the weight on false positive is 5 and on false negative is 1.
+`-cm   <cost_matrix>   [optional]`   Used   to   adjuste   the   weight   of   mistakes   on   a   class   (see
+[http://weka.wikispaces.com/CostMatrix](http://weka.wikispaces.com/CostMatrix). The cost matrix has to be define in a single line format ex:` [0.0 5.0; 1.0 0.0]`, here the weight on false positive is 5 and on false negative is 1.

--cn <coveragefilter_N> [optional] Defines the minimum of coverage for a position to be considered in the N files during a Somatic analysis or the Germline analysis. If a position in the T file doesn't reach the coverage limit in the N file, the position can't be call Somatic and won't be considered. Default value is 8.
+`-cn <coveragefilter_N> [optional]` Defines the minimum of coverage for a position to be considered in the N files during a Somatic analysis or the Germline analysis. If a position in the T file doesn't reach the coverage limit in the N file, the position can't be call Somatic and won't be considered. Default value is 8.

--ct <coveragefilter_T> [optional] Defines the minimum of coverage for a position to be considered in the T files during a Somatic analysis. Default value is 8.
+`-ct <coveragefilter_T> [optional]` Defines the minimum of coverage for a position to be considered in the T files during a Somatic analysis. Default value is 8.

--fi <freqinf> [optional] Defines the inferior limit of allele frequency for a variant position to be considered in the T files during a Somatic analysis. Default value is 0.
+`-fi <freqinf> [optional]` Defines the inferior limit of allele frequency for a variant position to be considered in the T files during a Somatic analysis. Default value is 0.

--fs <freqsup> [optional] Defines the superior limit of allele frequency for a variant position to be considered in the T files during a Somatic analysis. Default value is 1.
+`-fs <freqsup> [optional]` Defines the superior limit of allele frequency for a variant position to be considered in the T files during a Somatic analysis. Default value is 1.

--g <path_to_germDB_track> [optional] Complete path to any germline variants database track. This black list usually corresponds to problematic region in the genome. If such a file is provided and if the type_of_analysis1 is "somatic", the variations located at these positions will be considered as germline during the somatic variant calling process. Be careful to provide the track corresponding to the same reference you used to map your sequences.
+`-g <path_to_germDB_track> [optional]` Complete path to any germline variants database track. This black list usually corresponds to problematic region in the genome. If such a file is provided and if the type_of_analysis1 is "somatic", the variations located at these positions will be considered as germline during the somatic variant calling process. Be careful to provide the track corresponding to the same reference you used to map your sequences.

--id <job_id> [optional] The output file name will be: SNooPer_output_job_id_date.
+`-id <job_id> [optional]` The output file name will be: SNooPer_output_job_id_date.

--ind   <indel_filter>   [optional]   Can   take   the   following   values:   "on"   or   "off"   when type_of_analysis3 is "SNP". If the filter is "on", pileup lines containing indels won't be considered during the SNP calling process. Default value is on.
+`-ind   <indel_filter>   [optional]`   Can   take   the   following   values:   "on"   or   "off"   when type_of_analysis3 is "SNP". If the filter is "on", pileup lines containing indels won't be considered during the SNP calling process. Default value is on.

--k <cross_validation> [optional] Integer to define the k-fold cross-validation used to train the model. This option has to be set only if the type of analysis 2 is "train" or "classify". Default value is 10.
+`-k <cross_validation> [optional]` Integer to define the k-fold cross-validation used to train the model. This option has to be set only if the type of analysis 2 is "train" or "classify". Default value is 10.

--mem <memory> [optional] You can extend the memory available for the virtual machine by setting appropriate options. Ex: -Xmx2g to set it to 2GB. Do not use the -Xms parameter. Using this option, you can also set where the JVM will write temporary files by using the format: -Djava.io.tmpdir=/path/to/tmpdir
+`-mem <memory> [optional]` You can extend the memory available for the virtual machine by setting appropriate options. Ex: -Xmx2g to set it to 2GB. Do not use the -Xms parameter. Using this option, you can also set where the JVM will write temporary files by using the format: -Djava.io.tmpdir=/path/to/tmpdir

--mqv <mqv> [optional] Mapping quality value (phred) of a read presenting a variation to be considered as "High Quality". Default value is 20.
+`-mqv <mqv> [optional]` Mapping quality value (phred) of a read presenting a variation to be considered as "High Quality". Default value is 20.

--nN <nbvar_N> [optional] Defines the number of variant for a position to be considered in the N files during a Germline analysis or Somatic analysis.
+`-nN <nbvar_N> [optional]` Defines the number of variant for a position to be considered in the N files during a Germline analysis or Somatic analysis.

--nT <nbvar_T> [optional] Defines the number of variant for a position to be considered in the T files during a Somatic analysis.
+`-nT <nbvar_T> [optional]` Defines the number of variant for a position to be considered in the T files during a Somatic analysis.

--nv   <nb_of_non_validated_var_to_train>   [optional]   Number   of   non-validated   variants (disconcordant between platform 1 and 2) used to train your model. If no value is provided, a default value will be calculated from the input file. It prevails over validated_variant_fraction and validated_nonvalidated_ratio.
+`-nv   <nb_of_non_validated_var_to_train>   [optional]`   Number   of   non-validated   variants (disconcordant between platform 1 and 2) used to train your model. If no value is provided, a default value will be calculated from the input file. It prevails over validated_variant_fraction and validated_nonvalidated_ratio.

--p1 <platform1> [optional] Platform used to produce the data to be classified. Can take the following values: "Solid", "Solexa", "Illumina-1.3", "Illumina-1.5" or ">Illumina-1.8". Default value is the Illumina-1.8 or higher ">Illumina-1.8".
+`-p1 <platform1> [optional]` Platform used to produce the data to be classified. Can take the following values: "Solid", "Solexa", "Illumina-1.3", "Illumina-1.5" or ">Illumina-1.8". Default value is the Illumina-1.8 or higher ">Illumina-1.8".

--q <qual_filter> [optional] Can take the following values: "on", "on+", "off" or "off". If the filter is "on" or "on+", only variants matching the selected bqv and mqv values will be considered. If "on+" or "off+" are selected, all attributes will be considered including those depending of the quality. Default value is on.
+`-q <qual_filter> [optional]` Can take the following values: "on", "on+", "off" or "off". If the filter is "on" or "on+", only variants matching the selected bqv and mqv values will be considered. If "on+" or "off+" are selected, all attributes will be considered including those depending of the quality. Default value is on.

--r <path_to_blacklist> [optional] Complete path to the BlackList track. This black list usually corresponds to problematic region in the genome. If such a file is provided, the variations located in these regions won't be considered during the variant calling process. Be careful to provide the track corresponding to the same reference you used to map your sequences.
+`-r <path_to_blacklist> [optional]` Complete path to the BlackList track. This black list usually corresponds to problematic region in the genome. If such a file is provided, the variations located in these regions won't be considered during the variant calling process. Be careful to provide the track corresponding to the same reference you used to map your sequences.

--s <somatic_pvalue> [optional] Somatic P-value filter based on a one-tailed Fisher's exact test comparing the somatic and germline allele count. Only variants presenting a P-value <= to this value will be conserved. The default value is 0.1. The value must be between 0 and 1.
+`-s <somatic_pvalue> [optional]` Somatic P-value filter based on a one-tailed Fisher's exact test comparing the somatic and germline allele count. Only variants presenting a P-value <= to this value will be conserved. The default value is 0.1. The value must be between 0 and 1.

--t <tree> [optional] Number of trees to build the model. Default value is 300.
+`-t <tree> [optional]` Number of trees to build the model. Default value is 300.

--v   <nb_of_validated_var_to_train>   [optional]   Number   of   validated   variants   (concordant between platform 1 and 2) used to train your model. If no value is provided, a default value will   be   calculated   from   the   input   file.   It   prevails   over   validated_variant_fraction   and validated_nonvalidated_ratio.
+`-v   <nb_of_validated_var_to_train>   [optional]`   Number   of   validated   variants   (concordant between platform 1 and 2) used to train your model. If no value is provided, a default value will   be   calculated   from   the   input   file.   It   prevails   over   validated_variant_fraction   and validated_nonvalidated_ratio.

--vf <validated_variant_fraction> [optional] Fraction of the validated variants to be used for training. The default value is 1. Note that if the number of validated positions is large, the analysis can be time-consuming. 
+`-vf <validated_variant_fraction> [optional]` Fraction of the validated variants to be used for training. The default value is 1. Note that if the number of validated positions is large, the analysis can be time-consuming. 

--vr  <validated_nonvalidated_ratio>  [optional]  Ratio  (nb  of non-validated variants /  nb  of validated variants) in the training dataset. The default value is 0.1. Note that, if the training dataset   is   extremely   imbalanced,   a   cost   sensitive   learning   can   be   useful   to   improve performances
+`-vr  <validated_nonvalidated_ratio>  [optional]`  Ratio  (nb  of non-validated variants /  nb  of validated variants) in the training dataset. The default value is 0.1. Note that, if the training dataset   is   extremely   imbalanced,   a   cost   sensitive   learning   can   be   useful   to   improve performances

 -----

User Manual modified by jfspinella

jfspinella — Mon, 07 Mar 2016 19:03:32 -0000

--- v2
+++ v3
@@ -59,8 +59,83 @@
 **>**During weka development, the germline database used as germDB_track corresponded to
 the [1000 Genomes](http://www.1000genomes.org/) database. The track was formated in a .bed format.

+-----
+##### Options #####
+-help <brief help="" message="">

+-man <full documentation="">

+-a1 <type_of_analysis1> Can take the following values: "somatic" or "germline". "somatic" means that the somatic evaluation will be done based on provided N samples and the eventual provided additional normal data.

+-a2 <type_of_analysis2> Can take the following values: "train", "classify" or "evaluate".
+  **->** if "train" is selected, a model will be trained based on the comparison of the data generated from 2 different sequencing platforms. A subset of the data provided (subset chosen with the -v and -nv options or automatically selected) for whom the class is known (0/1 = non-validated/validated = not shared by platform1 and 2 / shared by platform1 and 2) will be used for the training. Therefore, a partially overlapping dataset between platform1 and platform2 has to be provided. A final classification of the complete data will be done base on the trained model. Furthermore, an evaluation of the model will be done using a subset of the data   never   seen   by   the   model.                                                                    
+  **->** if "classify" is selected, the provided dataset is classified using a model created previously. This model has to be in an .arff format (see Weka documentation for more info).
+  **->** if "evaluate" is selected, the provided dataset is classified using a model created previously. The purpose of this option is an evaluation of an already created model based on the classification of an independant dataset (never used to train the model). To evaluate the model, the class of each variant of the dataset has to been known. Therefore, the data from both platform1 and platform2 have to be provided. These data should be located in a fresh directory containing these files only.

+-i   <input_directory>   Complete   path   to   your   input   directory.            

+-o <output_directory> Complete path to your output directory (input and output can be in the same directory).
+
+-m <model> Complete path to the directory of an already trained model. This option has to be set only if the type of analysis 2 is "classify" or "evaluate".
+
+-w <path_to_weka> Complete path to Weka software, optimally to weka.jar script.
+
+-a3 <type_of_analysis3> [optional] Can take the following values: "SNP" or "Indel". The default value is "SNP".
+
+-a4 <attributes_selection> [optional] Can take the following value: "off", "MI" or "BestFirst". The   default   value   is   "off".   If   "MI"   is   selected   (Weka   InfoGainAttributeEval   +   Ranker): evaluation the worth of an attribute by measuring the information gain with respect to the class + ranking of attributes by their individual evaluations. Attributes will be discarded if presenting   less  than   0.001   bits   of  mutual   information.   If   "BestFirst"   is   selected   (Weka CfsSubsetEval + BestFirst): evaluation of the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them + evaluation of the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility.
+
+-b <path_to_bedtool> [optional] Complete path to the Bedtools software, optimally to bedtools binary.
+
+-bqv <bqv> Base quality value (phred) of a variation to be considered as "High Quality". Default value is 20.
+
+-c <contamination> [optional] Fraction of normal cells in the tumoral sample. Can take a value between 0 and 1. Default value is 0.
+
+-cf <covered_filter_N> [optional] Can take the following values: "on" or "off". If the filter is "on", only positions presenting at least a coverage of "coveragefilter_N" in the N will be considered in the T for somatic analysis. Default value is on.
+
+-cm   <cost_matrix>   [optional]   used   to   adjuste   the   weight   of   mistakes   on   a   class   (see
+[http://weka.wikispaces.com/CostMatrix](http://weka.wikispaces.com/CostMatrix). The cost matrix has to be define in a single line format ex: [0.0 5.0; 1.0 0.0], here the weight on false positive is 5 and on false negative is 1.
+
+-cn <coveragefilter_N> [optional] Defines the minimum of coverage for a position to be considered in the N files during a Somatic analysis or the Germline analysis. If a position in the T file doesn't reach the coverage limit in the N file, the position can't be call Somatic and won't be considered. Default value is 8.
+
+-ct <coveragefilter_T> [optional] Defines the minimum of coverage for a position to be considered in the T files during a Somatic analysis. Default value is 8.
+
+-fi <freqinf> [optional] Defines the inferior limit of allele frequency for a variant position to be considered in the T files during a Somatic analysis. Default value is 0.
+
+-fs <freqsup> [optional] Defines the superior limit of allele frequency for a variant position to be considered in the T files during a Somatic analysis. Default value is 1.
+
+-g <path_to_germDB_track> [optional] Complete path to any germline variants database track. This black list usually corresponds to problematic region in the genome. If such a file is provided and if the type_of_analysis1 is "somatic", the variations located at these positions will be considered as germline during the somatic variant calling process. Be careful to provide the track corresponding to the same reference you used to map your sequences.
+
+-id <job_id> [optional] The output file name will be: SNooPer_output_job_id_date.
+
+-ind   <indel_filter>   [optional]   Can   take   the   following   values:   "on"   or   "off"   when type_of_analysis3 is "SNP". If the filter is "on", pileup lines containing indels won't be considered during the SNP calling process. Default value is on.
+
+-k <cross_validation> [optional] Integer to define the k-fold cross-validation used to train the model. This option has to be set only if the type of analysis 2 is "train" or "classify". Default value is 10.
+
+-mem <memory> [optional] You can extend the memory available for the virtual machine by setting appropriate options. Ex: -Xmx2g to set it to 2GB. Do not use the -Xms parameter. Using this option, you can also set where the JVM will write temporary files by using the format: -Djava.io.tmpdir=/path/to/tmpdir
+
+-mqv <mqv> [optional] Mapping quality value (phred) of a read presenting a variation to be considered as "High Quality". Default value is 20.
+
+-nN <nbvar_N> [optional] Defines the number of variant for a position to be considered in the N files during a Germline analysis or Somatic analysis.
+
+-nT <nbvar_T> [optional] Defines the number of variant for a position to be considered in the T files during a Somatic analysis.
+
+-nv   <nb_of_non_validated_var_to_train>   [optional]   Number   of   non-validated   variants (disconcordant between platform 1 and 2) used to train your model. If no value is provided, a default value will be calculated from the input file. It prevails over validated_variant_fraction and validated_nonvalidated_ratio.
+
+-p1 <platform1> [optional] Platform used to produce the data to be classified. Can take the following values: "Solid", "Solexa", "Illumina-1.3", "Illumina-1.5" or ">Illumina-1.8". Default value is the Illumina-1.8 or higher ">Illumina-1.8".
+
+-q <qual_filter> [optional] Can take the following values: "on", "on+", "off" or "off". If the filter is "on" or "on+", only variants matching the selected bqv and mqv values will be considered. If "on+" or "off+" are selected, all attributes will be considered including those depending of the quality. Default value is on.
+
+-r <path_to_blacklist> [optional] Complete path to the BlackList track. This black list usually corresponds to problematic region in the genome. If such a file is provided, the variations located in these regions won't be considered during the variant calling process. Be careful to provide the track corresponding to the same reference you used to map your sequences.
+
+-s <somatic_pvalue> [optional] Somatic P-value filter based on a one-tailed Fisher's exact test comparing the somatic and germline allele count. Only variants presenting a P-value <= to this value will be conserved. The default value is 0.1. The value must be between 0 and 1.
+
+-t <tree> [optional] Number of trees to build the model. Default value is 300.
+
+-v   <nb_of_validated_var_to_train>   [optional]   Number   of   validated   variants   (concordant between platform 1 and 2) used to train your model. If no value is provided, a default value will   be   calculated   from   the   input   file.   It   prevails   over   validated_variant_fraction   and validated_nonvalidated_ratio.
+
+-vf <validated_variant_fraction> [optional] Fraction of the validated variants to be used for training. The default value is 1. Note that if the number of validated positions is large, the analysis can be time-consuming. 
+
+-vr  <validated_nonvalidated_ratio>  [optional]  Ratio  (nb  of non-validated variants /  nb  of validated variants) in the training dataset. The default value is 0.1. Note that, if the training dataset   is   extremely   imbalanced,   a   cost   sensitive   learning   can   be   useful   to   improve performances
+
+-----
+