SolSNP
---
About SolSNP
--
SolSNP is a SNP variant caller tool, that works by using Kolmogorov-Smirnov tests to compare the evidence to the reference distributions of possible genotypes at each locus.
Releases
--
1.0 - Initial release.
1.01 - Fixes and feature improvements
1.02 - VCF 4.0 support, more accurate variant confidence value, bug fixes, GENERATE_GENOTYPES option.
1.1 - Increased performance, reduced memory footprint, Picard library update, many bug fixes and new options (MAXIMIMUM_COVERAGE, multiple regions, OUTPUT_MODE)
Usage
--
SolSNP is a command-line Java application, packaged as a JAR file. It can be executed via the command prompt of the operating system, by moving to the directory that the JAR file is located and typing:
java -jar SolSNP.jar INPUT=Input.SAM OUTPUT=snp.gff [ARGUMENT]=[VALUE] [...]
Where [ARGUMENT] are options picked from the section below.
Example
--
INPUT=test/test.sam REFERENCE_SEQUENCE=test/test_ref.fasta OUTPUT=test.gff VALIDATION_STRINGENCY=SILENT PLOIDY=Haploid
Arguments
--
INPUT (Required) - The input SAM/BAM file. The file contains the complete alignment record for a reference sequence. Note: The order of sequences in this file must match the order of reference sequences provided from the REFERENCE_SEQUENCE argument below.
OUTPUT (Required) - The main output file name, containing the final variant list. The OUTPUT_FORMAT argument can be used to set which format will be used for this output.
REFERENCE_SEQUENCE (Required) - The FASTA-formatted file of the reference sequence used for alignment of the input SAM/BAM file.
KNOWN_CALLS (Optional ) - Used to provide a file containing a secondary set of calls/genotypes, to which the SolSNP output will be compared for concordance. This needs to be a 4-column, tab-delimited file of the format:
[UNIQUE_ID REFERENCE_SEQUENCE_NAME LOCUS GENOTYPE]
If a 'known calls' file is provided, a supplemental 'false negatives' output file will be created, with the name of the main output file, followed by ".falsenegatives". It will contain all calls in the known calls file that were not also generated by SolSNP or that were filtered out.
STRAND_MODE (Optional, defaults to Consensus) - Defines how SolSNP uses strand information for each locus. The options are as follows:
* VariantConsensus - Separate calls are made on the mapped nucleotides on the forward and reverse strand. A variant is called only if both calls agree on the existence of a variant.
* GenotypeConsensus - Separate calls are made on the mapped nucleotides on the forward and reverse strand. A variant is called only if both calls agree on the existence of a variant and agree on the genotype.
* None - Strand information is ignored.
* OneStrandAndTotal - Calls are made on both strand separately, and a call is made ignoring strand information. The final call requires that the complete call and at least one of the strand calls agree on the existence of a variant.
SUMMARY (Optional, defaults to false) - Specifies whether the summary metrics directory is to be created upon execution. Please refer to the output section below for more information.
FILTER (Options, defaults to 0.0) - The minimum confidence score allowed for calls. Ranges from 0 to 1.0 Calls made with lower confidence score than this value will be discarded.
CALL_BIAS (Optional, defaults to 0.0) - This value can be used to force a bias towards a variant prediction, as opposed to a prediction of a match with the reference sequence. This value can be used to experimentally and simplistically counter-weight potential biases towards the reference genome that the sequencing or alignment process may have introduced.
MINIMUM_BASE_QUALITY (Optional, defaults to 0) - Aligned bases with a quality lower than this value will not be processed.
OUTPUT_FORMAT (Optional, defaults to GFF) - The output format for the call text file. The options available in this versions are 'GFF' (as described at http://www.sanger.ac.uk/Software/formats/GFF/ ) and 'VCF' (as described at http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2 )
OUTPUT_MODE (Optional, defaults to 'Variants') - Specifies which records will be returned in the output file, depending on the call made. Possible values are 'Variants', 'VariantsAndReference' , 'AllCallable' and 'KnownCalls'
PLOIDY (Optional, defaults to 'Diploid') - Configures the ploidy of the analyzed sample. Possible values are 'Diploid' and 'Haploid'.
MINIMUM_COVERAGE (Optional, defaults to 3) - The minimum number of mapped bases to a locus in order to have it be evaluated.
MAXIMUM_COVERAGE (Optional, defaults to 0/None) - The maximum number of mapped bases to a locus in order to have it be evaluated.
MINIMUM_MAPQ (Optional, defaults to 1) - The lowest mapping quality value of a read for it to be present on the pileup.
REGION (Optional, defaults to full file) - Specifies which region is to be analyzed. The argument is a string in the form "sequence_name,starting_locus,ending_locus" (without the quotes). Note that this option requires the input to be in BAM form, and for a BAM index file (.bai) to be present in the same directory.
REGION_FILE (Optional, default to 'none') - Specifies multiple regions to be analyzed (overrides the REGION argument if that one is also given). The file must contain a series of lines, each one representing a region in the format "sequence_name:start-end" (without the quotes).
GENERATE_GENOTYPES (Optional, defaults to true) - Generates genotype information if the output format supports it.
Output
---
Main Output File
--
The main output file contains the final, filtered list of called variants, with a single line per call.
If the GFF output format is used, the file will look similar the sample below:
snp_MT_1 ks-snp-call snp 152 152 0.960026 . . [metadata]
snp_MT_2 ks-snp-call snp 410 410 0.859517 . . [metadata]
snp_MT_3 ks-snp-call snp 2485 2485 0.900490 . . [metadata]
If VCF is used, the file has format similar to this sample:
20 76713 . C G 14.8 0 [metadata]
20 77613 [known call ID] G A 30.0 0 [metadata]
20 77864 . C T 30.0 0 [metadata]
Please refer to the GFF and VCF format definitions for more information.
False Negatives File
--
The false negatives file contains all records in the 'known calls' file that were not called as a variant by SolSNP using the provided data. The format is identical to the the main output file.
Summary Metrics Directory
---
The summary metrics directory is created if the SUMMARY option is enabled. It contains information about the calls made, and their relationship with the known calls.
summary.txt:
--
Assorted general metrics regarding the results and the alignments:
- Transition and Transversion counts.
- Weighted allelic balance ratios as calculated at the known calls loci.
- A summary of the mismatch (transition) rates between a reference sequence base and a mapped base. Tab-delimited 3-column format.
* The combination of two nucleotides X->Y, indicating a transition from reference X to mapped base Y
* The number of occurrences of this transition.
* The transition percentage versus the total mapped bases on X.
validation.txt:
--
A category matrix of called variants versus known genotypes. The known-calls categorizations are horizontal; the calls created by solSNP are vertical
The categories are:
NoCall: The algorithm was not able to provide an accurate genotype call.
Unknown: There is no information about the locus.
Uncallable: The available data did not meet the requirements of the filters as configured in the execution
HomozygoteReference: Genotype agrees with reference
Heterozygote: Genotype is heterozygote variant
HomozygoteNonReference: Genotype is homozygote variant