SolSNP is a SNP variant caller tool, that works by using Kolmogorov-Smirnov tests to compare the evidence to the reference distributions of possible genotypes at each locus.
1.0 - Initial release.
SolSNP is a command-line Java application, packaged as a JAR file. It can be executed via the command prompt of the operating system, by moving to the directory that the JAR file is located and typing:
java -jar SolSNP.jar INPUT=Input.SAM OUTPUT=snp.gff [ARGUMENT]=[VALUE] [...]
Where [ARGUMENT] are options picked from the section below.
INPUT=test/test.sam REFERENCE_SEQUENCE=test/test_ref.fasta OUTPUT=test.gff VALIDATION_STRINGENCY=SILENT PLOIDY=Haploid
INPUT (Required) - The input SAM/BAM file. The file contains the complete alignment record for a reference sequence. Note: The order of sequences in this file must match the order of reference sequences provided from the REFERENCE_SEQUENCE argument below.
OUTPUT (Required) - The main output file name, containing the final variant list. The OUTPUT_FORMAT argument can be used to set which format will be used for this output.
REFERENCE_SEQUENCE (Required) - The FASTA-formatted file of the reference sequence used for alignment of the input SAM/BAM file.
KNOWN_CALLS (Optional ) - Used to provide a file containing a secondary set of calls/genotypes, to which the SolSNP output will be compared for concordance. This needs to be a 4-column, tab-delimited file of the format:
[UNIQUE_ID REFERENCE_SEQUENCE_NAME LOCUS GENOTYPE]
If a 'known calls' file is provided, a supplemental 'false negatives' output file will be created, with the name of the main output file, followed by ".falsenegatives". It will contain all calls in the known calls file that were not also generated by SolSNP or that were filtered out.
STRAND_MODE (Optional, defaults to Consensus) - Defines how SolSNP uses strand information for each locus. The options are as follows:
* VariantConsensus - Separate calls are made on the mapped nucleotides on the forward and reverse strand. A variant is called only if both calls agree on the existence of a variant.
* GenotypeConsensus - Separate calls are made on the mapped nucleotides on the forward and reverse strand. A variant is called only if both calls agree on the existence of a variant and agree on the genotype.
* PositiveOnly - Only bases on the positive (forward) strand are processed.
* NegativeOnly - Only bases on the negative (reverse) strand are processed.
* None - Strand information is ignored.
* NoneWithStrandInfo - Calls are made per strand, but not used for the final variant call. The auxiliary information is attached to the metadata fields of the output format.
* OneStrandAndTotal - Calls are made on both strand separately, and a call is made ignoring strand information. The final call requires that the complete call and at least one of the strand calls agree on the existence of a variant.
SUMMARY (Optional, defaults to false) - Specifies whether the summary metrics directory is to be created upon execution. Please refer to the output section below for more information.
FILTER (Options, defaults to 0.0) - The minimum confidence score allowed for calls. Ranges from 0 to 1.0 Calls made with lower confidence score than this value will be discarded.
CALL_BIAS (Optional, defaults to 0.0) - This value can be used to force a bias towards a variant prediction, as opposed to a prediction of a match with the reference sequence. This value can be used to experimentally and simplistically counter-weight potential biases towards the reference genome that the sequencing or alignment process may have introduced.
MINIMUM_BASE_QUALITY (Optional, defaults to 0) - Aligned bases with a quality lower than this value will not be processed.
OUTPUT_FORMAT (Optional, defaults to GFF) - The output format for the call text file. The options available in this versions are 'GFF' (as described at http://www.sanger.ac.uk/Software/formats/GFF/ ) and 'VCF' (as described at http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2 )
PLOIDY (Optional, defaults to 'Diploid') - Configures the ploidy of the analyzed sample. Possible values are 'Diploid' and 'Haploid'.
MINIMUM_COVERAGE (Optional, defaults to 3) - The minimum number of mapped bases to a locus in order to have it be evaluated.
MINIMUM_MAPQ (Optional, defaults to 1) - The lowest mapping quality value of a read for it to be present on the pileup.
REGION (Optional, defaults to full file) - Specifies which region is to be analyzed. The argument is a string in the form [sequence_name,starting_locus,ending_locus]. Note that this option requires the input to be in BAM form, and for a BAM index file (.bai) to be present in the same directory.
Main Output File
The main output file contains the final, filtered list of called variants, with a single line per call.
If the GFF output format is used, the file will look similar the sample below:
snp_MT_1 ks-snp-call snp 152 152 0.960026 . . [metadata]
snp_MT_2 ks-snp-call snp 410 410 0.859517 . . [metadata]
snp_MT_3 ks-snp-call snp 2485 2485 0.900490 . . [metadata]
If VCF is used, the file has format similar to this sample:
20 76713 . C G 14.8 0 [metadata]
20 77613 . G A 30.0 0 [metadata]
20 77864 . C T 30.0 0 [metadata]
Please refer to the GFF and VCF format definitions for more information.
False Negatives File
The false negatives file contains all records in the 'known calls' file that were not called as a variant by SolSNP using the provided data. The format is identical to the the main output file.
Summary Metrics Directory
The summary metrics directory is created if the SUMMARY option is enabled. It contains information about the calls made, and their relationship with the known calls.
Assorted general metrics regarding the results and the alignments:
- Transition and Transversion counts.
- Weighted allelic balance ratios as calculated at the known calls loci.
- A summary of the mismatch (transition) rates between a reference sequence base and a mapped base. Tab-delimited 3-column format.
* The combination of two nucleotides X->Y, indicating a transition from reference X to mapped base Y
* The number of occurrences of this transition.
* The transition percentage versus the total mapped bases on X.
A category matrix of called variants versus known genotypes. The known-calls categorizations are horizontal; the calls created by solSNP are vertical
The categories are:
NoCall: The algorithm was not able to provide an accurate genotype call.
Unknown: There is no information about the locus.
Uncallable: The available data did not meet the requirements of the filters as configured in the execution
HomozygoteReference: Genotype agrees with reference
Heterozygote: Genotype is heterozygote variant
HomozygoteNonReference: Genotype is homozygote variant
A set of files for each Category-Category combination, named in the scheme 'Category1Category2.txt' . Each file is a 2-column tab-delimited text file with:
a) A pileup depth
b) The number of elements called at that pileup depth