25 Sep, 2013
This project has now been replaced by SRST2 - Short Read Sequence Typing for Bacterial Pathogens, available at http://katholt.github.io/srst2/
The new SRST2 program does gene typing as well as MLST (e.g. typing resistance genes, virulence genes, etc). SRST2 is faster and more accurate than the old SRST, using bowtie2 to achieve local alignments (no need for flanking sequences) and a brand new scoring system.
(1) Download latest MLST data (sequences and profiles) from: http://pubmlst.org/data/
(a) ST profiles. This should be tab-delimited text, with the first column indicating the ST and subsequent columns indicating the locus variants that make up the ST. The first row should indicate the label or name of each locus.
E.g. for S. pneumoniae (http://pubmlst.org/data/profiles/spneumoniae.txt) the file looks like this:
ST aroe gdh_ gki_ recP spi_ xpt_ ddl_
1 1 1 1 1 1 1 1
2 1 1 4 1 18 13 18
3 1 5 1 8 14 11 14
(b) Locus variant sequences are in fasta format, one fasta file per locus. If downloaded from the above link, they will have extension ‘.tfa’ (if you got them elsewhere they might have another extension, this is fine). The file names, excluding the extension, must be the locus label and match exactly the labels used in the profiles file.
E.g. for S. pneumoniae, you need to download
http://pubmlst.org/data/alleles/spneumoniae/aroe.tfa
http://pubmlst.org/data/alleles/spneumoniae/ddl_.tfa
http://pubmlst.org/data/alleles/spneumoniae/gdh_.tfa
http://pubmlst.org/data/alleles/spneumoniae/gki_.tfa
http://pubmlst.org/data/alleles/spneumoniae/recP.tfa
http://pubmlst.org/data/alleles/spneumoniae/spi_.tfa
http://pubmlst.org/data/alleles/spneumoniae/xpt_.tfa
These files will contain fasta sequences, one fasta entry per variant, labeled with the allele number, aroe-1, aroe-2, like this example from S. pneumoniae aroe.tfa:
>aroe-1
GAAGCGAGTGACTTGGCAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATC
AATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATGAGCTAAGCGATGAA
GCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATAT
AATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAG
ATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGAT
GGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTAC
CTAGACAAGTTACAGGAGCAGACAGGCTTTAAAGTGGATTTGTGT
>aroe-2
GAAGCGAGTGACTTGGCAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATC
AATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATAAGCTGAGCGATGAA
GCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATAT
AATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAG
ATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGAT
GGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTAC
CTAGACAAGTTACAGGAGCAGACAGGTTTTAAAGTGGATTTGTGT
Note that some MLST databases use a different character (e.g. ‘_’) to separate the locus label (‘aroe’) from the allele number (‘1’), so it might be aroe_1, aroe_2 rather than aroe-1, aroe-2. This is OK but SRST assumes by default that a dash ‘-‘ is used, so if it is anything other than this you need to specify it in the SRST command via the –n argument.
(2) Download reference sequence (fasta format, .fna) from: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
(3) Generate inputs. Using these downloaded files and the supplied script, extract 100 bp flanking sequences and prepare the input file telling SRST where to find all the sequences it needs (this script requires Python 2.6.4, Biopython, BLAST+, EMBOSS):
python prepSRST.py -d ref.fna *tfa > summary.txt
Note dependencies can be obtained from:
Biopython http://biopython.org/wiki/Download
BLAST+ ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
EMBOSS http://emboss.sourceforge.net/
The outputs of the script, which are required for running SRST, can be generated in other ways too. These are:
(a) Flanking sequences. A set of fasta files, one for each MLST locus, containing the upstream and downstream 100 bp sequences (flanking sequences) for that locus. Each file must be labeled: “[locuslabel]_flanks.fasta” (see examples in table below) and contain two entries labeled ‘up’ and ‘down’.
e.g. S. pneumoniae aroe_flanks.fasta:
>up
AGTTGTTGCCAATCCTATTAAGCATTCTATTTCTCCCTTCATCCACAATAGAGCCTTTGA
GGCGACAGCTACCAACGGTGCTTATGTGGCTTGGGAGATT
>down
GTATGCTTTAGAAAATGTTTCTGAACTGCAAGCAAGGATTGTCGAGTCGGATTTACTGGT
CAATGCCACCAGTGTGGGCATGGATGGTCAATCATCCCCA
Note the flanks need to be in the same orientation as the MLST locus, so if the MLST locus sequence is on the forward strand, both flanks need to be on the forward strand too. Basically, SRST will reconstruct a single sequence that extends from 100 bp upstream of the locus to 100 bp downstream, and it will need to be able to do this by concatenating the ‘up’ sequence, the locus variant sequence (from the *.tfa files downloaded in step 1) and the ‘down’ sequence.
(b) Summary file. A tab-delimited text file telling SRST where to find these sequences for each locus:
aroe aroe.tfa aroe_flanks.fasta
ddl_ ddl_.tfa ddl__flanks.fasta
gdh_ gdh_.tfa gdh__flanks.fasta
gki_ gki_.tfa gki__flanks.fasta
recP recP.tfa recP_flanks.fasta
spi_ spi_.tfa spi__flanks.fasta
xpt_ xpt_.tfa xpt__flanks.fasta
(c) Note the prepSRST.py script will also output a genbank file containing your reference genome and features indicating the detected positions of the MLST loci and the flanking sequences that were extracted. This is not required for SRST but is handy to check that all the sequences look correct, e.g. by examining in Artemis (http://www.sanger.ac.uk/resources/software/artemis/).
(4) Run SRST script (requires Python 2.6.4, BWA, Samtools):
python SRST.py -P summary.txt -d db.txt -l log.txt *fastq.gz > out.txt
e.g.:
python SRST.py -P spneumo_inputs.txt -d spneumo.txt -l spneumo_log.txt *fastq.gz > spneumo_SRST.txt
The main argument (unflagged) is the list of fastq files (can be gzip'd) containing sequence reads to process.
-S Summary file, tab-delimited, see 3b
-d MLST profiles database, tab-delimited, see 1a
-l Name of log file to record verbose output (default, stderr)
-n Separator for allele names (default, ‘-‘)
-s Score cut-off value (default, 10)
-w Name of working directory to store mapping data (default, ‘tmp’)
-i Insert size for paired reads (bwa default, 500)
-V Flag to switch on storing of all output (otherwise temporary files are removed)
-b Path to bwa (default, ‘bwa’)
-t Path to samtools (default, ‘samtools’)
The script will determine which read files are paired and which are single, on the basis that paired reads are indicated as _1 and _2, e.g. filenames 'sampleName_1.fastq' and 'sampleName_2.fastq', while single reads will be named without this, e.g. 'sampleName.fastq'.
Dependencies can be downloaded from:
http://samtools.sourceforge.net/
http://bio-bwa.sourceforge.net/
ST information is printed to stdout, while more detailed information is printed to the log file specified via ‘-l’.
(1) ST Information (stdout)
ST information is in the typical format suitable for eBurst analysis, but with an additional column at the end giving the final score (the lowest score across alll MLST loci).
Column 1: read set identifier (extracted from read file names)
Column 2: ST
Columns 3-X: locus variants
Final column: final score for this read set
If a locus variant cannot be assigned (i.e. there is no zero-SNP match to a known locus with a score that exceeds the cut-off), the variant is recorded here as ‘-’. But SRST will try to infer the closest allele (reported in the log) and the closest ST, highlighted with a ‘’.
e.g. 152/1 indicates the closest match was ST152, differing by one locus.
(Check the log file for further details of the equivocal locus).
If a novel combination of known alleles is detected, a temporary ST will be assigned (preceded by ‘NOVEL-’ and reported here (and detailed in the log file).
e.g. ‘NOVEL-7882’ indicates a new combination of known alleles was found.
(2) Log file (-l)
This specifies the reads files that were processed, all variants scoring above the cut-off with no mismatches (the assigned allele is the one with the highest score among these).
If there are no exact matches with scores passing the cut-off, the log file will report the best scoring allele, the number of SNPs called in the data set vs the allele sequence, and the score itself,
e.g.: *2/1/20.4 indicates that the best match was to allele 2, with 1 SNP and a score of 20.4
Also printed are some statistics on the read depth covering each locus (Min/Max/Avg/StdDev).
At the end of the file, the results (alleles & scores) are collated for all input read sets.