Recent changes to User guide

User guide modified by Kat Holt

Kat Holt — Wed, 25 Sep 2013 05:23:16 -0000

--- v6
+++ v7
@@ -1,3 +1,15 @@
+Update - SRST2 Now Available
+------
+**25 Sep, 2013**
+
+**This project has now been replaced by SRST2 - Short Read Sequence Typing for Bacterial Pathogens, available at http://katholt.github.io/srst2/**
+
+**The new SRST2 program does gene typing as well as MLST (e.g. typing resistance genes, virulence genes, etc). SRST2 is faster and more accurate than the old SRST, using bowtie2 to achieve local alignments (no need for flanking sequences) and a brand new scoring system.**
+
+
+
+Legacy User Guide
+----------------
 **(1) Download latest MLST data** (sequences and profiles) from: http://pubmlst.org/data/

 (a) ***ST profiles***. This should be tab-delimited text, with the first column indicating the ST and subsequent columns indicating the locus variants that make up the ST. The first row should indicate the label or name of each locus.

WikiPage User guide modified by Kat Holt

Kat Holt — Sat, 24 Mar 2012 07:19:56 -0000

--- v5 
+++ v6 
@@ -91,29 +91,61 @@
 e.g.:
 python SRST.py -P spneumo_inputs.txt -d spneumo.txt -l spneumo_log.txt *fastq.gz > spneumo_SRST.txt
 
-***The main argument*** (unflagged) is the list of fastq files containing sequence reads to process. These can stay gzipped if desired.
-
-
+***The main argument*** (unflagged) is the list of fastq files (can be gzip'd) containing sequence reads to process.
+
+
 ######Other required arguments:######
 
     -S	Summary file, tab-delimited, see 3b
     -d	MLST profiles database, tab-delimited, see 1a
     -l	Name of log file to record verbose output (default, stderr)
 
 
 ######Optional arguments:######
     -n	Separator for allele names (default, ‘-‘)
     -s	Score cut-off value (default, 10)
-    -w	Name of working directory to store mapping data (default, ‘tmp’)
-    -p	Flag to indicate data is paired (note info below re paired reads) 
-    -i	Insert size for paired reads (bam default, 500)
+    -w	Name of working directory to store mapping data (default, ‘tmp’) 
+    -i	Insert size for paired reads (bwa default, 500)
     -V	Flag to switch on storing of all output (otherwise temporary files are removed)
     -b	Path to bwa (default, ‘bwa’)
     -t 	Path to samtools (default, ‘samtools’)
 
 ######Paired data:######
-If forward & reverse reads are available, this can be specified using the ‘-p’ flag. It is assumed forward and reverse reads will be adjacent to each other on the command line, e.g. ‘sample1_1.fastq sample1_2.fastq sample2__1.fastq sample2_2.fastq’.
+The script will determine which read files are paired and which are single, on the basis that paired reads are indicated as _1 and _2, e.g. filenames 'sampleName_1.fastq' and 'sampleName_2.fastq', while single reads will be named without this, e.g. 'sampleName.fastq'.
 
 Dependencies can be downloaded from:
 http://samtools.sourceforge.net/
 http://bio-bwa.sourceforge.net/
+
+
+######Output######
+
+ST information is printed to stdout, while more detailed information is printed to the log file specified via ‘-l’.
+
+**(1) ST Information (stdout)**
+
+ST information is in the typical format suitable for eBurst analysis, but with an additional column at the end giving the final score (the lowest score across alll MLST loci).
+
+Column 1: read set identifier (extracted from read file names)
+Column 2: ST
+Columns 3-X: locus variants
+Final column: final score for this read set
+
+If a locus variant cannot be assigned (i.e. there is no zero-SNP match to a known locus with a score that exceeds the cut-off), the variant is recorded here as ‘-’. But SRST will try to infer the closest allele (reported in the log) and the closest ST, highlighted with a ‘*’.
+e.g. *152/1 indicates the closest match was ST152, differing by one locus. 
+(Check the log file for further details of the equivocal locus).
+
+If a novel combination of known alleles is detected, a temporary ST will be assigned (preceded by ‘NOVEL-’ and reported here (and detailed in the log file).
+e.g. ‘NOVEL-7882’ indicates a new combination of known alleles was found.
+
+
+**(2) Log file (-l)**
+
+This specifies the reads files that were processed, all variants scoring above the cut-off with no mismatches (the assigned allele is the one with the highest score among these).
+
+If there are no exact matches with scores passing the cut-off, the log file will report the best scoring allele, the number of SNPs called in the data set vs the allele sequence, and the score itself, 
+e.g.: *2/1/20.4 indicates that the best match was to allele 2, with 1 SNP and a score of 20.4
+
+Also printed are some statistics on the read depth covering each locus (Min/Max/Avg/StdDev).
+
+At the end of the file, the results (alleles & scores) are collated for all input read sets.

WikiPage User guide modified by Kat Holt

Kat Holt — Thu, 02 Feb 2012 02:20:27 -0000

--- v4 
+++ v5 
@@ -117,6 +117,3 @@
 Dependencies can be downloaded from:
 http://samtools.sourceforge.net/
 http://bio-bwa.sourceforge.net/
-
-Note if your installation means these programs are not accessible to SRST using ‘bwa’ and ‘samtools’, you will need to provide the path 
-

WikiPage User guide modified by Kat Holt

Kat Holt — Thu, 02 Feb 2012 02:20:01 -0000

--- v3 
+++ v4 
@@ -47,56 +47,56 @@
 
 **(3) Generate inputs.** Using these downloaded files and the supplied script, extract 100 bp flanking sequences and prepare the input file telling SRST where to find all the sequences it needs (this script requires Python 2.6.4, Biopython, BLAST+, EMBOSS):
 
-	python getFlanksMLST.py -d ref.fna *tfa > summary.txt
-
-
+	python prepSRST.py -d ref.fna *tfa > summary.txt
+
+
 Note dependencies can be obtained from:
 Biopython 	http://biopython.org/wiki/Download
 BLAST+	
 EMBOSS	http://emboss.sourceforge.net/
 
 
 The outputs of the script, which are required for running SRST, can be generated in other ways too. These are:
 
 ***(a) Flanking sequences.*** A set of fasta files, one for each MLST locus, containing the upstream and downstream 100 bp sequences (flanking sequences) for that locus. Each file must be labeled: “[locuslabel]_flanks.fasta” (see examples in table below) and contain two entries labeled ‘up’ and ‘down’.
 
 e.g. S. pneumoniae aroe_flanks.fasta:
 
 >\>up
 AGTTGTTGCCAATCCTATTAAGCATTCTATTTCTCCCTTCATCCACAATAGAGCCTTTGA
 GGCGACAGCTACCAACGGTGCTTATGTGGCTTGGGAGATT
 >\>down
 GTATGCTTTAGAAAATGTTTCTGAACTGCAAGCAAGGATTGTCGAGTCGGATTTACTGGT
 CAATGCCACCAGTGTGGGCATGGATGGTCAATCATCCCCA
 
 Note the flanks need to be in the same orientation as the MLST locus, so if the MLST locus sequence is on the forward strand, both flanks need to be on the forward strand too. Basically, SRST will reconstruct a single sequence that extends from 100 bp upstream of the locus to 100 bp downstream, and it will need to be able to do this by concatenating the ‘up’ sequence, the locus variant sequence (from the *.tfa files downloaded in step 1) and the ‘down’ sequence.
 
 ***(b) Summary file.*** A tab-delimited text file telling SRST where to find these sequences for each locus:
 
     aroe	aroe.tfa	aroe_flanks.fasta
     ddl_	ddl_.tfa	ddl__flanks.fasta
     gdh_	gdh_.tfa	gdh__flanks.fasta
     gki_	gki_.tfa	gki__flanks.fasta
     recP	recP.tfa	recP_flanks.fasta
     spi_	spi_.tfa	spi__flanks.fasta
     xpt_	xpt_.tfa	xpt__flanks.fasta
 
-(c) Note the getFlanks.py script will also output a ***genbank file*** containing your reference genome and features indicating the detected positions of the MLST loci and the flanking sequences that were extracted. This is not required for SRST but is handy to check that all the sequences look correct, e.g. by examining in Artemis (http://www.sanger.ac.uk/resources/software/artemis/).
+(c) Note the prepSRST.py script will also output a ***genbank file*** containing your reference genome and features indicating the detected positions of the MLST loci and the flanking sequences that were extracted. This is not required for SRST but is handy to check that all the sequences look correct, e.g. by examining in Artemis (http://www.sanger.ac.uk/resources/software/artemis/).
 
 
 **(4) Run SRST script** (requires Python 2.6.4, BWA, Samtools):
 
     python SRST.py -P summary.txt -d db.txt -l log.txt *fastq.gz > out.txt
 
 e.g.:
 python SRST.py -P spneumo_inputs.txt -d spneumo.txt -l spneumo_log.txt *fastq.gz > spneumo_SRST.txt
 
 ***The main argument*** (unflagged) is the list of fastq files containing sequence reads to process. These can stay gzipped if desired.
 
 
 ######Other required arguments:######
 
-    -P	Summary file, tab-delimited, see 3b
+    -S	Summary file, tab-delimited, see 3b
     -d	MLST profiles database, tab-delimited, see 1a
     -l	Name of log file to record verbose output (default, stderr)

WikiPage User guide modified by Kat Holt

Kat Holt — Tue, 31 Jan 2012 00:14:32 -0000

--- v2 
+++ v3 
@@ -4,119 +4,114 @@
 
 E.g. for S. pneumoniae (http://pubmlst.org/data/profiles/spneumoniae.txt) the file looks like this: 
 
-ST | aroe | gdh_ | gki_ | recP | spi_ | xpt_ | ddl_
----------------------------------------------------
-1|1|1|1|1|1|1|1
-2	1	1	4	1	18	13	18
-3	1	5	1	8	14	11	14
-
+    ST	aroe	gdh_	gki_	recP	spi_	xpt_	ddl_
+    1	1	1	1	1	1	1	1
+    2	1	1	4	1	18	13	18
+    3	1	5	1	8	14	11	14
+
 (b) ***Locus variant sequences*** are in fasta format, one fasta file per locus. If downloaded from the above link, they will have extension ‘.tfa’ (if you got them elsewhere they might have another extension, this is fine). The file names, excluding the extension, must be the locus label and match exactly the labels used in the profiles file.
 
-E.g. for S. pneumoniae, you need to download http://pubmlst.org/data/alleles/spneumoniae/aroe.tfa
-
-http://pubmlst.org/data/alleles/spneumoniae/ddl_.tfa
-
-http://pubmlst.org/data/alleles/spneumoniae/gdh_.tfa 
-
-http://pubmlst.org/data/alleles/spneumoniae/gki_.tfa 
-
-http://pubmlst.org/data/alleles/spneumoniae/recP.tfa
-
-http://pubmlst.org/data/alleles/spneumoniae/spi_.tfa 
-
-http://pubmlst.org/data/alleles/spneumoniae/xpt_.tfa
+E.g. for S. pneumoniae, you need to download 
+
+    http://pubmlst.org/data/alleles/spneumoniae/aroe.tfa
+    http://pubmlst.org/data/alleles/spneumoniae/ddl_.tfa
+    http://pubmlst.org/data/alleles/spneumoniae/gdh_.tfa 
+    http://pubmlst.org/data/alleles/spneumoniae/gki_.tfa 
+    http://pubmlst.org/data/alleles/spneumoniae/recP.tfa
+    http://pubmlst.org/data/alleles/spneumoniae/spi_.tfa 
+    http://pubmlst.org/data/alleles/spneumoniae/xpt_.tfa
+
 
 These files will contain fasta sequences, one fasta entry per variant, labeled with the allele number, aroe-1, aroe-2, like this example from S. pneumoniae aroe.tfa:
 
->aroe-1
+>\>aroe-1
 GAAGCGAGTGACTTGGCAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATC
 AATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATGAGCTAAGCGATGAA
 GCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATAT
 AATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAG
 ATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGAT
 GGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTAC
 CTAGACAAGTTACAGGAGCAGACAGGCTTTAAAGTGGATTTGTGT
->aroe-2
+>\>aroe-2
 GAAGCGAGTGACTTGGCAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATC
 AATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATAAGCTGAGCGATGAA
 GCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATAT
 AATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAG
 ATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGAT
 GGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTAC
 CTAGACAAGTTACAGGAGCAGACAGGTTTTAAAGTGGATTTGTGT
 
 Note that some MLST databases use a different character (e.g. ‘_’) to separate the locus label (‘aroe’) from the allele number (‘1’), so it might be aroe_1, aroe_2 rather than aroe-1, aroe-2. This is OK but SRST assumes by default that a dash ‘-‘ is used, so if it is anything other than this you need to specify it in the SRST command via the –n argument.
 
-
-**(2) Download reference sequence** (fasta format, .fna) from:
-
-	ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
-
+**(2) Download reference sequence** (fasta format, .fna) from: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
 
 **(3) Generate inputs.** Using these downloaded files and the supplied script, extract 100 bp flanking sequences and prepare the input file telling SRST where to find all the sequences it needs (this script requires Python 2.6.4, Biopython, BLAST+, EMBOSS):
 
 	python getFlanksMLST.py -d ref.fna *tfa > summary.txt
 
 
 Note dependencies can be obtained from:
 Biopython 	http://biopython.org/wiki/Download
-BLAST+	ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
+BLAST+	
 EMBOSS	http://emboss.sourceforge.net/
 
 
 The outputs of the script, which are required for running SRST, can be generated in other ways too. These are:
 
 ***(a) Flanking sequences.*** A set of fasta files, one for each MLST locus, containing the upstream and downstream 100 bp sequences (flanking sequences) for that locus. Each file must be labeled: “[locuslabel]_flanks.fasta” (see examples in table below) and contain two entries labeled ‘up’ and ‘down’.
 
 e.g. S. pneumoniae aroe_flanks.fasta:
 
->up
+>\>up
 AGTTGTTGCCAATCCTATTAAGCATTCTATTTCTCCCTTCATCCACAATAGAGCCTTTGA
 GGCGACAGCTACCAACGGTGCTTATGTGGCTTGGGAGATT
->down
+>\>down
 GTATGCTTTAGAAAATGTTTCTGAACTGCAAGCAAGGATTGTCGAGTCGGATTTACTGGT
 CAATGCCACCAGTGTGGGCATGGATGGTCAATCATCCCCA
 
 Note the flanks need to be in the same orientation as the MLST locus, so if the MLST locus sequence is on the forward strand, both flanks need to be on the forward strand too. Basically, SRST will reconstruct a single sequence that extends from 100 bp upstream of the locus to 100 bp downstream, and it will need to be able to do this by concatenating the ‘up’ sequence, the locus variant sequence (from the *.tfa files downloaded in step 1) and the ‘down’ sequence.
 
 ***(b) Summary file.*** A tab-delimited text file telling SRST where to find these sequences for each locus:
 
-aroe	aroe.tfa	aroe_flanks.fasta
-ddl_	ddl_.tfa	ddl__flanks.fasta
-gdh_	gdh_.tfa	gdh__flanks.fasta
-gki_	gki_.tfa	gki__flanks.fasta
-recP	recP.tfa	recP_flanks.fasta
-spi_	spi_.tfa	spi__flanks.fasta
-xpt_	xpt_.tfa	xpt__flanks.fasta
+    aroe	aroe.tfa	aroe_flanks.fasta
+    ddl_	ddl_.tfa	ddl__flanks.fasta
+    gdh_	gdh_.tfa	gdh__flanks.fasta
+    gki_	gki_.tfa	gki__flanks.fasta
+    recP	recP.tfa	recP_flanks.fasta
+    spi_	spi_.tfa	spi__flanks.fasta
+    xpt_	xpt_.tfa	xpt__flanks.fasta
 
 (c) Note the getFlanks.py script will also output a ***genbank file*** containing your reference genome and features indicating the detected positions of the MLST loci and the flanking sequences that were extracted. This is not required for SRST but is handy to check that all the sequences look correct, e.g. by examining in Artemis (http://www.sanger.ac.uk/resources/software/artemis/).
 
 
 **(4) Run SRST script** (requires Python 2.6.4, BWA, Samtools):
 
-python SRST.py -P summary.txt -d db.txt -l log.txt *fastq.gz > out.txt
+    python SRST.py -P summary.txt -d db.txt -l log.txt *fastq.gz > out.txt
 
 e.g.:
 python SRST.py -P spneumo_inputs.txt -d spneumo.txt -l spneumo_log.txt *fastq.gz > spneumo_SRST.txt
 
 ***The main argument*** (unflagged) is the list of fastq files containing sequence reads to process. These can stay gzipped if desired.
 
-***Other required arguments:***
--P	Summary file, tab-delimited, see 3b
--d	MLST profiles database, tab-delimited, see 1a
--l	Name of log file to record verbose output (default, stderr)
-
-***Optional arguments:***
--n	Separator for allele names (default, ‘-‘)
--s	Score cut-off value (default, 10)
--w	Name of working directory to store mapping data (default, ‘tmp’)
--p	Flag to indicate data is paired (note info below re paired reads) 
--i	Insert size for paired reads (bam default, 500)
--V	Flag to switch on storing of all output (otherwise temporary files created by SRST are removed)
--b	Path to bwa (default, ‘bwa’)
--t 	Path to samtools (default, ‘samtools’)
-
-***Paired data:***
+
+######Other required arguments:######
+
+    -P	Summary file, tab-delimited, see 3b
+    -d	MLST profiles database, tab-delimited, see 1a
+    -l	Name of log file to record verbose output (default, stderr)
+
+
+######Optional arguments:######
+    -n	Separator for allele names (default, ‘-‘)
+    -s	Score cut-off value (default, 10)
+    -w	Name of working directory to store mapping data (default, ‘tmp’)
+    -p	Flag to indicate data is paired (note info below re paired reads) 
+    -i	Insert size for paired reads (bam default, 500)
+    -V	Flag to switch on storing of all output (otherwise temporary files are removed)
+    -b	Path to bwa (default, ‘bwa’)
+    -t 	Path to samtools (default, ‘samtools’)
+
+######Paired data:######
 If forward & reverse reads are available, this can be specified using the ‘-p’ flag. It is assumed forward and reverse reads will be adjacent to each other on the command line, e.g. ‘sample1_1.fastq sample1_2.fastq sample2__1.fastq sample2_2.fastq’.
 
 Dependencies can be downloaded from:

WikiPage User guide modified by Kat Holt

Kat Holt — Mon, 30 Jan 2012 23:59:05 -0000

--- v1 
+++ v2 
@@ -1,122 +1,122 @@
 **(1) Download latest MLST data** (sequences and profiles) from: http://pubmlst.org/data/
 
-
 (a) ***ST profiles***. This should be tab-delimited text, with the first column indicating the ST and subsequent columns indicating the locus variants that make up the ST. The first row should indicate the label or name of each locus. 
 
 E.g. for S. pneumoniae (http://pubmlst.org/data/profiles/spneumoniae.txt) the file looks like this: 
 
 ST | aroe | gdh_ | gki_ | recP | spi_ | xpt_ | ddl_
 ---------------------------------------------------
 1|1|1|1|1|1|1|1
 2	1	1	4	1	18	13	18
 3	1	5	1	8	14	11	14
 
 (b) ***Locus variant sequences*** are in fasta format, one fasta file per locus. If downloaded from the above link, they will have extension ‘.tfa’ (if you got them elsewhere they might have another extension, this is fine). The file names, excluding the extension, must be the locus label and match exactly the labels used in the profiles file.
 
 E.g. for S. pneumoniae, you need to download http://pubmlst.org/data/alleles/spneumoniae/aroe.tfa
 
 http://pubmlst.org/data/alleles/spneumoniae/ddl_.tfa
 
 http://pubmlst.org/data/alleles/spneumoniae/gdh_.tfa 
 
 http://pubmlst.org/data/alleles/spneumoniae/gki_.tfa 
 
 http://pubmlst.org/data/alleles/spneumoniae/recP.tfa
 
 http://pubmlst.org/data/alleles/spneumoniae/spi_.tfa 
 
 http://pubmlst.org/data/alleles/spneumoniae/xpt_.tfa
 
 These files will contain fasta sequences, one fasta entry per variant, labeled with the allele number, aroe-1, aroe-2, like this example from S. pneumoniae aroe.tfa:
 
 >aroe-1
 GAAGCGAGTGACTTGGCAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATC
 AATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATGAGCTAAGCGATGAA
 GCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATAT
 AATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAG
 ATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGAT
 GGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTAC
 CTAGACAAGTTACAGGAGCAGACAGGCTTTAAAGTGGATTTGTGT
 >aroe-2
 GAAGCGAGTGACTTGGCAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATC
 AATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATAAGCTGAGCGATGAA
 GCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATAT
 AATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAG
 ATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGAT
 GGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTAC
 CTAGACAAGTTACAGGAGCAGACAGGTTTTAAAGTGGATTTGTGT
 
 Note that some MLST databases use a different character (e.g. ‘_’) to separate the locus label (‘aroe’) from the allele number (‘1’), so it might be aroe_1, aroe_2 rather than aroe-1, aroe-2. This is OK but SRST assumes by default that a dash ‘-‘ is used, so if it is anything other than this you need to specify it in the SRST command via the –n argument.
 
 
-(2) Download reference sequence (fasta format, .fna) from:
+**(2) Download reference sequence** (fasta format, .fna) from:
 
 	ftp://ftp.ncbi.nih.gov/genomes/Bacteria/
 
 
-(3) Generate inputs. Using these downloaded files and the supplied script, extract 100 bp flanking sequences and prepare the input file telling SRST where to find all the sequences it needs (this script requires Python 2.6.4, Biopython, BLAST+, EMBOSS):
+**(3) Generate inputs.** Using these downloaded files and the supplied script, extract 100 bp flanking sequences and prepare the input file telling SRST where to find all the sequences it needs (this script requires Python 2.6.4, Biopython, BLAST+, EMBOSS):
 
 	python getFlanksMLST.py -d ref.fna *tfa > summary.txt
 
 
 Note dependencies can be obtained from:
 Biopython 	http://biopython.org/wiki/Download
 BLAST+	ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
 EMBOSS	http://emboss.sourceforge.net/
 
 
 The outputs of the script, which are required for running SRST, can be generated in other ways too. These are:
 
-(a) Flanking sequences. A set of fasta files, one for each MLST locus, containing the upstream and downstream 100 bp sequences (flanking sequences) for that locus. Each file must be labeled: “[locuslabel]_flanks.fasta” (see examples in table below) and contain two entries labeled ‘up’ and ‘down’.
+***(a) Flanking sequences.*** A set of fasta files, one for each MLST locus, containing the upstream and downstream 100 bp sequences (flanking sequences) for that locus. Each file must be labeled: “[locuslabel]_flanks.fasta” (see examples in table below) and contain two entries labeled ‘up’ and ‘down’.
 
 e.g. S. pneumoniae aroe_flanks.fasta:
+
 >up
 AGTTGTTGCCAATCCTATTAAGCATTCTATTTCTCCCTTCATCCACAATAGAGCCTTTGA
 GGCGACAGCTACCAACGGTGCTTATGTGGCTTGGGAGATT
 >down
 GTATGCTTTAGAAAATGTTTCTGAACTGCAAGCAAGGATTGTCGAGTCGGATTTACTGGT
 CAATGCCACCAGTGTGGGCATGGATGGTCAATCATCCCCA
 
 Note the flanks need to be in the same orientation as the MLST locus, so if the MLST locus sequence is on the forward strand, both flanks need to be on the forward strand too. Basically, SRST will reconstruct a single sequence that extends from 100 bp upstream of the locus to 100 bp downstream, and it will need to be able to do this by concatenating the ‘up’ sequence, the locus variant sequence (from the *.tfa files downloaded in step 1) and the ‘down’ sequence.
 
-(b) Summary file. A tab-delimited text file telling SRST where to find these sequences for each locus:
+***(b) Summary file.*** A tab-delimited text file telling SRST where to find these sequences for each locus:
 
 aroe	aroe.tfa	aroe_flanks.fasta
 ddl_	ddl_.tfa	ddl__flanks.fasta
 gdh_	gdh_.tfa	gdh__flanks.fasta
 gki_	gki_.tfa	gki__flanks.fasta
 recP	recP.tfa	recP_flanks.fasta
 spi_	spi_.tfa	spi__flanks.fasta
 xpt_	xpt_.tfa	xpt__flanks.fasta
 
-(c) Note the getFlanks.py script will also output a genbank file containing your reference genome and features indicating the detected positions of the MLST loci and the flanking sequences that were extracted. This is not required for SRST but is handy to check that all the sequences look correct, e.g. by examining in Artemis (http://www.sanger.ac.uk/resources/software/artemis/).
-
-
-(4) Run SRST script (requires Python 2.6.4, BWA, Samtools):
+(c) Note the getFlanks.py script will also output a ***genbank file*** containing your reference genome and features indicating the detected positions of the MLST loci and the flanking sequences that were extracted. This is not required for SRST but is handy to check that all the sequences look correct, e.g. by examining in Artemis (http://www.sanger.ac.uk/resources/software/artemis/).
+
+
+**(4) Run SRST script** (requires Python 2.6.4, BWA, Samtools):
 
 python SRST.py -P summary.txt -d db.txt -l log.txt *fastq.gz > out.txt
 
 e.g.:
 python SRST.py -P spneumo_inputs.txt -d spneumo.txt -l spneumo_log.txt *fastq.gz > spneumo_SRST.txt
 
-The main argument (unflagged) is the list of fastq files containing sequence reads to process. These can stay gzipped if desired.
-
-Other required arguments:
+***The main argument*** (unflagged) is the list of fastq files containing sequence reads to process. These can stay gzipped if desired.
+
+***Other required arguments:***
 -P	Summary file, tab-delimited, see 3b
 -d	MLST profiles database, tab-delimited, see 1a
 -l	Name of log file to record verbose output (default, stderr)
 
-Optional arguments:
+***Optional arguments:***
 -n	Separator for allele names (default, ‘-‘)
 -s	Score cut-off value (default, 10)
 -w	Name of working directory to store mapping data (default, ‘tmp’)
 -p	Flag to indicate data is paired (note info below re paired reads) 
 -i	Insert size for paired reads (bam default, 500)
 -V	Flag to switch on storing of all output (otherwise temporary files created by SRST are removed)
 -b	Path to bwa (default, ‘bwa’)
 -t 	Path to samtools (default, ‘samtools’)
 
-Paired data:
+***Paired data:***
 If forward & reverse reads are available, this can be specified using the ‘-p’ flag. It is assumed forward and reverse reads will be adjacent to each other on the command line, e.g. ‘sample1_1.fastq sample1_2.fastq sample2__1.fastq sample2_2.fastq’.
 
 Dependencies can be downloaded from:

WikiPage User guide modified by Kat Holt

Kat Holt — Mon, 30 Jan 2012 23:56:34 -0000

**(1) Download latest MLST data** (sequences and profiles) from: http://pubmlst.org/data/ (a) ***ST profiles***. This should be tab-delimited text, with the first column indicating the ST and subsequent columns indicating the locus variants that make up the ST. The first row should indicate the label or name of each locus. E.g. for S. pneumoniae (http://pubmlst.org/data/profiles/spneumoniae.txt) the file looks like this: ST | aroe | gdh_ | gki_ | recP | spi_ | xpt_ | ddl_ --------------------------------------------------- 1|1|1|1|1|1|1|1 2 1 1 4 1 18 13 18 3 1 5 1 8 14 11 14 (b) ***Locus variant sequences*** are in fasta format, one fasta file per locus. If downloaded from the above link, they will have extension ‘.tfa’ (if you got them elsewhere they might have another extension, this is fine). The file names, excluding the extension, must be the locus label and match exactly the labels used in the profiles file. E.g. for S. pneumoniae, you need to download http://pubmlst.org/data/alleles/spneumoniae/aroe.tfa http://pubmlst.org/data/alleles/spneumoniae/ddl_.tfa http://pubmlst.org/data/alleles/spneumoniae/gdh_.tfa http://pubmlst.org/data/alleles/spneumoniae/gki_.tfa http://pubmlst.org/data/alleles/spneumoniae/recP.tfa http://pubmlst.org/data/alleles/spneumoniae/spi_.tfa http://pubmlst.org/data/alleles/spneumoniae/xpt_.tfa These files will contain fasta sequences, one fasta entry per variant, labeled with the allele number, aroe-1, aroe-2, like this example from S. pneumoniae aroe.tfa: >aroe-1 GAAGCGAGTGACTTGGCAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATC AATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATGAGCTAAGCGATGAA GCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATAT AATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAG ATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGAT GGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTAC CTAGACAAGTTACAGGAGCAGACAGGCTTTAAAGTGGATTTGTGT >aroe-2 GAAGCGAGTGACTTGGCAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATC AATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATAAGCTGAGCGATGAA GCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATAT AATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAG ATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGAT GGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTAC CTAGACAAGTTACAGGAGCAGACAGGTTTTAAAGTGGATTTGTGT Note that some MLST databases use a different character (e.g. ‘_’) to separate the locus label (‘aroe’) from the allele number (‘1’), so it might be aroe_1, aroe_2 rather than aroe-1, aroe-2. This is OK but SRST assumes by default that a dash ‘-‘ is used, so if it is anything other than this you need to specify it in the SRST command via the –n argument. (2) Download reference sequence (fasta format, .fna) from: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ (3) Generate inputs. Using these downloaded files and the supplied script, extract 100 bp flanking sequences and prepare the input file telling SRST where to find all the sequences it needs (this script requires Python 2.6.4, Biopython, BLAST+, EMBOSS): python getFlanksMLST.py -d ref.fna *tfa > summary.txt Note dependencies can be obtained from: Biopython http://biopython.org/wiki/Download BLAST+ ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ EMBOSS http://emboss.sourceforge.net/ The outputs of the script, which are required for running SRST, can be generated in other ways too. These are: (a) Flanking sequences. A set of fasta files, one for each MLST locus, containing the upstream and downstream 100 bp sequences (flanking sequences) for that locus. Each file must be labeled: “[locuslabel]_flanks.fasta” (see examples in table below) and contain two entries labeled ‘up’ and ‘down’. e.g. S. pneumoniae aroe_flanks.fasta: >up AGTTGTTGCCAATCCTATTAAGCATTCTATTTCTCCCTTCATCCACAATAGAGCCTTTGA GGCGACAGCTACCAACGGTGCTTATGTGGCTTGGGAGATT >down GTATGCTTTAGAAAATGTTTCTGAACTGCAAGCAAGGATTGTCGAGTCGGATTTACTGGT CAATGCCACCAGTGTGGGCATGGATGGTCAATCATCCCCA Note the flanks need to be in the same orientation as the MLST locus, so if the MLST locus sequence is on the forward strand, both flanks need to be on the forward strand too. Basically, SRST will reconstruct a single sequence that extends from 100 bp upstream of the locus to 100 bp downstream, and it will need to be able to do this by concatenating the ‘up’ sequence, the locus variant sequence (from the *.tfa files downloaded in step 1) and the ‘down’ sequence. (b) Summary file. A tab-delimited text file telling SRST where to find these sequences for each locus: aroe aroe.tfa aroe_flanks.fasta ddl_ ddl_.tfa ddl__flanks.fasta gdh_ gdh_.tfa gdh__flanks.fasta gki_ gki_.tfa gki__flanks.fasta recP recP.tfa recP_flanks.fasta spi_ spi_.tfa spi__flanks.fasta xpt_ xpt_.tfa xpt__flanks.fasta (c) Note the getFlanks.py script will also output a genbank file containing your reference genome and features indicating the detected positions of the MLST loci and the flanking sequences that were extracted. This is not required for SRST but is handy to check that all the sequences look correct, e.g. by examining in Artemis (http://www.sanger.ac.uk/resources/software/artemis/). (4) Run SRST script (requires Python 2.6.4, BWA, Samtools): python SRST.py -P summary.txt -d db.txt -l log.txt *fastq.gz > out.txt e.g.: python SRST.py -P spneumo_inputs.txt -d spneumo.txt -l spneumo_log.txt *fastq.gz > spneumo_SRST.txt The main argument (unflagged) is the list of fastq files containing sequence reads to process. These can stay gzipped if desired. Other required arguments: -P Summary file, tab-delimited, see 3b -d MLST profiles database, tab-delimited, see 1a -l Name of log file to record verbose output (default, stderr) Optional arguments: -n Separator for allele names (default, ‘-‘) -s Score cut-off value (default, 10) -w Name of working directory to store mapping data (default, ‘tmp’) -p Flag to indicate data is paired (note info below re paired reads) -i Insert size for paired reads (bam default, 500) -V Flag to switch on storing of all output (otherwise temporary files created by SRST are removed) -b Path to bwa (default, ‘bwa’) -t Path to samtools (default, ‘samtools’) Paired data: If forward & reverse reads are available, this can be specified using the ‘-p’ flag. It is assumed forward and reverse reads will be adjacent to each other on the command line, e.g. ‘sample1_1.fastq sample1_2.fastq sample2__1.fastq sample2_2.fastq’. Dependencies can be downloaded from: http://samtools.sourceforge.net/ http://bio-bwa.sourceforge.net/ Note if your installation means these programs are not accessible to SRST using ‘bwa’ and ‘samtools’, you will need to provide the path