Home / mlstBLAST
Name Modified Size InfoDownloads / Week
Parent folder
mlstBLAST.py 2012-08-20 5.9 kB
mlstBLAST_README.txt 2012-08-16 3.4 kB
Totals: 2 Items   9.3 kB 0
This is a python script that uses blastn to extract MLST information from assembled genomes or contigs.
You need to have python installed, and blastn must be installed and accessible in your path (download BLAST+ from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/).

Usage:
python mlstBLAST.py -s Sp_summary.txt -d spneumoniae.txt genome1.fasta genome2.fasta genome3.fasta

Options:
  -h, --help            show this help message and exit
  -s SUMMARY, --summary=SUMMARY
                        text file giving paths to allele sequences (one
                        line/file per locus)
  -d DATABASE, --database=DATABASE
                        MLST profile database (col1=ST, other cols=loci, must
                        have loci names in header)
  -n NAMESEP, --namesep=NAMESEP
                        separator for allele names (either '-' (default) or '_')


Required inputs are:

(1) Locus variant sequences in fasta format, available from http://pubmlst.org/data/

E.g. for S. pneumoniae, download these files:

http://pubmlst.org/data/alleles/spneumoniae/aroe.tfa
http://pubmlst.org/data/alleles/spneumoniae/ddl_.tfa
http://pubmlst.org/data/alleles/spneumoniae/gdh_.tfa
http://pubmlst.org/data/alleles/spneumoniae/gki_.tfa
http://pubmlst.org/data/alleles/spneumoniae/recP.tfa
http://pubmlst.org/data/alleles/spneumoniae/spi_.tfa
http://pubmlst.org/data/alleles/spneumoniae/xpt_.tfa


(2) A text file listing the locations of these fasta files (-s).

E.g. you can generate the appropriate list of the files above, using this command:

ls *.tfa > Sp_summary.txt

which gives you a file called 'Sp_summary.txt' containing this list:

aroe.tfa
ddl_.tfa
gdh_.tfa
gki_.tfa
recP.tfa
spi_.tfa
xpt_.tfa


(3) A MLST profile database, which can be downloaded from http://pubmlst.org/data/ (-d).

E.g. for S. pneumoniae, download this file:
http://pubmlst.org/data/profiles/spneumoniae.txt


(4) Your assembled sequences, with each strain in a separate fasta/multifasta formatted file.


(5) You may need to check what delimiter is used in the locus variant sequences, to separate the locus name from the variant number (-n).
Note that some MLST databases use a different character (e.g. ‘_’) to separate the locus label (‘aroe’) from the allele number (‘1’), so it might be aroe_1, aroe_2 rather than aroe-1, aroe-2. This is OK but the script assumes by default that a dash ‘-' is used, so if it is anything other than this you need to specify it in the command via the –n argument.



Output:

Column 1 = strain name, taken from the input files (eg genome1, genome2, genome3)
Column 2 = ST with perfect match to the genome
			if no perfect match is found, this will be 0
			if a novel combination of known alleles is identified, this will be given a new number
Columns 3-n = locus variants for each allele, where perfect matches to known variants were identified
Subsequent columns = closest ST and locus variants
			where no perfect match is found for a given locus, the nearest locus will be reported, followed by the % nucleotide identity of the match and % of the sequence length of the match (if no matches with >90% identity and >90% length are found, none will be reported)
			where perfect or imperfect allele matches were obtained for all loci, the closest ST will be reported
			if a novel combination of known alleles is identified, this will be given a new number
Source: mlstBLAST_README.txt, updated 2012-08-16