Efficient methods for estimating genetic diversity among the microorganisms are essential for understanding evolutionary history, geographic distribution, pathogenicity and virulence. In the past decades, numerous methods have been developed for typing of bacteria and fungi. Multilocus sequence typing (MLST) based DNA sequencing results, which can be easily archived and shared among different laboratories. MLST is one of the most reliable and informative method for molecular genotyping, and it has been adopted in many bacterial and fungal studies.
MLSTEasy was designed for next generation sequencing technology (PacBio CCS or Roche 454 platform) based MSLT methods. MLST-Easy, can automatically identify the barcodes and primers used in the PCR reaction, corrects sequencing errors, generates the MLST profile for each isolate, predicts the potential heterozygous locus, and outputs different alleles. The source code can be downloaded from https://github.com/cyyuan2002/mlstez.git.
MLSTEasy was written in Python, version 2.7.6. The graphic user interface (GUI) was created by PyQt4 (http://www.riverbankcomputing.com/software/pyqt/download) and Qt Designer (http://qt-project.org/doc/qt-4.8/designer-manual.html). Mac version were tested under Mac OS X 10.9, and Windows version were tested under Windows Vista and Windows 7. The software runs on IBM-compatible PC under 32/64-bit Windows, and Mac OS X 10.6+. The minimum hardware requirements for the program are:
a processor based on the Intel Pentium 4/AMD Athlon
200 MB of RAM memory
hard drive with more than 200 MB available space
Windows XP or later version, Mac OS X 10.6+ with MUSCLE installed (http://www.drive5.com/muscle/downloads.htm)
The recommended hardware requirements for the program are:
a multiple core processor based on the Intel Core 2 Due/AMD Athlon II (or higher)
4 GB of RAM memory
hard drive with more than 1 GB available space
Windows XP or later version, Mac OS X 10.6+ with MUSCLE installed (http://www.drive5.com/muscle/downloads.htm)
A new project can be created by clicking the "New Project" button on toolbar or by selecting "Project->New Project" on menu. Creating a new project requires "Sequencing Files", "Barcode File", "Primer File" and properly set of the "Parameters".
MLSTEZ supports FASTA and FASTQ data file formats. Corresponding data format needs to be selected in the "Advanced" parameters. When you use FASTQ file as input, the corresponding scoring system (phred33 or phred64) needs to be selected in the "Advanced" parameters. Users can obtain the scoring system using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) or asking the information from sequencing facility. Mutilple files with the same file format and scoring system can be used at one time.
FASTA file format must begin with the symbol '>' in the first line of the file; the sequence name is the first word after that symbol. Additional characters in this line are considered to be comments. The sequence data starts in the second line. Nucleotide data can be written in one or more lines. For more detail informaiton of FASTA Format please check http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml for more details.
>m140505
AGACTGGACCCACAGCGGGCGAGAGAAGTACAAGCCCCCGCGACTGGAGTAATTCTTAGT
GATCATTAATCTTTTCTAGACTTTGCTTGACTGAGCTTGACTCAACTTAAAACGTTTGCT
TGACCAGCCTATTAGAGCCACCGTCAGGTCGGGTCAACAACTATTCAAAGTTTGATTTGC
CCATCCCCTCTTTGACTATGCTATAAGCACACCCACTGCATACACTTGGCAGCCCCCCCC
>m140507
AAGACTGGACCCACAGCGGGCGAGAGAAGTTACAAGCCCCCCGCGACTGGAGTAATTCTT
AGTGATCATTAATCTTTTCTAGACTTTGCTTGACTGAGCTTGACTCAACTTAAAACGTTT
GCTTGACCAGCCTATTAGAGCCACCGTCAGGTCGGGTCAACAACTATTCAAAGTTTGATT
GCCCATCCCCTCTTGACTATGCTATAAGCACACCCACTGCATACACTTGGCAGCCCCCCT
>m140510
AGACTGGACCCACAGCGGGCGAGAGAAGTTACAGGCCCCCCGCGACTGGAGTAATCCTTA
TGATCATTAATCTTTTCTAGACTTTGCTTGACTGAGCTTGACTCAACTTAAAACGTTTG
CTTGACCAGCCTATTAGAGCCACCGTCAGGTCGGGTCAACAACTATTCAAAGTTTGATTG
CCCATCCCCTCTTGACTATGCTATAAGCACACCCACTGCATACACTTGGCAGCCCCCCCT
CTCACCATCCATACCGCATTTACCCATTTTTCATTCCGGCTCACTACCACTATCAAAGTC
CCCCACGACTGGAAAAGTAACAAATACTAGTACTATAATACTAATACTAATTGACTTGCT
A FASTQ file normally uses four lines per sequence. The first line begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line); the second line is the rw sequence letters; the third line begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again; and the last line encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
@NCYC361-11a03.q1k bases 1 to 1576
GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT...
+NCYC361-11a03.q1k bases 1 to 1576
!)))))****(((***%%((((*(((+,**(((+**+,-...
MLSTEZ uses CSV file for importing barcode information. Users can use Microsoft Excel to generate CSV file using "Save As…" and selecting "Comma Seperated Values .csv" as output format. MLSTEZ 2.0 supports asymmetric barcode design, so you can save more money and put more samples into one batch. The new barcode file contains two (symmetric design) or three columns (asymmetric design). MLSTEZ 2.0 will switch between different modes based on the columns provided in the barcode file. The second column contains the names of strains/samples related to the barcodes in the first column, and the second column contains barcode sequences, which do not contains padding sequence or universal primer.
Isolates_A,gcgctctgtgtgcagc,gcgctctgtgtgcagc
Isolates_B,agagtactacatatga,agagtactacatatga
Isolates_C,cgtgtgcatagatcgc,cgtgtgcatagatcgc
Isolates_D,atgtatctcgactgca,atgtatctcgactgca
gcgctctgtgtgcagc,StrainA
agagtactacatatga,StrainB
tcatgagtcgacacta,StrainC
CSV file is used for importing primer information in MLSTEZ. The primer file contains three columns, which are locus name, upper primer of locus and lower primer of locus. Universal primer sequence should be removed from upper/lower primer sequences.
LOCUS1,TCTAATCGAAATGGTCAAGG,CGCAGCTGTTCGTCTGGATA
LOCUS2,AATCGTCAAGGAGACCAACG,CGTCACCAGACTTGACGAAC
LOCUS3,GATGGTTATGAACGAGAGGT,CTTACAGTCAGTATCGGACT
The output folder is used to store the project file, project.nma, and other results including consensus sequence for each loci, unmapped reads and allele sequences for heterozygous loci. Same output folder cannot be used for different project, otherwise, the project file and other output files will be overwritten by the newly built project.
All the advanced parameters are automatically saved in system after first time use. User can click "Advanced" button to expand the parameter panel in order to change the settings. The parameters are set according to following instructions:
MLSTEZ has four major functions in data analysis:
After user has created a project, user can click "project setting" button in toolbar or select "Job settings" in "Project" menu to choose programs in the analysis. User can select "Run the whole process", which will run all four programs one by one. Alternatly, user also can selected one to several programs they interested in. After programs are selected, the "Run" button/menu will be enabled.
This function is used to identify barcodes and primers in the sequencing reads, and it is necessary for the other three functions. Smith-Waterman algorithm is used for identification the barcode and primer sequences in the reads based on user settings (see "Advanced parameters"). After the analysis, MLSTEZ will show "Read Length", "Alignment Ratio", "Length Range" and "Sample Stats" in the software interface.
The consensus sequences will be generated for the locus with more than minimal number of valid reads. All the consensus sequences will be output into the "Output Folder" in FASTA format automatically. Each locus will be output as single file named with "cons.LOCUSNAME.fasta", and each consensus is named with the locus name and sample name. The barcode and primer sequences have removed from the consensus, and the sequence directions have been adjusted based on the given upper and lower primers. A table of all sample loci will be shown in the main interface after analysis is completed. Double click on the the sample name shows all the generated consensus sequences for the sample, and double click on single locus shows the corresponding consensus sequence.
The reads failed to identify by barcode or primer sequences can be output by this function for further analysis. The output file is stored in "Output Folder" named with "UnmappedReads.seq". Sequences after <nobarcode> are the reads that are failed to identify barcode on one or both ends. Sequences after certain barcode sequences are the reads that are failed to identify primer sequences on one or both ends after barcode indentification.</nobarcode>
MLSTEZ can identify possible heterozygous locus based on the sequence differences among the reads. Five valid reads is the minimal requirements for the analysis. If two different sequence clusters are identified, the software will generate consensus sequences for both clusters. A possible heterozygous locus requires more than three nucleotide differences among the two concensus sequences. A table view of the result are shown in the main interface. The locus labelled and "Yes" and orange background indicate this locus might be a heterozygous locus. Double click the grid will show the consensus sequences for both alleles. The locus labelled with "NA" indicates the read number is less than the minimal requirements for this analysis. All the consensus sequences are saved in "Het.cons.fasta" under "Output Folder". The consensus sequences are named with SAMPLENAMELOCUSNAMEallele1/2.
All of the project information is saved in "project.nma" under "Output Folder" automatically. User can open a existing project by clicking "Open Project" button in toolbar or by selecting "Open Project" in "Project" menu and then select the folder that was used as "Output Folder" before. All the results will be loaded in the software, and the user can even run the analysis functions that have not been processed.
The project merge function is designed for the samples that have been sequenced more than once in different batches in order to get higher read coverage. Different project can be merged together based on the sample names. The sequence reads identified by the sample name in different project will be merged together. After the merge step, user can "generate consensus sequences", "dump the unmapped reads" and identify the heteozygous locus using the merged data.
Souce code can be downloaded from https://bitbucket.org/cyyuan2002/mlstez