MESPA is a python based pipeline that builds gene models from fragmented genome assemblies using proteins that have greater than 70% amino acid identity or higher between species. Therefore, depending upon the level of evolutionary divergence between a genome and the protein gene set this may apply to many or a small number of your genes of interest. Generally, MESPA can work on proteins from the same species or from a species that is less than 100 MY divergent.
MESPA’s central function is to efficiently and accurately scaffold genomic assemblies using protein sequences, and then provide gene models for those scaffolds, producing high quality super-scaffolds with annotations of their gene coding regions. Even in a highly fragmented genome assembly, the resulting contigs or scaffolds contain complete exons for nearly all the genes in the genome; the average metazoan exon size is < 200bp and our genome assemblies from across a wide diversity of taxa have N50’s that are always many times larger. Fragment based shotgun genome assembly also has its greatest problems with repetitive and low complexity sequence which exons lack, thereby biasing the assembled content to include exons. Using a set of protein sequences from a related species, MESPA identifies exons in the draft genome and scaffolds them into gene models, thereby improving the genomic assembly for genic regions as well as generating high quality gene model annotations (i.e. a Gene Feature Format (gff) files). The resulting gene-model scaffolds are then combined with the un-scaffolded contigs of the original genome assembly to make a revised assembly.
Running MESPA generates the following files in the working directory:
1. onlygenemodels.fa - Genemodels in fasta format
2. remaining_contigs.fa - Contigs not used in genemodels in FASTA format
3. scaffolds.mfa - Combination of remaining_contigs.fa and onlygenemodels.fa
4. scaffolds.gff - Gene feature information for each gene-model
5. output_summary.txt - Statistics generated during gene-model construction
6. assembly_edited.fa - Sequence from the genome assembly provided as an input to MESPA in the same order as the original file but with a reformatted header. This is done to prevent memory conflicts that sometimes appear when long contig headers are used with spaln.
7. A range of intermediate files that MESPA generates along the way, which can be used for valiadting the results are created in subdirectory named mespa_tmp
MESPA intermediate files which can be used for validating the results:
Here we assume that you have successfully installed all of the prerequisite programs from above. Edit the configuration file (mespa.cfg), and alter the paths to make the specific to your local computer. It is recommended that you make a new copy for your local computer. Further, we recommend that you make a new configuration file for each run of your MESPA, so that you can keep track of the parameters used in your runs.
Note that for Walkthrough 1, you do not need to set blast and xenobiotic filtering parameters
An example config file is shown below (mespa.cfg):
[Required]
path_to_blastp = /data/programs/ncbi-blast-n-rmblast-2.2.28+/
path_to_spaln = /data/programs/spaln2.1.4/src/
path_to_spaln_seqdb = /data/DB/annotation/spaln_db/seqdb
path_to_spaln_table = /data/DB/annotation/spaln_db/table
path_to_AsmQC = /data/programs/scripts/
numthreads = 60 # number of blast and spaln threads [default: 2]
[Blast param]
db = /data/DB/blast/uniref90/uniref90.lat.fasta # path to database
evalue = # evalue cut-off for blast [default: 1e-5]
[xenobiotic filtering table]
taxonomyinfo = /data/projects/ram/gmodeler/taxonomy-lineage%3Ainsecta.tab # uniprot taxonomy file
[scaffolding]
spacer= # Number of N's between two contigs in a scaffold [default:100]
overlap_dist = #aa overlap between two alignments [ default 25]
disjoint_dist = #gap distance between two alignments [default 25]
[duplication detection - beta/ in development (use at your own risk)]
dup_protein_length = #minimum length of the protein to be considered for duplication event [default:300]
dup_coverage = #minimum coverage to be considered for duplication [ratio between 0-100, default:0.8]
The working directory must contain
1. the configuration file (e.g. your_mespa.cfg)
2. genome assembly file (e.g. genome.fa)
3. protein gene set file (e.g. protein_set.fa)
Now invoke MESPA by running:
python mespa.py -a <path_to_assembly_file> -p <path_to_protein_file> -s <path_to_mespa.cfg> -c <integer>
where,
-c [1-4] tells the script which stages of MESPA you are interested in running:
1 if you would like to run the full pipeline
2 if you would to like run uniref90 BLAST elsewhere
3 if you have run BLAST elsewhere and want to continue with scaffolding
4 if you do not want to perform any xenobiotic filtering
Example script where we are using option 4 and mespa.py is located in the folder /path/to/software/:
python /path/to/software/mespa.py -a genome.fa -p protein_set.fa -s your_mespa.cfg -c 4
In addition to the steps above in Walkthrough 1, here you need to generate the taxonomy files for filtering, as well as conduct the blast steps to identify the xenobiotics.
Select taxonomy in the search bar located on the top of the uniprot webpage (http://www.uniprot.org/) and enter the name of the classification that you want to include in your analysis. Ex: Lepidoptera
In the next page click download and in the resulting pop-up select download all and hit go. This will download the taxonomy table.
First, the format of the headers in the uniref90 database are altered by converting the spaces to semi colons as this will help keep the information intact in the BLAST output table, otherwise the taxonomy information will be unavailable for xenobiotic filtering.
awk '{gsub(" ",";",$0);print $0}' uniref_fasta_file > outputfile
and then do,
/path_to_makeblastdb/makeblastdb -in <uniref_protein_file> -dbtype prot
This will create the necessary blast database files.
Note: This step has to be performed only once, as once you have the database files ready you can use them for further runs. But if you would like to use an updated uniref90 remember to perform these steps after you have downloaded the uniref90 file from the website.
The working directory must contain the configuration file (e.g. your_mespa.cfg), genome assembly file (e.g. genome.fa), protein gene set file (e.g. protein_set.fa).
Now you can invoke MESPA by running:
python mespa.py -a <path_to_assembly_file> -p <path_to_protein_file> -s <path_to_mespa.cfg> -c <integer>
where:
-c [1-4] tells the script which stages of MESPA you are interested in running:
1 if you would like to run the full pipeline
2 if you would to like run uniref90 BLAST elsewhere
3 if you have run BLAST elsewhere and want to continue with scaffolding
4 if you do not want to perform any xenobiotic filtering
Note that you can run the full script, all of the BLAST steps will be conducted on your local computer and this can be a significant rate limiting step in analysis. If you are searching many proteins, we recommend running MESPA with option -c 2.
python /path/to/software/mespa.py -a genome.fa -p protein_set.fa -s your_mespa.cfg -c 2
This will cause MESPA to stop after writing the file spaln_translated_proteins_<genome_filename_prefix>_reformatted.fa to the folder where MESPA was run. This file contains protein sequences generated by spaln after searching the genome for query proteins. This is the file set that is used as input to BLAST for database searching for xenobiotic filtering. You should then run this file using this script
/path_to_blastx/blastx -db /path_to_uniref90_database/uniref_db -query spaln_genome_reformatted.fa -out blastresults.tab -evalue -max_target_seqs 1 -num_threads 40 -outfmt 6
Once this finishes, make sure that the output was written to blastresults.tab and was in tabular format.
Then, with this outfile in the main folder where you originally ran MESPA, run MESPA again, but this time with ‘-c 3’. MESPA will then just start at the point of using these BLAST results.
python /path/to/software/mespa.py -a genome.fa -p protein_set.fa -s your_mespa.cfg -c 3
NOTE: If you have run the uniref90 blast elsewhere rename the blast output as blastresults.tab and resume MESPA with -c 3