MESPA Wiki

Brought to you by: ramprasadne

Manual

Manual for MESPA

Introduction
Overview
Prerequisites that should be installed on the local computer
MESPA output
Walkthrough 1: MESPA without xenobiotic filtering
1. Prepare the config file
2. Running MESPA
Walkthrough 2: MESPA with xenobiotic filtering
1. Downloading taxonomy files
2. Preparing BLAST databases
3. Running MESPA
How to cite MESPA

Introduction

MESPA is a python based pipeline that builds gene models from fragmented genome assemblies using proteins that have greater than 70% amino acid identity or higher between species. Therefore, depending upon the level of evolutionary divergence between a genome and the protein gene set this may apply to many or a small number of your genes of interest. Generally, MESPA can work on proteins from the same species or from a species that is less than 100 MY divergent.

Overview

MESPA’s central function is to efficiently and accurately scaffold genomic assemblies using protein sequences, and then provide gene models for those scaffolds, producing high quality super-scaffolds with annotations of their gene coding regions. Even in a highly fragmented genome assembly, the resulting contigs or scaffolds contain complete exons for nearly all the genes in the genome; the average metazoan exon size is < 200bp and our genome assemblies from across a wide diversity of taxa have N50’s that are always many times larger. Fragment based shotgun genome assembly also has its greatest problems with repetitive and low complexity sequence which exons lack, thereby biasing the assembled content to include exons. Using a set of protein sequences from a related species, MESPA identifies exons in the draft genome and scaffolds them into gene models, thereby improving the genomic assembly for genic regions as well as generating high quality gene model annotations (i.e. a Gene Feature Format (gff) files). The resulting gene-model scaffolds are then combined with the un-scaffolded contigs of the original genome assembly to make a revised assembly.

Prerequisites that should be installed on the local computer

Python (version 2.7.x)
Perl (version 5.18.x)
Biopython (version >1.59)
Spaln (version 2.1.4)
NCBI-Blast+ (Version >2.2.28)
Uniref90 in FASTA format (can be downloaded from http://www.uniprot.org/downloads)
A relevant taxonomy file from the Uniref database.
Genome assembly file containing contigs (or scaffolds) in fasta format
Protein gene set file containing protein sequences in fasta format
AsmQc.pl (part of MESPA package)

MESPA output

Running MESPA generates the following files in the working directory:
1. onlygenemodels.fa - Genemodels in fasta format
2. remaining_contigs.fa - Contigs not used in genemodels in FASTA format
3. scaffolds.mfa - Combination of remaining_contigs.fa and onlygenemodels.fa
4. scaffolds.gff - Gene feature information for each gene-model
5. output_summary.txt - Statistics generated during gene-model construction
6. assembly_edited.fa - Sequence from the genome assembly provided as an input to MESPA in the same order as the original file but with a reformatted header. This is done to prevent memory conflicts that sometimes appear when long contig headers are used with spaln.
7. A range of intermediate files that MESPA generates along the way, which can be used for valiadting the results are created in subdirectory named mespa_tmp

MESPA intermediate files which can be used for validating the results:

blastresults.tab - blast result file generated from blasting the protein sequences from the genome against uniref90 for xenobiotic filtering(present only when xenobiotic filtering is performed).
order_scaff.txt - Contigs order information used to build scaffolds.
timelog.txt - time log for each major step in MESPA.
list_before_scaff.txt - Alignment information for each protein query in the genome to be used for scaffolding.
list_after_scaff.txt - Alignment information for each protein query in the genome that was not used for scaffolding.
potential_duplicates.tab - Potential duplicates for a protein query
spaln_translated_proteins_assembly_edited.fa - Protein sequences generated by spaln after searching the genome file (assembly_edited.fa) for query proteins.
spaln_translated_proteins_assembly_edited_reformatted.fa - Protein sequences in spaln_translated_proteins_assembly_edited.fa are reformatted to confirm with fasta format. This file is used as input to BLAST for xenobiotic filtering.
spaln_assembly_edited.gff - Alignment information between the proteins and the genome in gff format.
spaln_assembly_edited.mb - Alignment information between the proteins and the genome in megablast format
spaln_translated_proteins_scaffolds.fa - Protein sequences generated by spaln after searching the scaffolds file (scaffolds.mfa) for query proteins.
spaln_translated_proteins_scaffolds_reformatted.fa - Protein sequences in spaln_translated_proteins_scaffolds.fa are reformatted to confirm with fasta format.
spaln_scaffolds.mb - Alignment information between the proteins and the scaffolds in megablast format
validation_statistics.tab - Coverage summary for the proteins before and after scaffolding
cov_table.tab - Coverage information for each protein found in the genome before and after scaffolding
spaln_alignment_summary.tab - Alignment information for each protein query in the genome
xeno_filtered.txt - list of xenobiotic contigs
protein_scaff_relationship.tab - List of proteins present in each scaffold.
scaffolds_for_manual_curation.txt - Scaffold pairs that have conflicting orientation information are joined together arbitrarily when making the final scaffolds. These scaffolds need to be manually curated to establish the right order and orientation.

Walkthrough 1: MESPA without xenobiotic filtering

1. Prepare the config file

Here we assume that you have successfully installed all of the prerequisite programs from above. Edit the configuration file (mespa.cfg), and alter the paths to make the specific to your local computer. It is recommended that you make a new copy for your local computer. Further, we recommend that you make a new configuration file for each run of your MESPA, so that you can keep track of the parameters used in your runs.

Note that for Walkthrough 1, you do not need to set blast and xenobiotic filtering parameters

An example config file is shown below (mespa.cfg):

[Required]
path_to_blastp = /data/programs/ncbi-blast-n-rmblast-2.2.28+/
path_to_spaln = /data/programs/spaln2.1.4/src/
path_to_spaln_seqdb = /data/DB/annotation/spaln_db/seqdb
path_to_spaln_table = /data/DB/annotation/spaln_db/table
path_to_AsmQC = /data/programs/scripts/
numthreads = 60 # number of blast and spaln threads [default: 2]

[Blast param]
db = /data/DB/blast/uniref90/uniref90.lat.fasta # path to database
evalue = # evalue cut-off for blast [default: 1e-5]

[xenobiotic filtering table]
taxonomyinfo = /data/projects/ram/gmodeler/taxonomy-lineage%3Ainsecta.tab # uniprot taxonomy file

[scaffolding]
spacer= # Number of N's between two contigs in a scaffold [default:100]
overlap_dist = #aa overlap between two alignments [ default 25]
disjoint_dist = #gap distance between two alignments [default 25]

[duplication detection - beta/ in development (use at your own risk)]
dup_protein_length = #minimum length of the protein to be considered for duplication event [default:300]
dup_coverage = #minimum coverage to be considered for duplication [ratio between 0-100, default:0.8]

2. Running MESPA

The working directory must contain
1. the configuration file (e.g. your_mespa.cfg)
2. genome assembly file (e.g. genome.fa)
3. protein gene set file (e.g. protein_set.fa)
Now invoke MESPA by running:

python mespa.py -a <path_to_assembly_file> -p <path_to_protein_file> -s <path_to_mespa.cfg> -c <integer>

where,
-c [1-4] tells the script which stages of MESPA you are interested in running:
1 if you would like to run the full pipeline
2 if you would to like run uniref90 BLAST elsewhere
3 if you have run BLAST elsewhere and want to continue with scaffolding
4 if you do not want to perform any xenobiotic filtering

Example script where we are using option 4 and mespa.py is located in the folder /path/to/software/:

python /path/to/software/mespa.py -a genome.fa -p protein_set.fa -s your_mespa.cfg -c 4

Walkthrough 2: MESPA with xenobiotic filtering

In addition to the steps above in Walkthrough 1, here you need to generate the taxonomy files for filtering, as well as conduct the blast steps to identify the xenobiotics.

1. Downloading taxonomy files

Select taxonomy in the search bar located on the top of the uniprot webpage (http://www.uniprot.org/) and enter the name of the classification that you want to include in your analysis. Ex: Lepidoptera

In the next page click download and in the resulting pop-up select download all and hit go. This will download the taxonomy table.

2. Preparing BLAST databases

First, the format of the headers in the uniref90 database are altered by converting the spaces to semi colons as this will help keep the information intact in the BLAST output table, otherwise the taxonomy information will be unavailable for xenobiotic filtering.

awk '{gsub(" ",";",$0);print $0}' uniref_fasta_file > outputfile

and then do,

/path_to_makeblastdb/makeblastdb -in <uniref_protein_file> -dbtype prot

This will create the necessary blast database files.
Note: This step has to be performed only once, as once you have the database files ready you can use them for further runs. But if you would like to use an updated uniref90 remember to perform these steps after you have downloaded the uniref90 file from the website.

3. Running MESPA

The working directory must contain the configuration file (e.g. your_mespa.cfg), genome assembly file (e.g. genome.fa), protein gene set file (e.g. protein_set.fa).

Now you can invoke MESPA by running:

python mespa.py -a <path_to_assembly_file> -p <path_to_protein_file> -s <path_to_mespa.cfg> -c <integer>

where:
-c [1-4] tells the script which stages of MESPA you are interested in running:
1 if you would like to run the full pipeline
2 if you would to like run uniref90 BLAST elsewhere
3 if you have run BLAST elsewhere and want to continue with scaffolding
4 if you do not want to perform any xenobiotic filtering

Note that you can run the full script, all of the BLAST steps will be conducted on your local computer and this can be a significant rate limiting step in analysis. If you are searching many proteins, we recommend running MESPA with option -c 2.

python /path/to/software/mespa.py -a genome.fa -p protein_set.fa -s your_mespa.cfg -c 2

This will cause MESPA to stop after writing the file spaln_translated_proteins_<genome_filename_prefix>_reformatted.fa to the folder where MESPA was run. This file contains protein sequences generated by spaln after searching the genome for query proteins. This is the file set that is used as input to BLAST for database searching for xenobiotic filtering. You should then run this file using this script

/path_to_blastx/blastx -db /path_to_uniref90_database/uniref_db -query spaln_genome_reformatted.fa -out blastresults.tab -evalue -max_target_seqs 1 -num_threads 40 -outfmt 6

Once this finishes, make sure that the output was written to blastresults.tab and was in tabular format.

Then, with this outfile in the main folder where you originally ran MESPA, run MESPA again, but this time with ‘-c 3’. MESPA will then just start at the point of using these BLAST results.

python /path/to/software/mespa.py -a genome.fa -p protein_set.fa -s your_mespa.cfg -c 3

NOTE: If you have run the uniref90 blast elsewhere rename the blast output as blastresults.tab and resume MESPA with -c 3