MicrobeGPS requires SAM files as input. This means, you have to map the reads to the reference genomes before you can start the MicrobeGPS analysis. The SAM format is a standard format to report the alignment of reads to reference genomes. The majority of current read mappers supports this format.
We implemented our own SAM parser which is compatible to the SAM format specifications. Therefore, we can not load the binary BAM files, as it would be possible with SAMtools. MicrobeGPS calculates the mapping quality of each alignment and therefore searches for the CIGAR string and the NM
tag (number of mismatches). If the NM
tag is missing, MicrobeGPS will assume 0 mismatches. If the CIGAR string is missing hell knows what happens.
The read mapping step is essential for MicrobeGPS to work correctly. Please respect the following hints:
-k
mode does the trick here).-k 80
would be apropriate.Composing the refrerence genome database is as essential as the read mapping step. The following hints can help you composing your own database:
gi|
followed by numbers and terminated by |
or any non-number character.Further, it is important to use an apropriate database. When in doubt, we recommend including more genomes instead of less. MicrobeGPS can deal with duplicate and overhead genomes. Here you can find some popular collections of genomes:
The identifiers of the FASTA reference seqences are often cryptic and hinder interpretation. Further, there are often multiple reference sequences that belong to one organism, e.g. a chromosome and a plasmid, or multiple contigs. In these cases it is desirable to
Both is done in the Calculate Reference Table step. Here, you can choose between two options:
GI map: If your reference sequences contain GI numbers, we recommend using this option. The GI numbers must occur in the sequence identifier and must start with gi|
and end with a TAB
or |
. Here, you need a file that maps the GI numbers to NCBI taxid numbers. MicrobeGPS comes with a built in GI map, that comprises mappings for the NCBI bacterial reference genomes. Further, you can download a more comprehensive GI mapping file for all known bacterial sequences from the download section. The GI mapping files are excerpts from the extremely large gi_taxid_nucl.dmp
file from the NCBI Taxonomy FTP site. With this option, all sequences belonging to the same organism (represented by the taxid) are grouped and the correct scientific name is fetched from the taxonomy (see below).
Mapping file: We only recommend using this approach if the reference sequences do not have the gi|
tag. You need to provide tab separated file that maps sequence IDs to reference names. Note that multiple IDs may map to one reference name. Example:
contig_1.0001[tab]Organism One
contig_1.0002[tab]Organism One
contig_2.0001[tab]Organism Two
contig_2.0002[tab]Organism Two
contig_3.0001[tab]Another One
Taxonomic information can only be used if the GI map is applied. Then, MicrobeGPS automatically assigns the scientific names to the corresponding taxids and allows to place the reference genomes in a taxonomic tree. MicrobeGPS builds on the NCBI taxonomy and already comes with all necessary information. If you want to update the taxonomy information manually, you will have to replace the file names.dmp
and nodes.dmp
in the directory microbegps/data/taxonomy/
manually. In the future, we plan to provide an update script for this task.