Menu

Data preparation

Martin S. Lindner

Data preparation

Input data

MicrobeGPS requires SAM files as input. This means, you have to map the reads to the reference genomes before you can start the MicrobeGPS analysis. The SAM format is a standard format to report the alignment of reads to reference genomes. The majority of current read mappers supports this format.

We implemented our own SAM parser which is compatible to the SAM format specifications. Therefore, we can not load the binary BAM files, as it would be possible with SAMtools. MicrobeGPS calculates the mapping quality of each alignment and therefore searches for the CIGAR string and the NM tag (number of mismatches). If the NM tag is missing, MicrobeGPS will assume 0 mismatches. If the CIGAR string is missing hell knows what happens.

Read mapping

The read mapping step is essential for MicrobeGPS to work correctly. Please respect the following hints:

  • Avoid the Best mode of your mapper that only reports the best match for each read. MicrobeGPS relies on the information from shared reads. Unfortunately, this mode is very common and, for example, default for Bowtie2 (the -k mode does the trick here).
  • Report sufficient, but not too many matches. As a rule of thumb, count the number of reference genomes database that are higly similar to each other (the largest cluster). This should be about the maximum number of matches you should allow for each read. For example, the NCBI Bacteria genomes database contains about 80 genomes which are highly similar to E. coli. For Bowtie2, -k 80 would be apropriate.

Reference genome databases

Composing the refrerence genome database is as essential as the read mapping step. The following hints can help you composing your own database:

  • Use high quality genomes where possible. Draft genomes that are fragmented into thousands of contigs make it harder for MicrobeGPS to calculate the quality metrics, estimate the coverage, and so on.
  • Use a unique name for each sequence (genome or contig).
  • Include the GI number in the name. The GI is a unique sequence identifier and can be mapped to the NCBI taxid. This is essential if you want to use all the taxonomy-related features of MicrobeGPS! MicrobeGPS searches for the standard pattern: gi| followed by numbers and terminated by | or any non-number character.
  • Create bundles of reference genomes. Although MicrobeGPS has no limitation on the number of SAM files to analyze, it will save you a lot of computation time. Note: some mappers have limitations on the size of reference genome databases. For example, if you are using Bowtie2 together with the NCBI Bacteria database, you will have to create 3 or 4 databases and map against each one separately.

Further, it is important to use an apropriate database. When in doubt, we recommend including more genomes instead of less. MicrobeGPS can deal with duplicate and overhead genomes. Here you can find some popular collections of genomes:

  • NCBI RefSeq contains high quality reference genomes
  • NCBI Bacteria is a large and up to date collection of bacterial genomes
  • HMP contains human microbiome associated genomes (bacteria, viruses, eukaryotes, archaea)

Reference genome information

The identifiers of the FASTA reference seqences are often cryptic and hinder interpretation. Further, there are often multiple reference sequences that belong to one organism, e.g. a chromosome and a plasmid, or multiple contigs. In these cases it is desirable to

  • put sequences belonging to one organism into a single group
  • and give meaningful names to the group.

Both is done in the Calculate Reference Table step. Here, you can choose between two options:

GI map: If your reference sequences contain GI numbers, we recommend using this option. The GI numbers must occur in the sequence identifier and must start with gi| and end with a TAB or |. Here, you need a file that maps the GI numbers to NCBI taxid numbers. MicrobeGPS comes with a built in GI map, that comprises mappings for the NCBI bacterial reference genomes. Further, you can download a more comprehensive GI mapping file for all known bacterial sequences from the download section. The GI mapping files are excerpts from the extremely large gi_taxid_nucl.dmp file from the NCBI Taxonomy FTP site. With this option, all sequences belonging to the same organism (represented by the taxid) are grouped and the correct scientific name is fetched from the taxonomy (see below).

Mapping file: We only recommend using this approach if the reference sequences do not have the gi| tag. You need to provide tab separated file that maps sequence IDs to reference names. Note that multiple IDs may map to one reference name. Example:

contig_1.0001[tab]Organism One
contig_1.0002[tab]Organism One
contig_2.0001[tab]Organism Two
contig_2.0002[tab]Organism Two
contig_3.0001[tab]Another One

Taxonomic information

Taxonomic information can only be used if the GI map is applied. Then, MicrobeGPS automatically assigns the scientific names to the corresponding taxids and allows to place the reference genomes in a taxonomic tree. MicrobeGPS builds on the NCBI taxonomy and already comes with all necessary information. If you want to update the taxonomy information manually, you will have to replace the file names.dmp and nodes.dmp in the directory microbegps/data/taxonomy/ manually. In the future, we plan to provide an update script for this task.


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.