Menu

Update databases

Marc Strous

back to modules

Update databases

This module creates and updates the metawatt databases. Briefly, it performs the following actions:

  1. It runs "hmmpress" for each .hmm file present in the database folder
  2. It polls the NCBI assembly database for newly available reference genomes.
  3. It downloads aminoacid fasta files of relevant genomes.
  4. It detects and extracts conserved single copy genes with the .hmm files present in the database folder
  5. It updates the "reference-taxonomy.txt" file.
  6. It updates the "reference-genomes.faa" file.
  7. It creates a diamond database for the "reference-genomes.faa" file.

This module depends on an active internet connection. It should not be interrupted, or the database administration may get corrupted, which means you may need to start from scratch.

The key parameter for this module is "setNCBIdatabaseQuery", the query to populate diamond database from NCBI. By default, this parameter is set to:

superkingdom,Eukaryota,phylum,ftp;superkingdom,Archaea,genus,ftp;superkingdom,Bacteria,genus,ftp;superkingdom,Viruses,family,elink

This is a list of four queries, separated by ";". Each query consists of (a) the taxonomic level of the query, e.g. "superkingdom", (b) the query, (c) the taxonomic level at which the database will be unique, (d) the download method.

Thus, by default, metawatt queries superkingdom "Eukaryota" for new taxa that are unique at the phylum level, superkingdom "Archaea" for new taxa that are unique at the genus level, superkingdom "Bacteria" for new taxa that are unique at the genus level, and superkingdom "Viruses" for new taxa that are unique at the family level. Eukaryota, Bacteria and Archaea are downloaded via ftp, viruses via elink. Ftp is faster, but download of viruses could only be automated via elink.

As of March 2015, this query will yield a database of about 1500 reference genomes.


Fine tuning and adding custom genomes

The taxonomy file "reference-taxonomy.txt" contains a listing of all assemblies/genomes available at the NCBI in the following format (each line contains one taxon). For example:

#1046938~GCF_000380865.1~Contig;Archaea;Aenigmarchaeota;unclassified [class];unclassified [order];unclassified [family];unclassified [genus];unclassified Aenigmarchaeota

666510~GCF_000144915.1~Complete Genome;Archaea;Crenarchaeota;Thermoprotei;Acidilobales;Acidilobaceae;Acidilobus;Acidilobus saccharovorans

*1603069~GCF_000928535.1~Unavailable;Viruses;unclassified viruses;Smacovirusgroup [class];Smacovirusgroup [order];Smacovirusgroup [family];Smacovirusgroup [genus];Smacovirusgroup

Each line has three parts, separated by a "~": (a) The taxon id, (b) the sequence id, (c) the status of the genome assembly (e.g. "complete genome" or "contig"), followed by the taxonomic ranks separated by ";".

The non-alphabetic character at the start of the taxon id refers to the download-status of the genome. Absence of a non-alphabetic character means the genome has been downloaded and its proteins are present in the file "reference-genomes.faa". A hash (#) means that the genome has not been downloaded because it is redundant. Redundancy is specified by the query. For example, because by default archaea are specified to be unique at the genus level, only a single genome of "Acidilobus" will be downloaded, and all subsequent "Acidilobus" genomes will be flagged redundant. An asterix (*) indicates the genome is "wanted": it is not redundant but could not be downloaded, was unavailable, for some reason. If you would like to have a specific genome downloaded, even though it is redundant, you can edit the taxonomy file replace the "#" by a "+" at the start of the line. This will force the download of that genome. After successful download, metawatt will remove the "+". If you would like to remove a genome from the database, add a "#" to the beginning of the line. It will be removed from the database when you rerun the module.

You can add your own genomes to the database by adding a line to "reference-taxonomy.txt" for each custom genome. Make sure you do not add a hash (#) or asterix (*) to the taxon id. Then you can append your genome's predicted open reading frames to "reference-genomes.faa". Also make sure you specify a unique taxon id that is not yet present in the ncbi database. You can enter "grep your-taxon-id reference-taxonomy.txt" on the command line to make sure it is unique. Also note that the taxon id should be a 32 bit integer value, smaller than 2147483647. For example:

999999~XXX~Complete Genome;Archaea;Your phylum ;Your class;Your order;Your family;Your genus;Your species

Then, to "reference-genomes.faa", append the aminoacid sequences of the predicted open reading frames of your genome in the following way:

>...
>999999~orf0001
MADEALYVYLEGPGATLPEQQQRNNYIFYSPVPFTLYPRGVALLYLRLSIIIPKGYVGCFFSLTDANMSGLYASSRIIHA
GHREELSVLLFNHDDRFYEGRAGDPVACLVMERLIYPSVRQATMI
>999999~orf0002
MSGSNSIMTRLRARSTSCARHHPYTRAQLPRCEENETRASMTEDHPLLPDCDTMTMHSVSCVRGLPCSASFTVLQELPIP
WDMFLNPEELKIMRRCMHLCLCCATIDIFHSQVIHGRENWVLHCHCNQQGSLQCMAGGAVLAVWFRKVILGCMINQRCPW
YRQIVNMHMPKEIMYVGSVFLRERHLIYIKLWYDGHAGAIISDMSFGWSAFNYGLLNNIVIMCCTYCKDLSEIRMRCCAH
RTRKLMLRAIKIMLQDTVDPDPINSSRTERRRQRLLVGLMRHNRPIPFSDYDSHRSSSR
>...

To do this, if the file "my-custom-genome.fasta" contains these sequences in the above format, you could enter (on the command line):

cat my-custom-genome.fasta >> reference-genomes.faa

To extract the conserved single copy genes of your genome (used for treeing), untick "Extract profile reference sequences incrementally" in this module's options, and re-run this module.

Before engaging in this procedure, you could back up your current working database consisting of"reference-genomes.faa", "reference-taxonomy.txt", "reference-genomes.faa.dmnd" and the folder "conserved-genes.fasta", to enable restoration when something goes wrong.


Processing of downloaded aminoacid sequences

While the module runs, newly downloaded genomes are first saved in the temp folder. After all new genomes of a given query are downloaded, the module detects and extracts genes using the .hmm files present in the database folder. The module creates a separate fasta file for each of the profiles in the .hmm files. These fasta files are used by the module [Calculate bin phylogeny] to create a concatenated alignment. Finally, the module appends all newly downloaded and extracted fasta aminoacid sequences to the files in the database folder. During this stage, do not interrupt the module!

If you set the parameter "extractProfileGenesIncrementally" (Extract profile reference sequences incrementally) to false, the module will extract .hmm profile hits for the complete database, rather than only for the newly downloaded sequences. This is useful if you made changes to, or added new .hmm files, or if you added custom genomes to your database. Of course this takes substantially longer.


Runtime

Minutes for updating, an hour for initial database creation

External dependencies

This module requires diamond (version >0.7) and hmmer3.1.


Parameters (type, default)

  • Set query to populate diamond database from NCBI (setNCBIdatabaseQuery, String, "superkingdom,Eukaryota,phylum,ftp;superkingdom,Archaea,genus,ftp;superkingdom,Bacteria,genus,ftp;superkingdom,Viruses,family,elink"): Queries to poll NCBI.

  • Set processors used (setProcessorsUsed, int, 4): The number of processors/cores/threads used for computations.

  • Set temp folder (setTempFolder, String, "/temp/metawatt"): Temp folder used for intermediate files.

  • Set Diamond Block Size (setDiamondBlockSize, double, 2.0): Optimal for 32 Gb of memory, adjust proportionally.

  • Set HMM evalue (setHMMevalue, double, 2.0): Maximum evalue for detection of genes.

  • Set mimimum length for HMM profile hits (setHMMprofileMinLength, int, 50): Minimum length of detected genes.

  • Extract profile reference sequences incrementally (extractProfileGenesIncrementally, boolean, true): If true, only newly downloaded sequences are searches for genes.


Files generated

  • ./databases/reference-taxonomy.txt: File with taxonomy information.

  • ./databases/reference-genomes.faa: File with aminoacid fasta reference sequences.

  • ./databases/reference-genomes.faa.dmnd: File with aminoacid fasta reference sequences.


Related

Wiki: Calculate bin phylogeny
Wiki: Getting Started
Wiki: Pipeline modules

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.