Genome Database - A tool to create a local database of reference genome sequences
Usage: java path/to/GenomeDatabase.jar [options]
By Marc Strous, 2016
This tool enables you to download fasta files of protein and RNA sequences encoded
in reference genomes at NCBOI. You can select relevant genomes with a set of queries.
Each query has four fields, separated by comma's. Example of valid queries are:
superkingdom,Bacteria,genus,ftp
superkingdom,Archaea,genus,ftp
superkingdom,Eukaryota,phylum,ftp
superkingdom,Viruses,family,elink
The first query would download (with ftp) all available reference genomes of the
superkingdom Bacteria, limited to one genome per genus. The second query would do
the same for superkingdom Archaea. The third would download all Eukaryotic genomes,
a single representative for each phylum. The fourth would download all available
viral genomes, one representative per family, via the ncbi elink tool.
Multiple queries can be concatenated, separated by "~".
The program creates three files: a protein fasta file of all protein coding genes
of all genomes ("genome-database.faa"), a nucleotide fasta file of all RNA genes
(rRNA, tRNA, etc) of all genomes ("genome-database.fna") and a taxonomy file
("genome-taxonomy.txt") that lists all downloaded taxa.
If you run the tool in the same folder multiple times, the changes will be
incremental, e.g. the information already downloaded will not be downloaded again.
This way, you can easily keep your database up to date.
Optionally, you can use this tool to format your database for diamond searches and
you can extract specific genes using a hmm profile database, with hmmsearch. These
options require that the "hmmer" programs and "diamond" are in your path.
==========================================
Depends:
wget
hmmer, version 3.1b (optional)
diamond (optional)
a internet connection
==========================================
Options:
-update [queries] Updates the database by downloading newly available information
from the NCBI with the queries provided. Default:
superkingdom,Bacteria,genus,ftp~superkingdom,Archaea,genus,ftp
-dir [/path/to/dir] Builds the database in the specified folder (if omitted, will
build the database in the present dir.
-hmm [/path/to/file] Extracts the genes in the local database that hit a hmm profile.
-e [evalue] Evalue cutoff for hmm searches (default 1e-25).
-diamond [block-size] Create diamond database with the specified block size. See
diamond manual for default value and choosing the correct block
size.
-processors [x] Will use x processors for creation of diamond database.
Default 4.
-help Print this text
Copyright Marc Strous, 2016