Genome Downloader
-----------------
* SYNOPSIS
This program was designed to download, from NCBI databases, all genomic information
belonging to a specific taxon. Data from other genomic databases (e.g., EMBL, DDBJ,
TriTrypDB, etc.) can NOT be downloaded.
* INSTALLATION
Place the files genome_downloader.pl and mapping_hasher in a directory listed in your
$PATH environmental variable. Run "echo $PATH" (without the quotes) to see what those
are in your system.
If you get a "bad interpreter" error when trying to run genome_downloader.pl, you will
need to edit the first line of both programs (using any text editor such as nano, gedit,
etc.). Currently, they are:
#!/usr/bin/env perl
Run "which env" to see where env is installed in your system. If you see, for example,
"/bin/env", then change those first lines to:
#!/bin/env perl
* OPERATING SYSTEM
This program was written to run on Linux-based systems, but it might work in other
POSIX-like systems such as Mac OS (it has not been tested though), as long as dependencies
are all present. It might work on Windows systems running Cygwin, but it also has not
been tested. Support will only be provided for Linux.
* DESCRIPTION
This program was designed to download all genomes on the NCBI database belonging to a
specific taxon. It uses NCBI's taxonomic information database and genome assembly list
in order to know which sequences to download. Downloads are performed by wget (curl is
currently not supported). An accessory program called mapping_hasher.pl (included) is
also needed.
Previously downloaded taxonomic information can be placed in a central directory for
later reuse (for speed), or regenerated every time the program is run in a directory
that does not contain the files called "n2t" and "nodes". Given the speed of genome
generation nowadays, it is probably worth it to wait a little longer,
genome_downloader.pl is capable of downloads limited to certain kind of files, if so
specified by the user (e.g., only .faa files, or only .fna and .gbff files etc.).
* DEPENDENCIES
- mapping_hasher (included)
- wget
- Perl
- standard system tools such as tar, less, grep
* USAGE HELP
Run genome_downloader.pl -h, or see below.
genome_downloader.pl v. 1.0
-------------------------------
Usage: genome_downloader.pl -l <list_of_search_terms> [-o output_directory]
Description:
Downloads genome data from NCBI. Genome assembly data is listed in a table (currently,
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt) from the NCBI
FTP site. This program downloads and uses this table to decide which genomic information
to download, and where to obtain it.
Genome data can be searched, with partial names allowed (e.g.: using "Streptococcus"
as the search term will download all genomes whose names contain the word Streptococcus).
Downloads can be filtered to include only certain assembly levels (contigs, complete,
etc., see -L option below) or certain organisms, by NCBI taxonomic identifier (-T option).
This latter option is specially useful when one wants to download all genomic data from a
larger taxon, e.g. Gammaproteobacteria (in which case, the taxon ID 1236 should be used).
Options (can be bundled, e.g. -VwL):
* Mandatory:
-l File with list of search terms, one per line, for genome selection. If a file by
that name does not exist, this will be interpreted as a search term; search is NOT
case sensitive; use \* (slash plus asterisk) to download all genomes available
(NOT RECOMMENDED without some kind of limit [such as taxonomic identifier], there
are more than 120 thousand of them);
* Optional:
-o Output directory (default: files in current directory, no directories created);
-t File type(s) to download (default: all), by file extension (do NOT include the .gz
part) -- for more than one type, use a comma-delimited line, e.g.: "-t fna,gbff"
or "-t _rna_from_genomic.fna";
-T Limit download by NCBI taxonomic identifier;
-I Directory (if any) where previously generated taxonomic information structures,
files "n2t" and "nodes", are located (default: current directory);
-a Limit download to a comma-delimited list of assembly levels (possibilities:
"Contig", "Scaffold", "Chromosome" or "Complete Genome" (default: all levels);
-b Base URL for NCBI's FTP site (default: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank);
-L Do NOT download NCBI genome assembly list file, but use previously downloaded
local copy instead;
-i Download and display README.txt file from NCBI and exit;
-w Verbose wget messages (default: quiet download);
-V Verbose screen output, use it twice for more verbosity (default: only search terms
not found, if any, printed);
-d Print debug information, use it twice for more information;
-v Print program version and exit;
-h Print help and exit.
Copyright J.M.P. Alves, 2014-2017 (alvesjmp@yahoo.com)
This software is licensed under the GNU General Public License v. 3.
Please see http://www.fsf.org/licensing/licenses/gpl.html for details.