Genome Downloader Code
Downloads genome data from NCBI based on search terms.
Brought to you by:
alvesjmp
File | Date | Author | Commit |
---|---|---|---|
README.txt | 2017-10-18 |
![]() |
[3e5aca] Complete rewrite of genome_downloader.pl to con... |
genome_downloader.pl | 2017-10-18 |
![]() |
[3e5aca] Complete rewrite of genome_downloader.pl to con... |
mapping_hasher | 2017-10-18 |
![]() |
[3e5aca] Complete rewrite of genome_downloader.pl to con... |
Genome Downloader ----------------- * SYNOPSIS This program was designed to download, from NCBI databases, all genomic information belonging to a specific taxon. Data from other genomic databases (e.g., EMBL, DDBJ, TriTrypDB, etc.) can NOT be downloaded. * INSTALLATION Place the files genome_downloader.pl and mapping_hasher in a directory listed in your $PATH environmental variable. Run "echo $PATH" (without the quotes) to see what those are in your system. If you get a "bad interpreter" error when trying to run genome_downloader.pl, you will need to edit the first line of both programs (using any text editor such as nano, gedit, etc.). Currently, they are: #!/usr/bin/env perl Run "which env" to see where env is installed in your system. If you see, for example, "/bin/env", then change those first lines to: #!/bin/env perl * OPERATING SYSTEM This program was written to run on Linux-based systems, but it might work in other POSIX-like systems such as Mac OS (it has not been tested though), as long as dependencies are all present. It might work on Windows systems running Cygwin, but it also has not been tested. Support will only be provided for Linux. * DESCRIPTION This program was designed to download all genomes on the NCBI database belonging to a specific taxon. It uses NCBI's taxonomic information database and genome assembly list in order to know which sequences to download. Downloads are performed by wget (curl is currently not supported). An accessory program called mapping_hasher.pl (included) is also needed. Previously downloaded taxonomic information can be placed in a central directory for later reuse (for speed), or regenerated every time the program is run in a directory that does not contain the files called "n2t" and "nodes". Given the speed of genome generation nowadays, it is probably worth it to wait a little longer, genome_downloader.pl is capable of downloads limited to certain kind of files, if so specified by the user (e.g., only .faa files, or only .fna and .gbff files etc.). * DEPENDENCIES - mapping_hasher (included) - wget - Perl - standard system tools such as tar, less, grep * USAGE HELP Run genome_downloader.pl -h, or see below. genome_downloader.pl v. 1.0 ------------------------------- Usage: genome_downloader.pl -l <list_of_search_terms> [-o output_directory] Description: Downloads genome data from NCBI. Genome assembly data is listed in a table (currently, ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt) from the NCBI FTP site. This program downloads and uses this table to decide which genomic information to download, and where to obtain it. Genome data can be searched, with partial names allowed (e.g.: using "Streptococcus" as the search term will download all genomes whose names contain the word Streptococcus). Downloads can be filtered to include only certain assembly levels (contigs, complete, etc., see -L option below) or certain organisms, by NCBI taxonomic identifier (-T option). This latter option is specially useful when one wants to download all genomic data from a larger taxon, e.g. Gammaproteobacteria (in which case, the taxon ID 1236 should be used). Options (can be bundled, e.g. -VwL): * Mandatory: -l File with list of search terms, one per line, for genome selection. If a file by that name does not exist, this will be interpreted as a search term; search is NOT case sensitive; use \* (slash plus asterisk) to download all genomes available (NOT RECOMMENDED without some kind of limit [such as taxonomic identifier], there are more than 120 thousand of them); * Optional: -o Output directory (default: files in current directory, no directories created); -t File type(s) to download (default: all), by file extension (do NOT include the .gz part) -- for more than one type, use a comma-delimited line, e.g.: "-t fna,gbff" or "-t _rna_from_genomic.fna"; -T Limit download by NCBI taxonomic identifier; -I Directory (if any) where previously generated taxonomic information structures, files "n2t" and "nodes", are located (default: current directory); -a Limit download to a comma-delimited list of assembly levels (possibilities: "Contig", "Scaffold", "Chromosome" or "Complete Genome" (default: all levels); -b Base URL for NCBI's FTP site (default: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank); -L Do NOT download NCBI genome assembly list file, but use previously downloaded local copy instead; -i Download and display README.txt file from NCBI and exit; -w Verbose wget messages (default: quiet download); -V Verbose screen output, use it twice for more verbosity (default: only search terms not found, if any, printed); -d Print debug information, use it twice for more information; -v Print program version and exit; -h Print help and exit. Copyright J.M.P. Alves, 2014-2017 (alvesjmp@yahoo.com) This software is licensed under the GNU General Public License v. 3. Please see http://www.fsf.org/licensing/licenses/gpl.html for details.