Genome Downloader Code

Downloads genome data from NCBI based on search terms.

Brought to you by: alvesjmp

Tree [3e5aca] master / History

HTTPS access

File	Date	Author	Commit
README.txt	2017-10-18	J.M.P. Alves	[3e5aca] Complete rewrite of genome_downloader.pl to con...
genome_downloader.pl	2017-10-18	J.M.P. Alves	[3e5aca] Complete rewrite of genome_downloader.pl to con...
mapping_hasher	2017-10-18	J.M.P. Alves	[3e5aca] Complete rewrite of genome_downloader.pl to con...

Read Me

Genome Downloader
-----------------

* SYNOPSIS

  This program was designed to download, from NCBI databases, all genomic information 
belonging to a specific taxon. Data from other genomic databases (e.g., EMBL, DDBJ,
TriTrypDB, etc.) can NOT be downloaded.


* INSTALLATION

  Place the files genome_downloader.pl and mapping_hasher in a directory listed in your
$PATH environmental variable. Run "echo $PATH" (without the quotes) to see what those
are in your system.

  If you get a "bad interpreter" error when trying to run genome_downloader.pl, you will 
need to edit the first line of both programs (using any text editor such as nano, gedit, 
etc.). Currently, they are:

#!/usr/bin/env perl

  Run "which env" to see where env is installed in your system. If you see, for example,
"/bin/env", then change those first lines to:

#!/bin/env perl


* OPERATING SYSTEM

  This program was written to run on Linux-based systems, but it might work in other
POSIX-like systems such as Mac OS (it has not been tested though), as long as dependencies 
are all present. It might work on Windows systems running Cygwin, but it also has not
been tested. Support will only be provided for Linux.


* DESCRIPTION

  This program was designed to download all genomes on the NCBI database belonging to a 
specific taxon. It uses NCBI's taxonomic information database and genome assembly list
in order to know which sequences to download. Downloads are performed by wget (curl is 
currently not supported). An accessory program called mapping_hasher.pl (included) is 
also needed.

  Previously downloaded taxonomic information can be placed in a central directory for
later reuse (for speed), or regenerated every time the program is run in a directory 
that does not contain the files called "n2t" and "nodes". Given the speed of genome
generation nowadays, it is probably worth it to wait a little longer, 

  genome_downloader.pl is capable of downloads limited to certain kind of files, if so
specified by the user (e.g., only .faa files, or only .fna and .gbff files etc.).


* DEPENDENCIES

  - mapping_hasher (included)
  - wget
  - Perl
  - standard system tools such as tar, less, grep


* USAGE HELP

Run genome_downloader.pl -h, or see below.

genome_downloader.pl v. 1.0
-------------------------------

Usage: genome_downloader.pl -l <list_of_search_terms> [-o output_directory]

Description:

  Downloads genome data from NCBI. Genome assembly data is listed in a table (currently, 
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt) from the NCBI
FTP site. This program downloads and uses this table to decide which genomic information
to download, and where to obtain it.

  Genome data can be searched, with partial names allowed (e.g.: using "Streptococcus"
as the search term will download all genomes whose names contain the word Streptococcus). 
Downloads can be filtered to include only certain assembly levels (contigs, complete,
etc., see -L option below) or certain organisms, by NCBI taxonomic identifier (-T option). 
This latter option is specially useful when one wants to download all genomic data from a 
larger taxon, e.g. Gammaproteobacteria (in which case, the taxon ID 1236 should be used).

Options (can be bundled, e.g. -VwL):

* Mandatory:
  -l   File with list of search terms, one per line, for genome selection. If a file by 
       that name does not exist, this will be interpreted as a search term; search is NOT 
       case sensitive; use \* (slash plus asterisk) to download all genomes available 
       (NOT RECOMMENDED without some kind of limit [such as taxonomic identifier], there 
       are more than 120 thousand of them);

* Optional:
  -o   Output directory (default: files in current directory, no directories created);
  -t   File type(s) to download (default: all), by file extension (do NOT include the .gz
       part) -- for more than one type, use a comma-delimited line, e.g.: "-t fna,gbff"
       or "-t _rna_from_genomic.fna";
  -T   Limit download by NCBI taxonomic identifier;
  -I   Directory (if any) where previously generated taxonomic information structures,
       files "n2t" and "nodes", are located (default: current directory);
  -a   Limit download to a comma-delimited list of assembly levels (possibilities: 
       "Contig", "Scaffold", "Chromosome" or "Complete Genome" (default: all levels);
  -b   Base URL for NCBI's FTP site (default: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank);
  -L   Do NOT download NCBI genome assembly list file, but use previously downloaded 
       local copy instead;
  -i   Download and display README.txt file from NCBI and exit;
  -w   Verbose wget messages (default: quiet download);
  -V   Verbose screen output, use it twice for more verbosity (default: only search terms 
       not found, if any, printed);
  -d   Print debug information, use it twice for more information;
  -v   Print program version and exit;
  -h   Print help and exit.

Copyright J.M.P. Alves, 2014-2017 (alvesjmp@yahoo.com)
This software is licensed under the GNU General Public License v. 3.
Please see http://www.fsf.org/licensing/licenses/gpl.html for details.

Genome Downloader Code

Downloads genome data from NCBI based on search terms.

Branches

Tree [3e5aca] master / Download Snapshot History

Read Me

Tree [3e5aca] master /

History