Name | Modified | Size | Downloads / Week |
---|---|---|---|
README.md | 2017-11-08 | 8.5 kB | |
dudes_v0.08.tar.gz | 2017-11-08 | 2.2 MB | |
dudes-0.07.tar.gz | 2017-10-26 | 2.2 MB | |
dudes_v0_06.tar.gz | 2016-10-14 | 3.5 MB | |
dudes_v0_04.tar.gz | 2016-02-23 | 2.3 MB | |
dudes_v0_05.tar.gz | 2016-02-23 | 2.3 MB | |
Totals: 6 Items | 12.5 MB | 0 |
DUDes: a top-down taxonomic profiler for metagenomics
Vitor C. Piro (vitorpiro@gmail.com)
Piro, V. C., Lindner, M. S., & Renard, B. Y. (2016). DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics, 32(15), 2272–2280. http://doi.org/10.1093/bioinformatics/btw150
Requirements:
python3 and numpy (DUDes.py) and pandas (DUDesDB.py only)
Install:
Local installation
git clone https://github.com/pirovc/dudes.git
cd dudes
./DUDes.py -h
Global installation
conda install -c bioconda dudes
DUDes.py -h
or
git clone https://github.com/pirovc/dudes.git
cd dudes
python3 setup.py install
DUDes.py -h
Usage:
- Download the pre-compiled index and database:
Info | Date | Size | Link |
---|---|---|---|
Archaea + Bacteria - RefSeq Complete Genomes | 2015-03 | 13.2 GB | https://zenodo.org/record/1036748/files/dudesdb_arc-bac_refseq-cg_201503.tar.gz |
Archaea + Bacteria - RefSeq Complete Genomes | 2017-09 | 37.7 GB | https://zenodo.org/record/1037091/files/dudesdb_arc-bac_refseq-cg_201709.tar.gz |
Fungal + Viral - RefSeq Complete Genomes | 2017-09 | 9.5 GB | https://zenodo.org/record/1037288/files/dudesdb_fun-vir_refseq-cg_201709.tar.gz |
-
Unpack:
tar zxfv dudesdb_arc-bac_refseq-cg_201709.tar.gz
-
Map your reads (fastq) with bowtie2 (any other mapper/index can be used - check
-i
parameter on DUDes.py):bowtie2 -x dudesdb_arc-bac_refseq-cg_201709/arc-bac_refseq-cg_201709 --no-unal --very-fast -k 10 -1 reads.1.fq -2 reads.2.fq -S mapping_output.sam
-
Run DUDes:
DUDes.py -s mapping_output.sam -d dudesdb_arc-bac_refseq-cg_201709/arc-bac_refseq-cg_201709.npz -o output_prefix
Example with sample data:
DUDes.py -s sampledata/hiseq_accuracy_k60.sam -d sampledata/arc-bac_refseq-cg_201503.npz -o sampledata/dudes_profile_output
- The sample data is based on a set of bacterial whole-genome shotgun reads comprising 10 organisms (HiSeq - 10000 reads [1]). The read set was mapped with Bowtie2 [2] against the set of complete genome sequences (dudesdb_arc-bac_refseq-cg_201503).
Custom index and dudes database:
Index your reference file (.fasta) with bowtie2 (any other mapper/index can be used - check -i
parameter on DUDes.py):
bowtie2-build -f references.fasta custom_db
Create a dudes database based on the same set of references:
[python3] DUDesDB.py -m 'av' -f references.fasta -n nodes.dmp -a names.dmp -g nucl_gb.accession2taxid -t 12 -o custom_db
-
Choose the parameter
-m
considering the format of the headers in your reference sequences:New NCBI header [>NC_009925.1 Acaryochloris marina MBIC11017, complete genome.] -m 'av' Old NCBI header [>gi|158333233|ref|NC_009925.1| Acaryochloris marina MBIC11017, complete genome.] -m 'gi'
-
nodes.dmp
andnames.dmp
can be obtained from:ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
-
nucl_gb.accession2taxid
,nucl_wgs.accession2taxid
orgi_taxid_nucl.dmp.gz
(depending on your reference origin) can be obtained from:ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_XX.accession2taxid ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
Details:
DUDes.py requires two main input files to perform the taxonomic analysis: 1) a sequence alignment/map file (.sam file) 2) a database generated by DUDesDB.py (.npz file)
DUDesDB.py links taxonomic information and reference sequences identifiers (GI or accession.version). The input to DUDesDB script should be the same set of reference sequences (or a subset with matching identifiers)** used for the index database of the mapping tool.
** It is possible to run DUDes with previously generated alignment/map files with a pre-compiled database (see above) or with a database generated from a different source/date/version from the mapping tool. DUDes' algorithm filters references (and matches) not found in DUDes database before performing the analysis. Notice that some information can be lost in this case.
Parameters:
$ DUDes.py -h
usage: DUDes.py [-h] -s <sam_file> -d <database_file> [-i <sam_format>]
[-t <threads>] [-x <taxid_start>] [-m <max_read_matches>]
[-a <min_reference_matches>] [-l <last_rank>] [-b <bin_size>]
[-o <output_prefix>] [-v]
optional arguments:
-h, --help show this help message and exit
-s <sam_file> Alignment/mapping file in SAM format. DUDes does not
depend on any specific read mapper, but it requires
header information (@SQ
SN:gi|556555098|ref|NC_022650.1| LN:55956) and
mismatch information (check -i)
-d <database_file> Database file (output from DUDesDB [.npz])
-i <sam_format> SAM file format ['nm': sam file with standard cigar
string plus NM flag (NM:i:[0-9]*) for mismatches count
| 'ex': just the extended cigar string]. Default: 'nm'
-t <threads> # of threads. Default: 1
-x <taxid_start> Taxonomic Id used to start the analysis (1 = root).
Default: 1
-m <max_read_matches>
Keep reads up to this number/percentile of matches (0:
off / 0-1: percentile / >=1: match count). Default: 0
-a <min_reference_matches>
Minimum number/percentage of supporting matches to
consider the reference (0: off / 0-1: percentage /
>=1: read number). Default: 0.001
-l <last_rank> Last considered rank [superkingdom,phylum,class,order,
family,genus,species,strain]. Default: 'species'
-b <bin_size> Bin size (0-1: percentile from the lengths of all
references in the database / >=1: bp). Default: 0.25
-o <output_prefix> Output prefix. Default: STDOUT
-v show program's version number and exit
$ DUDesDB.py -h
usage: DUDesDB.py [-h] [-m <reference_mode>] -f
[<fasta_files> [<fasta_files> ...]] -g
[<ref2tax_files> [<ref2tax_files> ...]] -n <nodes_file>
[-a <names_file>] [-o <output_prefix>] [-t <threads>] [-v]
optional arguments:
-h, --help show this help message and exit
-m <reference_mode> 'gi' uses the GI as the identifier (For headers like:
>gi|158333233|ref|NC_009925.1|) [NCBI is phasing out
sequence GI numbers in September 2016]. 'av' uses the
accession.version as the identifier (for headers like:
>NC_013791.2). Default: 'av'
-f [<fasta_files> [<fasta_files> ...]]
Reference fasta file(s) for header extraction only,
plain or gzipped - the same file used to generate the
read mapping index. Each sequence header '>' should
contain a identifier as defined in the reference mode.
-g [<ref2tax_files> [<ref2tax_files> ...]]
reference id to taxid file(s):
'gi_taxid_nucl.dmp[.gz]' --> 'gi' mode,
'*.accession2taxid[.gz]' --> 'av' mode [from NCBI
taxonomy database
ftp://ftp.ncbi.nih.gov/pub/taxonomy/]
-n <nodes_file> nodes.dmp file [from NCBI taxonomy database
ftp://ftp.ncbi.nih.gov/pub/taxonomy/]
-a <names_file> names.dmp file [from NCBI taxonomy database
ftp://ftp.ncbi.nih.gov/pub/taxonomy/]
-o <output_prefix> Output prefix. Default: dudesdb
-t <threads> # of threads. Default: 1
-v show program's version number and exit
Change log:
2017-11-08 (v0.08): - bug fixes on DUDesDB and multiple gzipped file suppport for fasta_files and ref2tax_files - distutils installation
2016-11-03 (v0.07): - code changed to python 3 - changed .ddb to a new and smaller database format -> .npz
2016-03-23 (v0.06): - New database format supporing GI or accession.version as an identifier (DUDesDB.py parameter -m). - Check for sam flags - Faster code for identification matrix evaluation
References:
[1] Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46.
[2] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods 2012, 9(4), 357–9.