Download Latest Version dudes_v0.08.tar.gz (2.2 MB)
Email in envelope

Get an email when there's a new version of DUDes

Home
Name Modified Size InfoDownloads / Week
README.md 2017-11-08 8.5 kB
dudes_v0.08.tar.gz 2017-11-08 2.2 MB
dudes-0.07.tar.gz 2017-10-26 2.2 MB
dudes_v0_06.tar.gz 2016-10-14 3.5 MB
dudes_v0_04.tar.gz 2016-02-23 2.3 MB
dudes_v0_05.tar.gz 2016-02-23 2.3 MB
Totals: 6 Items   12.5 MB 0

DUDes: a top-down taxonomic profiler for metagenomics

Vitor C. Piro (vitorpiro@gmail.com)

install with bioconda

Piro, V. C., Lindner, M. S., & Renard, B. Y. (2016). DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics, 32(15), 2272–2280. http://doi.org/10.1093/bioinformatics/btw150

Requirements:

python3 and numpy (DUDes.py) and pandas (DUDesDB.py only)

Install:

Local installation

git clone https://github.com/pirovc/dudes.git
cd dudes
./DUDes.py -h

Global installation

conda install -c bioconda dudes
DUDes.py -h

or

git clone https://github.com/pirovc/dudes.git
cd dudes
python3 setup.py install
DUDes.py -h

Usage:

  • Download the pre-compiled index and database:
Info Date Size Link
Archaea + Bacteria - RefSeq Complete Genomes 2015-03 13.2 GB https://zenodo.org/record/1036748/files/dudesdb_arc-bac_refseq-cg_201503.tar.gz
Archaea + Bacteria - RefSeq Complete Genomes 2017-09 37.7 GB https://zenodo.org/record/1037091/files/dudesdb_arc-bac_refseq-cg_201709.tar.gz
Fungal + Viral - RefSeq Complete Genomes 2017-09 9.5 GB https://zenodo.org/record/1037288/files/dudesdb_fun-vir_refseq-cg_201709.tar.gz
  • Unpack:

    tar zxfv dudesdb_arc-bac_refseq-cg_201709.tar.gz

  • Map your reads (fastq) with bowtie2 (any other mapper/index can be used - check -i parameter on DUDes.py):

    bowtie2 -x dudesdb_arc-bac_refseq-cg_201709/arc-bac_refseq-cg_201709 --no-unal --very-fast -k 10 -1 reads.1.fq -2 reads.2.fq -S mapping_output.sam

  • Run DUDes:

    DUDes.py -s mapping_output.sam -d dudesdb_arc-bac_refseq-cg_201709/arc-bac_refseq-cg_201709.npz -o output_prefix

Example with sample data:

DUDes.py -s sampledata/hiseq_accuracy_k60.sam -d sampledata/arc-bac_refseq-cg_201503.npz -o sampledata/dudes_profile_output
  • The sample data is based on a set of bacterial whole-genome shotgun reads comprising 10 organisms (HiSeq - 10000 reads [1]). The read set was mapped with Bowtie2 [2] against the set of complete genome sequences (dudesdb_arc-bac_refseq-cg_201503).

Custom index and dudes database:

Index your reference file (.fasta) with bowtie2 (any other mapper/index can be used - check -i parameter on DUDes.py):

bowtie2-build -f references.fasta custom_db

Create a dudes database based on the same set of references:

[python3] DUDesDB.py -m 'av' -f references.fasta -n nodes.dmp -a names.dmp -g nucl_gb.accession2taxid -t 12 -o custom_db
  • Choose the parameter -m considering the format of the headers in your reference sequences:

    New NCBI header [>NC_009925.1 Acaryochloris marina MBIC11017, complete genome.]
        -m 'av'
    Old NCBI header [>gi|158333233|ref|NC_009925.1| Acaryochloris marina MBIC11017, complete genome.]
        -m 'gi'
    
  • nodes.dmp and names.dmp can be obtained from:

    ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
    
  • nucl_gb.accession2taxid, nucl_wgs.accession2taxid or gi_taxid_nucl.dmp.gz(depending on your reference origin) can be obtained from:

    ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_XX.accession2taxid
    ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
    

Details:

DUDes.py requires two main input files to perform the taxonomic analysis: 1) a sequence alignment/map file (.sam file) 2) a database generated by DUDesDB.py (.npz file)

DUDesDB.py links taxonomic information and reference sequences identifiers (GI or accession.version). The input to DUDesDB script should be the same set of reference sequences (or a subset with matching identifiers)** used for the index database of the mapping tool.

** It is possible to run DUDes with previously generated alignment/map files with a pre-compiled database (see above) or with a database generated from a different source/date/version from the mapping tool. DUDes' algorithm filters references (and matches) not found in DUDes database before performing the analysis. Notice that some information can be lost in this case.

Parameters:

$ DUDes.py -h

usage: DUDes.py [-h] -s <sam_file> -d <database_file> [-i <sam_format>]
                [-t <threads>] [-x <taxid_start>] [-m <max_read_matches>]
                [-a <min_reference_matches>] [-l <last_rank>] [-b <bin_size>]
                [-o <output_prefix>] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -s <sam_file>         Alignment/mapping file in SAM format. DUDes does not
                        depend on any specific read mapper, but it requires
                        header information (@SQ
                        SN:gi|556555098|ref|NC_022650.1| LN:55956) and
                        mismatch information (check -i)
  -d <database_file>    Database file (output from DUDesDB [.npz])
  -i <sam_format>       SAM file format ['nm': sam file with standard cigar
                        string plus NM flag (NM:i:[0-9]*) for mismatches count
                        | 'ex': just the extended cigar string]. Default: 'nm'
  -t <threads>          # of threads. Default: 1
  -x <taxid_start>      Taxonomic Id used to start the analysis (1 = root).
                        Default: 1
  -m <max_read_matches>
                        Keep reads up to this number/percentile of matches (0:
                        off / 0-1: percentile / >=1: match count). Default: 0
  -a <min_reference_matches>
                        Minimum number/percentage of supporting matches to
                        consider the reference (0: off / 0-1: percentage /
                        >=1: read number). Default: 0.001
  -l <last_rank>        Last considered rank [superkingdom,phylum,class,order,
                        family,genus,species,strain]. Default: 'species'
  -b <bin_size>         Bin size (0-1: percentile from the lengths of all
                        references in the database / >=1: bp). Default: 0.25
  -o <output_prefix>    Output prefix. Default: STDOUT
  -v                    show program's version number and exit

$ DUDesDB.py -h

usage: DUDesDB.py [-h] [-m <reference_mode>] -f
                  [<fasta_files> [<fasta_files> ...]] -g
                  [<ref2tax_files> [<ref2tax_files> ...]] -n <nodes_file>
                  [-a <names_file>] [-o <output_prefix>] [-t <threads>] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -m <reference_mode>   'gi' uses the GI as the identifier (For headers like:
                        >gi|158333233|ref|NC_009925.1|) [NCBI is phasing out
                        sequence GI numbers in September 2016]. 'av' uses the
                        accession.version as the identifier (for headers like:
                        >NC_013791.2). Default: 'av'
  -f [<fasta_files> [<fasta_files> ...]]
                        Reference fasta file(s) for header extraction only,
                        plain or gzipped - the same file used to generate the
                        read mapping index. Each sequence header '>' should
                        contain a identifier as defined in the reference mode.
  -g [<ref2tax_files> [<ref2tax_files> ...]]
                        reference id to taxid file(s):
                        'gi_taxid_nucl.dmp[.gz]' --> 'gi' mode,
                        '*.accession2taxid[.gz]' --> 'av' mode [from NCBI
                        taxonomy database
                        ftp://ftp.ncbi.nih.gov/pub/taxonomy/]
  -n <nodes_file>       nodes.dmp file [from NCBI taxonomy database
                        ftp://ftp.ncbi.nih.gov/pub/taxonomy/]
  -a <names_file>       names.dmp file [from NCBI taxonomy database
                        ftp://ftp.ncbi.nih.gov/pub/taxonomy/]
  -o <output_prefix>    Output prefix. Default: dudesdb
  -t <threads>          # of threads. Default: 1
  -v                    show program's version number and exit

Change log:

2017-11-08 (v0.08): - bug fixes on DUDesDB and multiple gzipped file suppport for fasta_files and ref2tax_files - distutils installation

2016-11-03 (v0.07): - code changed to python 3 - changed .ddb to a new and smaller database format -> .npz

2016-03-23 (v0.06): - New database format supporing GI or accession.version as an identifier (DUDesDB.py parameter -m). - Check for sam flags - Faster code for identification matrix evaluation

References:

[1] Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46.

[2] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods 2012, 9(4), 357–9.

Source: README.md, updated 2017-11-08