DeconSeq - Browse Files at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.
Name	Modified	Size	InfoDownloads / Week
standalone	2014-02-03		1
misc	2014-02-03		0
web-based	2010-09-14		0
README.txt	2013-05-05	8.2 kB	0
Totals: 4 Items		8.2 kB	1
DeconSeq
========

DECONtamination of SEQuence data using a modified version of BWA-SW
(http://deconseq.sourceforge.net).

For more information about BWA-SW go to http://bio-bwa.sourceforge.net/


SETUP
-----

1. Create the databases used for contaminat screening

   Use command:
      bwaXXX index -p database_name -a bwtsw fasta_file_with_ref >out.txt 2>&1 &

      (XXX should be replaced by MAC or 64 based on your system architecture)

   For alternative system architectures, download BWA version 0.5.9 source code
   from http://bio-bwa.sourceforge.net/
   (http://sourceforge.net/projects/bio-bwa/files/bwa-0.5.9.tar.bz2/download);
   extract the file with "tar -xjf bwa-0.5.9.tar.bz2"; replace files in the
   extracted BWA directory with the files in bwasw_modified_source (distributed
   with DeconSeq); run "make" in the BWA directory.

   Notes:   (i) It is advised to remove very short or long sequences from files
                retrieved from public resources. Most likely, those sequences
                are a result of misannotations and should not be included in the
                database.
           (ii) Removing ducplicates may reduce the database size and speed up
                analysis. Tools such as PRINSEQ (http://prinseq.sourceforge.net)
                can assist with this task.
          (iii) There seem to be some issues with Mac OSX 10.6.8 that will
                require you to compile the program (the bwaMAC version might
                crash).

2. Change values in DeconSeqConfig.pm (if applicable)

   DB_DIR
   TMP_DIR
   OUTPUT_DIR
   PROG_NAME
   PROG_DIR
   DBS
   DBS_DEFAULT

   Notes: (i) If the bwaXXX program is located in the same directory as the
              deconseq.pl file, please specify ./ as program directory
              (use: "constant PROG_DIR => './';").

3. Setup databases

   Download databases from: ftp://edwards.sdsu.edu:7009/deconseq/db/

   Or create your own database following the steps at:
      http://deconseq.sourceforge.net/manual.html#DB

   In the DeconSeqConfig.pm, specify all your databases as follows:

   use constant DBS => {hsref => {name => 'Human Reference GRCh37',
                                  db => 'hs_ref_GRCh37'},
                        vir => {name => 'Viral genomes',
                                db => 'virDB'}};

   In this example, you have two databases created/downloaded that start with
   hs_ref_GRCh37 and virDB. You can either give them the same name or specify a
   shorter/easier name for calling the databases in DeconSeq. Here, the
   databases are called from DeconSeq with hsref and vir
   (e.g. -dbs hsref -db_retain vir).

   If you have multiple chunks per database (as in the 1GB database chunks on
   the FTP site), you can specify all chunks directly in the config and call
   the database with a single name. In the config file, you would separate the
   database names by commas (no spaces):

   use constant DBS => {hsref => {name => 'Human - Reference GRCh37',
                       db => 'hs_ref_GRCh37_1,hs_ref_GRCh37_2,hs_ref_GRCh37_3'},
                        vir => {name => 'Viral genomes',
                                db => 'virDB'}};

   Here, the human reference genome is split up into three database chunks. In
   the command line, simply call the database by its name (hsref) and it will
   use the three databases defined in the config.

4. Processing large input files in a cluster environment

   Since mapping is embarrassingly parallel, there was no need to implement a
   parallel processing features into DeconSeq. In other words, splitting up the
   input file into chunks and processing them in parallel on the compute cores
   will get you the same results faster than using the multi-thread option in
   BWA. After processing the chunks, simply concatenate the result files with
   cat. If you have access to a big cluster, you can split large input data into
   smaller chunks and then submit them to the cluster. When done, simply
   concatenate the output files and you are done. (That is how DeconSeq
   processes the data for the web version.)

   A Perl script to split the input data based on file size or number of chunks
   can be found at: https://sourceforge.net/projects/deconseq/files/misc/

   The file is called splitFasta.pl and you can get details on its use with:
   perl splitFasta.pl -h

   Examples:
   perl splitFasta.pl -verbose -i file.fasta -s 2     #chunks of 2MB
   perl splitFasta.pl -verbose -i file.fasta -n 10    #10 chunks


USAGE
-----

Run as:
   perl deconseq.pl [options] -f <file> -dbs <list> -dbs_retain <list> ...

or rename file and set chmod +x to run as:
   ./deconseq [options] -f <file> -dbs <list> -dbs_retain <list> ...

Try 'deconseq -h' for more information on the options.


DEPENDENCIES
------------

The PERL script requires these other modules:

   DeconSeqConfig  (included)
   Data::Dumper
   Getopt::Long
   Pod::Usage
   File::Path      >= 2.07
   Cwd
   FindBin


BUG REPORTS
-----------

If you find a bug please email me at <rschmieder_at_gmail_dot_com> so that I can
make DeconSeq better.


COPYRIGHT AND LICENSE
---------------------

Copyright (C) 2010-2013  Robert SCHMIEDER

This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program.  If not, see <http://www.gnu.org/licenses/>.


VERSION HISTORY
---------------

(If not otherwise stated, changes apply to the standalone and web version.)

deconseq-0.4.3:
Fixed issue when multiple chunks per database were specified in the config file
that did not correctly split during search. (Thanks to Osvaldo Zagordi for
pointing out the problem.)

deconseq-0.4.2:
Added FindBin module to find config file when run from outside the source
directory. Improve readme file. Allow multiple chunks per database to be
specified in the config file and used with a single database name in -dbs.

deconseq-0.4.1:
Fixed issue with missing method generateFastaFromIds. (Thanks to Valmik Desai
for pointing out the problem.) Extended the input format from ACGTN to full
nucleic acid ambiguity code (ACGTURYKMSWBDHVNX-).

deconseq-0.4:
Added FASTQ support to standalone version.

deconseq-0.3.2:
Fixed issues of undefined output during FASTA file generation and grouping when
using option dbs_retain.

deconseq-0.3.1:
Removed debugging print out (3).

deconseq-0.3:
Updated BWA-SW from 0.5.8 to 0.5.9 (bugfix of very rare mismapping and change of
scoring matrix, see http://bio-bwa.sourceforge.net/ for more details). Allow
FASTQ file format for download (web version). Use input file name with added
string as output file name.

deconseq-0.2:
Code cleanup. Added access to bwasw parameters -S (chunk size of input),
-z (Z-best value) and -T (score threshold). Use -f instead of '>' for output
generation. Modified BWA-SW file bwtsw2_aux.c to allow SAM and alternative
DeconSeq output. Modified BWA-SW files bwtsw2_main.c and bwtsw2.h to fix double
defined params (-s given twice, changed one to -S) and added new parameter -A
for alternative output selection. Changed BWA-SW files stdaln.c, stdaln.h and
bwtsw2_aux.c to include R for replacement in CIGAR strings instead of using M.
Added parameter -R to BWA-SW to output extented CIGAR string instead of standard
version. Modified BWA-SW file bwtsw2_aux.c to always mismatch ambiguous base N
in query sequence instead of randomly replacing it by A, C, G or T. Added
parameter -M to BWA-SW to force always mismatch of Ns in query.
Web version only:
Fixed hash issue while parsing TSV file in process_data.pl file. Fixed issue in
parsing FASTQ files with no information in '+' header line.

deconseq-0.1:
First public release of DeconSeq.
Source: README.txt, updated 2013-05-05
DeconSeq Files

Get an email when there's a new version of DeconSeq