DeconSeq ======== DECONtamination of SEQuence data using a modified version of BWA-SW (http://deconseq.sourceforge.net). For more information about BWA-SW go to http://bio-bwa.sourceforge.net/ SETUP ----- 1. Create the databases used for contaminat screening Use command: bwaXXX index -p database_name -a bwtsw fasta_file_with_ref >out.txt 2>&1 & (XXX should be replaced by MAC or 64 based on your system architecture) For alternative system architectures, download BWA version 0.5.9 source code from http://bio-bwa.sourceforge.net/ (http://sourceforge.net/projects/bio-bwa/files/bwa-0.5.9.tar.bz2/download); extract the file with "tar -xjf bwa-0.5.9.tar.bz2"; replace files in the extracted BWA directory with the files in bwasw_modified_source (distributed with DeconSeq); run "make" in the BWA directory. Notes: (i) It is advised to remove very short or long sequences from files retrieved from public resources. Most likely, those sequences are a result of misannotations and should not be included in the database. (ii) Removing ducplicates may reduce the database size and speed up analysis. Tools such as PRINSEQ (http://prinseq.sourceforge.net) can assist with this task. (iii) There seem to be some issues with Mac OSX 10.6.8 that will require you to compile the program (the bwaMAC version might crash). 2. Change values in DeconSeqConfig.pm (if applicable) DB_DIR TMP_DIR OUTPUT_DIR PROG_NAME PROG_DIR DBS DBS_DEFAULT Notes: (i) If the bwaXXX program is located in the same directory as the deconseq.pl file, please specify ./ as program directory (use: "constant PROG_DIR => './';"). 3. Setup databases Download databases from: ftp://edwards.sdsu.edu:7009/deconseq/db/ Or create your own database following the steps at: http://deconseq.sourceforge.net/manual.html#DB In the DeconSeqConfig.pm, specify all your databases as follows: use constant DBS => {hsref => {name => 'Human Reference GRCh37', db => 'hs_ref_GRCh37'}, vir => {name => 'Viral genomes', db => 'virDB'}}; In this example, you have two databases created/downloaded that start with hs_ref_GRCh37 and virDB. You can either give them the same name or specify a shorter/easier name for calling the databases in DeconSeq. Here, the databases are called from DeconSeq with hsref and vir (e.g. -dbs hsref -db_retain vir). If you have multiple chunks per database (as in the 1GB database chunks on the FTP site), you can specify all chunks directly in the config and call the database with a single name. In the config file, you would separate the database names by commas (no spaces): use constant DBS => {hsref => {name => 'Human - Reference GRCh37', db => 'hs_ref_GRCh37_1,hs_ref_GRCh37_2,hs_ref_GRCh37_3'}, vir => {name => 'Viral genomes', db => 'virDB'}}; Here, the human reference genome is split up into three database chunks. In the command line, simply call the database by its name (hsref) and it will use the three databases defined in the config. 4. Processing large input files in a cluster environment Since mapping is embarrassingly parallel, there was no need to implement a parallel processing features into DeconSeq. In other words, splitting up the input file into chunks and processing them in parallel on the compute cores will get you the same results faster than using the multi-thread option in BWA. After processing the chunks, simply concatenate the result files with cat. If you have access to a big cluster, you can split large input data into smaller chunks and then submit them to the cluster. When done, simply concatenate the output files and you are done. (That is how DeconSeq processes the data for the web version.) A Perl script to split the input data based on file size or number of chunks can be found at: https://sourceforge.net/projects/deconseq/files/misc/ The file is called splitFasta.pl and you can get details on its use with: perl splitFasta.pl -h Examples: perl splitFasta.pl -verbose -i file.fasta -s 2 #chunks of 2MB perl splitFasta.pl -verbose -i file.fasta -n 10 #10 chunks USAGE ----- Run as: perl deconseq.pl [options] -f <file> -dbs <list> -dbs_retain <list> ... or rename file and set chmod +x to run as: ./deconseq [options] -f <file> -dbs <list> -dbs_retain <list> ... Try 'deconseq -h' for more information on the options. DEPENDENCIES ------------ The PERL script requires these other modules: DeconSeqConfig (included) Data::Dumper Getopt::Long Pod::Usage File::Path >= 2.07 Cwd FindBin BUG REPORTS ----------- If you find a bug please email me at <rschmieder_at_gmail_dot_com> so that I can make DeconSeq better. COPYRIGHT AND LICENSE --------------------- Copyright (C) 2010-2013 Robert SCHMIEDER This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. VERSION HISTORY --------------- (If not otherwise stated, changes apply to the standalone and web version.) deconseq-0.4.3: Fixed issue when multiple chunks per database were specified in the config file that did not correctly split during search. (Thanks to Osvaldo Zagordi for pointing out the problem.) deconseq-0.4.2: Added FindBin module to find config file when run from outside the source directory. Improve readme file. Allow multiple chunks per database to be specified in the config file and used with a single database name in -dbs. deconseq-0.4.1: Fixed issue with missing method generateFastaFromIds. (Thanks to Valmik Desai for pointing out the problem.) Extended the input format from ACGTN to full nucleic acid ambiguity code (ACGTURYKMSWBDHVNX-). deconseq-0.4: Added FASTQ support to standalone version. deconseq-0.3.2: Fixed issues of undefined output during FASTA file generation and grouping when using option dbs_retain. deconseq-0.3.1: Removed debugging print out (3). deconseq-0.3: Updated BWA-SW from 0.5.8 to 0.5.9 (bugfix of very rare mismapping and change of scoring matrix, see http://bio-bwa.sourceforge.net/ for more details). Allow FASTQ file format for download (web version). Use input file name with added string as output file name. deconseq-0.2: Code cleanup. Added access to bwasw parameters -S (chunk size of input), -z (Z-best value) and -T (score threshold). Use -f instead of '>' for output generation. Modified BWA-SW file bwtsw2_aux.c to allow SAM and alternative DeconSeq output. Modified BWA-SW files bwtsw2_main.c and bwtsw2.h to fix double defined params (-s given twice, changed one to -S) and added new parameter -A for alternative output selection. Changed BWA-SW files stdaln.c, stdaln.h and bwtsw2_aux.c to include R for replacement in CIGAR strings instead of using M. Added parameter -R to BWA-SW to output extented CIGAR string instead of standard version. Modified BWA-SW file bwtsw2_aux.c to always mismatch ambiguous base N in query sequence instead of randomly replacing it by A, C, G or T. Added parameter -M to BWA-SW to force always mismatch of Ns in query. Web version only: Fixed hash issue while parsing TSV file in process_data.pl file. Fixed issue in parsing FASTQ files with no information in '+' header line. deconseq-0.1: First public release of DeconSeq.
Source: README.txt, updated 2013-05-05

