BreakSeek Code
A breakpoint-based algorithm for full spectral range INDEL detection
Brought to you by:
breakseek
File | Date | Author | Commit |
---|---|---|---|
lib | 2015-01-01 |
![]() |
[c26bbc] better memeory allocation for parallel computing |
.gitignore | 2013-04-17 |
![]() |
[862d95] UPDATE: Python Class For SV and Parallelism |
LICENSE.txt | 2014-12-23 |
![]() |
[be98e7] Format Update - Formating the scripts for public! |
README.txt | 2014-12-29 |
![]() |
[b265a0] minor format update - README |
breakseek.py | 2015-01-01 |
![]() |
[0dacd7] bug fix |
sam_prep.py | 2014-12-31 |
![]() |
[4d9330] extreme number of references supported in sam_p... |
Manual of BreakSeek v1.1 Contact: Hui Zhao (kenkvs@126.com) Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, CHINA BreakSeek is a breakpoint-based algorithm for full spectral range INDEL detection, which can unbiasedly and efficiently detect both homozygous and heterozygous INDELs, ranging from several base pairs to over thousands of base pairs, with accurate breakpoint and heterozygosity rate estimations. IMPORTANT: The BreakSeek method is implemented in Python 2.7, please make sure the scripts are running under the Python 2.7 environment. # ---------------------------------------- 1 Commands and arguments How to run BreakSeek: python breakseek.py -f sorted_sam_file -r ref_fasta_file [-optional parameters] The arguments of BreakSeek are as follows: -f the SAM file sorted by read pairs (required) -r the fasta formatted reference sequence used in SAM file (required) -o the output directory, default as the directory of the input SAM file (optional) -m mean of the insert size of read pairs in SAM (optional but recommended) -s sd of the insert size (optional but recommended) -c read depth of the dataset (optional but highly recommended) -q read length (optional) -g the maximum size of ignored CIGAR segments (default 5, recommended) -z sampling size for estimation of -m, -s, -c, -q if not provided (default 8000) -x maximum iteration of EM estimation of local PEM distances (default 2000) -n the number of processors used for breakseek (default 1) Note: If the users are not using default parameters for -m/-s/-c/-q/, breakseek will use the user pre-defined values instead of estimating these values by itself. Therefore, please make sure these paramters are correct. If you are not sure about the mean (-m) or sd (-s) of the insert sizes of read pairs or the read depth (-c), it is recommended to run breakseek simply as: # run breakseek scripts python breakseek.py -f sorted_sam_file -r ref_fasta_file # ---------------------------------------- 2 Preparation before running BreakSeek 2.1 Generation of the SAM file BreakSeek detected INDELs from sorted SAM file generated by the BWA tool, which is a split mapping algorithm (http://bio-bwa.sourceforge.net/bwa.shtml). SAM files generated by other mapping algorithms, that may produce multiple primary alignments for different parts of a single query read sequence, are not supported. Please make sure the SAM file submitted to BreakSeek by -f only contains one line for the mapping of each read. Recommended protocols for running BWA: # shell scripts bwa index -a bwtsw ref.fa bwa aln ref.fa read1.fq > read1.sai bwa aln ref.fa read2.fq > read2.sai bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln_pe.sam Reads from the generated aln_pe.sam are already in pairs and only need to be sorted using sam_prep.py. IMPORTANT: If your SAM file is not generated by bwa aln like aln_pe.sam, be caution that SAM files containing multiple lines for the alignments of different parts of the same read sequence are not supported by BreakSeek. 2.2 Preparation for SAM file For efficiency, the SAM file is required to be sorted in read pairs. Currently, BreakSeek can only detect INDELs from different chromosomes separately and independently. That is, the SAM file must be partitioned by references and all read pairs from the same partitioned SAM file should all be mapped to the same reference. Please run BreakSeek with the partitioned SAM files and the references separately. The sam_prep.py script checks and formats the input SAM file if it meets the requirements. If the SAM file contains read pairs from multiple references, sam_prep.py will partition the SAM file into smaller SAM files by reference, and the partitioned SAM files will then be sorted by read pairs independently and are saved into ${WD}/${REF_NAME}/, where ${WD} is the output directory (default the directory of the original SAM file) and ${REF_NAME} is the name of the reference sequence. If the input SAM file already meets the requirements, sam_prep.py will not re-format it, and thus it is recommended to run sam_prep.py before breakseek.py. A qualified SAM file may look like this: @SQ SN:chr20 LN:63025520 @PG ID:bwa PN:bwa VN:0.7.5a-r405 ERR194147.680911999 117 chr20 59990 0 * = 59990 ...rest_of_the_terms_in_SAM_format... ERR194147.680911999 153 chr20 59990 25 101M = 59990 ...rest_of_the_terms_in_SAM_format... ERR194147.680912000 89 chr20 59991 37 101M = 59991 ...rest_of_the_terms_in_SAM_format... ERR194147.680912000 181 chr20 59991 0 * = 59991 ...rest_of_the_terms_in_SAM_format... ERR194147.680912024 99 chr20 59995 60 101M = 60081 ...rest_of_the_terms_in_SAM_format... ERR194147.680912024 147 chr20 60081 60 101M = 59995 ...rest_of_the_terms_in_SAM_format... ERR194147.680912063 83 chr20 60230 60 101M = 59996 ...rest_of_the_terms_in_SAM_format... ERR194147.680912063 163 chr20 59996 60 101M = 60230 ...rest_of_the_terms_in_SAM_format... That is, the SAM file should be: 1) all read pairs in the SAM file should be mapped to the same and only one reference sequence (usually recorded in @SQ if SAM header is provided) 2) all read pairs are sorted by pairs (reads of the same pair are always together) 3)* only one line for each read in the SAM file *IMPORTANT: the third requirement, determined by how the SAM file is generated, is not checked by sam_prep.py. Please make sure the input SAM file already meet this requirement! That is, there should be no multiple lines in the SAM file for alignments of different parts of the same single read sequence. # ---------------------------------------- 3 An example of running BreakSeek Before we start, please make sure you have installed Python 2.7 and use Mac OS X or Linux operation system. BreakSeek should also work on Window systems, but is not tested. Please download and gunzip BreakSeek and test.zip. test.sam is a SAM file of alignment records generated by bwa aln with the default settings. test.fa is a FASTA file of hg19 chromosome 20 downloaded from UCSC. Enter the directory and type as following in your terminal: # check the SAM file, since test.sam does not meet the requirements, a qualified SAM file testS.sam is produced by sam_prep.py # since the SAM file only contains reads mapped to test.fa, it will not be partitioned # and thus no sub directory by references will be created, the qualified testS.sam will be in the same directory as test.sam python sam_prep.py -f test.sam # run BreakSeek on the sorted SAM file python breakseek.py -f testS.sam -r test.fa INDELs identified by BreakSeek are recorded in ${NAME_OF_SAM_FILE}_INDEL_list.txt in the same directory (default). # ---------------------------------------- 4 Output format Columns of output file are split by tabs ("\t" in shell and python). Here is an example of the output file: ST ED Type Chr HetR Lbreak Rbreak PEM 287035 287035 I chr20 0.54 3/8 3/8 0/18/14 287065 287065 I chr20 0.65 15/5 15/5 0/12/9 373683 373808 D chr20 0.45 10/0 0/16 0/53/17 1051575 1051575 I chr20 0.59 13/13 13/13 0/10/7 1069160 1069230 D chr20 0.76 7/0 0/11 0/24/61 1170760 1170789 D chr20 1.0 4/0 - 0/0/59 1389147 1390817 D chr20 0.47 10/0 - 0/35/21 1389142 1390815 D chr20 0.42 - 0/9 0/41/21 1497990 1498017 D chr20 0.6 5/0 0/9 0/29/29 1746594 1746734 D chr20 0.68 - 0/5 0/15/29 42654167 42654167 I chr20 1.0 15/20 15/20 0/0/0 Each column gives information of the detected INDELs, such as positions and local read counts. From the left, the columns are: ST - start position (left breakpoint) of the INDEL on reference ED - end position (right breakpoint) of the INDEL on reference, for insertions, ST and ED are the same position Type - I for Insertions and D for Deletions Chr - on which reference/chromosome did the INDEL occurred HetR - estimated heterozygosity rate of the INDEL, ranging from 0. to 1. Lbreak - the number of breakreads (#MSreads/#SMreads) at the left breakpoint. For deletions #SMreads is the number of error-reads and should be close to 0; for insertions the two numbers should be of similar size Rbreak - the number of of breakreads (#MSreads/#SMreads) at the right breakpoint. For deletions #MSreads is the number of error-reads and should be close to 0; for insertions the two numbers should be of similar size PEM - classification of local read pairs into #ErroneouslyMapped/#Normally#Mapped/#INDEL_Supporting according to the EM analysis, #ErroneouslyMapped should be close to 0