BreakSeek Code

A breakpoint-based algorithm for full spectral range INDEL detection

Brought to you by: breakseek
Tree [0dacd7] master / History
HTTPS access
File	Date	Author	Commit
lib	2015-01-01	Hui Zhao	[c26bbc] better memeory allocation for parallel computing
.gitignore	2013-04-17	Hui Zhao	[862d95] UPDATE: Python Class For SV and Parallelism
LICENSE.txt	2014-12-23	Hui Zhao	[be98e7] Format Update - Formating the scripts for public!
README.txt	2014-12-29	Hui Zhao	[b265a0] minor format update - README
breakseek.py	2015-01-01	Hui Zhao	[0dacd7] bug fix
sam_prep.py	2014-12-31	Hui Zhao	[4d9330] extreme number of references supported in sam_p...
Read Me

Manual of BreakSeek v1.1
Contact: Hui Zhao (kenkvs@126.com)
Computational Genomics Lab, Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, CHINA

BreakSeek is a breakpoint-based algorithm for full spectral range INDEL detection, which can unbiasedly and efficiently detect both homozygous and heterozygous INDELs, ranging from several base pairs to over thousands of base pairs, with accurate breakpoint and heterozygosity rate estimations. 

IMPORTANT: The BreakSeek method is implemented in Python 2.7, please make sure the scripts are running under the Python 2.7 environment.


# ----------------------------------------
1 Commands and arguments
How to run BreakSeek:
python breakseek.py -f sorted_sam_file -r ref_fasta_file [-optional parameters]

The arguments of BreakSeek are as follows:

	-f	the SAM file sorted by read pairs (required)
	-r	the fasta formatted reference sequence used in SAM file (required)
	-o	the output directory, default as the directory of the input SAM file (optional)
	-m	mean of the insert size of read pairs in SAM (optional but recommended)
	-s	sd of the insert size (optional but recommended)
	-c	read depth of the dataset (optional but highly recommended)
	-q	read length (optional)
	-g	the maximum size of ignored CIGAR segments (default 5, recommended)
	-z	sampling size for estimation of -m, -s, -c, -q if not provided (default 8000)
	-x	maximum iteration of EM estimation of local PEM distances (default 2000)
	-n	the number of processors used for breakseek (default 1)

Note: If the users are not using default parameters for -m/-s/-c/-q/, breakseek will use the user pre-defined values instead of estimating these values by itself. Therefore, please make sure these paramters are correct. If you are not sure about the mean (-m) or sd (-s) of the insert sizes of read pairs or the read depth (-c), it is recommended to run breakseek simply as:
# run breakseek scripts
python breakseek.py -f sorted_sam_file -r ref_fasta_file


# ----------------------------------------
2 Preparation before running BreakSeek

2.1 Generation of the SAM file
BreakSeek detected INDELs from sorted SAM file generated by the BWA tool, which is a split mapping algorithm (http://bio-bwa.sourceforge.net/bwa.shtml). SAM files generated by other mapping algorithms, that may produce multiple primary alignments for different parts of a single query read sequence, are not supported. Please make sure the SAM file submitted to BreakSeek by -f only contains one line for the mapping of each read.

Recommended protocols for running BWA:
# shell scripts
bwa index -a bwtsw ref.fa
bwa aln ref.fa read1.fq > read1.sai
bwa aln ref.fa read2.fq > read2.sai
bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln_pe.sam

Reads from the generated aln_pe.sam are already in pairs and only need to be sorted using sam_prep.py.

IMPORTANT: If your SAM file is not generated by bwa aln like aln_pe.sam, be caution that SAM files containing multiple lines for the alignments of different parts of the same read sequence are not supported by BreakSeek.

2.2 Preparation for SAM file
For efficiency, the SAM file is required to be sorted in read pairs. Currently, BreakSeek can only detect INDELs from different chromosomes separately and independently. That is, the SAM file must be partitioned by references and all read pairs from the same partitioned SAM file should all be mapped to the same reference. Please run BreakSeek with the partitioned SAM files and the references separately.

The sam_prep.py script checks and formats the input SAM file if it meets the requirements. If the SAM file contains read pairs from multiple references, sam_prep.py will partition the SAM file into smaller SAM files by reference, and the partitioned SAM files will then be sorted by read pairs independently and are saved into ${WD}/${REF_NAME}/, where ${WD} is the output directory (default the directory of the original SAM file) and ${REF_NAME} is the name of the reference sequence. If the input SAM file already meets the requirements, sam_prep.py will not re-format it, and thus it is recommended to run sam_prep.py before breakseek.py.

A qualified SAM file may look like this:
@SQ     SN:chr20        LN:63025520
@PG     ID:bwa  PN:bwa  VN:0.7.5a-r405
ERR194147.680911999     117     chr20   59990   0       *       =       59990   ...rest_of_the_terms_in_SAM_format...
ERR194147.680911999     153     chr20   59990   25      101M    =       59990   ...rest_of_the_terms_in_SAM_format...
ERR194147.680912000     89      chr20   59991   37      101M    =       59991   ...rest_of_the_terms_in_SAM_format...
ERR194147.680912000     181     chr20   59991   0       *       =       59991   ...rest_of_the_terms_in_SAM_format...
ERR194147.680912024     99      chr20   59995   60      101M    =       60081   ...rest_of_the_terms_in_SAM_format...
ERR194147.680912024     147     chr20   60081   60      101M    =       59995   ...rest_of_the_terms_in_SAM_format...
ERR194147.680912063     83      chr20   60230   60      101M    =       59996   ...rest_of_the_terms_in_SAM_format...
ERR194147.680912063     163     chr20   59996   60      101M    =       60230   ...rest_of_the_terms_in_SAM_format...

That is, the SAM file should be:
1) all read pairs in the SAM file should be mapped to the same and only one reference sequence (usually recorded in @SQ if SAM header is provided)
2) all read pairs are sorted by pairs (reads of the same pair are always together)
3)* only one line for each read in the SAM file

*IMPORTANT: the third requirement, determined by how the SAM file is generated, is not checked by sam_prep.py. Please make sure the input SAM file already meet this requirement! That is, there should be no multiple lines in the SAM file for alignments of different parts of the same single read sequence.


# ----------------------------------------
3 An example of running BreakSeek
Before we start, please make sure you have installed Python 2.7 and use Mac OS X or Linux operation system. BreakSeek should also work on Window systems, but is not tested.
Please download and gunzip BreakSeek and test.zip.
test.sam is a SAM file of alignment records generated by bwa aln with the default settings.
test.fa is a FASTA file of hg19 chromosome 20 downloaded from UCSC.

Enter the directory and type as following in your terminal:
# check the SAM file, since test.sam does not meet the requirements, a qualified SAM file testS.sam is produced by sam_prep.py
# since the SAM file only contains reads mapped to test.fa, it will not be partitioned 
# and thus no sub directory by references will be created, the qualified testS.sam will be in the same directory as test.sam
python sam_prep.py -f test.sam
# run BreakSeek on the sorted SAM file
python breakseek.py -f testS.sam -r test.fa

INDELs identified by BreakSeek are recorded in ${NAME_OF_SAM_FILE}_INDEL_list.txt in the same directory (default).


# ----------------------------------------
4 Output format
Columns of output file are split by tabs ("\t" in shell and python).
Here is an example of the output file:
ST      ED      Type	Chr    HetR    Lbreak  Rbreak  PEM
287035	287035  I	chr20	0.54    3/8     3/8     0/18/14
287065  287065  I       chr20	0.65    15/5    15/5    0/12/9
373683  373808  D       chr20	0.45    10/0    0/16    0/53/17
1051575 1051575 I       chr20	0.59    13/13   13/13   0/10/7
1069160 1069230 D       chr20	0.76    7/0     0/11    0/24/61
1170760 1170789 D       chr20	1.0     4/0     -       0/0/59
1389147 1390817 D       chr20	0.47    10/0    -       0/35/21
1389142 1390815 D       chr20	0.42    -       0/9     0/41/21
1497990 1498017 D       chr20	0.6     5/0     0/9     0/29/29
1746594 1746734 D       chr20	0.68    -       0/5     0/15/29
42654167	42654167        I       chr20	1.0     15/20   15/20   0/0/0

Each column gives information of the detected INDELs, such as positions and local read counts.
From the left, the columns are:
ST	- start position (left breakpoint) of the INDEL on reference
ED	- end position (right breakpoint) of the INDEL on reference, for insertions, ST and ED are the same position
Type	- I for Insertions and D for Deletions
Chr	- on which reference/chromosome did the INDEL occurred
HetR	- estimated heterozygosity rate of the INDEL, ranging from 0. to 1.
Lbreak	- the number of breakreads (#MSreads/#SMreads) at the left breakpoint. For deletions #SMreads is the number of error-reads and should be close to 0; for insertions the two numbers should be of similar size
Rbreak	- the number of of breakreads (#MSreads/#SMreads) at the right breakpoint. For deletions #MSreads is the number of error-reads and should be close to 0; for insertions the two numbers should be of similar size
PEM	- classification of local read pairs into #ErroneouslyMapped/#Normally#Mapped/#INDEL_Supporting according to the EM analysis, #ErroneouslyMapped should be close to 0
BreakSeek Code

A breakpoint-based algorithm for full spectral range INDEL detection

Branches

Tree [0dacd7] master / Download Snapshot History

Read Me

Tree [0dacd7] master /

History