Home / r210
Name Modified Size InfoDownloads / Week
Parent folder
README.txt 2015-10-09 6.6 kB
Totals: 1 Item   6.6 kB 0
=======================
|   BALSA - README    |
=======================

Visit latest version of BALSA/ELSA at:
http://www.l3-bioinfo.com/ 

------------
BALSA is an integrated solution for the secondary analysis of next generation sequencing (NGS) data;

Caveats
-------
1. Please ensure that your machine has ≥50GB main memory. BALSA will not terminate itself on machines with large swap space but without enough main memory, however, the program will more likely to finish in weeks instead of hours.

2. Local disk instead of any sort of network attached disk should be used for temporary files storage. Please ensure you have at least 500GB local storage for analyzing a typical 50x WGS sample. BALSA will not terminate itself if a network attached file-system is set for storing temporary files, but the program will more likely to finish in weeks instead of hours.

3. Using SSD as local disk for temporary storage won’t significantly improve BALSA’s performance.

4. In BALSA’s paper, 5.5 hours from raw reads to variants was achieved using Intel i7-3730k@3.2GHz and nVidia GTX680. In production environments, it’s popular to use more reliable server-grade hardware such as Intel Xeon CPU, which usually owns more cores but with lower per-core performance, and nVidia Tesla cards, which is with ECC memory but much lower core frequency. The high performance of BALSA is partially achieved by tuning the balance of utilizing CPU and GPU resources, and the default version of BALSA is fine-tuned for i7-3730k + GTX680.

However, if you have a configuration with a different CPU or GPU, BALSA can degrade in performance,
	i) For CPU, you can set “NumOfCpuThreads” to no larger than 16 to utilize more CPU cores, however, BALSA seldom benefit from a number larger than 10 since the bottleneck shall be at GPU. BALSA benefit from higher frequency per-core more than larger number of cores. According to our experiments, using two Intel Xeon E5-2620@2.0GHz is about 1.5 times slower than a Intel i7-3730@3.2GHz with the default version.
	ii) For GPU, according to our experiments, using a Tesla K20c is about 1.5 times slower than GTX680 with the default version. Note that different from CPU, the higher number of cores in GPU do not necessarily lead to improvement in speed due to the saturation in GPU’s scheduling logics.

Please notice that we are strongly against using non-server-grade hardware in production. However, if you are using two Intel Xeon E5-2620 + a Tesla K20c, BALSA can be 1.5-2 times slower than i7-3730k@3.2GHz + GTX680 with the default version. This would be about 10 hours for analyzing a 50x WGS sample.


Usage Guildlines
----------------
A 2way-BWT index of the target reference genome is required.
A SNP database built from know SNP variants (such as dbSNP),
an IndelDB built from know Indel variants (such as 1000G and Mills)
and a gene region list built from known gene intervals (such as
UCSC known genes), are required for realignment.
For Exome mode, an Exome Region List in GFF format is required.


> Index Building:

2way-BWT
	please refer to SOAP3-DP for building the index
SNP database
	Input: VCF
	Output: Prefix.1list, Prefix.bv
	Usage:
		SnpDBBuilder <2BWT path>.index <VCF> <Output Prefix>
IndelDB
	Input: VCF
	Usage:
		IndelDBBuilder <2BWT path>.index <VCF> <Output file name>
General Region List (e.g. Known Gene List, Exome Region List)
	Input: GFF
	Usage:
		RegionIndexBuilder <2BWT path>.index <GFF> <Output file name>


> BALSA (alignment + recalibration + deduplication + realignment + variant calling):

BALSA requires an configuration file suffixing ini, its file name must be same as the BALSA executable.
BALSA takes reads in FASTQ format as input.

[One pair of input reads]
	Usage:
		balsa pair <2BWT path>.index <dbSNP prefix> <indelDB> <Gene List> <input read file 1_1> <input read file 1_2> <result prefix>
		[-L maxReadLength] (optional: the maximum read length of input read)
		[-I] (optional: 64-base quality encoding)
		[-v] (optional: minimum allowed insert size)
		[-u] (optional: maximum allowed insert size)
		[-t tempPrefix] (optional: the temporary files' prefix, if not set, current directory with name "asc" prefix is used)
		[-snapshot] (optional: if set, SNAPSHOT files will be outputted)

[Two or more pairs of input reads]
If two or more pairs of input read files are used, an input list is required for input:
	Input List Format:
		<input read file 1_1>	<input read file 1_2>	<min. insert size>	<max. insert size>	<temporary files' prefix>
		<input read file 2_1>	<input read file 2_2>	<min. insert size>	<max. insert size>	<temporary files' prefix>
		...
		<input read file n_1>	<input read file n_2>	<min. insert size>	<max. insert size>	<temporary files' prefix>
	Usage:
		balsa pair-multi <2BWT path>.index <dbSNP prefix> <indelDB> <Gene List> <input list> <result prefix>
		[-L maxReadLength] (optional: the maximum read length of input read)
		[-I] (optional: 64-base quality encoding)
		[-t tempPrefix] (optional: the temporary files' prefix, if not set, current directory with name "asc" prefix is used)
		[-snapshot] (optional: if set, SNAPSHOT files will be outputted)

Output and Result post-processing
---------------------------------
The resulting SNPs/Indels are in output file "<result prefix>.txt".

The output contains details of each variant position. Please refer to the Supplementary Document of BALSA's manuscript
for per column details.

To transform the output into VCF format, please run
	Usage:
		perl txt2vcf.pl <result prefix>.txt <Sample Name> <Reference in FASTA format>

BALSA also outputs SNAPSHOT files,
named as
	1. <result prefix>.SCr
	2. <result prefix>.SCOr
	3. <result prefix>.IVr
	4. <result prefix>.MPr
	5. <result prefix>.Cr
	6. <result prefix>.DSCr

An example of using these SNAPSHOT files are provided in this BALSA package.

[Somatic Variant Caller]
Somatic Variant Caller accepts two set of SNAPSHOT files, typically one set from normal sample, another set from tumor sample.
	Usage:
		somatic-caller <2BWT path>.index <Normal SNAPSHOT prefix> <Tumor SNAPSHOT prefix> <Somatic Score Threshold, typically 20> <Output file name>

[CNV Caller]
Copy-number variation Caller accepts two set of SNAPSHOT files, typically one set from normal sample, another set from tumor sample.
	Usage:
		cnv-caller <2BWT path>.index <Normal SNAPSHOT prefix> <Tumor SNAPSHOT prefix> <File Output For CNV Regions>

For the details of the result please refer to the the Supplementary Document of BALSA's manuscript.

=============
End of README
Source: README.txt, updated 2015-10-09