#####################
# BCP Guide # Copyright (c) 2012 Fernando Palluzzi
#####################
Contact: fernando.palluzzi@studio.unibo.it
0. LICENSE.
This file is part of BCP - The Breast Cancer risk Pipeline.
BCP is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
BCP is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with BCP. If not, see <http://www.gnu.org/licenses/>.
1. INSTALLATION.
1) Move/Copy the BCP folder in your home directory.
2) Start a terminal.
2) Enter in your .bashrc file (you can find it by using 'locate bashrc';
there should be one in your home directory).
3) Modfy your PYTHON and PYTHONPATH variables by modifying the following
two lines (XXXXX stands for the initial variable content):
export PATH=XXXXX:/home/<user>/BCP
export PYTHONPATH=XXXXX:/home/<user>/BCP
If one or both of these line do not exist, please create it.
4) Do not forget to give execution permissions to the BCP files;
e.g.:
$ chmod 700 -R /home/<user>/BCP
5) BCP needs third-party dependencies.
Only Python, Biopython and Pysam are needed to boot BCP. The other
dependencies are needed to make BCP work correctly.
Python http://www.python.org/
Biopython http://biopython.org/wiki/Biopython
SQLite 3 http://www.sqlite.org/
Fastx http://hannonlab.cshl.edu/fastx_toolkit/
R http://www.r-project.org/
SMALT http://www.sanger.ac.uk/resources/software/smalt/
Samtools http://samtools.sourceforge.net/
Pysam http://code.google.com/p/pysam/
ART http://www.niehs.nih.gov/research/resources/software/biostatistics/art/
6) Try to boot bcp, with the command:
$ bcp.py
You should receive the following message:
Welcome to the BCPipeline
Version: 0.2 - October 01, 2012
Usage: bcp.py <command> [options]
Type bcp.py <command> for information about command usage.
Remember to use the right file extension:
Reference genome fa, fas, fasta
Libraries fq, fastq
Alignments sam, bam
ART artificial align. aln
Variants bcf, vcf, tab
ART statistics stat
Quality statistics stats
Length distribution ldst
Coverage distribution cdst
Plots png
Command: convert Convert file format and phred score
index_fastq Index a huge fastq file into a .idx file
clip Clip mids, adapters and primers and split fastq
or .idx files on the base of tag sequences
clean Clean either a fastq or a .idx file
index Index a reference file in fasta format
align Produce .sam and .bam alignments of a fastq file
against its reference in fasta format
pileup Produce a .bcf file from a .bam alignment
call Call variants (convert a .bcf file into a .vcf one)
view Show statistics for fasta, fastq and bam files
stat Produce stats files and graphs
simulate Produce artificial 454 and Illumina reads in
fastq format and alignments in SAM format
intersect Intersect tables
risk Evaluate Breast Cancer Risk by merging clinical
and genome information
run Automated BCP run
7) Once installed BCP, check the dependencies:
$ bcp.py check
BCP dependencies checkup
[ ]: not installed [X]: installed
[X]: python
[X]: sqlite3
[X]: fastx_quality_stats
[X]: R
[X]: smalt_i686
[X]: smalt_MacOSX_i386
[X]: smalt_MacOSX_i686_64bit
[X]: smalt_x86_64
[X]: samtools
[X]: bcftools
[X]: art_454
[X]: art_illumina
2. DEPENDENCIES.
The list of dependencies can be checked as explained above.
The dependencies to be installed are:
Python/Biopython Core functions
SQLite3 Index databases
Fastx Quality stats (it can also be used for clipping and splitting fastq files)
R Statistics and plots
SMALT Alignments
SAMtools Variant calling
Pysam BAM parsing
ART Simulation
3. RUN BCP FOR THE FIRST TIME.
To run BCP correctly the user should remember some rule.
1) Do not merge options. e.g.: write -a -f, not -af; write -q 20, not -q20
2) Read the internal guides to know each single BCP function.
BCP functions are listed at the beginning of this document.
To read the documentation from inside BCP, type from a terminal: bcp.py <function>
3) Try to make a test run.
3.0) Go in the 'test' folder.
3.1) Check the fastq file with:
$ bcp.py view fq BCP/test/test_sim.fastq
# Reads number: 7476
# Ambiguous reads: 388 (5.19%)
# Short reads (length < 30): 0 (0.0%)
# Mean read length: 142.687
# Mean read quality: 58.027
3.2) These reads are not cleaned.
3.3) Launch:
$ bcp.py run BCP/test/test_sim.fastq BCP/test/ref.fasta
3.4) A series of messages inform you about what is happening,
from cleaning to variant calling and plotting.
3.5) If the computation goes well, you will visualize '# Done.'
as last line.
3.6) Test the risk function by typing:
$ bcp.py risk BCP/test/ClinicalData.csv BCP/test/GenomicData.csv
3.7) If no error messages occur, BCP has been correctly installed.
4) Type 'bcp.py run' to visualize what the 'run' function can do.
Anyway, be carefull. The 'run function' do not clip and
the final output is in VCF format. If clipping is needed,
use the 'clip' function. If you want a tabular filtered
file, use the 'call' function on the .bcf produced by
the run function.
'clean' and 'run' will clean permanently uncleaned files!
(uncleaned files will be lost)
5) To have a good BCP walkthrough, type in sequence:
$ bcp.py convert
$ bcp.py index_fastq
$ bcp.py clip
$ bcp.py clean
$ bcp.py index
$ bcp.py align
$ bcp.py pileup
$ bcp.py call
$ bcp.py view
$ bcp.py stat
$ bcp.py simulate
$ bcp.py intersect
$ bcp.py risk
$ bcp.py run
4. BREAST CANCER (BC) RISK ASSESSMENT.
The BC risk function is based on the Gail model for breast cancer risk
assessment (references 1 to 3), written in R. The function needs two tables,
in .csv format, for clinical and genome information. Examples in the BCP/test
folder, called ClinicalData.csv and GenomicData.csv, show how to build the
two tables.
The columns in ClinicalData.csv (i.e. the required clinical information)
are respectively:
Patient ID [any]
Race white=0, african=1, hispanic=2, chinese=3, japanese=4
filippino=5, hawaiian=6, other_pacific=7, other_asian=8
Age Current age [INT]
Risk_age The aim is to calculate risk from Age to Risk_age, so:
Risk_age = Age + follow_up; usually follow_up = 5 years [INT]
Menarche_age Age at first menstrual cycle [INT]
First_live_birth Age at first live birth [INT]
Affected_relatives Number of first-degree relatives affected by BC [INT]
Biopsies Number of biopsies [INT]
Hyperplasia Number of biopsies with detected hyperplasia [INT]
The columns in GenomicData.csv (i.e. the required genome information)
are respectively:
Patient ID [any - the same as ClinicalData.csv]
SNP_ID Usually the rs number [STR]
Chromosome Chromosome number [INT]
Gene_name Gene name that contains the mutation [STR]
Rare_allele The disease allele (A, C, T, G)
OR The Odds-Ratio for the disease allele, related to the
risk of developing BC [FLOAT]
5. BENCHMARKING (optional).
You can benchmark the pipeline if you want. The benchmarkng protocol is
not automated, but you can proceed by using the BCP functions.
First, you can calculate the sensitivity. Sensitivity is the ratio
between the variants detected over all the variants present in the
sample. To know this variability a priori, we must simulate a sequence.
1) Simulate a sequence using the 'simulate' function.
2) You have a simulated .sam file. Convert it to bam using 'convert'.
3) You have now a sorted and indexed bam.
4) Call variants using 'align' and 'pileup' with -a (accurate):
you will obtain a .bcf file.
5) Convert the .bcf file to tabular using 'call' with -T and -S options.
6) Now align the simulated fastq (it should have the _aln.fastq extension)
against the reference using 'run' with -a.
7) Convert the obtained .bcp file to tabular format using 'call'.
8) Intersect the table at point (7) with the table obtained at (5), using 'intersect'.
Remember that table (5) is the reference table.
9) The 'intersect' function work out a percentage: it is the sensitivity.
Specificity is easier to compute. It is defined as the number of variants that
are 'real' variants (i.e. are present in a known variants file).
Suppose to work out variants from a chromosomal region, e.g. chromosome 13.
Provided a known variants file (as downloaded from the UCSC Genome Browser)
from that region, we can intersect our table with the known variants (reference)
table. Once again, the 'inersect' fuction will provide a percentage that is the
value of specificity.
6. ACKNOWLEDGEMENTS.
The author thanks Professor Elena Maestrini and the Biocoputing Group of
Università di Bologna, for their constant support.
7. REFERENCES.
1) Gail M H, Brinton L A, et al. (1989). Projecting individualized probabilities
of developing breast cancer for white females who are being examined
annually. J. Natl. Cancer Inst., 81: 1879.
2) Gail M H (2009). Value of adding Single-Nucleotide Polymorphism genotypes to
a breast cancer risk model. JNCI, 101: 959.
3) Gail M H (2010). Personalized estimates of breast cancer risk in clinical
practice and public health. Stat. Med., 30: 1090.