Home
Name Modified Size InfoDownloads / Week
BCP-0.2.tar.gz 2012-10-02 1.0 MB
README 2012-10-02 10.6 kB
BCP.tar.gz 2012-09-29 1.0 MB
Totals: 3 Items   2.1 MB 0
#####################
#     BCP Guide     #   Copyright (c) 2012 Fernando Palluzzi
#####################

Contact: fernando.palluzzi@studio.unibo.it

0. LICENSE.

    This file is part of BCP - The Breast Cancer risk Pipeline.

    BCP is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    BCP is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with BCP.  If not, see <http://www.gnu.org/licenses/>.


1. INSTALLATION.

1) Move/Copy the BCP folder in your home directory.

2) Start a terminal.

2) Enter in your .bashrc file (you can find it by using 'locate bashrc';
   there should be one in your home directory).

3) Modfy your PYTHON and PYTHONPATH variables by modifying the following
   two lines (XXXXX stands for the initial variable content):

   export PATH=XXXXX:/home/<user>/BCP

   export PYTHONPATH=XXXXX:/home/<user>/BCP

   If one or both of these line do not exist, please create it.

4) Do not forget to give execution permissions to the BCP files;
   e.g.:
         $ chmod 700 -R /home/<user>/BCP

5) BCP needs third-party dependencies.
   Only Python, Biopython and Pysam are needed to boot BCP. The other
   dependencies are needed to make BCP work correctly.

   Python       http://www.python.org/
   Biopython    http://biopython.org/wiki/Biopython
   SQLite 3     http://www.sqlite.org/
   Fastx        http://hannonlab.cshl.edu/fastx_toolkit/
   R            http://www.r-project.org/
   SMALT        http://www.sanger.ac.uk/resources/software/smalt/
   Samtools     http://samtools.sourceforge.net/
   Pysam        http://code.google.com/p/pysam/
   ART          http://www.niehs.nih.gov/research/resources/software/biostatistics/art/

6) Try to boot bcp, with the command:
   
   $ bcp.py
   
   You should receive the following message:
   
   Welcome to the BCPipeline
   Version: 0.2 - October 01, 2012
	
	Usage: bcp.py <command> [options]
	
	Type bcp.py <command> for information about command usage.
	Remember to use the right file extension:
	Reference genome	fa, fas, fasta
	Libraries		fq, fastq
	Alignments		sam, bam
	ART artificial align.	aln
	Variants		bcf, vcf, tab
	ART statistics		stat
	Quality statistics	stats
	Length distribution	ldst
	Coverage distribution	cdst
	Plots			png
	
   Command: convert        Convert file format and phred score
            index_fastq    Index a huge fastq file into a .idx file
            clip           Clip mids, adapters and primers and split fastq
                            or .idx files on the base of tag sequences
            clean          Clean either a fastq or a .idx file
            index          Index a reference file in fasta format
            align          Produce .sam and .bam alignments of a fastq file
                            against its reference in fasta format
            pileup         Produce a .bcf file from a .bam alignment
            call           Call variants (convert a .bcf file into a .vcf one)
            view           Show statistics for fasta, fastq and bam files
            stat           Produce stats files and graphs
            simulate       Produce artificial 454 and Illumina reads in
                            fastq format and alignments in SAM format
            intersect      Intersect tables
            risk           Evaluate Breast Cancer Risk by merging clinical
                            and genome information
            run            Automated BCP run

7) Once installed BCP, check the dependencies:
   
   $ bcp.py check
   
   BCP dependencies checkup

   [ ]: not installed    [X]: installed
   
   [X]: python
   [X]: sqlite3
   [X]: fastx_quality_stats
   [X]: R
   [X]: smalt_i686
   [X]: smalt_MacOSX_i386
   [X]: smalt_MacOSX_i686_64bit
   [X]: smalt_x86_64
   [X]: samtools
   [X]: bcftools
   [X]: art_454
   [X]: art_illumina


2. DEPENDENCIES.

The list of dependencies can be checked as explained above.
The dependencies to be installed are:

   Python/Biopython     Core functions
   SQLite3              Index databases
   Fastx                Quality stats (it can also be used for clipping and splitting fastq files)
   R                    Statistics and plots
   SMALT                Alignments
   SAMtools             Variant calling
   Pysam                BAM parsing
   ART                  Simulation


3. RUN BCP FOR THE FIRST TIME.

To run BCP correctly the user should remember some rule.

1) Do not merge options. e.g.: write -a -f, not -af; write -q 20, not -q20

2) Read the internal guides to know each single BCP function.
   BCP functions are listed at the beginning of this document.
   To read the documentation from inside BCP, type from a terminal: bcp.py <function>

3) Try to make a test run.
   
   3.0) Go in the 'test' folder.
   3.1) Check the fastq file with:
	
	$ bcp.py view fq BCP/test/test_sim.fastq
	
	#              Reads number: 7476
	#           Ambiguous reads: 388 (5.19%)
	# Short reads (length < 30): 0 (0.0%)
	#          Mean read length: 142.687
	#         Mean read quality: 58.027

   3.2) These reads are not cleaned.
   3.3) Launch:
	
	$ bcp.py run BCP/test/test_sim.fastq BCP/test/ref.fasta

   3.4) A series of messages inform you about what is happening,
	from cleaning to variant calling and plotting.
   3.5) If the computation goes well, you will visualize '# Done.'
	as last line.
   3.6) Test the risk function by typing:

        $ bcp.py risk BCP/test/ClinicalData.csv BCP/test/GenomicData.csv

   3.7) If no error messages occur, BCP has been correctly installed.

4) Type 'bcp.py run' to visualize what the 'run' function can do.

	Anyway, be carefull. The 'run function' do not clip and
	the final output is in VCF format. If clipping is needed,
	use the 'clip' function. If you want a tabular filtered
	file, use the 'call' function on the .bcf produced by
	the run function.
	'clean' and 'run' will clean permanently uncleaned files!
	(uncleaned files will be lost)

5) To have a good BCP walkthrough, type in sequence:
	$ bcp.py convert
	$ bcp.py index_fastq
	$ bcp.py clip
	$ bcp.py clean
	$ bcp.py index
	$ bcp.py align
	$ bcp.py pileup
	$ bcp.py call
	$ bcp.py view
	$ bcp.py stat
	$ bcp.py simulate
	$ bcp.py intersect
	$ bcp.py risk
	$ bcp.py run


4. BREAST CANCER (BC) RISK ASSESSMENT.

The BC risk function is based on the Gail model for breast cancer risk
assessment (references 1 to 3), written in R. The function needs two tables,
in .csv format, for clinical and genome information. Examples in the BCP/test
folder, called ClinicalData.csv and GenomicData.csv, show how to build the
two tables.

The columns in ClinicalData.csv (i.e. the required clinical information)
are respectively:

Patient ID            [any]
Race                  white=0, african=1, hispanic=2, chinese=3, japanese=4
                      filippino=5, hawaiian=6, other_pacific=7, other_asian=8
Age                   Current age [INT]
Risk_age              The aim is to calculate risk from Age to Risk_age, so:
                      Risk_age = Age + follow_up; usually follow_up = 5 years [INT]
Menarche_age          Age at first menstrual cycle [INT]
First_live_birth      Age at first live birth [INT]
Affected_relatives    Number of first-degree relatives affected by BC [INT]
Biopsies              Number of biopsies [INT]
Hyperplasia           Number of biopsies with detected hyperplasia [INT]

The columns in GenomicData.csv (i.e. the required genome information)
are respectively:

Patient ID            [any - the same as ClinicalData.csv]
SNP_ID                Usually the rs number [STR]
Chromosome            Chromosome number [INT]
Gene_name             Gene name that contains the mutation [STR]
Rare_allele           The disease allele (A, C, T, G)
OR                    The Odds-Ratio for the disease allele, related to the
                      risk of developing BC [FLOAT]


5. BENCHMARKING (optional).

You can benchmark the pipeline if you want. The benchmarkng protocol is
not automated, but you can proceed by using the BCP functions.
First, you can calculate the sensitivity. Sensitivity is the ratio
between the variants detected over all the variants present in the
sample. To know this variability a priori, we must simulate a sequence.

    1) Simulate a sequence using the 'simulate' function.
    2) You have a simulated .sam file. Convert it to bam using 'convert'.
    3) You have now a sorted and indexed bam.
    4) Call variants using 'align' and 'pileup' with -a (accurate):
       you will obtain a .bcf file.
    5) Convert the .bcf file to tabular using 'call' with -T and -S options.
    6) Now align the simulated fastq (it should have the _aln.fastq extension)
       against the reference using 'run' with -a.
    7) Convert the obtained .bcp file to tabular format using 'call'.
    8) Intersect the table at point (7) with the table obtained at (5), using 'intersect'.
       Remember that table (5) is the reference table.
    9) The 'intersect' function work out a percentage: it is the sensitivity.

Specificity is easier to compute. It is defined as the number of variants that
are 'real' variants (i.e. are present in a known variants file).
Suppose to work out variants from a chromosomal region, e.g. chromosome 13.
Provided a known variants file (as downloaded from the UCSC Genome Browser)
from that region, we can intersect our table with the known variants (reference)
table. Once again, the 'inersect' fuction will provide a percentage that is the
value of specificity.


6. ACKNOWLEDGEMENTS.

The author thanks Professor Elena Maestrini and the Biocoputing Group of
Università di Bologna, for their constant support.


7. REFERENCES.

1) Gail M H, Brinton L A, et al. (1989). Projecting individualized probabilities
          of developing breast cancer for white females who are being examined
          annually. J. Natl. Cancer Inst., 81: 1879.
2) Gail M H (2009). Value of adding Single-Nucleotide Polymorphism genotypes to
          a breast cancer risk model. JNCI, 101: 959.
3) Gail M H (2010). Personalized estimates of breast cancer risk in clinical
          practice and public health. Stat. Med., 30: 1090.

Source: README, updated 2012-10-02