CoDeCZ
from Coverage Depth to Copy number using Z-scores
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Author: Marlous Hoogstraat, UMC Utrecht
Reference for tumor data:
1) ...under review...
Reference for congenital disorders / exome based data:
2) Targeted next-generation sequencing: A novel diagnostic tool for primary immunodeficiencies.
Nijman IJ, van Montfrans JM, Hoogstraat M, Boes ML, van de Corput L, Renner ED, van Zon P, van Lieshout S, Elferink MG,
van der Burg M, Vermont CL, van der Zwaag B, Janson E, Cuppen E, Ploos van Amstel JK, van Gijn ME.
J Allergy Clin Immunol. 2013 Oct 15. doi: 10.1016/j.jaci.2013.08.032
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Description:
CoDeCZ derives copy number state from targeted sequencing data on gene or exon level.
Results are NOT true copy number however, but they are semi-quantitative:
It uses the modified z-score to calculate deviation in normalized coverage per specified region. The higher the Z-score,
the higher the copy number. The final z-score depends on the quality of the data and the makeup of the reference pool.
The most important inputfile for CoDeCZ is your design. Several posibilities:
- For an amplicon based design (e.g. IonTorrent AmpliSeq Cancer Hotspot panel, Illumina TruSeq Amplicon Cancer panel)
use the genomic coordinates of the amplicons, and (if not yet present) add a column with the name of the gene these
amplicons belong to.
- For a targeted, exome based design (e.g. SureSelect, NimbleGen or Agilent arrays), use the coordinates of the exons in the
design and (if not yet present) add a column with the name of the gene these exons belong to. This will results in copy number
data on exon-level and will also allow you to detect breakpoints within genes as described in ref 2).
NOTE: If you use whole exome data but only analyze a subset of genes, use the -f option to retreive the total number of
mapped reads per sample for normalization.
- If you're interested in larger events, such as large or even complete chromosome duplications / deletions, and the
resolution of your sequencing data is high enough (e.g. whole exome), you can also create a desing with bigger chunks.
Either use windows of e.g. 10.000bp, or the start/end location of genes within your region of interest, and add a column
with a name for your region (e.g. "chr1","1p","1q","1p31.1").
Examples can be found in CoDeCZ_example.bed.
If a z-score of more than 1 amplicon/exon/gene/chunk within your gene/region of interest reaches significance, CoDeCZ will
bin neighbouring significant chunks and re-calculate the z-score for the whole bin. This score will be reported in the
"significant_regions" file, which is the main outputfile. Also, if more than half of the chunks in your region have a
significant highscore but no significant bins are found, CoDeCZ will calculate a z-score for the full region to test
for significant deviation.
Another important aspect is the reference pool.
If true reference samples are available, i.e. samples that do not have any copy number alterations, I recommend you use
these. It will greatly increase sensitivity of the method. If not, you can use your test samples as a reference as well.
However, if many of your samples ( >25%) contain the same alteration and you use these samples as reference as well,
this alteration will not be picked up.
Also, if you are dealing with chromosomally instable tumor samples, this will probably mean that you will miss deletions and
duplications, but amplifications will still be reported (unless many samples contain amplification of the same gene).
CoDeCZ can work with as little as 4 samples, but increasing the number of samples will greatly increase sensitivity as well.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Requires samtools, R and bedtools version 2.16 or higher.
Samtools to create flagstats if needed, total number of mapped reads is used for normalization.
Bedtools to calculate number of reads per amplicon / exon.
R to calculate z-scores.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Install:
- Extract the CoDeCZ_v1.0.tar.gz tarball to your favourite tools folder, e.g. /usr/local/bin/
- No make or make install needed
- Edit the config file to point CoDeCZ to the locations for samtools and bedtools.
- If you want to use CoDeCZ to calculate z-scores over a large number of regions, e.g. more than
1000 exons / amplicons, it is possible to spread the coverage calculation over cluster nodes.
This is done with basic qsub commands.
The last part of CoDeCZ uses R, so make sure the last job is send to a clusternode with R
installed by setting the HOST option.
If you do not intend to use a cluster, this option is never read so can be anything you like.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Example:
./CoDeCZ_v1.0.sh -t "<PATH/TO/BAMFILES/>*tumor.bam" -r "<PATH/TO/BAMFILES/>*ref.bam" -o myfirstCoDeCZ -d <PATH/TO/BEDFILE/>CoDeCZ_design.bed -f myfirstCoDeCZ.conf -q 150 -sgw
or
./CoDeCZ_v1.0.sh -p <PATH/TO/CONFIG/>test.conf -t "<PATH/TO/BAMFILES/>*tumor.bam" -r "<PATH/TO/MULTIBAMCOV/>reference_pool.multibamcov" -o myfirstCoDeCZ -d <PATH/TO/BEDFILE/>CoDeCZ_design.bed -f myfirstCoDeCZ.conf -q 500 -sgwmu
If CoDeCZ has used the cluster, use ./cleanup.sh to remove all temporary files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage:
OPTIONS:
-p [s] Full path to config file that contains path to bedtools and samtools. Optional: if empty,
the default will be used (CoDeCZ_v1.0.conf).
-t [s] Full path to bams to test for amplifications (required). The use of wildcards (*) is possible, as
well as multiple paths. Use -t only once and surround paths with double quotes, e.g. "/path/1/*bam /path/2/*bam"
-o [s] Outputname for R config and image directory (optional; if not provided, directory
will be created within current PWD)
-d [s] Full path to design file, format: "chr start end name" (e.g. gene)
(required)
-r [s] Full path to bams for common reference pool, or multiBamCov file of reference
samples PREVIOUSLY CREATED BY THIS SCRIPT (optional)
-q [i] Create coverage boxplots of all samples as qc, plot line at i as minimum cutoff
or desired median. RECOMMENDED!
-s Create plots of all regions / genes per test sample. RECOMMENDED!
-g Create plots of all test samples per gene / group.
-c Create z-score boxplots per chromosome for each test sample
-m If a gene only contains 1 exon / amplicon and this exon / amplicon reaches
significance, report it in "sign_regions" outputfile in stead of in "sign_singles"
-w If no significant region or single is found at the end of z-score calculation, but
more than half of the probes/exons do reach 1.5 cutoff, calculate z-score for the
whole gene
-u If needed, use cluster to calculate coverages. If this option is selected,
clusterjobs will be created if the design file contains more than 1000 regions.
-f Use samtools flagstat to calculate total number of mapped reads per sample for normalization.
Use this option if the design used for sequencing is larger than the design used for this
algorithm, e.g. whole exome sequencing but only CNV analysis of a subset of these genes.
-i Remove regions with high IQR (instability). Recommended when detecting 1n / 3n events,
not for amplifications when test samples are part of reference pool.
-n No dynamic binning, recommended when testing very large genesets e.g. whole exome
-y Answer "yes" to all security questions (will OVERWRITE directories if existing and use
testbams as ref if no ref provided)
-a Use alternative cut-offs for significance (define these cut-offs in the config file)
The -q and -s options are recommended as quality checks of your data. The value specified with -q depends on
your data.
NOTE: Plotted here is the number of reads per amplicon / exon, not the base coverage.
If the -r option is not specified, the bams provided with -t will also be used as reference pool. Use this when
no true negatives / control samples are available.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Output:
<OUTPUTNAME>_sign_regions_sorted.tsv: File with regions within any sample of 2 or more significant neighbouring
exons / amplicons within a gene, sorted by % significant of total exons
within the gene and number of significant exons within the gene. Standard.
<OUTPUTNAME>_sign_singles.tsv: File with gene / sample combinations, where only one exon or amplicon
reaches significance. Standard.
<OUTPUTNAME><SAMPLENAME>_scores.tsv Z-scores for every exon / amplicon in the given sample. Standard.
Can be used for downstream analysis or plotting.
<OUTPUTNAME><SAMPLENAME>.png Barplots of z-scores of each gene for the given sample. With -s.
<OUTPUTNAME><SAMPLENAME>_chr.png Boxplots of z-score distribution per chromosome. Can indicate full
chromosome deletions or duplications if the design is large enough. With -c.
<OUTPUTNAME><GENENAME>.png Barplots of z-scores of each sample for the given gene. With -g.
<OUTPUTNAME>QCplot.png Boxplots indicating distribution of the number of reads per amplicon, per
sample. If the distribution for a certain sample is very broad compared to
other samples in the same run, check the <OUTPUTNAME><SAMPLENAME>.png to
determine if this is due to many copy number alterations / chromosomal
instability or if the sample is of bad quality. With -q.
cleanup.sh Bash file to clean up temporary files after clusterjobs. Standard.
runR.sh Bash file to run the full pipeline. Standard.
<OUTPUTNAME><SAMPLENAME>.flagstat Samtools flagstat per sample if not present in the same directory as the
bam-file. Standard.
ref_<OUTPUTNAME>.flagstat File with total number of reads per ref sample. Standard.
test_<OUTPUTNAME>.flagstat File with total number of reads per test sample. Standard.
ref_<OUTPUTNAME>.multiBamCov Number of reads per amplicon / exon in the design for each ref sample. Standard.
test_<OUTPUTNAME>.multiBamCov Number of reads per amplicon / exon in the design for each test sample. Standard.
<OUTPUTNAME>_conf.R R script that calculates z-scores. Not very pretty to read. R history is saved
after running CoDeCZ. Standard.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Known bugs or issues:
Bedtools will give the error "Could not open input bamfiles" on 2 occasions (that I know of):
1) If the bam-files are really not there, or if you have forgotten to set quotes around the paths;
2) If you include bam-files with different types of headers, e.g. if one bam-file does include readgroups while the
other does not.
MultiBamCov files only show zeroes instead of coverage:
Your design probably has "chr1, chr2" to indicate chromosomes, while your bam-files have just "1, 2" or "Chr1, Chr2" or
vice versa. You can check your bam-files using samtools view -H
Sensitivity of the tool is of course highly dependent on type and quality of the input:
if the reference pool contains only normal, 2n samples, sensitivity will be much higher
compared to a analysis with chromosomally unstable tumor samples in the reference pool.
Gene amplifications can always be detected however, but deletions and duplications can be
missed.
If one sample contains a high level gene amplification, it may appear that all other
genes in this sample are deleted. This is unfortunately a representation of the sequence
data, as the amplified locus claims a lot of the reads / sequence capacity. If you rerun the
analysis without this gene and without the -f option, information on the remaining genes can
be obtained. Or, copy the entire output directory, delete the gene from the multibamcov files and
rerun the R-script by using ./runR.sh in the copied directory.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EOF