Clonal Heterogeneity Analysis Tool Introduction
If you are using this package in your publications, please cite: Li et al., A general framework for analyzing tumor subclonality using SNP array and DNA sequencing data
Below is a detailed instruction of usage of the R package CHAT (also available as CRAN-R package). This tool is designed to systematically analyze tumor subclonality using SNP array and DNA sequencing data from tumor/normal pair samples. Clonality of two types of somatic events are estimated in CHAT: somatic copy number alteration (sCNA) and somatic mutation. We use sAGP (segmental aneuploid genome proportion) to denote the fraction of cells carrying a specific sCNA and CCF (cancer cell fraction) to denote the fraction of carriers for a specific somatic mutation.
To acquire the package via CRAN-R repository, simply open an R console, and type:
install.packages('CHAT')
Or you can download the tar.gz file and install the package locally:
install.packages('Path_to_the_package/CHAT_1.0.tar.gz',type='source',repos=NULL)
1. Data Preparation
Before analysis, make sure you have access to two types of data for each tumor/normal pair to complete the full pipeline: allele specific SNP array and DNA sequencing data. To note, the first step of CHAT pipeline is to estimate sAGP using SNP array data, so if sequencing data is not available, users can still estimate sAGP. Allele-specific SNP array data contains two types of signals for each SNP: intensity of A allele and intensity of B allele. A or B is arbitrarily assigned parental allele. In CHAT, the input is commonly used logR ratio (LRR) and B-allele frequency (BAF), transformed from the original intensities:
LRR = log2(intensity of A+B)-1
BAF = intensity of B / intensity of (A+B)
The pipeline is optimized using SNP array data of both Illumina and Affymetrix platforms. It is also possible for users to extract LRR and BAF from sequencing data with decent coverage (≥50X recommended). It is currently not part of CHAT and details for preparing LRR and BAF data can be found in [Extract LRR and BAF signals from next generation sequencing data]
Genome-wide SNP arrays usually contains more than 500K markers. When sample size is large (if 100 tumor samples are analyzes, there will be 200 in total, since each tumor is paired with a normal sample), in R, memory consumption to load the complete data is large. Therefore, I split the original data file into 22 auto-chromosomes. Sex chromosomes are currently not part of the analysis, and is one of my future directions.
User needs to create two directories: BAF/ and LRR/ with writing and reading privileges, save txt files for each of the 22 chromosomes in these directories with names chrN.txt, N=1,2,...22. The columns for BAF and LRR files are:
SNP id (optional), chromosome, position start, position end (optional), tumor_1, normal_1, tumor_2, normal_2, ...
To note, in BAF files, the identifier used in real file for tumor_i and normal_i should be exactly the same as those used in LRR files. In other words, the first line for every BAF file and every LRR file should be identical.
For DNA sequencing data, users should provide variant call format (VCF) by jointly calling each tumor/normal pair. If there are K tumor/normal pairs, there should be K VCF files, all saved in the same directory with only these K files in it. The 10th and 11th columns for the VCF file should be tumor and normal (not necessarily this order). Both indels and SNVs should be contained in one file for each tumor/normal pair.
2. sCNA analysis
CHAT provide two types of units of DNA segments: fixed SNP marker bins and natural CNA discovery (by circular binary segmentation).
getSeg() // type ?getSeg in your console for more details
Example 1: binned segment with 1000 germline heterozygous SNP markers in a bin
getSeg('BAF/','LRR/',sam.col=5,savefile='Test.Rdata',data.type='log',bin=1000)
Example 2: cbs segmentation using multi-thread processing
getSeg('BAF/','LRR/',sam.col=5,nt=T,savefile='Test.Rdata',data.type='log')
3. AGP inference
AGP is short for aneuploid genome proportion, a concept brought about in my previous work. CHAT is able to estimate AGP as well as averaged tumor ploidy:
getAGP()
To perform this analysis, user needs to define the parameters required for the method getAGP():
para <- getPara()
which returns a list of parameters. All the details can be found in the documentation files, but there are several things need to be paid attention to particularly. First, the LRR data obtained from in this fashion usually suffer from technical contraction, for example, the theoretical value for a heterozygous deletion is -1, but due to this artifact, its absolute value is smaller. To fix this issue, in the parameter list, we provide two normalization factors: $LRR_correction_amp and $LRR_correction_del, for normalization of amplification and deletion respectively. The default values are for Illumina 550K array. For a different array, an experienced user should be able to test a range of scaling factors manually on the BAF-LRR plot and pick an optimal one, in terms of allowing the most observed data points to located close to canonical positions. The usage of BAF-LRR plot and concept of canonical positions can also be found in my previous work.
Another important parameter is $thr.penalty. This is the penalty for increasing the copy number of the baseline DNA content. On a BAF-LRR plot with x-axis being |BAF-0.5|, the origin stands for genotype AB. When the majority of tumor genome is doubled, the origin will be AABB, or AAABBB. The genotype for the origin is determined using a global maximization of likelihood approach, to map as many observed data point to the theoretical canonical positions as possible. However, since more canonical positions are amplifications, increasing origin genotype copy number results in better fitting sometimes. So a penalty term is introduced to prevent over-fitting. When the true origin genotype is AABB, we are expected to observe DNA segments with AB, AAB etc to locate beneath the origin. And $thr.penalty is the minimum number of BAF markers in all of these DNA segments. In other words, if we cannot find a segment that is large enough to beneath the origin, we assume the origin is AB rather than AABB.
Below is an example of AGP inference:
para <- getPara()
para$datafile <- 'Test.Rdata' // this is the file generated by getSeg()
para$thr.penalty <- 300
para$savefile <- 'AGP_temp.txt'
getAGP(para=para)
4. sAGP inference
sAGP inference is similar to AGP inference:
para.s <- getPara.sAGP()
para.s$inputdata <- 'Test.Rdata'
para.s$purityfile <- 'AGP_temp.txt' // this is the file generated by getAGP()
para.s$savedata <- 'Test_sAGP.Rdata'
getsAGP(para=para.s)
You may also boost the computational speed by setting before running getsAGP():
para.s$is.multicore <- TRUE
5. CCF inference
CCF inference is performed using getCCF() method:
getCCF('VCF/',sAGPfile='Test_sAGP.Rdata',nt=TRUE,filter=TRUE,output='Test.vcf')
Wiki: Extract LRR and BAF signals from next generation sequencing data
Hi again Bo,
for the getCCF function, I noticed the AD is set to 3 by default (with delimiter ';'). Considering the VCF example I showed you in the sequencing discussion, my AD should be in 4. However, the output file is empty. There is no error in the terminal so I'm assuming the VCF gets loaded but not properly.
Unlike you, my delimiter is not ';' but ':'. In the sequencing discussion, you mentioned VCF with delimiter ',' for AD (ParseVCF) so I'm getting a bit confused.
I would just like to know the real delimiter that getCCF recognize so I can convert my VCF accordingly for both functions (ParseVCF and getCCF) before running everything. I think it would be easier for me to convert to what getCCF recognize and then, change the ParseVCF myself.
Thanks again!
Jean-Sebastien
Hi Jean-Sebastien,
The VCF I am using have two types of delimiters, and looks something like this:
0/1:50,32:82:...
The fields are GP:AD:DP...
I guess you can convert your VCF file into this format and apply CHAT.
It is probably harder for you to modify getSampleCCF or getCCF functions, since they are built in the package. But you can certainly modify ParseVCF.
Thanks,
Bo
I managed to run getSeg(), getAGP() and getsAGP() to obtain the sAGP data properly, e.g.,
... but when I send it to getCCF() along with the vcf files, the resulting output file always turns out empty. I wonder if there are any ways to find out what's going wrong.
Last edit: Mitsuko 2014-12-23
The output of sAGP looks OK to me. In getCCF() step, the most typical problem is the format of VCF files. In my code, I suppose the file includes a tumor/normal pair, with the 10th and 11th columns tumor and normal (or the other way round), and Allele Depth (AD) should be coded as the 3rd (you can also specify a different number) field. Could you please check if your VCF file has this format?
Hi Mitsuko,
I wonder if you still have this problem or you have solved it? I saw your
most recent post and it seems to me that you have solved this issue.
Thanks,
Bo
On Tue, Dec 23, 2014 at 2:05 PM, Mitsuko mitsukok@users.sf.net wrote:
Yes, I solved the problem on my own - I needed to specify data.type="log" when calling getSeg(). I should have read the help file more carefully.
Yes, I found out what's wrong myself before you replied me. I forgot to specify data.type='log' while running getSeg() previously.
I'll update on the other problem I'm having later.
Thanks again,
-Mitsuko
From: Bo Li [mailto:lukeli1987@users.sf.net]
Sent: Tuesday, December 23, 2014 3:34 PM
To: [clonalhetanalysistool:wiki]
Subject: [clonalhetanalysistool:wiki] Re: Discussion for CHAT page
Hi Mitsuko,
I wonder if you still have this problem or you have solved it? I saw your
most recent post and it seems to me that you have solved this issue.
Thanks,
Bo
On Tue, Dec 23, 2014 at 2:05 PM, Mitsuko mitsukok@users.sf.netmitsukok@users.sf.net wrote:
When using getSeg(), I tend to get NaN in the mean values: for example,
seg.dat=get(load('seg.Rdata'))
head(seg.dat)
chr logR.mean bin BAF.mean bin
Sample001.T1 2 1017197 11754168 NaN 1000 0.000000000 1000
Sample001.T1 2 11761252 22448217 NaN 1000 0.006551691 1000
Sample001.T1 2 22450487 37096184 NaN 1000 0.000000000 1000
Sample001.T1 2 37096594 50605337 NaN 1000 0.000000000 1000
Sample001.T1 2 50613676 63529371 NaN 1000 0.000000000 1000
Sample001.T1 2 63533935 77754853 NaN 1000 0.000000000 1000
... I assume this is due to some division by zeros, but how can I avoid
such errors?
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT/https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT/https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
I prepare my vcf files so that normal and tumor samples appear in the 10th and the 11th field, respectively. The AD data in the INFO field is in the 2nd, so I let AD=2. (The attached vcf file with this message shows for the first 10 lines of my input file.) I run the getCCF() as follows.
Program loads all the vcf files with no error messages, but the output file is empty. I will check if there exist some common entries among those vcf files to see if the problem is caused by the input files or not. Thank you anyway for your help.
Last edit: Mitsuko 2014-12-23
OK, I see the problem. This is a bit of legacy issue -- VCF calls from older variant callers, such as GATK (which is not a good example), will assign a genotype for the tumor sample. I am not familiar with newer callers. But in your file, I don't see a GT field in the 11th columns. getCCF aims to find true somatic mutations. If GT not given, it will assume the first field is GT and use the wrong information. If you are certain that all the mutations are somatic, you can add 0/1 prior to the current tumor INFO column. But you should also make sure that in your normal column, there are GT=0/0 or 1/1. I see many calls have 0/1 but with alternative read depths quite small. I think that may be a technical problem you should pay attention to.
I'm glad you pointed out the problem, because there was in fact a bug in my script that prepared input VCF files. In the example vcf file I sent you, the header was correctly positioned (10th=normal, 11th=tumor), but the data entries for the normal and tumor fields were swapped, which caused the error. Another problem was that the variant caller I used (MuTect) tend to set GT=0 for the normal sample although correct way seems to be GT=0/0:
http://gatkforums.broadinstitute.org/discussion/2485/vcf-output-question
After fixing these problems, I was able to obtain CCF output. Thanks again for your help!
I have one more question: I tend to get no segment information from chromosome 1 and cannot figure out why. The rest of the chromosomes look fine, e.g.,
Input BAF and LRR files have numerous entries in chromosome 1, and so I don't understand why getSeg() returns no segments. I attach the R script I use to produce the seg.Rdata with this post.
I am not sure at this point where could go wrong. May I take a look at your SNP data to diagnose?
I think I found the cause of the problem - I removed the X and Y chromosome entries in my BAF/LRR input files and re-run my script, and the output seg.Rdata contains segments in the chromosome 1 correctly.
Thank you for the help anyways.
You are right -- sex chromosomes are currently not included in my analysis. Glad that you fix it. Thanks for letting me know!
Hi Bo,
I was able to run CHAT following the pipeline you provided.
However, I am facing problems in understanding the output .vcf file.
For example, I take 2 lines from my output .vcf file as example.
In the second line, the SNP is assigned to A1, which is corresponding to the 1st lineage scenario described in your paper. What I am not sure is what does the "NA" mean in the first line? Thanks in advance!
10 134898263 . G T 200 PASS tumor_A001.T.S01;32;72;0;111;NA;NA;NA;NA;NA;NA GT:DP:AD:BQ:MQ:SB:FA 0/0:111,0:111 0/1:40,32:73
11 223764 . A G 200 PASS tumor_A001.T.S01;29;53;1;86;0.57;0.197;0.574;1;4;A1 GT:DP:AD:BQ:MQ:SB:FA 0/0:85,1:86 0/1:24,29:53
Hi Tong,
The NA in these fields indicate that the CHAT cannot estimate information for the corresponding somatic mutation due to the unidentifiability issues for sAGP or CCF. Please check out our paper for more details.
Thanks,
Bo
Hi Bo,
What maybe the problems when I use getSegChr(bb.chr, ll.chr, thr.hets=thr.hets, data.type='log') to calculate segments and get errors:
.Error in nls(y ~ 1/sqrt(2 * pi)/b * exp(-(x - a)^2/2/b^2), start = list(b = 0.06, :
number of iterations exceeded maximum of 50
The erros just take place on some chromosomes and some samples.
Thanks,
Ying
Hi Ying,
You probably need to look at the distribution of BAF markers. There should
be 4 peaks -- use thr.hets to remove the top and bottom peaks. If those two
are not effectively removed, there will be problem.
Best,
Bo
On Fri, Sep 23, 2016 at 2:08 AM, ying ya yingya@users.sf.net wrote:
Hi,
I am tring to get the purity file:
para <- getPara()
para$BAFfilter=1 #i think this needs correction instead of 10
para$datafile <- 'Test.Rdata'
para$thr.penalty <- 300
para$savefile <- 'AGP_temp.txt'
getAGP(para=para)
I keep getting the following error:
Error in dat[delPos, 4] * (1/para$LRR_correction_del) :
non-numeric argument to binary operator
Thanks
Ron