Menu

CHAT

Clonal Heterogeneity Analysis Tool Introduction

If you are using this package in your publications, please cite: Li et al., A general framework for analyzing tumor subclonality using SNP array and DNA sequencing data

Below is a detailed instruction of usage of the R package CHAT (also available as CRAN-R package). This tool is designed to systematically analyze tumor subclonality using SNP array and DNA sequencing data from tumor/normal pair samples. Clonality of two types of somatic events are estimated in CHAT: somatic copy number alteration (sCNA) and somatic mutation. We use sAGP (segmental aneuploid genome proportion) to denote the fraction of cells carrying a specific sCNA and CCF (cancer cell fraction) to denote the fraction of carriers for a specific somatic mutation.

To acquire the package via CRAN-R repository, simply open an R console, and type:

install.packages('CHAT')

Or you can download the tar.gz file and install the package locally:

install.packages('Path_to_the_package/CHAT_1.0.tar.gz',type='source',repos=NULL)

1. Data Preparation

Before analysis, make sure you have access to two types of data for each tumor/normal pair to complete the full pipeline: allele specific SNP array and DNA sequencing data. To note, the first step of CHAT pipeline is to estimate sAGP using SNP array data, so if sequencing data is not available, users can still estimate sAGP. Allele-specific SNP array data contains two types of signals for each SNP: intensity of A allele and intensity of B allele. A or B is arbitrarily assigned parental allele. In CHAT, the input is commonly used logR ratio (LRR) and B-allele frequency (BAF), transformed from the original intensities:

LRR = log2(intensity of A+B)-1
BAF = intensity of B / intensity of (A+B)

The pipeline is optimized using SNP array data of both Illumina and Affymetrix platforms. It is also possible for users to extract LRR and BAF from sequencing data with decent coverage (≥50X recommended). It is currently not part of CHAT and details for preparing LRR and BAF data can be found in [Extract LRR and BAF signals from next generation sequencing data]

Genome-wide SNP arrays usually contains more than 500K markers. When sample size is large (if 100 tumor samples are analyzes, there will be 200 in total, since each tumor is paired with a normal sample), in R, memory consumption to load the complete data is large. Therefore, I split the original data file into 22 auto-chromosomes. Sex chromosomes are currently not part of the analysis, and is one of my future directions.

User needs to create two directories: BAF/ and LRR/ with writing and reading privileges, save txt files for each of the 22 chromosomes in these directories with names chrN.txt, N=1,2,...22. The columns for BAF and LRR files are:

SNP id (optional), chromosome, position start, position end (optional), tumor_1, normal_1, tumor_2, normal_2, ...

To note, in BAF files, the identifier used in real file for tumor_i and normal_i should be exactly the same as those used in LRR files. In other words, the first line for every BAF file and every LRR file should be identical.

For DNA sequencing data, users should provide variant call format (VCF) by jointly calling each tumor/normal pair. If there are K tumor/normal pairs, there should be K VCF files, all saved in the same directory with only these K files in it. The 10th and 11th columns for the VCF file should be tumor and normal (not necessarily this order). Both indels and SNVs should be contained in one file for each tumor/normal pair.

2. sCNA analysis

CHAT provide two types of units of DNA segments: fixed SNP marker bins and natural CNA discovery (by circular binary segmentation).

getSeg() // type ?getSeg in your console for more details

Example 1: binned segment with 1000 germline heterozygous SNP markers in a bin

getSeg('BAF/','LRR/',sam.col=5,savefile='Test.Rdata',data.type='log',bin=1000)

Example 2: cbs segmentation using multi-thread processing

getSeg('BAF/','LRR/',sam.col=5,nt=T,savefile='Test.Rdata',data.type='log')

3. AGP inference

AGP is short for aneuploid genome proportion, a concept brought about in my previous work. CHAT is able to estimate AGP as well as averaged tumor ploidy:

getAGP()

To perform this analysis, user needs to define the parameters required for the method getAGP():

para <- getPara()

which returns a list of parameters. All the details can be found in the documentation files, but there are several things need to be paid attention to particularly. First, the LRR data obtained from in this fashion usually suffer from technical contraction, for example, the theoretical value for a heterozygous deletion is -1, but due to this artifact, its absolute value is smaller. To fix this issue, in the parameter list, we provide two normalization factors: $LRR_correction_amp and $LRR_correction_del, for normalization of amplification and deletion respectively. The default values are for Illumina 550K array. For a different array, an experienced user should be able to test a range of scaling factors manually on the BAF-LRR plot and pick an optimal one, in terms of allowing the most observed data points to located close to canonical positions. The usage of BAF-LRR plot and concept of canonical positions can also be found in my previous work.

Another important parameter is $thr.penalty. This is the penalty for increasing the copy number of the baseline DNA content. On a BAF-LRR plot with x-axis being |BAF-0.5|, the origin stands for genotype AB. When the majority of tumor genome is doubled, the origin will be AABB, or AAABBB. The genotype for the origin is determined using a global maximization of likelihood approach, to map as many observed data point to the theoretical canonical positions as possible. However, since more canonical positions are amplifications, increasing origin genotype copy number results in better fitting sometimes. So a penalty term is introduced to prevent over-fitting. When the true origin genotype is AABB, we are expected to observe DNA segments with AB, AAB etc to locate beneath the origin. And $thr.penalty is the minimum number of BAF markers in all of these DNA segments. In other words, if we cannot find a segment that is large enough to beneath the origin, we assume the origin is AB rather than AABB.

Below is an example of AGP inference:

para <- getPara()
para$datafile <- 'Test.Rdata' // this is the file generated by getSeg()
para$thr.penalty <- 300
para$savefile <- 'AGP_temp.txt'
getAGP(para=para)

4. sAGP inference

sAGP inference is similar to AGP inference:

para.s <- getPara.sAGP()
para.s$inputdata <- 'Test.Rdata'
para.s$purityfile <- 'AGP_temp.txt' // this is the file generated by getAGP()
para.s$savedata <- 'Test_sAGP.Rdata'
getsAGP(para=para.s)

You may also boost the computational speed by setting before running getsAGP():

para.s$is.multicore <- TRUE

5. CCF inference

CCF inference is performed using getCCF() method:

getCCF('VCF/',sAGPfile='Test_sAGP.Rdata',nt=TRUE,filter=TRUE,output='Test.vcf')


Related

Wiki: Extract LRR and BAF signals from next generation sequencing data

Discussion

  • Jean-Sebastien Milanese

    Hi again Bo,

    for the getCCF function, I noticed the AD is set to 3 by default (with delimiter ';'). Considering the VCF example I showed you in the sequencing discussion, my AD should be in 4. However, the output file is empty. There is no error in the terminal so I'm assuming the VCF gets loaded but not properly.

    Unlike you, my delimiter is not ';' but ':'. In the sequencing discussion, you mentioned VCF with delimiter ',' for AD (ParseVCF) so I'm getting a bit confused.

    I would just like to know the real delimiter that getCCF recognize so I can convert my VCF accordingly for both functions (ParseVCF and getCCF) before running everything. I think it would be easier for me to convert to what getCCF recognize and then, change the ParseVCF myself.

    Thanks again!
    Jean-Sebastien

     
    • Bo Li

      Bo Li - 2014-10-03

      Hi Jean-Sebastien,

      The VCF I am using have two types of delimiters, and looks something like this:

      0/1:50,32:82:...

      The fields are GP:AD:DP...

      I guess you can convert your VCF file into this format and apply CHAT.

      It is probably harder for you to modify getSampleCCF or getCCF functions, since they are built in the package. But you can certainly modify ParseVCF.

      Thanks,
      Bo

       
  • Mitsuko

    Mitsuko - 2014-12-23

    I managed to run getSeg(), getAGP() and getsAGP() to obtain the sAGP data properly, e.g.,

    new.dd.dat=get(load('sAGP.Rdata'))
    head(new.dd.dat)
    chr logR.mean bin BAF.mean bin
    Sample1.T1 2 752566 12278932 0.032007233 1000 0.000000000 1000
    Sample1.T1 2 12450829 24516880 0.020524412 1000 0.000000000 1000
    Sample1.T1 2 24531634 41293664 0.006148282 1000 0.039042118 1000
    Sample1.T1 2 41295542 56904162 -0.077622378 1000 0.007352231 1000
    Sample1.T1 2 56904480 68334436 0.179746835 1000 0.084860368 1000
    Sample1.T1 2 68335624 83172684 0.135804702 1000 0.000000000 1000
    seg_purity nb nt
    Sample1.T1 0.0000000 1 2
    Sample1.T1 0.0000000 1 2
    Sample1.T1 0.1192383 0 1
    Sample1.T1 0.0000000 1 2
    Sample1.T1 0.1170898 0 3
    Sample1.T1 0.0000000 1 2

    ... but when I send it to getCCF() along with the vcf files, the resulting output file always turns out empty. I wonder if there are any ways to find out what's going wrong.

     

    Last edit: Mitsuko 2014-12-23
    • Bo Li

      Bo Li - 2014-12-23

      The output of sAGP looks OK to me. In getCCF() step, the most typical problem is the format of VCF files. In my code, I suppose the file includes a tumor/normal pair, with the 10th and 11th columns tumor and normal (or the other way round), and Allele Depth (AD) should be coded as the 3rd (you can also specify a different number) field. Could you please check if your VCF file has this format?

       
    • Bo Li

      Bo Li - 2014-12-23

      Hi Mitsuko,

      I wonder if you still have this problem or you have solved it? I saw your
      most recent post and it seems to me that you have solved this issue.

      Thanks,
      Bo

      On Tue, Dec 23, 2014 at 2:05 PM, Mitsuko mitsukok@users.sf.net wrote:

      When using getSeg(), I tend to get NaN in the mean values: for example,

      seg.dat=get(load('seg.Rdata'))
      head(seg.dat)
      chr logR.mean bin BAF.mean bin
      Sample001.T1 2 1017197 11754168 NaN 1000 0.000000000 1000
      Sample001.T1 2 11761252 22448217 NaN 1000 0.006551691 1000
      Sample001.T1 2 22450487 37096184 NaN 1000 0.000000000 1000
      Sample001.T1 2 37096594 50605337 NaN 1000 0.000000000 1000
      Sample001.T1 2 50613676 63529371 NaN 1000 0.000000000 1000
      Sample001.T1 2 63533935 77754853 NaN 1000 0.000000000 1000

      ... I assume this is due to some division by zeros, but how can I avoid
      such errors?


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
      • Mitsuko

        Mitsuko - 2014-12-23

        Yes, I solved the problem on my own - I needed to specify data.type="log" when calling getSeg(). I should have read the help file more carefully.

         
      • Mitsuko Korobkin

        Yes, I found out what's wrong myself before you replied me. I forgot to specify data.type='log' while running getSeg() previously.

        I'll update on the other problem I'm having later.

        Thanks again,
        -Mitsuko

        From: Bo Li [mailto:lukeli1987@users.sf.net]
        Sent: Tuesday, December 23, 2014 3:34 PM
        To: [clonalhetanalysistool:wiki]
        Subject: [clonalhetanalysistool:wiki] Re: Discussion for CHAT page

        Hi Mitsuko,

        I wonder if you still have this problem or you have solved it? I saw your
        most recent post and it seems to me that you have solved this issue.

        Thanks,
        Bo

        On Tue, Dec 23, 2014 at 2:05 PM, Mitsuko mitsukok@users.sf.netmitsukok@users.sf.net wrote:

        When using getSeg(), I tend to get NaN in the mean values: for example,

        seg.dat=get(load('seg.Rdata'))
        head(seg.dat)
        chr logR.mean bin BAF.mean bin
        Sample001.T1 2 1017197 11754168 NaN 1000 0.000000000 1000
        Sample001.T1 2 11761252 22448217 NaN 1000 0.006551691 1000
        Sample001.T1 2 22450487 37096184 NaN 1000 0.000000000 1000
        Sample001.T1 2 37096594 50605337 NaN 1000 0.000000000 1000
        Sample001.T1 2 50613676 63529371 NaN 1000 0.000000000 1000
        Sample001.T1 2 63533935 77754853 NaN 1000 0.000000000 1000

        ... I assume this is due to some division by zeros, but how can I avoid
        such errors?


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT/https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT/https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions

        This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.

         
  • Mitsuko

    Mitsuko - 2014-12-23

    I prepare my vcf files so that normal and tumor samples appear in the 10th and the 11th field, respectively. The AD data in the INFO field is in the 2nd, so I let AD=2. (The attached vcf file with this message shows for the first 10 lines of my input file.) I run the getCCF() as follows.

    getCCF("./VCF",'sAGP.Rdata', output="CCF.vcf", nt=FALSE, nc=10, tc=11, AD=2, filter=TRUE, TCGA=FALSE)

    Program loads all the vcf files with no error messages, but the output file is empty. I will check if there exist some common entries among those vcf files to see if the problem is caused by the input files or not. Thank you anyway for your help.

     

    Last edit: Mitsuko 2014-12-23
    • Bo Li

      Bo Li - 2014-12-23

      OK, I see the problem. This is a bit of legacy issue -- VCF calls from older variant callers, such as GATK (which is not a good example), will assign a genotype for the tumor sample. I am not familiar with newer callers. But in your file, I don't see a GT field in the 11th columns. getCCF aims to find true somatic mutations. If GT not given, it will assume the first field is GT and use the wrong information. If you are certain that all the mutations are somatic, you can add 0/1 prior to the current tumor INFO column. But you should also make sure that in your normal column, there are GT=0/0 or 1/1. I see many calls have 0/1 but with alternative read depths quite small. I think that may be a technical problem you should pay attention to.

       
  • Mitsuko

    Mitsuko - 2014-12-23

    I'm glad you pointed out the problem, because there was in fact a bug in my script that prepared input VCF files. In the example vcf file I sent you, the header was correctly positioned (10th=normal, 11th=tumor), but the data entries for the normal and tumor fields were swapped, which caused the error. Another problem was that the variant caller I used (MuTect) tend to set GT=0 for the normal sample although correct way seems to be GT=0/0:

    http://gatkforums.broadinstitute.org/discussion/2485/vcf-output-question

    After fixing these problems, I was able to obtain CCF output. Thanks again for your help!

     
  • Mitsuko

    Mitsuko - 2014-12-24

    I have one more question: I tend to get no segment information from chromosome 1 and cannot figure out why. The rest of the chromosomes look fine, e.g.,

    data=get(load('seg.Rdata'))
    head(data)
    chr logR.mean bin BAF.mean bin
    Sample1.T1 1 0 0 -0.03060 1000 0.008756967 1000
    Sample1.T2 1 0 0 -0.00440 1000 0.020682437 1000
    Sample1.T1 2 1017197 11754168 -0.06160 1000 0.000000000 1000
    Sample1.T1 2 11761252 22448217 -0.05935 1000 0.006551691 1000
    Sample1.T1 2 22450487 37096184 -0.06015 1000 0.000000000 1000
    Sample1.T1 2 37096594 50605337 -0.05770 1000 0.000000000 1000

    Input BAF and LRR files have numerous entries in chromosome 1, and so I don't understand why getSeg() returns no segments. I attach the R script I use to produce the seg.Rdata with this post.

     
    • Bo Li

      Bo Li - 2014-12-24

      I am not sure at this point where could go wrong. May I take a look at your SNP data to diagnose?

       
      • Mitsuko

        Mitsuko - 2014-12-24

        I think I found the cause of the problem - I removed the X and Y chromosome entries in my BAF/LRR input files and re-run my script, and the output seg.Rdata contains segments in the chromosome 1 correctly.

        Thank you for the help anyways.

         
        • Bo Li

          Bo Li - 2014-12-24

          You are right -- sex chromosomes are currently not included in my analysis. Glad that you fix it. Thanks for letting me know!

           
  • Tong

    Tong - 2015-01-05

    Hi Bo,

    I was able to run CHAT following the pipeline you provided.
    However, I am facing problems in understanding the output .vcf file.

    For example, I take 2 lines from my output .vcf file as example.
    In the second line, the SNP is assigned to A1, which is corresponding to the 1st lineage scenario described in your paper. What I am not sure is what does the "NA" mean in the first line? Thanks in advance!


    10 134898263 . G T 200 PASS tumor_A001.T.S01;32;72;0;111;NA;NA;NA;NA;NA;NA GT:DP:AD:BQ:MQ:SB:FA 0/0:111,0:111 0/1:40,32:73
    11 223764 . A G 200 PASS tumor_A001.T.S01;29;53;1;86;0.57;0.197;0.574;1;4;A1 GT:DP:AD:BQ:MQ:SB:FA 0/0:85,1:86 0/1:24,29:53


     
    • Bo Li

      Bo Li - 2015-01-05

      Hi Tong,

      The NA in these fields indicate that the CHAT cannot estimate information for the corresponding somatic mutation due to the unidentifiability issues for sAGP or CCF. Please check out our paper for more details.

      Thanks,
      Bo

       
  • ying ya

    ying ya - 2016-09-23

    Hi Bo,

    What maybe the problems when I use getSegChr(bb.chr, ll.chr, thr.hets=thr.hets, data.type='log') to calculate segments and get errors:
    .Error in nls(y ~ 1/sqrt(2 * pi)/b * exp(-(x - a)^2/2/b^2), start = list(b = 0.06, :
    number of iterations exceeded maximum of 50

    The erros just take place on some chromosomes and some samples.

    Thanks,
    Ying

     
    • Bo Li

      Bo Li - 2016-09-23

      Hi Ying,

      You probably need to look at the distribution of BAF markers. There should
      be 4 peaks -- use thr.hets to remove the top and bottom peaks. If those two
      are not effectively removed, there will be problem.

      Best,
      Bo

      On Fri, Sep 23, 2016 at 2:08 AM, ying ya yingya@users.sf.net wrote:

      Hi Bo,

      What maybe the problems when I use getSegChr(bb.chr, ll.chr,
      thr.hets=thr.hets, data.type='log') to calculate segments and get errors:
      .Error in nls(y ~ 1/sqrt(2 * pi)/b * exp(-(x - a)^2/2/b^2), start = list(b
      = 0.06, :
      number of iterations exceeded maximum of 50

      The erros just take place on some chromosomes and some samples.

      Thanks,
      Ying


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/clonalhetanalysistool/wiki/CHAT/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • Ron

    Ron - 2018-11-02

    Hi,
    I am tring to get the purity file:

    para <- getPara()
    para$BAFfilter=1 #i think this needs correction instead of 10
    para$datafile <- 'Test.Rdata'
    para$thr.penalty <- 300
    para$savefile <- 'AGP_temp.txt'
    getAGP(para=para)

    I keep getting the following error:

    Error in dat[delPos, 4] * (1/para$LRR_correction_del) :
    non-numeric argument to binary operator

    Thanks
    Ron

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.