Menu

Parameter problem

Anonymous
2014-03-24
2014-03-25
  • Anonymous

    Anonymous - 2014-03-24

    Dear Charles,

    When I use the function of COHCAP.avg.by.island, I encountered the following problem.

    1)The content in the file "450k_avg_by_island_test_CpG_island_filtered-Avg_by_Island.xlsx" and "450k_avg_by_island_test_CpG_island_stats-Avg_by_Island.xlsx" are the same when I use the command as you describe in the manual.

    filtered.islands <- COHCAP.avg.by.island(sample.file, filtered.sites, beta.table, project.name, project.folder, ref="parental")

    2)In the R console, it returns the following results. What's the meaning of the 7th line "29 10"? In the 8th line, I think there are 5 CpG islands to be analyzed. How can I obtain the statistic of all the 5 CpG islands?

    [1] "Reading Sample Description File...."
    [1] 172 6
    [1] 172 6
    [1] "Group: mutant" "Group: parental"
    [1] "Checking CpG Site Stats Table"
    [1] 34 10
    [1] 29 10
    [1] "Analyzing 5 CpG Islands..."
    [1] "Average CpG Sites per CpG Island"
    [1] "Differential Methylation Stats for 2 Groups with Reference"
    [1] 4 7
    [1] 4 7
    [1] "There are 4 differentially methylated islands"
    [1] 4 8
    [1] 4 8
    [1] "Plotting Significant Islands...."

    3)If I set the parameter of methyl.cutoff, unmethyl.cutoff and so on, I got the replicated results in the "450k_avg_by_island_test_CpG_island_filtered-Avg_by_Island.xlsx" file. You can check it using the example data.

    filtered.sites <- COHCAP.site(sample.file, beta.table, project.name, project.folder, ref="parental", methyl.cutoff=0, unmethyl.cutoff=1, delta.beta.cutoff=0.2, pvalue.cutoff=0.05)

    filtered.islands <- COHCAP.avg.by.island(sample.file, filtered.sites, beta.table, project.name, project.folder, ref="parental", methyl.cutoff=0, unmethyl.cutoff=1, delta.beta.cutoff=0.2, pvalue.cutoff=0.05)

    Thanks,

    Jiang

     
  • Charles Warden

    Charles Warden - 2014-03-24

    Hi Jiang,

    Thank you for using COHCAP.

    Let me answer your questions in a slightly different order:

    Question #3) You are using more liberal thresholds than the default settings for methyl.cutoff and unmethyl.cutoff. However, the delta.beta.cutoff and pvalue.cutoff parameters are not necessary (as specified above) because you are setting them to the default parameters (so, you aren't actually changing the parameters). You can use help(COHCAP.site) and help(COHCAP.avg.by.island) to see the default settings for all of the parameters.

    The demo dataset is a significantly truncated dataset (used for testing purposes). It is good for confirming that COHCAP is installed properly, but it isn't a great example of what you should expect to see with a real, full dataset. In particular, this dataset only contains probes for genes mapped to the 4 CpG islands subject to validation in the NAR COHCAP paper (so, all the islands are differentially methylated and the lists should be identical). If you want to see an example where the files will vary more, you can use the full demo dataset as described in the Protocol Exchange listing:

    http://www.nature.com/protocolexchange/protocols/2965#/procedure

    Question #2) These are the sizes of data matrices at various steps of the algorithm. This meant to help me with troubleshooting, but you shouldn't have to normally worry about those values. I'd have to check the code more carefully, but I think those numbers correspond to the number of sites (29 and 34) before and after one of the filtering steps. The main thing to look out for is zeros: that means none of the subsequent steps will work, likely because some filter didn't work (or it could also be because of a formatting issue, if it happens early on in the pipeline).

    There is a minor typo that I will work on fixing (the number is actually one more than the real number of CpG sites because of the sites that don't fall in any CpG islands). For the 4 differentially methylated islands, you should look in the CpG_Island folder. The Excel file contains the statistics and the subfolder contains box plots for each of the 4 significant islands.

    Question #1) If you are using the demo dataset, then please see my response to question #3. If this is your own dataset that you are describing, can you find examples in the "450k_avg_by_island_test_CpG_island_filtered-Avg_by_Island.xlsx" file where the stats don't seem like they should correspond to differentially methylated islands? For example, are the delta.beta values low (say, less than 0.2, if you left the default parameters alone) and FDR/p-value values high (greater than 0.05, if you left the default parameters alone)?

    Thanks,
    Charles

     

    Last edit: Charles Warden 2014-03-24
  • Anonymous

    Anonymous - 2014-03-25

    Dear Charles,

    Maybe I described the problem unclear. In the 3rd question, the replicated results are as follows. Each island has one replicate, although it cannot hinder to read the results. If I don't set the two parameters of methyl.cutoff and unmethyl.cutoff, the results have no replicates.

    filtered.islands <- COHCAP.avg.by.island(sample.file, filtered.sites, beta.table, project.name, project.folder, ref="parental", methyl.cutoff=0, unmethyl.cutoff=1)

    island gene mutant.avg.beta parental.avg.beta
    chr18:55862653-55862873 NEDD4L 0.903906667 0.076935
    chr18:55862653-55862873 NEDD4L 0.903906667 0.076935
    chr17:27044168-27045049 RAB34 0.038574762 0.894215714
    chr17:27046856-27047273 RAB34 0.050071111 0.841995556
    chr17:27044168-27045049 RAB34 0.038574762 0.894215714
    chr17:27046856-27047273 RAB34 0.050071111 0.841995556
    chr10:114712115-114712544 TCF7L2 0.743541905 0.052695238
    chr10:114712115-114712544 TCF7L2 0.743541905 0.052695238
    chr2:17720155-17720417 VSNL1 0.639643333 0.088956667
    chr2:17721537-17722021 VSNL1 0.816211852 0.050505556
    chr2:17720155-17720417 VSNL1 0.639643333 0.088956667
    chr2:17721537-17722021 VSNL1 0.816211852 0.050505556

    Jiang

     
    • Charles Warden

      Charles Warden - 2014-03-25

      Hi Jiang,

      Ok - I understand the problem better now. Thank you.

      I think the problem is due to the lack of ability to define a single methylated group and a single unmethylated group. I will work on fixing this - in the meantime, you are correct that it shouldn't negatively affect the results, but I agree this is annoying.

      Thanks,
      Charles

       

      Last edit: Charles Warden 2014-03-25
  • Anonymous

    Anonymous - 2014-03-25

    Dear Charles,

    When I analyzed my data, I got the following results. It means that there are 1367 CpG islands. I want to extract all of 1367 islands. So I set the parameters as: methyl.cutoff=0,unmethyl.cutoff=1,pvalue.cutoff=1,fdr.cutoff=1,num.sites=1. But it only returns 1022 islands. How to obtain all islands?

    [1] "Analyzing 1367 CpG Islands..."
    [1] "Average CpG Sites per CpG Island"
    [1] "Differential Methylation Stats for 2 Groups with Reference"
    [1] 1022 7
    [1] 2042 7
    [1] "There are 2042 differentially methylated islands"
    [1] 1022 23
    [1] 2042 23
    [1] "Plotting Significant Islands...."

    Jiang

     
  • Charles Warden

    Charles Warden - 2014-03-25

    Hi Jiang,

    As you mentioned before, there are multiple files for the CpG islands: a filtered one in the "CpG_Islands" folder and an unfiltered one in the "Raw_Data" folder.

    How many CpG Islands are in the stats table in the "Raw_Data" folder?

    As an example, I ran the protocol exchange dataset and found 144 unfiltered islands and 128 filtered islands.

    Also, I think this relates to the issue that you see above: the output statistics are wonky because of the thresholds being used. For example, the number of rows in the data matrices shouldn't go from "1022 7" to "2042 7", and you can't have more differentially methylated islands than islands. Maybe I'll fix this my implementing a general solution that ignores the methylated and unmethylated cutoffs if unmethyl.cutoff > methyl.cutoff.

    Additionally, I think that value may have other issues. For example, you clearly pointed out that the truncated dataset has 6 CpG Islands (4 of which significantly vary). When I ran your code, this step said that there were 7 CpG islands (so, it might be reporting +1 the real number of islands, even though I have changed the code so that it should be ignoring the "NA" cells for probes that don't map to a CpG island).

    Once I can update COHCAP (there is currently an update in queue, so this change would be in version 0.99.11), can you please re-try the analysis? I think this should result in 1022 islands instead of 2042 islands, and we can see if that earlier count is affected.

    Thanks,
    Charles

     
  • Anonymous

    Anonymous - 2014-03-25

    Dear Charles,

    The duplicates might be due to the unmethyl.cutoff > methyl.cutoff. Is my understanding correct? I will re-try using the next version. It's a small problem. I can omit the duplicates.

    I still don't know how to obtain all islands. In my analysis, there are 1367 islands. Even if I set the num.sites=1, I only got 1022 islands. Is it attribute to unmethyl.cutoff > methyl.cutoff?

    Thanks,

    Jiang

     
  • Charles Warden

    Charles Warden - 2014-03-25

    Hi Jiang,

    Yes - this is causing duplicates, starting with the COHCAP.site() step. I think this should be fixed in the updates that I just submitted to bioconductor (which will be in version 0.99.11).

    So, it looks like that number is not precise. Based upon the previous examples we discussed, my guess is that the true number should be that number minus 1. However, there may be other factors. For this reason, I have switched that number to be more like the other intermediate values. In the case of the single extra CpG island, it looks like the artificial "NA" islands are getting removed at some step, but not at that point (for whatever reason).

    If you want to keep altering parameters to try and understand COHCAP better, you can see what happens if you set delta.beta.cutoff = 0 for both the site and island analysis. This should cause some degree of increase. However, I don't think these alternative parameters are great for actual analysis. I was trying to design COHCAP to provide high-confidence results. With the exception of the methyl.cutoff change that I recommend for patient data (see "Common Problems & Solutions" the on the FAQ Page), I'm not sure how good the results will be with increasingly liberal parameters. However, I provide these as an option so that users won't ever be stuck without being able to get any results.

    Best,
    Charles

     

    Last edit: Charles Warden 2014-03-25
  • Anonymous

    Anonymous - 2014-03-25

    Dear Charles,

    I have read your codes. I think I find why the number of islands is not consistent. There are some special cases as follows.

    cg04920527 1 153505651 NA chr1:153508424-153508840
    cg24155129 1 153505790 NA chr1:153508424-153508840
    cg16291048 1 153507498 S100A6 chr1:153508424-153508840

    There are three sites in this island. In your code, you only extract the first line, genes[i] <- as.character(info.mat[1,4]). In this case, the gene name in the first line is NA. So it will be omitted. In the results, we cannot find S100A6 and chr1:153508424-153508840.

    I also find why there are duplicates in the results. If I set the cutoff of methyl and unmethyl, the temp.methyl.up and temp.methyl.down are same, filter.table <- rbind(temp.methyl.up, temp.methyl.down). So the rbind produced the duplicates.

    I have a concern about the number of sites used to calculate the islands. If a island has 10 sites, in which 5 sites are differentially methylated, should we use 10 sites to calculate the island methylation level? Why only use 5 differentially methylated sites? This is the reason that "450k_avg_by_island_test_CpG_island_filtered-Avg_by_Island.xlsx" and "450k_avg_by_island_test_CpG_island_stats-Avg_by_Island.xlsx" are similar. Because we only use the differentially methylated sites to calculate the island methylation level, most of the islands are more likely differentially methylated. Thus, the two files are similar.

    Jiang

     
  • Charles Warden

    Charles Warden - 2014-03-25

    Hi Jiang,

    Thanks for your feedback. You are correct about the duplicates and this is what I changed in v0.99.11. I will have to review your other technical comments more carefully (and probably release a patch for this in v0.99.12).

    As for the high-level discussion about the strategy, only the sites that vary are considered for subsequent analysis. This is part of the region why it is important to use the .wig files visually confirm the optimal start and stop coordinates for the island when performing validation (and the other reason is that the shores are included as part of the island, but the island coordinates are provided so that they match the "official" UCSC CpG islands).

    If you require 4 sites, I think the results typically look OK. At least when I have checked the individual results, there are usually additional sites that don't meet the criteria for statistical significance and the sites that do vary tend to be near each other. However, if you used all of the sites, you might miss cases where 1/2 of the sites do make up a clear differentially methylated region but the other sites do not (this does happen a decent amount of the time: for sample, you can see that most of the changes in ESR in Figure 3 of the COHCAP NAR paper actually occur in the north shore).

    However, I would share your concern if you only require 2 sites (I think this is more likely to occur by chance, and you'll be more likely to find the sites are not actually next to each other).

    Please let me know if you have additional questions.

    Thanks,
    Charles

     

    Last edit: Charles Warden 2014-03-26

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.