From: David N. <Dav...@hc...> - 2010-06-29 22:35:50
|
Hello Noboru, Thanks for pointing out that dataset, it's very interesting.... I'm seeing the same issue you are. ScanSeqs -> EnrichedRegionMaker -> 50K peaks! These look very real. A visual inspection shows lots of enrichment, maybe too much enrichment for DESeq. With the MultipleReplicaScanSeqs I don't get anything coming back from DESeq. The problem here is that DESeq really needs biological replicas to get an accurate estimation of the overdispersion in the count data. When no replicas are present, DESeq takes the treatment and control samples and treats them as replicas to estimate the variance. This works fine for typical chIp-seq samples where the number of peaks is relatively small. But when the number gets huge then DESeq thinks it's seeing huge overdispersion between the replicas and drastically down weights the pvalues. Hmm, no easy way to fix this. Would be good to add a catch to warn folks when the overdispersion factor gets too big. For now use the ScanSeqs results. Be sure to remove duplicate reads using the PointDataManipulator. I noticed they were 15%, a bit high. Also, in running the PeakShiftFinder, a window size of 350bp would be better, 100bp peak shift is fine. Lastly, there's a big standard deviation in the peak sizes. Again, this is likely do to the nature of H3K4Me1 marks, so would be good to increase the max gap permitted to say 500-1000 bp in the EnrichedRegionMaker. -cheers, D P.S. Hope you don't mind but I'm going to forward this to the useq-users forum. -- David Austin Nix, PhD | Director HCI Bioinformatics/ Co-Director UofU Bioinformatics Shared Resource | Huntsman Cancer Institute | 2000 Circle of Hope | SLC, UT 84112 | Rm: 3165 | Vc: 801.587.4611 | Fx: 801.585.6458 | dav...@hc... | Skype/iChat: LiveNix | WebSite: http://bioserver.hci.utah.edu | DAS/2: http://bioserver.hci.utah.edu:8080/DAS2DB/genome On 6/29/10 3:04 PM, "Noboru Jo Sakabe" <ns...@uc...> wrote: Hi David, I've been having problems with MultipleReplicaScanSeqs. I ran MACS and QuEST on a public data set and found tens of thousands of peaks with very low FDR. However, when I use MultipleReplicaScanSeqs, I don't get peaks. ScanSeqs works fine. I tried versions 6.4 and 6.6. Below's what I see on my screen, and I get nothing in my results directory (only a Temp... directory). These are the raw data I used: from NCBI GEO GSE19553 -> GSM487453: Input_DNA, GSM487452: H3K4me1.UT Hope this was helfpul in some way. noboru $ java -Xmx4500M -jar /programs/chip/USeq_6.6/Apps/MultipleReplicaScanSeqs -t ../../useq_pointData -c ../../../../input/useq_pointData -s ./results -p 100 -w 200 Arguments: -t ../../useq_pointData -c ../../../../input/useq_pointData -s ./results -p 100 -w 200 Calculating read count stats... 6100635 Treatment Observations 6058571 Control Observations 100 Peak shift 200 Window size 0.5 Minimum Window FDR Scanning chromosomes...... Skipping chrY. No windows found with minimum reads of 10 within a window size of 200 ............... Calculating negative binomial p-values and FDRs in R using DESeq (http://www-huber.embl.de/users/anders/DESeq/). This requires patience, 64bit R, and > 6-8G RAM... Parsing results... WARNING: No significant windows found?! FDR threshold is set to 0.5 . Try restarting with a less stringent threshold. Done! 6 min |