[Useq-users] No results from MultipleReplicaScanSeqs

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello Noboru,

Thanks for pointing out that dataset, it's very interesting....

I'm seeing the same issue you are.  ScanSeqs -> EnrichedRegionMaker -> 50K peaks!  These look very real. A visual inspection shows lots of enrichment, maybe too much enrichment for DESeq.

With the MultipleReplicaScanSeqs I don't get anything coming back from DESeq.  The problem here is that DESeq really needs biological replicas to get an accurate estimation of the overdispersion in the count data.  When no replicas are present, DESeq takes the treatment and control samples and treats them as replicas to estimate the variance.  This works fine for typical chIp-seq samples where the number of peaks is relatively small.  But when the number gets huge then DESeq thinks it's seeing huge overdispersion between the replicas and drastically down weights the pvalues.  Hmm, no easy way to fix this.  Would be good to add a catch to warn folks when the overdispersion factor gets too big.  For now use the ScanSeqs results.  Be sure to remove duplicate reads using the PointDataManipulator.  I noticed they were 15%, a bit high.  Also, in running the PeakShiftFinder, a window size of 350bp would be better, 100bp peak shift is fine.  Lastly, there's a big standard deviation in the peak sizes. Again, this is likely do to the nature of H3K4Me1 marks, so would be good to increase the max gap permitted to say 500-1000 bp in the EnrichedRegionMaker.

-cheers, D

P.S. Hope you don't mind but I'm going to forward this to the useq-users forum.
--
David Austin Nix, PhD | Director HCI Bioinformatics/ Co-Director UofU Bioinformatics Shared Resource | Huntsman Cancer Institute | 2000 Circle of Hope | SLC, UT 84112 | Rm: 3165 | Vc: 801.587.4611 | Fx: 801.585.6458 | dav...@hc... | Skype/iChat: LiveNix | WebSite: http://bioserver.hci.utah.edu | DAS/2: http://bioserver.hci.utah.edu:8080/DAS2DB/genome

On 6/29/10 3:04 PM, "Noboru Jo Sakabe" <ns...@uc...> wrote:

   Hi David, I've been having problems with MultipleReplicaScanSeqs.
    I ran MACS and QuEST on a public data set and found tens of
thousands of peaks with very low FDR. However, when I use
MultipleReplicaScanSeqs, I don't get peaks. ScanSeqs works fine. I tried
versions 6.4 and 6.6.

    Below's what I see on my screen, and I get nothing in my results
directory (only a Temp... directory).
    These are the raw data I used: from NCBI GEO  GSE19553 -> GSM487453:
Input_DNA, GSM487452: H3K4me1.UT

    Hope this was helfpul in some way.

noboru

$ java -Xmx4500M -jar
/programs/chip/USeq_6.6/Apps/MultipleReplicaScanSeqs -t
../../useq_pointData -c ../../../../input/useq_pointData -s ./results -p
100 -w 200

Arguments: -t ../../useq_pointData -c ../../../../input/useq_pointData
-s ./results -p 100 -w 200

Calculating read count stats...
    6100635 Treatment Observations
    6058571 Control Observations
    100    Peak shift
    200    Window size
    0.5    Minimum Window FDR
Scanning chromosomes......
    Skipping chrY. No windows found with minimum reads of 10 within a
window size of 200
...............
Calculating negative binomial p-values and FDRs in R using DESeq
(http://www-huber.embl.de/users/anders/DESeq/). This requires patience,
64bit R, and > 6-8G RAM...
Parsing results...

WARNING: No significant windows found?! FDR threshold is set to 0.5 .
Try restarting with a less stringent threshold.

Done! 6 min