From: Benjamin L. <benjaminlevinson2010@u.northwestern.edu> - 2013-06-21 23:08:49
|
Hey, Currently that function of Picard Tools MarkDuplicates is only implemented for paired end reads. Is there an obvious reason why it cannot be extended to single end reads? If I wanted to apply the formula to single end reads, is there an obvious way to do this? |
From: Alec W. <al...@br...> - 2013-06-23 17:29:08
|
Hi Benjamin, MarkDuplicates decides that reads are duplicates if 5' ends of the two pairs are identical. See http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_How_does_MarkDuplicates_work.3F . Although MarkDuplicates will work on single-end reads, because there is only one 5' end to compare, the chance of two reads that come from distinct molecules in the sample, rather than PCR duplication, would nonetheless have the same 5' end is greatly increased. Therefore MarkDuplicates is not considered a great tool for single-end data. A better approach for single-end data would be to decide that two reads are dupes if 5' ends are the same and also the sequence content of the two reads is similar enough. Since the vast majority of what we do is paired-end, we haven't had the need to implement this. -Alec On Fri, Jun 21, 2013 at 6:44 PM, Benjamin Levinson < benjaminlevinson2010@u.northwestern.edu> wrote: > Hey, > > Currently that function of Picard Tools MarkDuplicates is only implemented > for paired end reads. Is there an obvious reason why it cannot be extended > to single end reads? If I wanted to apply the formula to single end reads, > is there an obvious way to do this? > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Samtools-help mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-help > > |
From: Benjamin L. <benjaminlevinson2010@u.northwestern.edu> - 2013-06-23 19:06:31
|
Ah, yes, that makes sense, and I see now that the formula assumes any observed duplicates are true duplicates and not just randomly obtaining multiple reads with the same start site. However, I think one situation where Picard would be the way to go (as opposed to a method that requires the exact same composition of two reads to label one of them as a duplicate) would be when doing count-based analysis between different samples. If one sample has a heterozygous position while the other does not (in a given region), then if you require every base in the read to be the same to call it a duplicate you will end up with twice as many reads for the sample that is heterozygous vs. the sample that is homozygous after removing duplicates, whereas if you use Picard you will end up with the same number of reads in each sample if the SNP doesn't make aligning more difficult. If you believe the SNP will not impact the counts (ie, ChIP-seq with the SNP being adjacent to a TFBS), then the Picard way would be the way to go, I think. Thanks for the timely reply. On Sun, Jun 23, 2013 at 12:28 PM, Alec Wysoker <al...@br...>wrote: > Hi Benjamin, > > MarkDuplicates decides that reads are duplicates if 5' ends of the two > pairs are identical. See > http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_How_does_MarkDuplicates_work.3F > . > > Although MarkDuplicates will work on single-end reads, because there is > only one 5' end to compare, the chance of two reads that come from distinct > molecules in the sample, rather than PCR duplication, would nonetheless > have the same 5' end is greatly increased. Therefore MarkDuplicates is not > considered a great tool for single-end data. > > A better approach for single-end data would be to decide that two reads > are dupes if 5' ends are the same and also the sequence content of the two > reads is similar enough. Since the vast majority of what we do is > paired-end, we haven't had the need to implement this. > > -Alec > > > On Fri, Jun 21, 2013 at 6:44 PM, Benjamin Levinson < > benjaminlevinson2010@u.northwestern.edu> wrote: > >> Hey, >> >> Currently that function of Picard Tools MarkDuplicates is only >> implemented for paired end reads. Is there an obvious reason why it cannot >> be extended to single end reads? If I wanted to apply the formula to single >> end reads, is there an obvious way to do this? >> >> >> >> ------------------------------------------------------------------------------ >> This SF.net email is sponsored by Windows: >> >> Build for Windows Store. >> >> http://p.sf.net/sfu/windows-dev2dev >> _______________________________________________ >> Samtools-help mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-help >> >> > |