Thread: [Samtools-help] Estimate library size for single end reads?

Brought to you by: awhitwham, bhandsaker, daviesrob, jenniferliddle, and 5 others

samtools-help

[Samtools-help] Estimate library size for single end reads?

From: Benjamin L. <benjaminlevinson2010@u.northwestern.edu> - 2013-06-21 23:08:49

Hey,

Currently that function of Picard Tools MarkDuplicates is only implemented
for paired end reads. Is there an obvious reason why it cannot be extended
to single end reads? If I wanted to apply the formula to single end reads,
is there an obvious way to do this?

Re: [Samtools-help] Estimate library size for single end reads?

From: Alec W. <al...@br...> - 2013-06-23 17:29:08

Hi Benjamin,

MarkDuplicates decides that reads are duplicates if 5' ends of the two
pairs are identical.  See
http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_How_does_MarkDuplicates_work.3F
 .

Although MarkDuplicates will work on single-end reads, because there is
only one 5' end to compare, the chance of two reads that come from distinct
molecules in the sample, rather than PCR duplication, would nonetheless
have the same 5' end is greatly increased.  Therefore MarkDuplicates is not
considered a great tool for single-end data.

A better approach for single-end data would be to decide that two reads are
dupes if 5' ends are the same and also the sequence content of the two
reads is similar enough.  Since the vast majority of what we do is
paired-end, we haven't had the need to implement this.

-Alec

On Fri, Jun 21, 2013 at 6:44 PM, Benjamin Levinson <
benjaminlevinson2010@u.northwestern.edu> wrote:

> Hey,
>
> Currently that function of Picard Tools MarkDuplicates is only implemented
> for paired end reads. Is there an obvious reason why it cannot be extended
> to single end reads? If I wanted to apply the formula to single end reads,
> is there an obvious way to do this?
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Samtools-help mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>
>

Re: [Samtools-help] Estimate library size for single end reads?

From: Benjamin L. <benjaminlevinson2010@u.northwestern.edu> - 2013-06-23 19:06:31

Ah, yes, that makes sense, and I see now that the formula assumes any
observed duplicates are true duplicates and not just randomly obtaining
multiple reads with the same start site.

However, I think one situation where Picard would be the way to go (as
opposed to a method that requires the exact same composition of two reads
to label one of them as a duplicate) would be when doing count-based
analysis between different samples. If one sample has a heterozygous
position while the other does not (in a given region), then if you require
every base in the read to be the same to call it a duplicate you will end
up with twice as many reads for the sample that is heterozygous vs. the
sample that is homozygous after removing duplicates, whereas if you use
Picard you will end up with the same number of reads in each sample if the
SNP doesn't make aligning more difficult. If you believe the SNP will not
impact the counts (ie, ChIP-seq with the SNP being adjacent to a TFBS),
then the Picard way would be the way to go, I think.

Thanks for the timely reply.

On Sun, Jun 23, 2013 at 12:28 PM, Alec Wysoker <al...@br...>wrote:

> Hi Benjamin,
>
> MarkDuplicates decides that reads are duplicates if 5' ends of the two
> pairs are identical.  See
> http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_How_does_MarkDuplicates_work.3F
>  .
>
> Although MarkDuplicates will work on single-end reads, because there is
> only one 5' end to compare, the chance of two reads that come from distinct
> molecules in the sample, rather than PCR duplication, would nonetheless
> have the same 5' end is greatly increased.  Therefore MarkDuplicates is not
> considered a great tool for single-end data.
>
> A better approach for single-end data would be to decide that two reads
> are dupes if 5' ends are the same and also the sequence content of the two
> reads is similar enough.  Since the vast majority of what we do is
> paired-end, we haven't had the need to implement this.
>
> -Alec
>
>
> On Fri, Jun 21, 2013 at 6:44 PM, Benjamin Levinson <
> benjaminlevinson2010@u.northwestern.edu> wrote:
>
>> Hey,
>>
>> Currently that function of Picard Tools MarkDuplicates is only
>> implemented for paired end reads. Is there an obvious reason why it cannot
>> be extended to single end reads? If I wanted to apply the formula to single
>> end reads, is there an obvious way to do this?
>>
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>>
>> Build for Windows Store.
>>
>> http://p.sf.net/sfu/windows-dev2dev
>> _______________________________________________
>> Samtools-help mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-help
>>
>>
>