Re: [Samtools-help] Algorithm for Picard MarkDuplicates

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Well, treating paired reads as single-end reads will generally result  
in a significant over-marking of duplicates as single end reads are  
far more likely to appear as duplicates than PE reads.

-t

On Apr 26, 2010, at 11:14 AM, Sendu Bala wrote:

> On 26/04/2010 16:08, Tim Fennell wrote:
>> Hey Sendu,
>>
>> I'm not really sure I understand the question. The way Mark  
>> Duplicates
>> works it actually needs to see each read in a pair to determine
>> duplicate status, because it needs access to the CIGAR for both  
>> reads.
>> The way this is implemented it doesn't process the read pair until  
>> it's
>> seen both ends - and it knows a read is paired or not based on the  
>> flags.
>
> When it doesn't find the other read in the pair, can't it just treat  
> the read as unpaired and do what it would normally do for single- 
> ended duplicate marking? Or could the missing CIGAR information  
> result in some cases where a read is marked as a duplicate  
> incorrectly?
>
>
>> On Apr 26, 2010, at 10:31 AM, Sendu Bala wrote:
>>
>>> On 26/04/2010 14:24, Tim Fennell wrote:
>>>> It will not work optimally - it will only detect duplicates for  
>>>> pairs
>>>> where it has access to both ends. So if you split your files by
>>>> chromosome then you'll essentially lose inter-chromosomal duplicate
>>>> marking.
>>>
>>> So even if there's a whole bunch of forward reads stacked up on  
>>> one chr,
>>> they'll be ignored if their reverse reads aren't in the bam file?
>>>
>>> Is that something that can't be changed because these somehow  
>>> might not
>>> be 'real' duplicates without knowing the pair information (why  
>>> not?), or
>>> is it on the to-do list?
>>>
>>>
>>>> On Apr 26, 2010, at 5:35 AM, Sendu Bala wrote:
>>>>
>>>>> On 21/04/2010 17:45, Tim Fennell wrote:
>>>>>> Hi Feiyu,
>>>>>>
>>>>>> The algorithm probably does need describing somewhere in detail,
>>>>>> but I don't believe I have anything handy. Essentially what it
>>>>>> does (for pairs; single-end data is also handled) is to find the
>>>>>> 5' coordinates and mapping orientations of each read pair. When
>>>>>> doing this it takes into account all clipping that has taking
>>>>>> place as well as any gaps or jumps in the alignment. You can
>>>>>> thus think of it as determining "if all the bases from the read
>>>>>> were aligned, where would the 5' most base have been aligned".
>>>>>> It then matches all read pairs that have identical 5' coordinates
>>>>>> and orientations and marks as duplicates all but the "best" pair.
>>>>>> "Best" is defined as the read pair having the highest sum of base
>>>>>> qualities as bases with Q>= 15.
>>>>>
>>>>> Am I right in thinking it will work correctly on a bam that has
>>>>> been split by chromosome? Or will something not work quite right  
>>>>> if
>>>>> one read of a pair is missing because it mapped to a different
>>>>> chromosome?
>>>>>
>>>>>
>>>>> -- The Wellcome Trust Sanger Institute is operated by Genome
>>>>> Research Limited, a charity registered in England with number
>>>>> 1021457 and a company registered in England with number 2742969,
>>>>> whose registered office is 215 Euston Road, London, NW1 2BE.
>>>>
>>>
>>>
>>>
>>> --
>>> The Wellcome Trust Sanger Institute is operated by Genome Research
>>> Limited, a charity registered in England with number 1021457 and a
>>> company registered in England with number 2742969, whose registered
>>> office is 215 Euston Road, London, NW1 2BE.
>>
>
>
>
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research  
> Limited, a charity registered in England with number 1021457 and a  
> company registered in England with number 2742969, whose registered  
> office is 215 Euston Road, London, NW1 2BE.