Re: [Samtools-help] Algorithm for Picard MarkDuplicates

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

It will not work optimally - it will only detect duplicates for pairs where it has access to both ends.  So if you split your files by chromosome then you'll essentially lose inter-chromosomal duplicate marking.

-t

On Apr 26, 2010, at 5:35 AM, Sendu Bala wrote:

> On 21/04/2010 17:45, Tim Fennell wrote:
>> Hi Feiyu,
>> 
>> The algorithm probably does need describing somewhere in detail, but I
>> don't believe I have anything handy.  Essentially what it does (for
>> pairs; single-end data is also handled) is to find the 5' coordinates
>> and mapping orientations of each read pair.  When doing this it takes
>> into account all clipping that has taking place as well as any gaps or
>> jumps in the alignment.  You can thus think of it as determining "if
>> all the bases from the read were aligned, where would the 5' most base
>> have been aligned".  It then matches all read pairs that have
>> identical 5' coordinates and orientations and marks as duplicates all
>> but the "best" pair.  "Best" is defined as the read pair having the
>> highest sum of base qualities as bases with Q>= 15.
> 
> Am I right in thinking it will work correctly on a bam that has been split by chromosome? Or will something not work quite right if one read of a pair is missing because it mapped to a different chromosome?
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.