|
From: Tim F. <tfe...@br...> - 2010-04-26 15:08:48
|
Hey Sendu, I'm not really sure I understand the question. The way Mark Duplicates works it actually needs to see each read in a pair to determine duplicate status, because it needs access to the CIGAR for both reads. The way this is implemented it doesn't process the read pair until it's seen both ends - and it knows a read is paired or not based on the flags. -t On Apr 26, 2010, at 10:31 AM, Sendu Bala wrote: > On 26/04/2010 14:24, Tim Fennell wrote: >> It will not work optimally - it will only detect duplicates for pairs >> where it has access to both ends. So if you split your files by >> chromosome then you'll essentially lose inter-chromosomal duplicate >> marking. > > So even if there's a whole bunch of forward reads stacked up on one > chr, > they'll be ignored if their reverse reads aren't in the bam file? > > Is that something that can't be changed because these somehow might > not > be 'real' duplicates without knowing the pair information (why > not?), or > is it on the to-do list? > > >> On Apr 26, 2010, at 5:35 AM, Sendu Bala wrote: >> >>> On 21/04/2010 17:45, Tim Fennell wrote: >>>> Hi Feiyu, >>>> >>>> The algorithm probably does need describing somewhere in detail, >>>> but I don't believe I have anything handy. Essentially what it >>>> does (for pairs; single-end data is also handled) is to find the >>>> 5' coordinates and mapping orientations of each read pair. When >>>> doing this it takes into account all clipping that has taking >>>> place as well as any gaps or jumps in the alignment. You can >>>> thus think of it as determining "if all the bases from the read >>>> were aligned, where would the 5' most base have been aligned". >>>> It then matches all read pairs that have identical 5' coordinates >>>> and orientations and marks as duplicates all but the "best" pair. >>>> "Best" is defined as the read pair having the highest sum of base >>>> qualities as bases with Q>= 15. >>> >>> Am I right in thinking it will work correctly on a bam that has >>> been split by chromosome? Or will something not work quite right if >>> one read of a pair is missing because it mapped to a different >>> chromosome? >>> >>> >>> -- The Wellcome Trust Sanger Institute is operated by Genome >>> Research Limited, a charity registered in England with number >>> 1021457 and a company registered in England with number 2742969, >>> whose registered office is 215 Euston Road, London, NW1 2BE. >> > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. |