|
From: Tim F. <tfe...@br...> - 2010-04-26 15:20:05
|
Well, treating paired reads as single-end reads will generally result in a significant over-marking of duplicates as single end reads are far more likely to appear as duplicates than PE reads. -t On Apr 26, 2010, at 11:14 AM, Sendu Bala wrote: > On 26/04/2010 16:08, Tim Fennell wrote: >> Hey Sendu, >> >> I'm not really sure I understand the question. The way Mark >> Duplicates >> works it actually needs to see each read in a pair to determine >> duplicate status, because it needs access to the CIGAR for both >> reads. >> The way this is implemented it doesn't process the read pair until >> it's >> seen both ends - and it knows a read is paired or not based on the >> flags. > > When it doesn't find the other read in the pair, can't it just treat > the read as unpaired and do what it would normally do for single- > ended duplicate marking? Or could the missing CIGAR information > result in some cases where a read is marked as a duplicate > incorrectly? > > >> On Apr 26, 2010, at 10:31 AM, Sendu Bala wrote: >> >>> On 26/04/2010 14:24, Tim Fennell wrote: >>>> It will not work optimally - it will only detect duplicates for >>>> pairs >>>> where it has access to both ends. So if you split your files by >>>> chromosome then you'll essentially lose inter-chromosomal duplicate >>>> marking. >>> >>> So even if there's a whole bunch of forward reads stacked up on >>> one chr, >>> they'll be ignored if their reverse reads aren't in the bam file? >>> >>> Is that something that can't be changed because these somehow >>> might not >>> be 'real' duplicates without knowing the pair information (why >>> not?), or >>> is it on the to-do list? >>> >>> >>>> On Apr 26, 2010, at 5:35 AM, Sendu Bala wrote: >>>> >>>>> On 21/04/2010 17:45, Tim Fennell wrote: >>>>>> Hi Feiyu, >>>>>> >>>>>> The algorithm probably does need describing somewhere in detail, >>>>>> but I don't believe I have anything handy. Essentially what it >>>>>> does (for pairs; single-end data is also handled) is to find the >>>>>> 5' coordinates and mapping orientations of each read pair. When >>>>>> doing this it takes into account all clipping that has taking >>>>>> place as well as any gaps or jumps in the alignment. You can >>>>>> thus think of it as determining "if all the bases from the read >>>>>> were aligned, where would the 5' most base have been aligned". >>>>>> It then matches all read pairs that have identical 5' coordinates >>>>>> and orientations and marks as duplicates all but the "best" pair. >>>>>> "Best" is defined as the read pair having the highest sum of base >>>>>> qualities as bases with Q>= 15. >>>>> >>>>> Am I right in thinking it will work correctly on a bam that has >>>>> been split by chromosome? Or will something not work quite right >>>>> if >>>>> one read of a pair is missing because it mapped to a different >>>>> chromosome? >>>>> >>>>> >>>>> -- The Wellcome Trust Sanger Institute is operated by Genome >>>>> Research Limited, a charity registered in England with number >>>>> 1021457 and a company registered in England with number 2742969, >>>>> whose registered office is 215 Euston Road, London, NW1 2BE. >>>> >>> >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> Limited, a charity registered in England with number 1021457 and a >>> company registered in England with number 2742969, whose registered >>> office is 215 Euston Road, London, NW1 2BE. >> > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. |