|
From: Sendu B. <sb...@sa...> - 2010-04-26 15:25:26
|
On 26/04/2010 16:19, Tim Fennell wrote: > Well, treating paired reads as single-end reads will generally result in > a significant over-marking of duplicates as single end reads are far > more likely to appear as duplicates than PE reads. I see. Thank you for all the information. Other than inter-chromosomal duplicate marking, does picard's algorithm do anything that samtools rmdup does not? So if I'm running on chr bams, can I just use samtools instead for the speed increase and lower memory benefits? > On Apr 26, 2010, at 11:14 AM, Sendu Bala wrote: > >> On 26/04/2010 16:08, Tim Fennell wrote: >>> Hey Sendu, >>> >>> I'm not really sure I understand the question. The way Mark Duplicates >>> works it actually needs to see each read in a pair to determine >>> duplicate status, because it needs access to the CIGAR for both reads. >>> The way this is implemented it doesn't process the read pair until it's >>> seen both ends - and it knows a read is paired or not based on the >>> flags. >> >> When it doesn't find the other read in the pair, can't it just treat >> the read as unpaired and do what it would normally do for single-ended >> duplicate marking? Or could the missing CIGAR information result in >> some cases where a read is marked as a duplicate incorrectly? >> >> >>> On Apr 26, 2010, at 10:31 AM, Sendu Bala wrote: >>> >>>> On 26/04/2010 14:24, Tim Fennell wrote: >>>>> It will not work optimally - it will only detect duplicates for pairs >>>>> where it has access to both ends. So if you split your files by >>>>> chromosome then you'll essentially lose inter-chromosomal duplicate >>>>> marking. >>>> >>>> So even if there's a whole bunch of forward reads stacked up on one >>>> chr, >>>> they'll be ignored if their reverse reads aren't in the bam file? >>>> >>>> Is that something that can't be changed because these somehow might not >>>> be 'real' duplicates without knowing the pair information (why >>>> not?), or >>>> is it on the to-do list? >>>> >>>> >>>>> On Apr 26, 2010, at 5:35 AM, Sendu Bala wrote: >>>>> >>>>>> On 21/04/2010 17:45, Tim Fennell wrote: >>>>>>> Hi Feiyu, >>>>>>> >>>>>>> The algorithm probably does need describing somewhere in detail, >>>>>>> but I don't believe I have anything handy. Essentially what it >>>>>>> does (for pairs; single-end data is also handled) is to find the >>>>>>> 5' coordinates and mapping orientations of each read pair. When >>>>>>> doing this it takes into account all clipping that has taking >>>>>>> place as well as any gaps or jumps in the alignment. You can >>>>>>> thus think of it as determining "if all the bases from the read >>>>>>> were aligned, where would the 5' most base have been aligned". >>>>>>> It then matches all read pairs that have identical 5' coordinates >>>>>>> and orientations and marks as duplicates all but the "best" pair. >>>>>>> "Best" is defined as the read pair having the highest sum of base >>>>>>> qualities as bases with Q>= 15. >>>>>> >>>>>> Am I right in thinking it will work correctly on a bam that has >>>>>> been split by chromosome? Or will something not work quite right if >>>>>> one read of a pair is missing because it mapped to a different >>>>>> chromosome? >>>>>> >>>>>> >>>>>> -- The Wellcome Trust Sanger Institute is operated by Genome >>>>>> Research Limited, a charity registered in England with number >>>>>> 1021457 and a company registered in England with number 2742969, >>>>>> whose registered office is 215 Euston Road, London, NW1 2BE. >>>>> >>>> >>>> >>>> >>>> -- >>>> The Wellcome Trust Sanger Institute is operated by Genome Research >>>> Limited, a charity registered in England with number 1021457 and a >>>> company registered in England with number 2742969, whose registered >>>> office is 215 Euston Road, London, NW1 2BE. >>> >> >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |