[Samtools-help] Picard markduplicates on single-end data with variable read-length

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear All, 

I have done exome capture on a DNA source being ~170 bp in average. Since its sequenced as paired end-data (100x2), most of the pairs will harbor overlapping sequence. I have been using a tool called seqprep to merge the fastq-files.
Its available here: https://github.com/jstjohn/SeqPrep 
This creats single-end data from most of my reads. 

The problem is that doing:

fastq -> seqprep, single end (for overlapping reads) + paired end (for non-overlapping reads) -> map using BWA -> markduplicates

will give ~35% lower coverage relatively doing

fastq -> map using BWA -> markduplicates -> back to fastq -> seqprep, single end (for overlapping reads) + paired end (for non-overlapping reads) -> map using BWA

I guess this could be fixed if markduplicates would look at the length of each single-end read and use that to decide if single-end reads are duplicates or not. Is this something thats about to get implemented in Picard or can be easily altered in the code? Or have I missed a solution perhaps already in place?

any comments or suggestions as greatly appreciated!

best regards, 

// Johan Lindberg

*****************************************
Johan Lindberg, PhD 
Department of Medical Epidemiology and Biostatistics
Nobels Väg 12A, PO.Box 281
17177 Solna, Sweden
*****************************************