I'm using Bowtie 2.0.0-beta7 on 64-bit Ubuntu Linux. I have single-end Illumina reads with an 11bp adapter on the 5' end of each read. I passed "--trim5 11" to Bowtie. In the output SAM file, the full (untrimmed) read sequence is reported. The 12th base (first base of real data) is aligned with the correct location in the genome, but the SAM record has the 11 adapter bases aligned with the 11 bases preceeding that in the genome, even though they don't match.
This would be OK if the CIGAR string reported those 11 leading bases as hard- or soft-masked (I'm not sure of the difference), but it doesn't -- they are reported as matching ("M") the genome at those positions! To me, this is extremely surprising behavior to say the least! It seems like a bug, but if it was intentional, I'd be very interested in the rationale. It seems like either excluding the trimmed bases from the output altogether or modifying the CIGAR string to mark the trimmed bases as masked would be an appropriate fix.
Log in to post a comment.