From: Goldin, I. <gol...@mw...> - 2011-01-25 17:06:26
|
I definitely support the idea of building some similar facility into SamRecord or as a related class. For some work I'm doing I've also had to implement getInsertedBases() and getDeletedBases() methods. My approach is a little different: I only loop over the CIGAR string once, and I keep separate counts of insertions, deletions, and substitutions. Just as an example, my code is enclosed. Best, Ilya ________________________________________ From: Robert Egan [rs...@lb...] Sent: Thursday, January 20, 2011 4:53 PM To: sam...@li... Subject: [Samtools-devel] Correct for indels in GC Bias Metrics quality reporting Hello, I'm working with some Pacific Biosciences data where the error mode is dominated by inserts and deletions. Using the picard CollectGcBiasMetrics module, I noticed that the base quality scores were significantly over estimated. I looked at the formula which Picard uses to calculate this and noticed it only counted #mismatches / #bases, so I fixed it in the patch below to also count insertions and deletions as errors, and now the reported quality is inline with the actual data. Please consider adding this small patch to the next release of picard. Sincerely, Rob Egan Index: src/java/net/sf/samtools/SAMRecord.java =================================================================== --- src/java/net/sf/samtools/SAMRecord.java (revision 703) +++ src/java/net/sf/samtools/SAMRecord.java (working copy) @@ -1162,6 +1162,34 @@ } /** + * Returns the total length of a given CigarOperator over the CIGAR string + */ + public long getTotalCigarElementLength(CigarOperator op) { + long bases = 0; + final Cigar cigar = getCigar(); + for (final CigarElement e : cigar.getCigarElements()) { + if (e.getOperator() == op) { + bases += e.getLength(); + } + } + return bases; + } + + /** + * Returns the number of inserted bases + */ + public long getInsertedBases() { + return getTotalCigarElementLength(CigarOperator.I); + } + + /** + * Returns the number of deleted bases + */ + public long getDeletedBases() { + return getTotalCigarElementLength(CigarOperator.D); + } + + /** * Returns blocks of the read sequence that have been aligned directly to the * reference sequence. Note that clipped portions of the read and inserted and * deleted bases (vs. the reference) are not represented in the alignment blocks. Index: src/java/net/sf/picard/analysis/CollectGcBiasMetrics.java =================================================================== --- src/java/net/sf/picard/analysis/CollectGcBiasMetrics.java (revision 703) +++ src/java/net/sf/picard/analysis/CollectGcBiasMetrics.java (working copy) @@ -149,7 +149,7 @@ if (windowGc >= 0) { ++readsByGc[windowGc]; basesByGc[windowGc] += rec.getReadLength(); - errorsByGc[windowGc] += SequenceUtil.countMismatches(rec, refBases); + errorsByGc[windowGc] += SequenceUtil.countMismatches(rec, refBases) + rec.getInsertedBases() + rec.getDeletedBases(); } } } ------------------------------------------------------------------------------ Protect Your Site and Customers from Malware Attacks Learn about various malware tactics and how to avoid them. Understand malware threats, the impact they can have on your business, and how you can protect your company and customers by using code signing. http://p.sf.net/sfu/oracle-sfdevnl _______________________________________________ Samtools-devel mailing list Sam...@li... https://lists.sourceforge.net/lists/listinfo/samtools-devel |