From: Heng Li <lh...@li...> - 2011-11-28 22:54:30
|
A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation. The backward operation is designed to describe such an alignment: REF:: GCATACGATCGACTAGTCACGT READ: --ATACGATCGA---------- READ: ---------CGACTAGTCAC-- i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags. Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line. The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented. Do you think it is worth having a backward operation? regards, Heng -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |