From: Tim F. <tfe...@br...> - 2011-11-29 14:28:35
|
One thought is that if B is really intended only to provide the ability to store CG and similar data where the read "skips" back a few bases now and again vs. the reference one thing that would make this much easier on those parsing SAM/BAM would be to limit the use of the B operator so that it cannot skip backwards past the beginning of the read. So something like 20M5B20M would be valid, but 50M5000B20M would not be. While tools would still need to explicitly support the B operator it would allow any existing assumptions about not needing to move backwards in the reference from the read start point to still hold. -t On Nov 28, 2011, at 5:18 PM, Heng Li wrote: > A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation. > > The backward operation is designed to describe such an alignment: > > REF:: GCATACGATCGACTAGTCACGT > READ: --ATACGATCGA---------- > READ: ---------CGACTAGTCAC-- > > i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags. > > Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line. > > The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented. > > Do you think it is worth having a backward operation? > > regards, > > Heng > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel |