Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Ewan,

Thanks for the suggestion. To make sure I understand your proposal, could you show me the CIGAR, SEQ, QUAL and the additional tag for the following example?

> REF:: GCATACGATCGACTAGTCACGT  
> READ: --ATACGATCGA----------
> READ: ---------CGACTAGTCAC--

Thanks,

Heng

On Nov 29, 2011, at 3:52 AM, Ewan Birney wrote:

> 
> On 29 Nov 2011, at 08:48, Ewan Birney wrote:
> 
>> 
>> I'd just to express an alternative way of storing CG data in sam/bam/cram
>> which captures nearly all the data, can be efficiently compressed and
>> will have little disruption on downstream tools though tools would have
>> to become "CG aware" to fully leverage the data (whatever happens, tools will 
>> have to become CG aware to really leverage the data, so there is no I think 0 cost 
>> scheme)
>> 
> 
> 
> Just to add to this (only in writing this email did I realise it): if one has
> some additional tag to store the second quality values in the B() regions one
> can achieve complete representation of CG data in this scheme. 
> 
> 
>> 
>> Complete genomics has two levels of reads:
>> 
>>  The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)
>> 
>>  The "read fragments" inside of the paired ends, being a structured 
>> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
>> below) and a 5 base pair "gap".
>> 
>> 
>> The scheme below looks at an aligment at the read-fragment level.
>> 
>> 
>> An alternative scheme is to look at the alignment of the paired ends,
>> with the sequence being the called CG reads with "Ns" in the gap positions
>> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
>> "CG mask" being an encoding of the characteristic gap structure in this CG read.
>> This is, in effect, a cigar string, and as this will be CG specific, one could
>> at this point allow a "B" operation to indicate overlap. Either the whole readset
>> could be declared as "CG" or the presence of this "CG mask" would indicate it
>> on per read basis (I think better to have an attribute at the readset level).
>> 
>> 
>> 
>> This "mask" is highly structured, and so even in sam representation probably
>> best represented as a huffman encoded dictionary - certainly in more compressed
>> forms, this would be a huffman encode variable length, and in playing around
>> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
>> pretty compressible, being ~3.5 bits per mask.
>> 
>> 
>> In many  ways this "CG mask" to represent the CG-level indels allows also for
>> small indels to be represented as a traditional "read level" CIGAR string, 
>> (standard sam) to be mapped post the CG level mask.
>> 
>> 
>> 
>> Just to look at this scheme in terms of legacy code vs non legacy code:
>> 
>>   For legacy code/non-CG aware code:
>> 
>>       There is a long read with "Ns".
>>       The Ns imply no knowledge of the bases - so SNP callers will not make
>> a terrible mistake here
>> 
>> 
>>   For CG aware code:
>> 
>>       You read in the mask, look up the "CG level" string in the dictionary
>> 
>>       Apply the CG level string, including "B" operations
>> 
>>       You can recreate the actual fragments now internally.
>> 
>> 
>> The only small wrinkle is that inside the B() operations there are two quality
>> values, not 1. As I think B operations are rare, it is actually fine to perhaps
>> have a second, rare tag that encode this (might be a bit fiddly but doable). 
>> It is also the case that although there are two "chemistry reads" at this point,
>> there are other error processes (eg, template construction) in which the two
>> bases are not independent. So - for any machinery which is sophisticated enough
>> to want to think about 2 quality values will almost certainly do something
>> sophisticated about (say) the recalibration (I would be thinking about calibrating
>> the errors of these B regions separately from the quality codes elsewhere)
>> 
>> 
>> 
>> 
>> 
>> 
>> Does this make sense? Comments?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 28 Nov 2011, at 22:18, Heng Li wrote:
>> 
>>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>>> 
>>> The backward operation is designed to describe such an alignment:
>>> 
>>> REF:: GCATACGATCGACTAGTCACGT  
>>> READ: --ATACGATCGA----------
>>> READ: ---------CGACTAGTCAC--
>>> 
>>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>>> 
>>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>>> 
>>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>>> 
>>> Do you think it is worth having a backward operation?
>>> 
>>> regards,
>>> 
>>> Heng
>>> 
>>> -- 
>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>> Limited, a charity registered in England with number 1021457 and a 
>>> company registered in England with number 2742969, whose registered 
>>> office is 215 Euston Road, London, NW1 2BE. 
>>> 
>>> ------------------------------------------------------------------------------
>>> All the data continuously generated in your IT infrastructure 
>>> contains a definitive record of customers, application performance, 
>>> security threats, fraudulent activity, and more. Splunk takes this 
>>> data and makes sense of it. IT sense. And common sense.
>>> http://p.sf.net/sfu/splunk-novd2d
>>> _______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>> 
>> 
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure 
>> contains a definitive record of customers, application performance, 
>> security threats, fraudulent activity, and more. Splunk takes this 
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.