Thread: [Samtools-devel] Propose to add the backward CIGAR operation 'B' | SAM tools

samtools-devel

[Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Heng Li <lh...@sa...> - 2011-11-28 22:36:46

A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.

The backward operation is designed to describe such an alignment:

REF:: GCATACGATCGACTAGTCACGT
READ: --ATACGATCGA----------
READ: ---------CGACTAGTCAC--

i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.

Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.

The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.

Do you think it is worth having a backward operation?

regards,

Heng

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

[Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Heng Li <lh...@li...> - 2011-11-28 22:54:30

A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.

The backward operation is designed to describe such an alignment:

REF:: GCATACGATCGACTAGTCACGT
READ: --ATACGATCGA----------
READ: ---------CGACTAGTCAC--

i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.

Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.

The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.

Do you think it is worth having a backward operation?

regards,

Heng

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Fred L. <fl...@sd...> - 2011-11-29 02:21:19

Currently SAM assumes one segment per alignment record, and my current 
preference is to keep it that way.  I see no easy way to loosen this 
restriction and still have the SAM spec make sense.  So I prefer to keep 
SAM alignment records simple and compatible, and to represent complex 
alignments by grouping together sets of simple alignments.  But if this 
takes up too much space for archival purposes, I offer another 
suggestion below.

By the way, I'm a little bit confused about what the proposed "B" 
operator actually means.  Is it just a backwards reference skip in the 
same segment, or does it actually imply that a new segment is starting?  
If it implies a new segment, then what operator will be used to denote 
this if there is no overlap between the two segments?  E.g.,

    REF:: GCATACGATCGACTAGTCACGT
    READ: --ATACGATCGA----------
    READ: ------------CTAGTCACGT

I understand the need to keep file sizes as small as possible for 
archival purposes, but I'm not sure that changing the SAM spec is the 
way to go.  For a while now I've been thinking about creating a "virtual 
SAM format" that can be easily converted to SAM, but takes up less space 
on disk.  For example, perhaps the SEQ field could be removed entirely 
from the virtual SAM file, but rebuilt dynamically when the file is 
converted to proper SAM format.

So I think that my best suggestion for keeping archive files small is to 
create a SAM "extension" format that can be easily converted to proper 
SAM format.  Instead of building support for "B" and other extensions 
into all the components of samtools (and all other APIs), I suggest that 
a new samtools command be created, such as "samtools convert", that will 
convert SAM extension/virtual formats into proper SAM format.  Then 
people can be as creative as they like in their efforts to save disk space.

FL

> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>
> The backward operation is designed to describe such an alignment:
>
> REF:: GCATACGATCGACTAGTCACGT
> READ: --ATACGATCGA----------
> READ: ---------CGACTAGTCAC--
>
> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>
> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>
> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>
> Do you think it is worth having a backward operation?
>
> regards,
>
> Heng
>

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Ewan B. <bi...@eb...> - 2011-11-29 08:48:22

I'd just to express an alternative way of storing CG data in sam/bam/cram
which captures nearly all the data, can be efficiently compressed and
will have little disruption on downstream tools though tools would have
to become "CG aware" to fully leverage the data (whatever happens, tools will 
have to become CG aware to really leverage the data, so there is no I think 0 cost 
scheme)

Complete genomics has two levels of reads:

   The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)

   The "read fragments" inside of the paired ends, being a structured 
set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
below) and a 5 base pair "gap".

The scheme below looks at an aligment at the read-fragment level.

An alternative scheme is to look at the alignment of the paired ends,
with the sequence being the called CG reads with "Ns" in the gap positions
(if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
"CG mask" being an encoding of the characteristic gap structure in this CG read.
This is, in effect, a cigar string, and as this will be CG specific, one could
at this point allow a "B" operation to indicate overlap. Either the whole readset
could be declared as "CG" or the presence of this "CG mask" would indicate it
on per read basis (I think better to have an attribute at the readset level).

This "mask" is highly structured, and so even in sam representation probably
best represented as a huffman encoded dictionary - certainly in more compressed
forms, this would be a huffman encode variable length, and in playing around
with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
pretty compressible, being ~3.5 bits per mask.

In many  ways this "CG mask" to represent the CG-level indels allows also for
small indels to be represented as a traditional "read level" CIGAR string, 
(standard sam) to be mapped post the CG level mask.

Just to look at this scheme in terms of legacy code vs non legacy code:

    For legacy code/non-CG aware code:

        There is a long read with "Ns".
        The Ns imply no knowledge of the bases - so SNP callers will not make
a terrible mistake here

    For CG aware code:

        You read in the mask, look up the "CG level" string in the dictionary

        Apply the CG level string, including "B" operations

        You can recreate the actual fragments now internally.

The only small wrinkle is that inside the B() operations there are two quality
values, not 1. As I think B operations are rare, it is actually fine to perhaps
have a second, rare tag that encode this (might be a bit fiddly but doable). 
It is also the case that although there are two "chemistry reads" at this point,
there are other error processes (eg, template construction) in which the two
bases are not independent. So - for any machinery which is sophisticated enough
to want to think about 2 quality values will almost certainly do something
sophisticated about (say) the recalibration (I would be thinking about calibrating
the errors of these B regions separately from the quality codes elsewhere)

Does this make sense? Comments?

On 28 Nov 2011, at 22:18, Heng Li wrote:

> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
> 
> The backward operation is designed to describe such an alignment:
> 
> REF:: GCATACGATCGACTAGTCACGT  
> READ: --ATACGATCGA----------
> READ: ---------CGACTAGTCAC--
> 
> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
> 
> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
> 
> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
> 
> Do you think it is worth having a backward operation?
> 
> regards,
> 
> Heng
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE. 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Ewan B. <bi...@eb...> - 2011-11-29 08:52:31

On 29 Nov 2011, at 08:48, Ewan Birney wrote:

> 
> I'd just to express an alternative way of storing CG data in sam/bam/cram
> which captures nearly all the data, can be efficiently compressed and
> will have little disruption on downstream tools though tools would have
> to become "CG aware" to fully leverage the data (whatever happens, tools will 
> have to become CG aware to really leverage the data, so there is no I think 0 cost 
> scheme)
> 


Just to add to this (only in writing this email did I realise it): if one has
some additional tag to store the second quality values in the B() regions one
can achieve complete representation of CG data in this scheme. 


> 
> Complete genomics has two levels of reads:
> 
>   The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)
> 
>   The "read fragments" inside of the paired ends, being a structured 
> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
> below) and a 5 base pair "gap".
> 
> 
> The scheme below looks at an aligment at the read-fragment level.
> 
> 
> An alternative scheme is to look at the alignment of the paired ends,
> with the sequence being the called CG reads with "Ns" in the gap positions
> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
> "CG mask" being an encoding of the characteristic gap structure in this CG read.
> This is, in effect, a cigar string, and as this will be CG specific, one could
> at this point allow a "B" operation to indicate overlap. Either the whole readset
> could be declared as "CG" or the presence of this "CG mask" would indicate it
> on per read basis (I think better to have an attribute at the readset level).
> 
> 
> 
> This "mask" is highly structured, and so even in sam representation probably
> best represented as a huffman encoded dictionary - certainly in more compressed
> forms, this would be a huffman encode variable length, and in playing around
> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
> pretty compressible, being ~3.5 bits per mask.
> 
> 
> In many  ways this "CG mask" to represent the CG-level indels allows also for
> small indels to be represented as a traditional "read level" CIGAR string, 
> (standard sam) to be mapped post the CG level mask.
> 
> 
> 
> Just to look at this scheme in terms of legacy code vs non legacy code:
> 
>    For legacy code/non-CG aware code:
> 
>        There is a long read with "Ns".
>        The Ns imply no knowledge of the bases - so SNP callers will not make
> a terrible mistake here
> 
> 
>    For CG aware code:
> 
>        You read in the mask, look up the "CG level" string in the dictionary
> 
>        Apply the CG level string, including "B" operations
> 
>        You can recreate the actual fragments now internally.
> 
> 
> The only small wrinkle is that inside the B() operations there are two quality
> values, not 1. As I think B operations are rare, it is actually fine to perhaps
> have a second, rare tag that encode this (might be a bit fiddly but doable). 
> It is also the case that although there are two "chemistry reads" at this point,
> there are other error processes (eg, template construction) in which the two
> bases are not independent. So - for any machinery which is sophisticated enough
> to want to think about 2 quality values will almost certainly do something
> sophisticated about (say) the recalibration (I would be thinking about calibrating
> the errors of these B regions separately from the quality codes elsewhere)
> 
> 
> 
> 
> 
> 
> Does this make sense? Comments?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 28 Nov 2011, at 22:18, Heng Li wrote:
> 
>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>> 
>> The backward operation is designed to describe such an alignment:
>> 
>> REF:: GCATACGATCGACTAGTCACGT  
>> READ: --ATACGATCGA----------
>> READ: ---------CGACTAGTCAC--
>> 
>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>> 
>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>> 
>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>> 
>> Do you think it is worth having a backward operation?
>> 
>> regards,
>> 
>> Heng
>> 
>> -- 
>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>> Limited, a charity registered in England with number 1021457 and a 
>> company registered in England with number 2742969, whose registered 
>> office is 215 Euston Road, London, NW1 2BE. 
>> 
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure 
>> contains a definitive record of customers, application performance, 
>> security threats, fraudulent activity, and more. Splunk takes this 
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Heng Li <lh...@li...> - 2011-11-29 15:11:47

Hi Ewan,

Thanks for the suggestion. To make sure I understand your proposal, could you show me the CIGAR, SEQ, QUAL and the additional tag for the following example?

> REF:: GCATACGATCGACTAGTCACGT  
> READ: --ATACGATCGA----------
> READ: ---------CGACTAGTCAC--


Thanks,

Heng


On Nov 29, 2011, at 3:52 AM, Ewan Birney wrote:

> 
> On 29 Nov 2011, at 08:48, Ewan Birney wrote:
> 
>> 
>> I'd just to express an alternative way of storing CG data in sam/bam/cram
>> which captures nearly all the data, can be efficiently compressed and
>> will have little disruption on downstream tools though tools would have
>> to become "CG aware" to fully leverage the data (whatever happens, tools will 
>> have to become CG aware to really leverage the data, so there is no I think 0 cost 
>> scheme)
>> 
> 
> 
> Just to add to this (only in writing this email did I realise it): if one has
> some additional tag to store the second quality values in the B() regions one
> can achieve complete representation of CG data in this scheme. 
> 
> 
>> 
>> Complete genomics has two levels of reads:
>> 
>>  The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)
>> 
>>  The "read fragments" inside of the paired ends, being a structured 
>> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
>> below) and a 5 base pair "gap".
>> 
>> 
>> The scheme below looks at an aligment at the read-fragment level.
>> 
>> 
>> An alternative scheme is to look at the alignment of the paired ends,
>> with the sequence being the called CG reads with "Ns" in the gap positions
>> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
>> "CG mask" being an encoding of the characteristic gap structure in this CG read.
>> This is, in effect, a cigar string, and as this will be CG specific, one could
>> at this point allow a "B" operation to indicate overlap. Either the whole readset
>> could be declared as "CG" or the presence of this "CG mask" would indicate it
>> on per read basis (I think better to have an attribute at the readset level).
>> 
>> 
>> 
>> This "mask" is highly structured, and so even in sam representation probably
>> best represented as a huffman encoded dictionary - certainly in more compressed
>> forms, this would be a huffman encode variable length, and in playing around
>> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
>> pretty compressible, being ~3.5 bits per mask.
>> 
>> 
>> In many  ways this "CG mask" to represent the CG-level indels allows also for
>> small indels to be represented as a traditional "read level" CIGAR string, 
>> (standard sam) to be mapped post the CG level mask.
>> 
>> 
>> 
>> Just to look at this scheme in terms of legacy code vs non legacy code:
>> 
>>   For legacy code/non-CG aware code:
>> 
>>       There is a long read with "Ns".
>>       The Ns imply no knowledge of the bases - so SNP callers will not make
>> a terrible mistake here
>> 
>> 
>>   For CG aware code:
>> 
>>       You read in the mask, look up the "CG level" string in the dictionary
>> 
>>       Apply the CG level string, including "B" operations
>> 
>>       You can recreate the actual fragments now internally.
>> 
>> 
>> The only small wrinkle is that inside the B() operations there are two quality
>> values, not 1. As I think B operations are rare, it is actually fine to perhaps
>> have a second, rare tag that encode this (might be a bit fiddly but doable). 
>> It is also the case that although there are two "chemistry reads" at this point,
>> there are other error processes (eg, template construction) in which the two
>> bases are not independent. So - for any machinery which is sophisticated enough
>> to want to think about 2 quality values will almost certainly do something
>> sophisticated about (say) the recalibration (I would be thinking about calibrating
>> the errors of these B regions separately from the quality codes elsewhere)
>> 
>> 
>> 
>> 
>> 
>> 
>> Does this make sense? Comments?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 28 Nov 2011, at 22:18, Heng Li wrote:
>> 
>>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>>> 
>>> The backward operation is designed to describe such an alignment:
>>> 
>>> REF:: GCATACGATCGACTAGTCACGT  
>>> READ: --ATACGATCGA----------
>>> READ: ---------CGACTAGTCAC--
>>> 
>>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>>> 
>>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>>> 
>>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>>> 
>>> Do you think it is worth having a backward operation?
>>> 
>>> regards,
>>> 
>>> Heng
>>> 
>>> -- 
>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>> Limited, a charity registered in England with number 1021457 and a 
>>> company registered in England with number 2742969, whose registered 
>>> office is 215 Euston Road, London, NW1 2BE. 
>>> 
>>> ------------------------------------------------------------------------------
>>> All the data continuously generated in your IT infrastructure 
>>> contains a definitive record of customers, application performance, 
>>> security threats, fraudulent activity, and more. Splunk takes this 
>>> data and makes sense of it. IT sense. And common sense.
>>> http://p.sf.net/sfu/splunk-novd2d
>>> _______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>> 
>> 
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure 
>> contains a definitive record of customers, application performance, 
>> security threats, fraudulent activity, and more. Splunk takes this 
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Ewan B. <bi...@eb...> - 2011-11-29 15:54:38

On 29 Nov 2011, at 15:11, Heng Li wrote:

> Hi Ewan,
> 
> Thanks for the suggestion. To make sure I understand your proposal, could you show me the CIGAR, SEQ, QUAL and the additional tag for the following example?
> 
>> REF:: GCATACGATCGACTAGTCACGT  
>> READ: --ATACGATCGA----------
>> READ: ---------CGACTAGTCAC--
> 



So - I think the full "paired end" has 3 fragments (i think - it might be 4 - does not matter)

Say:


Ref:  GCATACGATCGACTAGTCACGTTTGTTTGGATGCCCTGATG
FR1   --ATACGATCGA-----------------------------
FR2   ---------CGACTAGTCAC---------------------
FR3   --------------------------TTGGATGCC------



*all three* of these fragments would be one record in the SAM:


ArbitaryReadName1	1187	Ref	3	60	33M	=	345	516	ATACGATCGACTAGTCACNNNNNNTTGGATGCC	BBBB??BBBBBBBBBBBBBNNNNNNNBBB?=BBBB	XC:i:23 XB:Z:B?B



(I might have gotten some of the formatting wrong here, and problaby the quality for the N bases would not be N but
be some appropriate set character - perhaps whatever 0 is...).


The XC (say) tag is the key thing. This says look up in the dictionary for entry 23 for the "CG Mask"

Entry 23 will be something like:

    10M3B11M6I9M

(not sure if it should be I or D in the middle there - whichever one is correct!)


The XB tag is to provide the quality values of the overlapping bases, in this case 3; I think convention
would be something like you use hte leftmost set in the main string, and the XB flag would tell you the
overlap qualities.


So - if you "understand" the CG XC and XB flags, you can now recreate the original 3 fragments correctly.


if you don't you make the reasonably assumption that the bases are called as above. 








>> 







> 
> Thanks,
> 
> Heng
> 
> 
> On Nov 29, 2011, at 3:52 AM, Ewan Birney wrote:
> 
>> 
>> On 29 Nov 2011, at 08:48, Ewan Birney wrote:
>> 
>>> 
>>> I'd just to express an alternative way of storing CG data in sam/bam/cram
>>> which captures nearly all the data, can be efficiently compressed and
>>> will have little disruption on downstream tools though tools would have
>>> to become "CG aware" to fully leverage the data (whatever happens, tools will 
>>> have to become CG aware to really leverage the data, so there is no I think 0 cost 
>>> scheme)
>>> 
>> 
>> 
>> Just to add to this (only in writing this email did I realise it): if one has
>> some additional tag to store the second quality values in the B() regions one
>> can achieve complete representation of CG data in this scheme. 
>> 
>> 
>>> 
>>> Complete genomics has two levels of reads:
>>> 
>>> The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)
>>> 
>>> The "read fragments" inside of the paired ends, being a structured 
>>> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
>>> below) and a 5 base pair "gap".
>>> 
>>> 
>>> The scheme below looks at an aligment at the read-fragment level.
>>> 
>>> 
>>> An alternative scheme is to look at the alignment of the paired ends,
>>> with the sequence being the called CG reads with "Ns" in the gap positions
>>> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
>>> "CG mask" being an encoding of the characteristic gap structure in this CG read.
>>> This is, in effect, a cigar string, and as this will be CG specific, one could
>>> at this point allow a "B" operation to indicate overlap. Either the whole readset
>>> could be declared as "CG" or the presence of this "CG mask" would indicate it
>>> on per read basis (I think better to have an attribute at the readset level).
>>> 
>>> 
>>> 
>>> This "mask" is highly structured, and so even in sam representation probably
>>> best represented as a huffman encoded dictionary - certainly in more compressed
>>> forms, this would be a huffman encode variable length, and in playing around
>>> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
>>> pretty compressible, being ~3.5 bits per mask.
>>> 
>>> 
>>> In many  ways this "CG mask" to represent the CG-level indels allows also for
>>> small indels to be represented as a traditional "read level" CIGAR string, 
>>> (standard sam) to be mapped post the CG level mask.
>>> 
>>> 
>>> 
>>> Just to look at this scheme in terms of legacy code vs non legacy code:
>>> 
>>>  For legacy code/non-CG aware code:
>>> 
>>>      There is a long read with "Ns".
>>>      The Ns imply no knowledge of the bases - so SNP callers will not make
>>> a terrible mistake here
>>> 
>>> 
>>>  For CG aware code:
>>> 
>>>      You read in the mask, look up the "CG level" string in the dictionary
>>> 
>>>      Apply the CG level string, including "B" operations
>>> 
>>>      You can recreate the actual fragments now internally.
>>> 
>>> 
>>> The only small wrinkle is that inside the B() operations there are two quality
>>> values, not 1. As I think B operations are rare, it is actually fine to perhaps
>>> have a second, rare tag that encode this (might be a bit fiddly but doable). 
>>> It is also the case that although there are two "chemistry reads" at this point,
>>> there are other error processes (eg, template construction) in which the two
>>> bases are not independent. So - for any machinery which is sophisticated enough
>>> to want to think about 2 quality values will almost certainly do something
>>> sophisticated about (say) the recalibration (I would be thinking about calibrating
>>> the errors of these B regions separately from the quality codes elsewhere)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Does this make sense? Comments?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 28 Nov 2011, at 22:18, Heng Li wrote:
>>> 
>>>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>>>> 
>>>> The backward operation is designed to describe such an alignment:
>>>> 
>>>> REF:: GCATACGATCGACTAGTCACGT  
>>>> READ: --ATACGATCGA----------
>>>> READ: ---------CGACTAGTCAC--
>>>> 
>>>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>>>> 
>>>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>>>> 
>>>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>>>> 
>>>> Do you think it is worth having a backward operation?
>>>> 
>>>> regards,
>>>> 
>>>> Heng
>>>> 
>>>> -- 
>>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>>> Limited, a charity registered in England with number 1021457 and a 
>>>> company registered in England with number 2742969, whose registered 
>>>> office is 215 Euston Road, London, NW1 2BE. 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> All the data continuously generated in your IT infrastructure 
>>>> contains a definitive record of customers, application performance, 
>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>> data and makes sense of it. IT sense. And common sense.
>>>> http://p.sf.net/sfu/splunk-novd2d
>>>> _______________________________________________
>>>> Samtools-devel mailing list
>>>> Sam...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> All the data continuously generated in your IT infrastructure 
>>> contains a definitive record of customers, application performance, 
>>> security threats, fraudulent activity, and more. Splunk takes this 
>>> data and makes sense of it. IT sense. And common sense.
>>> http://p.sf.net/sfu/splunk-novd2d
>>> _______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>> 
>> 
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure 
>> contains a definitive record of customers, application performance, 
>> security threats, fraudulent activity, and more. Splunk takes this 
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> 
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Heng Li <lh...@sa...> - 2011-11-29 17:12:02

Thanks for the explanation, Ewan. Your proposal is sort of like the way CG is using, which I proposed about two years ago. At present, a CG SAM record looks like:

GS21184-FS3-L07-7:8218888 67 chr20 8049 41 23M6N10M = 8335 286 GAGTGCCAGGACCAGATCCTGGCCAGGTGGTTA 987::767777778:=:;:;:;*88999988/- GC:Z:3S2G28S GS:Z:TGTG GQ:Z:++26

The GS tag keeps the raw sequence in the overlap and GQ keeps the raw quality. GC describes where to put the overlaps (3bp from 5'-end), which can be replaced by a CIGAR with 'B' if desired. The SEQ and QUAL keep the _consensus_ sequence and quality. Note that a difference from your proposal is that the sequence in the overlap can be different (GS may be "TGAG" when there is a sequencing error). This representation is more verbose (and thus less compressible) but more generic.


So, the situation is like what happened two years ago. We still have two options for the CG data:

1) We add a backward CIGAR operation "B". This is the cleanest solution for CG developers, but comes as a pain to other users. In addition, 'B' mainly makes a CG SAM cleaner and easier to compress. I guess most end-users would like to use a SAM with 'B' removed for the purpose of analyses.

2) We do not modify CIGAR in SAM, but instead ask cSRA/CRAM developers to come up with creative ways to compress the overlap.


It seems that people in this samtools-devel list have not expressed strong interest in solution 1) so far. I will wait for a few days to see if further comments come up. I will also convey the opinions of SAM developers to CG if I am involved in related offline discussions again.

Thank you all,

Heng

On Nov 29, 2011, at 10:54 AM, Ewan Birney wrote:

> 
> On 29 Nov 2011, at 15:11, Heng Li wrote:
> 
>> Hi Ewan,
>> 
>> Thanks for the suggestion. To make sure I understand your proposal, could you show me the CIGAR, SEQ, QUAL and the additional tag for the following example?
>> 
>>> REF:: GCATACGATCGACTAGTCACGT  
>>> READ: --ATACGATCGA----------
>>> READ: ---------CGACTAGTCAC--
>> 
> 
> 
> 
> So - I think the full "paired end" has 3 fragments (i think - it might be 4 - does not matter)
> 
> Say:
> 
> 
> Ref:  GCATACGATCGACTAGTCACGTTTGTTTGGATGCCCTGATG
> FR1   --ATACGATCGA-----------------------------
> FR2   ---------CGACTAGTCAC---------------------
> FR3   --------------------------TTGGATGCC------
> 
> 
> 
> *all three* of these fragments would be one record in the SAM:
> 
> 
> ArbitaryReadName1	1187	Ref	3	60	33M	=	345	516	ATACGATCGACTAGTCACNNNNNNTTGGATGCC	BBBB??BBBBBBBBBBBBBNNNNNNNBBB?=BBBB	XC:i:23 XB:Z:B?B
> 
> 
> 
> (I might have gotten some of the formatting wrong here, and problaby the quality for the N bases would not be N but
> be some appropriate set character - perhaps whatever 0 is...).
> 
> 
> The XC (say) tag is the key thing. This says look up in the dictionary for entry 23 for the "CG Mask"
> 
> Entry 23 will be something like:
> 
>    10M3B11M6I9M
> 
> (not sure if it should be I or D in the middle there - whichever one is correct!)
> 
> 
> The XB tag is to provide the quality values of the overlapping bases, in this case 3; I think convention
> would be something like you use hte leftmost set in the main string, and the XB flag would tell you the
> overlap qualities.
> 
> 
> So - if you "understand" the CG XC and XB flags, you can now recreate the original 3 fragments correctly.
> 
> 
> if you don't you make the reasonably assumption that the bases are called as above. 
> 
> 
> 
> 
> 
> 
> 
> 
>>> 
> 
> 
> 
> 
> 
> 
> 
>> 
>> Thanks,
>> 
>> Heng
>> 
>> 
>> On Nov 29, 2011, at 3:52 AM, Ewan Birney wrote:
>> 
>>> 
>>> On 29 Nov 2011, at 08:48, Ewan Birney wrote:
>>> 
>>>> 
>>>> I'd just to express an alternative way of storing CG data in sam/bam/cram
>>>> which captures nearly all the data, can be efficiently compressed and
>>>> will have little disruption on downstream tools though tools would have
>>>> to become "CG aware" to fully leverage the data (whatever happens, tools will 
>>>> have to become CG aware to really leverage the data, so there is no I think 0 cost 
>>>> scheme)
>>>> 
>>> 
>>> 
>>> Just to add to this (only in writing this email did I realise it): if one has
>>> some additional tag to store the second quality values in the B() regions one
>>> can achieve complete representation of CG data in this scheme. 
>>> 
>>> 
>>>> 
>>>> Complete genomics has two levels of reads:
>>>> 
>>>> The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)
>>>> 
>>>> The "read fragments" inside of the paired ends, being a structured 
>>>> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
>>>> below) and a 5 base pair "gap".
>>>> 
>>>> 
>>>> The scheme below looks at an aligment at the read-fragment level.
>>>> 
>>>> 
>>>> An alternative scheme is to look at the alignment of the paired ends,
>>>> with the sequence being the called CG reads with "Ns" in the gap positions
>>>> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
>>>> "CG mask" being an encoding of the characteristic gap structure in this CG read.
>>>> This is, in effect, a cigar string, and as this will be CG specific, one could
>>>> at this point allow a "B" operation to indicate overlap. Either the whole readset
>>>> could be declared as "CG" or the presence of this "CG mask" would indicate it
>>>> on per read basis (I think better to have an attribute at the readset level).
>>>> 
>>>> 
>>>> 
>>>> This "mask" is highly structured, and so even in sam representation probably
>>>> best represented as a huffman encoded dictionary - certainly in more compressed
>>>> forms, this would be a huffman encode variable length, and in playing around
>>>> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
>>>> pretty compressible, being ~3.5 bits per mask.
>>>> 
>>>> 
>>>> In many  ways this "CG mask" to represent the CG-level indels allows also for
>>>> small indels to be represented as a traditional "read level" CIGAR string, 
>>>> (standard sam) to be mapped post the CG level mask.
>>>> 
>>>> 
>>>> 
>>>> Just to look at this scheme in terms of legacy code vs non legacy code:
>>>> 
>>>> For legacy code/non-CG aware code:
>>>> 
>>>>     There is a long read with "Ns".
>>>>     The Ns imply no knowledge of the bases - so SNP callers will not make
>>>> a terrible mistake here
>>>> 
>>>> 
>>>> For CG aware code:
>>>> 
>>>>     You read in the mask, look up the "CG level" string in the dictionary
>>>> 
>>>>     Apply the CG level string, including "B" operations
>>>> 
>>>>     You can recreate the actual fragments now internally.
>>>> 
>>>> 
>>>> The only small wrinkle is that inside the B() operations there are two quality
>>>> values, not 1. As I think B operations are rare, it is actually fine to perhaps
>>>> have a second, rare tag that encode this (might be a bit fiddly but doable). 
>>>> It is also the case that although there are two "chemistry reads" at this point,
>>>> there are other error processes (eg, template construction) in which the two
>>>> bases are not independent. So - for any machinery which is sophisticated enough
>>>> to want to think about 2 quality values will almost certainly do something
>>>> sophisticated about (say) the recalibration (I would be thinking about calibrating
>>>> the errors of these B regions separately from the quality codes elsewhere)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Does this make sense? Comments?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 28 Nov 2011, at 22:18, Heng Li wrote:
>>>> 
>>>>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>>>>> 
>>>>> The backward operation is designed to describe such an alignment:
>>>>> 
>>>>> REF:: GCATACGATCGACTAGTCACGT  
>>>>> READ: --ATACGATCGA----------
>>>>> READ: ---------CGACTAGTCAC--
>>>>> 
>>>>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>>>>> 
>>>>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>>>>> 
>>>>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>>>>> 
>>>>> Do you think it is worth having a backward operation?
>>>>> 
>>>>> regards,
>>>>> 
>>>>> Heng
>>>>> 
>>>>> -- 
>>>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>>>> Limited, a charity registered in England with number 1021457 and a 
>>>>> company registered in England with number 2742969, whose registered 
>>>>> office is 215 Euston Road, London, NW1 2BE. 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> All the data continuously generated in your IT infrastructure 
>>>>> contains a definitive record of customers, application performance, 
>>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>>> data and makes sense of it. IT sense. And common sense.
>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>> _______________________________________________
>>>>> Samtools-devel mailing list
>>>>> Sam...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> All the data continuously generated in your IT infrastructure 
>>>> contains a definitive record of customers, application performance, 
>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>> data and makes sense of it. IT sense. And common sense.
>>>> http://p.sf.net/sfu/splunk-novd2d
>>>> _______________________________________________
>>>> Samtools-devel mailing list
>>>> Sam...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> All the data continuously generated in your IT infrastructure 
>>> contains a definitive record of customers, application performance, 
>>> security threats, fraudulent activity, and more. Splunk takes this 
>>> data and makes sense of it. IT sense. And common sense.
>>> http://p.sf.net/sfu/splunk-novd2d
>>> _______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>> 
>> 
>> 
>> -- 
>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>> Limited, a charity registered in England with number 1021457 and a 
>> company registered in England with number 2742969, whose registered 
>> office is 215 Euston Road, London, NW1 2BE. 
> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Len T. <len...@re...> - 2011-11-29 21:00:37

At Tue, 29 Nov 2011 12:11:53 -0500,
Heng Li wrote:
> So, the situation is like what happened two years ago. We still have two options for the CG data:
> 
> 1) We add a backward CIGAR operation "B". This is the cleanest solution for CG developers, but comes as a pain to other users. In addition, 'B' mainly makes a CG SAM cleaner and easier to compress. I guess most end-users would like to use a SAM with 'B' removed for the purpose of analyses.
> 
> 2) We do not modify CIGAR in SAM, but instead ask cSRA/CRAM developers to come up with creative ways to compress the overlap.
> 
> 
> It seems that people in this samtools-devel list have not expressed strong interest in solution 1) so far. I will wait for a few days to see if further comments come up. I will also convey the opinions of SAM developers to CG if I am involved in related offline discussions again.

At Realtime Genomics we would prefer the addition of the 'B'
operator. Our cgmap tool currently outputs the flattened/consensus
read in the primary SAM records for compatibility with non-CG aware
downstream tools, plus CG specific additional attributes (effectively
a backstep-capable CIGAR plus other missing information) that are
utilized by our variant caller.  Adding support for 'B' to the SAM
spec would greatly simplify our SAM files; as you note, it is
relatively easy to provide an API to give a backwards-compatible view
and yet is fairly simple for other tools to make use of the more
specific information if they want.

Cheers,
Len.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Richard D. <rd...@sa...> - 2011-11-29 23:49:09

I think my view would now be to add the B cigar operator.   This has the very big benefit to have the original read and qualities in the natural place.
I think Tim Fennel's point makes sense, i.e. it is reasonable to require that B does not take you back beyond the beginning of the read.

However, I am pretty sure it would be wrong to treat two overlapping bases from the same read as independent.  Certainly they aren't from a
haplotype sampling point of view (they come from the same haplotype) and also their qualities probably aren't independent because any
mutation prior to sequencing would be shared.  So I think it would make sense in mpileup output, and likelihood calculations, to calculate a
single combined base.  Primary implementers such as samtools and GATK should agree on this calculation, probably with CG, probably 
using whatever CG uses now to calculate the current derived read.  This means that or most end users there will not be an increase in complication.

Importantly, I suggest that a tool is developed to transform BAM with native CG reads and the B operator into something like the current version
with collapsed reads and no B operator.  This is quite trivial once the operations in the preceding paragraph exist.  Given this tool exists, anyone
using code that reads cigar strings but doesn't yet recognise the B operator can run files through this conversion first.  I guess the exception is
browsers who show the actual aligned reads - they will need to recognise B.

A major reason to have the true read is to compare the with a proposed haplotype in an local assembly type method as in dindel and
other new indel methods.  For this, they will need a way to calculate the likelihood of a sequence given a native CG read.  It would be great 
if CG would be prepared to share their calculation/code for this.  A major part of it is the overlapping base calculation from the preceding paragraph.

So, there is clearly some work to do if we adopt B, but I think it is reasonably limited.  I think it would be worth while if CG are prepared to 
share their corresponding key bits of code (note this is not giving away the whole mapping/alignment package).

Richard

On 29 Nov 2011, at 17:11, Heng Li wrote:

> Thanks for the explanation, Ewan. Your proposal is sort of like the way CG is using, which I proposed about two years ago. At present, a CG SAM record looks like:
> 
> GS21184-FS3-L07-7:8218888 67 chr20 8049 41 23M6N10M = 8335 286 GAGTGCCAGGACCAGATCCTGGCCAGGTGGTTA 987::767777778:=:;:;:;*88999988/- GC:Z:3S2G28S GS:Z:TGTG GQ:Z:++26
> 
> The GS tag keeps the raw sequence in the overlap and GQ keeps the raw quality. GC describes where to put the overlaps (3bp from 5'-end), which can be replaced by a CIGAR with 'B' if desired. The SEQ and QUAL keep the _consensus_ sequence and quality. Note that a difference from your proposal is that the sequence in the overlap can be different (GS may be "TGAG" when there is a sequencing error). This representation is more verbose (and thus less compressible) but more generic.
> 
> 
> So, the situation is like what happened two years ago. We still have two options for the CG data:
> 
> 1) We add a backward CIGAR operation "B". This is the cleanest solution for CG developers, but comes as a pain to other users. In addition, 'B' mainly makes a CG SAM cleaner and easier to compress. I guess most end-users would like to use a SAM with 'B' removed for the purpose of analyses.
> 
> 2) We do not modify CIGAR in SAM, but instead ask cSRA/CRAM developers to come up with creative ways to compress the overlap.
> 
> 
> It seems that people in this samtools-devel list have not expressed strong interest in solution 1) so far. I will wait for a few days to see if further comments come up. I will also convey the opinions of SAM developers to CG if I am involved in related offline discussions again.
> 
> Thank you all,
> 
> Heng
> 
> On Nov 29, 2011, at 10:54 AM, Ewan Birney wrote:
> 
>> 
>> On 29 Nov 2011, at 15:11, Heng Li wrote:
>> 
>>> Hi Ewan,
>>> 
>>> Thanks for the suggestion. To make sure I understand your proposal, could you show me the CIGAR, SEQ, QUAL and the additional tag for the following example?
>>> 
>>>> REF:: GCATACGATCGACTAGTCACGT  
>>>> READ: --ATACGATCGA----------
>>>> READ: ---------CGACTAGTCAC--
>>> 
>> 
>> 
>> 
>> So - I think the full "paired end" has 3 fragments (i think - it might be 4 - does not matter)
>> 
>> Say:
>> 
>> 
>> Ref:  GCATACGATCGACTAGTCACGTTTGTTTGGATGCCCTGATG
>> FR1   --ATACGATCGA-----------------------------
>> FR2   ---------CGACTAGTCAC---------------------
>> FR3   --------------------------TTGGATGCC------
>> 
>> 
>> 
>> *all three* of these fragments would be one record in the SAM:
>> 
>> 
>> ArbitaryReadName1	1187	Ref	3	60	33M	=	345	516	ATACGATCGACTAGTCACNNNNNNTTGGATGCC	BBBB??BBBBBBBBBBBBBNNNNNNNBBB?=BBBB	XC:i:23 XB:Z:B?B
>> 
>> 
>> 
>> (I might have gotten some of the formatting wrong here, and problaby the quality for the N bases would not be N but
>> be some appropriate set character - perhaps whatever 0 is...).
>> 
>> 
>> The XC (say) tag is the key thing. This says look up in the dictionary for entry 23 for the "CG Mask"
>> 
>> Entry 23 will be something like:
>> 
>>  10M3B11M6I9M
>> 
>> (not sure if it should be I or D in the middle there - whichever one is correct!)
>> 
>> 
>> The XB tag is to provide the quality values of the overlapping bases, in this case 3; I think convention
>> would be something like you use hte leftmost set in the main string, and the XB flag would tell you the
>> overlap qualities.
>> 
>> 
>> So - if you "understand" the CG XC and XB flags, you can now recreate the original 3 fragments correctly.
>> 
>> 
>> if you don't you make the reasonably assumption that the bases are called as above. 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> 
>>> Thanks,
>>> 
>>> Heng
>>> 
>>> 
>>> On Nov 29, 2011, at 3:52 AM, Ewan Birney wrote:
>>> 
>>>> 
>>>> On 29 Nov 2011, at 08:48, Ewan Birney wrote:
>>>> 
>>>>> 
>>>>> I'd just to express an alternative way of storing CG data in sam/bam/cram
>>>>> which captures nearly all the data, can be efficiently compressed and
>>>>> will have little disruption on downstream tools though tools would have
>>>>> to become "CG aware" to fully leverage the data (whatever happens, tools will 
>>>>> have to become CG aware to really leverage the data, so there is no I think 0 cost 
>>>>> scheme)
>>>>> 
>>>> 
>>>> 
>>>> Just to add to this (only in writing this email did I realise it): if one has
>>>> some additional tag to store the second quality values in the B() regions one
>>>> can achieve complete representation of CG data in this scheme. 
>>>> 
>>>> 
>>>>> 
>>>>> Complete genomics has two levels of reads:
>>>>> 
>>>>> The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)
>>>>> 
>>>>> The "read fragments" inside of the paired ends, being a structured 
>>>>> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
>>>>> below) and a 5 base pair "gap".
>>>>> 
>>>>> 
>>>>> The scheme below looks at an aligment at the read-fragment level.
>>>>> 
>>>>> 
>>>>> An alternative scheme is to look at the alignment of the paired ends,
>>>>> with the sequence being the called CG reads with "Ns" in the gap positions
>>>>> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
>>>>> "CG mask" being an encoding of the characteristic gap structure in this CG read.
>>>>> This is, in effect, a cigar string, and as this will be CG specific, one could
>>>>> at this point allow a "B" operation to indicate overlap. Either the whole readset
>>>>> could be declared as "CG" or the presence of this "CG mask" would indicate it
>>>>> on per read basis (I think better to have an attribute at the readset level).
>>>>> 
>>>>> 
>>>>> 
>>>>> This "mask" is highly structured, and so even in sam representation probably
>>>>> best represented as a huffman encoded dictionary - certainly in more compressed
>>>>> forms, this would be a huffman encode variable length, and in playing around
>>>>> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
>>>>> pretty compressible, being ~3.5 bits per mask.
>>>>> 
>>>>> 
>>>>> In many  ways this "CG mask" to represent the CG-level indels allows also for
>>>>> small indels to be represented as a traditional "read level" CIGAR string, 
>>>>> (standard sam) to be mapped post the CG level mask.
>>>>> 
>>>>> 
>>>>> 
>>>>> Just to look at this scheme in terms of legacy code vs non legacy code:
>>>>> 
>>>>> For legacy code/non-CG aware code:
>>>>> 
>>>>>   There is a long read with "Ns".
>>>>>   The Ns imply no knowledge of the bases - so SNP callers will not make
>>>>> a terrible mistake here
>>>>> 
>>>>> 
>>>>> For CG aware code:
>>>>> 
>>>>>   You read in the mask, look up the "CG level" string in the dictionary
>>>>> 
>>>>>   Apply the CG level string, including "B" operations
>>>>> 
>>>>>   You can recreate the actual fragments now internally.
>>>>> 
>>>>> 
>>>>> The only small wrinkle is that inside the B() operations there are two quality
>>>>> values, not 1. As I think B operations are rare, it is actually fine to perhaps
>>>>> have a second, rare tag that encode this (might be a bit fiddly but doable). 
>>>>> It is also the case that although there are two "chemistry reads" at this point,
>>>>> there are other error processes (eg, template construction) in which the two
>>>>> bases are not independent. So - for any machinery which is sophisticated enough
>>>>> to want to think about 2 quality values will almost certainly do something
>>>>> sophisticated about (say) the recalibration (I would be thinking about calibrating
>>>>> the errors of these B regions separately from the quality codes elsewhere)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Does this make sense? Comments?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 28 Nov 2011, at 22:18, Heng Li wrote:
>>>>> 
>>>>>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>>>>>> 
>>>>>> The backward operation is designed to describe such an alignment:
>>>>>> 
>>>>>> REF:: GCATACGATCGACTAGTCACGT  
>>>>>> READ: --ATACGATCGA----------
>>>>>> READ: ---------CGACTAGTCAC--
>>>>>> 
>>>>>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>>>>>> 
>>>>>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>>>>>> 
>>>>>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>>>>>> 
>>>>>> Do you think it is worth having a backward operation?
>>>>>> 
>>>>>> regards,
>>>>>> 
>>>>>> Heng
>>>>>> 
>>>>>> -- 
>>>>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>>>>> Limited, a charity registered in England with number 1021457 and a 
>>>>>> company registered in England with number 2742969, whose registered 
>>>>>> office is 215 Euston Road, London, NW1 2BE. 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> All the data continuously generated in your IT infrastructure 
>>>>>> contains a definitive record of customers, application performance, 
>>>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>>>> data and makes sense of it. IT sense. And common sense.
>>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>>> _______________________________________________
>>>>>> Samtools-devel mailing list
>>>>>> Sam...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> All the data continuously generated in your IT infrastructure 
>>>>> contains a definitive record of customers, application performance, 
>>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>>> data and makes sense of it. IT sense. And common sense.
>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>> _______________________________________________
>>>>> Samtools-devel mailing list
>>>>> Sam...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> All the data continuously generated in your IT infrastructure 
>>>> contains a definitive record of customers, application performance, 
>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>> data and makes sense of it. IT sense. And common sense.
>>>> http://p.sf.net/sfu/splunk-novd2d
>>>> _______________________________________________
>>>> Samtools-devel mailing list
>>>> Sam...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>> 
>>> 
>>> 
>>> -- 
>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>> Limited, a charity registered in England with number 1021457 and a 
>>> company registered in England with number 2742969, whose registered 
>>> office is 215 Euston Road, London, NW1 2BE. 
>> 
>> 
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure 
>> contains a definitive record of customers, application performance, 
>> security threats, fraudulent activity, and more. Splunk takes this 
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> 
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE. 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Heng Li <lh...@sa...> - 2011-11-30 04:52:20

On Nov 29, 2011, at 6:54 PM, Richard Durbin wrote:

> I think my view would now be to add the B cigar operator.   This has the very big benefit to have the original read and qualities in the natural place.
> I think Tim Fennel's point makes sense, i.e. it is reasonable to require that B does not take you back beyond the beginning of the read.
> 
> However, I am pretty sure it would be wrong to treat two overlapping bases from the same read as independent.  Certainly they aren't from a
> haplotype sampling point of view (they come from the same haplotype) and also their qualities probably aren't independent because any
> mutation prior to sequencing would be shared.  So I think it would make sense in mpileup output, and likelihood calculations, to calculate a
> single combined base.  Primary implementers such as samtools and GATK should agree on this calculation, probably with CG, probably 
> using whatever CG uses now to calculate the current derived read.  This means that or most end users there will not be an increase in complication.

As I wrote the original email, I have gone over the complication of having 'B'. Requiring 'B' not to take back beyond the start is only a minor one. More complex examples include 20M10B5M1D15M (inconsistent CIGAR in the overlap and I think we should allow this) and 20M5B10M7B20M (three sequences overlap but I think we should disallow this). Even in the simplest case, we are dealing with a consensus construction problem, which is not that trivial.

> Importantly, I suggest that a tool is developed to transform BAM with native CG reads and the B operator into something like the current version
> with collapsed reads and no B operator.  This is quite trivial once the operations in the preceding paragraph exist.  Given this tool exists, anyone
> using code that reads cigar strings but doesn't yet recognise the B operator can run files through this conversion first.  I guess the exception is
> browsers who show the actual aligned reads - they will need to recognise B.

I also suggested adding an option to "samtools view" and an API "remove_B()". When we call SNPs, samtools will call "remove_B()" immediately after the alignment is loaded into memory. The mpileup engine will not see any 'B' operations. I actually have no plan to properly support 'B' in mpileup. That will be very complicated in the current framework.

> A major reason to have the true read is to compare the with a proposed haplotype in an local assembly type method as in dindel and
> other new indel methods.  For this, they will need a way to calculate the likelihood of a sequence given a native CG read.  It would be great 
> if CG would be prepared to share their calculation/code for this.  A major part of it is the overlapping base calculation from the preceding paragraph.

BAQ will not work properly without knowing the read structure, either. However, I have been convinced that given deep coverage, de novo local assembly is able to deliver satisfactory results on indel calling. We may not need to go back to the raw reads to do statistical alignment for CG data because CG does de novo local assembly for every indel. In my view, all the benefit of 'B' is a cleaner and more compressible representation.

> So, there is clearly some work to do if we adopt B, but I think it is reasonably limited.  I think it would be worth while if CG are prepared to 
> share their corresponding key bits of code (note this is not giving away the whole mapping/alignment package).

That said, I see the possibility of having the B operation. All I need to do for samtools is to implement remove_B() that probably takes a few hundred lines of code to deal with the meticulous corner cases.

Heng

> Richard
> 
> On 29 Nov 2011, at 17:11, Heng Li wrote:
> 
>> Thanks for the explanation, Ewan. Your proposal is sort of like the way CG is using, which I proposed about two years ago. At present, a CG SAM record looks like:
>> 
>> GS21184-FS3-L07-7:8218888 67 chr20 8049 41 23M6N10M = 8335 286 GAGTGCCAGGACCAGATCCTGGCCAGGTGGTTA 987::767777778:=:;:;:;*88999988/- GC:Z:3S2G28S GS:Z:TGTG GQ:Z:++26
>> 
>> The GS tag keeps the raw sequence in the overlap and GQ keeps the raw quality. GC describes where to put the overlaps (3bp from 5'-end), which can be replaced by a CIGAR with 'B' if desired. The SEQ and QUAL keep the _consensus_ sequence and quality. Note that a difference from your proposal is that the sequence in the overlap can be different (GS may be "TGAG" when there is a sequencing error). This representation is more verbose (and thus less compressible) but more generic.
>> 
>> 
>> So, the situation is like what happened two years ago. We still have two options for the CG data:
>> 
>> 1) We add a backward CIGAR operation "B". This is the cleanest solution for CG developers, but comes as a pain to other users. In addition, 'B' mainly makes a CG SAM cleaner and easier to compress. I guess most end-users would like to use a SAM with 'B' removed for the purpose of analyses.
>> 
>> 2) We do not modify CIGAR in SAM, but instead ask cSRA/CRAM developers to come up with creative ways to compress the overlap.
>> 
>> 
>> It seems that people in this samtools-devel list have not expressed strong interest in solution 1) so far. I will wait for a few days to see if further comments come up. I will also convey the opinions of SAM developers to CG if I am involved in related offline discussions again.
>> 
>> Thank you all,
>> 
>> Heng
>> 
>> On Nov 29, 2011, at 10:54 AM, Ewan Birney wrote:
>> 
>>> 
>>> On 29 Nov 2011, at 15:11, Heng Li wrote:
>>> 
>>>> Hi Ewan,
>>>> 
>>>> Thanks for the suggestion. To make sure I understand your proposal, could you show me the CIGAR, SEQ, QUAL and the additional tag for the following example?
>>>> 
>>>>> REF:: GCATACGATCGACTAGTCACGT  
>>>>> READ: --ATACGATCGA----------
>>>>> READ: ---------CGACTAGTCAC--
>>>> 
>>> 
>>> 
>>> 
>>> So - I think the full "paired end" has 3 fragments (i think - it might be 4 - does not matter)
>>> 
>>> Say:
>>> 
>>> 
>>> Ref:  GCATACGATCGACTAGTCACGTTTGTTTGGATGCCCTGATG
>>> FR1   --ATACGATCGA-----------------------------
>>> FR2   ---------CGACTAGTCAC---------------------
>>> FR3   --------------------------TTGGATGCC------
>>> 
>>> 
>>> 
>>> *all three* of these fragments would be one record in the SAM:
>>> 
>>> 
>>> ArbitaryReadName1	1187	Ref	3	60	33M	=	345	516	ATACGATCGACTAGTCACNNNNNNTTGGATGCC	BBBB??BBBBBBBBBBBBBNNNNNNNBBB?=BBBB	XC:i:23 XB:Z:B?B
>>> 
>>> 
>>> 
>>> (I might have gotten some of the formatting wrong here, and problaby the quality for the N bases would not be N but
>>> be some appropriate set character - perhaps whatever 0 is...).
>>> 
>>> 
>>> The XC (say) tag is the key thing. This says look up in the dictionary for entry 23 for the "CG Mask"
>>> 
>>> Entry 23 will be something like:
>>> 
>>> 10M3B11M6I9M
>>> 
>>> (not sure if it should be I or D in the middle there - whichever one is correct!)
>>> 
>>> 
>>> The XB tag is to provide the quality values of the overlapping bases, in this case 3; I think convention
>>> would be something like you use hte leftmost set in the main string, and the XB flag would tell you the
>>> overlap qualities.
>>> 
>>> 
>>> So - if you "understand" the CG XC and XB flags, you can now recreate the original 3 fragments correctly.
>>> 
>>> 
>>> if you don't you make the reasonably assumption that the bases are called as above. 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Heng
>>>> 
>>>> 
>>>> On Nov 29, 2011, at 3:52 AM, Ewan Birney wrote:
>>>> 
>>>>> 
>>>>> On 29 Nov 2011, at 08:48, Ewan Birney wrote:
>>>>> 
>>>>>> 
>>>>>> I'd just to express an alternative way of storing CG data in sam/bam/cram
>>>>>> which captures nearly all the data, can be efficiently compressed and
>>>>>> will have little disruption on downstream tools though tools would have
>>>>>> to become "CG aware" to fully leverage the data (whatever happens, tools will 
>>>>>> have to become CG aware to really leverage the data, so there is no I think 0 cost 
>>>>>> scheme)
>>>>>> 
>>>>> 
>>>>> 
>>>>> Just to add to this (only in writing this email did I realise it): if one has
>>>>> some additional tag to store the second quality values in the B() regions one
>>>>> can achieve complete representation of CG data in this scheme. 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Complete genomics has two levels of reads:
>>>>>> 
>>>>>> The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)
>>>>>> 
>>>>>> The "read fragments" inside of the paired ends, being a structured 
>>>>>> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
>>>>>> below) and a 5 base pair "gap".
>>>>>> 
>>>>>> 
>>>>>> The scheme below looks at an aligment at the read-fragment level.
>>>>>> 
>>>>>> 
>>>>>> An alternative scheme is to look at the alignment of the paired ends,
>>>>>> with the sequence being the called CG reads with "Ns" in the gap positions
>>>>>> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
>>>>>> "CG mask" being an encoding of the characteristic gap structure in this CG read.
>>>>>> This is, in effect, a cigar string, and as this will be CG specific, one could
>>>>>> at this point allow a "B" operation to indicate overlap. Either the whole readset
>>>>>> could be declared as "CG" or the presence of this "CG mask" would indicate it
>>>>>> on per read basis (I think better to have an attribute at the readset level).
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> This "mask" is highly structured, and so even in sam representation probably
>>>>>> best represented as a huffman encoded dictionary - certainly in more compressed
>>>>>> forms, this would be a huffman encode variable length, and in playing around
>>>>>> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
>>>>>> pretty compressible, being ~3.5 bits per mask.
>>>>>> 
>>>>>> 
>>>>>> In many  ways this "CG mask" to represent the CG-level indels allows also for
>>>>>> small indels to be represented as a traditional "read level" CIGAR string, 
>>>>>> (standard sam) to be mapped post the CG level mask.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Just to look at this scheme in terms of legacy code vs non legacy code:
>>>>>> 
>>>>>> For legacy code/non-CG aware code:
>>>>>> 
>>>>>>  There is a long read with "Ns".
>>>>>>  The Ns imply no knowledge of the bases - so SNP callers will not make
>>>>>> a terrible mistake here
>>>>>> 
>>>>>> 
>>>>>> For CG aware code:
>>>>>> 
>>>>>>  You read in the mask, look up the "CG level" string in the dictionary
>>>>>> 
>>>>>>  Apply the CG level string, including "B" operations
>>>>>> 
>>>>>>  You can recreate the actual fragments now internally.
>>>>>> 
>>>>>> 
>>>>>> The only small wrinkle is that inside the B() operations there are two quality
>>>>>> values, not 1. As I think B operations are rare, it is actually fine to perhaps
>>>>>> have a second, rare tag that encode this (might be a bit fiddly but doable). 
>>>>>> It is also the case that although there are two "chemistry reads" at this point,
>>>>>> there are other error processes (eg, template construction) in which the two
>>>>>> bases are not independent. So - for any machinery which is sophisticated enough
>>>>>> to want to think about 2 quality values will almost certainly do something
>>>>>> sophisticated about (say) the recalibration (I would be thinking about calibrating
>>>>>> the errors of these B regions separately from the quality codes elsewhere)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Does this make sense? Comments?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 28 Nov 2011, at 22:18, Heng Li wrote:
>>>>>> 
>>>>>>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>>>>>>> 
>>>>>>> The backward operation is designed to describe such an alignment:
>>>>>>> 
>>>>>>> REF:: GCATACGATCGACTAGTCACGT  
>>>>>>> READ: --ATACGATCGA----------
>>>>>>> READ: ---------CGACTAGTCAC--
>>>>>>> 
>>>>>>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>>>>>>> 
>>>>>>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>>>>>>> 
>>>>>>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>>>>>>> 
>>>>>>> Do you think it is worth having a backward operation?
>>>>>>> 
>>>>>>> regards,
>>>>>>> 
>>>>>>> Heng
>>>>>>> 
>>>>>>> -- 
>>>>>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>>>>>> Limited, a charity registered in England with number 1021457 and a 
>>>>>>> company registered in England with number 2742969, whose registered 
>>>>>>> office is 215 Euston Road, London, NW1 2BE. 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> All the data continuously generated in your IT infrastructure 
>>>>>>> contains a definitive record of customers, application performance, 
>>>>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>>>>> data and makes sense of it. IT sense. And common sense.
>>>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>>>> _______________________________________________
>>>>>>> Samtools-devel mailing list
>>>>>>> Sam...@li...
>>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> All the data continuously generated in your IT infrastructure 
>>>>>> contains a definitive record of customers, application performance, 
>>>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>>>> data and makes sense of it. IT sense. And common sense.
>>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>>> _______________________________________________
>>>>>> Samtools-devel mailing list
>>>>>> Sam...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> All the data continuously generated in your IT infrastructure 
>>>>> contains a definitive record of customers, application performance, 
>>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>>> data and makes sense of it. IT sense. And common sense.
>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>> _______________________________________________
>>>>> Samtools-devel mailing list
>>>>> Sam...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>>> Limited, a charity registered in England with number 1021457 and a 
>>>> company registered in England with number 2742969, whose registered 
>>>> office is 215 Euston Road, London, NW1 2BE. 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> All the data continuously generated in your IT infrastructure 
>>> contains a definitive record of customers, application performance, 
>>> security threats, fraudulent activity, and more. Splunk takes this 
>>> data and makes sense of it. IT sense. And common sense.
>>> http://p.sf.net/sfu/splunk-novd2d
>>> _______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>> 
>> 
>> 
>> -- 
>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>> Limited, a charity registered in England with number 1021457 and a 
>> company registered in England with number 2742969, whose registered 
>> office is 215 Euston Road, London, NW1 2BE. 
>> 
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure 
>> contains a definitive record of customers, application performance, 
>> security threats, fraudulent activity, and more. Splunk takes this 
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> 



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Goncalo A. <go...@um...> - 2011-11-30 08:44:24

> As I wrote the original email, I have gone over the complication of having
'B'.
> Requiring 'B' not to take back beyond the start is only a minor one. More
> complex examples include 20M10B5M1D15M (inconsistent CIGAR in the
> overlap and I think we should allow this) and 20M5B10M7B20M (three
> sequences overlap but I think we should disallow this). Even in the
simplest
> case, we are dealing with a consensus construction problem, which is not
> that trivial.

Heng,

Don't these complex examples (e.g. particularly the one that produces
inconsistent CIGAR strings) suggest that we really would prefer CG to
"flatten" their alignment and record that into the SEQ and QUAL fields?
Presumably, they'll have generated a multiple alignment prior to generating
the BAM and will have a much better idea of how to resolve these complex
(but hopefully rare) cases.

It seems that accounting for all these nuances could make a lot of BAM
operations into complicated things that require calculation of alignments.

Gonçalo

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Fred L. <fl...@sd...> - 2011-11-30 01:40:33

Misc. comments:

   1. I don't think the addition of the "B" operator alone is sufficient
      to represent the alignment of multiple fragments/segments.  It
      seems like we also need another operator, such as "F", to denote
      the start of a new fragment/segment.  Also, I'm wondering what
      happens if one of the CG fragments is reversed; how do we
      represent that?
   2. This is yet another special case of the more general problem: "How
      do we represent complex multi-fragment alignments in SAM?"  There
      have been other special case requests before this, and there will
      be more requests after this.  Instead of dealing with one special
      case after another, why not solve the problem in a way that is
      general enough to solve all problems?
   3. Richard and I have both suggested that if new operators are added
      that break compatibility, then we should also create an official
      conversion tool that will convert the new format back to the
      standard format.  This gives people the freedom to create, for
      example, a "SAM Archive Format" that takes up less disk space, or
      a "SAM Complex Alignment" format that allows arbitrarily complex
      alignments to be stored in a single record.

FL

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Goncalo A. <go...@um...> - 2011-11-30 02:53:12


If we add a B operator, my take on how to do it is quite similar to what
Tim Fennel, Richard, and others have expressed in this thread. 

* We should require that the B operator doesn't skip past the beginning of
the read; if we don't impose this restriction all sorts of operations could
become quite complicated (for example, when piling up bases at a particular
point one could never be sure that future reads wouldn't skip backwards and
further add to the pile, even in a coordinate sorted BAM). 

* For tools that wish to fully support CG reads (which will be much easier
to do with some guidance from CG on how to calculate the likelihood of one
of their "reads" given a particular template sequence), this will be very
useful. For tools that don't need to support CG in depth, a very simple
implementation of the "B" operator would be to discard the corresponding
bases and qualities as they are read. 

An alternative to this B operator, which I originally advocated, but that
hasn't
gained much traction, would be to store a projection of the CG read
and qualities in the SEQ and QUAL fields (this definitely wouldn't break
any tools, but would provide less encouragement to developers to extract
the additional information in a CG read), would be to define three
(optional) additional tags, one of these would look similar to a CIGAR
string and store the edits required to transform a the SEQ field into a CG
"read", the next would list the bases sequenced "in duplicate" (and which
presumably would be omitted from the projection, which would include only
the consensus), and finally a third which would store the corresponding
qualities. This format probably would be a little less space efficient than
the B operator (although the precise tradeoffs are not clear to me - the B
operator presumably could also reduce compression ratios for the SEQ field,
because it could make overlapping reads more different -- this could be
mitigated in part by using "=" in portions of the
read that match the
reference). 

Goncalo 

On Tue, 29 Nov 2011 17:40:16 -0800, Fred Long  wrote: Misc. comments:

 	* I don't think the addition of the "B" operator alone is sufficient to
represent the alignment of multiple fragments/segments. It seems like we
also need another operator, such as "F", to denote the start of a new
fragment/segment. Also, I'm wondering what happens if one of the CG
fragments is reversed; how do we represent that?
 	* This is yet another special case of the more general problem: "How do
we represent complex multi-fragment alignments in SAM?" There have been
other special case requests before this, and there will be more requests
after this. Instead of dealing with one special case after another, why not
solve the problem in a way that is general enough to solve all problems?
 	* Richard and I have both suggested that if new operators are added that
break compatibility, then we should also create an official conversion tool
that will convert the new
format back to the standard format. This gives
people the freedom to create, for example, a "SAM Archive Format" that
takes up less disk space, or a "SAM Complex Alignment" format that allows
arbitrarily complex alignments to be stored in a single record.

 FL

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Heng Li <lh...@sa...> - 2011-11-30 04:52:46

On Nov 29, 2011, at 8:40 PM, Fred Long wrote:

> Misc. comments:
> 	• I don't think the addition of the "B" operator alone is sufficient to represent the alignment of multiple fragments/segments.  It seems like we also need another operator, such as "F", to denote the start of a new fragment/segment.  Also, I'm wondering what happens if one of the CG fragments is reversed; how do we represent that?

This does not happen to CG data.

> 	• This is yet another special case of the more general problem: "How do we represent complex multi-fragment alignments in SAM?"  There have been other special case requests before this, and         there will be more requests after this.  Instead of dealing with one special case after another, why not solve the problem in a way that is general enough to solve all problems?

The CG case is simpler than a generic multi-segment alignment in our lengthy discussion months ago. We should avoid going into that complexity again.

Heng



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Ewan B. <bi...@eb...> - 2011-11-30 08:40:24

On 29 Nov 2011, at 17:11, Heng Li wrote:

> Thanks for the explanation, Ewan. Your proposal is sort of like the way CG is using, which I proposed about two years ago. At present, a CG SAM record looks like:
> 
> GS21184-FS3-L07-7:8218888 67 chr20 8049 41 23M6N10M = 8335 286 GAGTGCCAGGACCAGATCCTGGCCAGGTGGTTA 987::767777778:=:;:;:;*88999988/- GC:Z:3S2G28S GS:Z:TGTG GQ:Z:++26
> 
> The GS tag keeps the raw sequence in the overlap and GQ keeps the raw quality. GC describes where to put the overlaps (3bp from 5'-end), which can be replaced by a CIGAR with 'B' if desired. The SEQ and QUAL keep the _consensus_ sequence and quality. Note that a difference from your proposal is that the sequence in the overlap can be different (GS may be "TGAG" when there is a sequencing error). This representation is more verbose (and thus less compressible) but more generic.
> 

Just to respond to this the current system is notionally similar due to the representation of multiple
"fragment" by one read, but still places most of the "gap" structure in the traditional CIGAR position. 



By factoring out the highly repetitive "gap" (including 'B' operations) which are inherent to the CG 
data one can build a more compressed structure, which would approach similar compression rates
to Illumina reads for a given error rate (the bigger problem than the gap structure in CG data
in my understanding is the error rate, which is appreciably higher than illumina, and inherently
a more error-prone dataset is harder to compress. In the big scheme of things quite possibly
the saving on indel structure is minor compared to the number mismatched bases to reference 
stored). 

I think adding the 'B' operator might make the SAM more compressible to BAM due to dropping the additional
tags used for this, but as BAM is 3-5 (or even more, depending how many tags one has of course) 
fold bigger than (lossless) CRAM/cSRA, by far a bigger win overall is moving to CRAM/cSRA. So,
I think the argument for adding in B operation is really to have a cleaner, more true to the data
way of describing CG data, rather than a compression win.



Compression priorities have to be driven very much by the practical issues on the ground,
and at this moment this means dealing with Illumina data (>85% of the current bases in the archives); 
it is far more important that we shift the community to using/submitting/downloading (lossless) CRAM/cSRA 
over the next 6 months for illumina data than we optimise the marginal compressability of CG 
data (though discussion never hurts, and it is good to get ahead of the curve!).




Take home message - if the SAM/BAM community wants to put in a 'B' operator, there is no objection
from me (or I think the ENA - Guy and Rasko to confirm). What we are more keen on is continued
engagement on the more compressed data forms the archives are developing.






> 
> So, the situation is like what happened two years ago. We still have two options for the CG data:
> 
> 1) We add a backward CIGAR operation "B". This is the cleanest solution for CG developers, but comes as a pain to other users. In addition, 'B' mainly makes a CG SAM cleaner and easier to compress. I guess most end-users would like to use a SAM with 'B' removed for the purpose of analyses.
> 
> 2) We do not modify CIGAR in SAM, but instead ask cSRA/CRAM developers to come up with creative ways to compress the overlap.
> 
> 
> It seems that people in this samtools-devel list have not expressed strong interest in solution 1) so far. I will wait for a few days to see if further comments come up. I will also convey the opinions of SAM developers to CG if I am involved in related offline discussions again.
> 
> Thank you all,
> 
> Heng
> 
> On Nov 29, 2011, at 10:54 AM, Ewan Birney wrote:
> 
>> 
>> On 29 Nov 2011, at 15:11, Heng Li wrote:
>> 
>>> Hi Ewan,
>>> 
>>> Thanks for the suggestion. To make sure I understand your proposal, could you show me the CIGAR, SEQ, QUAL and the additional tag for the following example?
>>> 
>>>> REF:: GCATACGATCGACTAGTCACGT  
>>>> READ: --ATACGATCGA----------
>>>> READ: ---------CGACTAGTCAC--
>>> 
>> 
>> 
>> 
>> So - I think the full "paired end" has 3 fragments (i think - it might be 4 - does not matter)
>> 
>> Say:
>> 
>> 
>> Ref:  GCATACGATCGACTAGTCACGTTTGTTTGGATGCCCTGATG
>> FR1   --ATACGATCGA-----------------------------
>> FR2   ---------CGACTAGTCAC---------------------
>> FR3   --------------------------TTGGATGCC------
>> 
>> 
>> 
>> *all three* of these fragments would be one record in the SAM:
>> 
>> 
>> ArbitaryReadName1	1187	Ref	3	60	33M	=	345	516	ATACGATCGACTAGTCACNNNNNNTTGGATGCC	BBBB??BBBBBBBBBBBBBNNNNNNNBBB?=BBBB	XC:i:23 XB:Z:B?B
>> 
>> 
>> 
>> (I might have gotten some of the formatting wrong here, and problaby the quality for the N bases would not be N but
>> be some appropriate set character - perhaps whatever 0 is...).
>> 
>> 
>> The XC (say) tag is the key thing. This says look up in the dictionary for entry 23 for the "CG Mask"
>> 
>> Entry 23 will be something like:
>> 
>>   10M3B11M6I9M
>> 
>> (not sure if it should be I or D in the middle there - whichever one is correct!)
>> 
>> 
>> The XB tag is to provide the quality values of the overlapping bases, in this case 3; I think convention
>> would be something like you use hte leftmost set in the main string, and the XB flag would tell you the
>> overlap qualities.
>> 
>> 
>> So - if you "understand" the CG XC and XB flags, you can now recreate the original 3 fragments correctly.
>> 
>> 
>> if you don't you make the reasonably assumption that the bases are called as above. 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> 
>>> Thanks,
>>> 
>>> Heng
>>> 
>>> 
>>> On Nov 29, 2011, at 3:52 AM, Ewan Birney wrote:
>>> 
>>>> 
>>>> On 29 Nov 2011, at 08:48, Ewan Birney wrote:
>>>> 
>>>>> 
>>>>> I'd just to express an alternative way of storing CG data in sam/bam/cram
>>>>> which captures nearly all the data, can be efficiently compressed and
>>>>> will have little disruption on downstream tools though tools would have
>>>>> to become "CG aware" to fully leverage the data (whatever happens, tools will 
>>>>> have to become CG aware to really leverage the data, so there is no I think 0 cost 
>>>>> scheme)
>>>>> 
>>>> 
>>>> 
>>>> Just to add to this (only in writing this email did I realise it): if one has
>>>> some additional tag to store the second quality values in the B() regions one
>>>> can achieve complete representation of CG data in this scheme. 
>>>> 
>>>> 
>>>>> 
>>>>> Complete genomics has two levels of reads:
>>>>> 
>>>>> The paired ends - being fragments of 35bp separated by a variable gap (often 500bp)
>>>>> 
>>>>> The "read fragments" inside of the paired ends, being a structured 
>>>>> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem
>>>>> below) and a 5 base pair "gap".
>>>>> 
>>>>> 
>>>>> The scheme below looks at an aligment at the read-fragment level.
>>>>> 
>>>>> 
>>>>> An alternative scheme is to look at the alignment of the paired ends,
>>>>> with the sequence being the called CG reads with "Ns" in the gap positions
>>>>> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length
>>>>> "CG mask" being an encoding of the characteristic gap structure in this CG read.
>>>>> This is, in effect, a cigar string, and as this will be CG specific, one could
>>>>> at this point allow a "B" operation to indicate overlap. Either the whole readset
>>>>> could be declared as "CG" or the presence of this "CG mask" would indicate it
>>>>> on per read basis (I think better to have an attribute at the readset level).
>>>>> 
>>>>> 
>>>>> 
>>>>> This "mask" is highly structured, and so even in sam representation probably
>>>>> best represented as a huffman encoded dictionary - certainly in more compressed
>>>>> forms, this would be a huffman encode variable length, and in playing around
>>>>> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually 
>>>>> pretty compressible, being ~3.5 bits per mask.
>>>>> 
>>>>> 
>>>>> In many  ways this "CG mask" to represent the CG-level indels allows also for
>>>>> small indels to be represented as a traditional "read level" CIGAR string, 
>>>>> (standard sam) to be mapped post the CG level mask.
>>>>> 
>>>>> 
>>>>> 
>>>>> Just to look at this scheme in terms of legacy code vs non legacy code:
>>>>> 
>>>>> For legacy code/non-CG aware code:
>>>>> 
>>>>>    There is a long read with "Ns".
>>>>>    The Ns imply no knowledge of the bases - so SNP callers will not make
>>>>> a terrible mistake here
>>>>> 
>>>>> 
>>>>> For CG aware code:
>>>>> 
>>>>>    You read in the mask, look up the "CG level" string in the dictionary
>>>>> 
>>>>>    Apply the CG level string, including "B" operations
>>>>> 
>>>>>    You can recreate the actual fragments now internally.
>>>>> 
>>>>> 
>>>>> The only small wrinkle is that inside the B() operations there are two quality
>>>>> values, not 1. As I think B operations are rare, it is actually fine to perhaps
>>>>> have a second, rare tag that encode this (might be a bit fiddly but doable). 
>>>>> It is also the case that although there are two "chemistry reads" at this point,
>>>>> there are other error processes (eg, template construction) in which the two
>>>>> bases are not independent. So - for any machinery which is sophisticated enough
>>>>> to want to think about 2 quality values will almost certainly do something
>>>>> sophisticated about (say) the recalibration (I would be thinking about calibrating
>>>>> the errors of these B regions separately from the quality codes elsewhere)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Does this make sense? Comments?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 28 Nov 2011, at 22:18, Heng Li wrote:
>>>>> 
>>>>>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>>>>>> 
>>>>>> The backward operation is designed to describe such an alignment:
>>>>>> 
>>>>>> REF:: GCATACGATCGACTAGTCACGT  
>>>>>> READ: --ATACGATCGA----------
>>>>>> READ: ---------CGACTAGTCAC--
>>>>>> 
>>>>>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>>>>>> 
>>>>>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>>>>>> 
>>>>>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>>>>>> 
>>>>>> Do you think it is worth having a backward operation?
>>>>>> 
>>>>>> regards,
>>>>>> 
>>>>>> Heng
>>>>>> 
>>>>>> -- 
>>>>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>>>>> Limited, a charity registered in England with number 1021457 and a 
>>>>>> company registered in England with number 2742969, whose registered 
>>>>>> office is 215 Euston Road, London, NW1 2BE. 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> All the data continuously generated in your IT infrastructure 
>>>>>> contains a definitive record of customers, application performance, 
>>>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>>>> data and makes sense of it. IT sense. And common sense.
>>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>>> _______________________________________________
>>>>>> Samtools-devel mailing list
>>>>>> Sam...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> All the data continuously generated in your IT infrastructure 
>>>>> contains a definitive record of customers, application performance, 
>>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>>> data and makes sense of it. IT sense. And common sense.
>>>>> http://p.sf.net/sfu/splunk-novd2d
>>>>> _______________________________________________
>>>>> Samtools-devel mailing list
>>>>> Sam...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> All the data continuously generated in your IT infrastructure 
>>>> contains a definitive record of customers, application performance, 
>>>> security threats, fraudulent activity, and more. Splunk takes this 
>>>> data and makes sense of it. IT sense. And common sense.
>>>> http://p.sf.net/sfu/splunk-novd2d
>>>> _______________________________________________
>>>> Samtools-devel mailing list
>>>> Sam...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>> 
>>> 
>>> 
>>> -- 
>>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>>> Limited, a charity registered in England with number 1021457 and a 
>>> company registered in England with number 2742969, whose registered 
>>> office is 215 Euston Road, London, NW1 2BE. 
>> 
>> 
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure 
>> contains a definitive record of customers, application performance, 
>> security threats, fraudulent activity, and more. Splunk takes this 
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> 
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Tim F. <tfe...@br...> - 2011-11-29 14:28:35

One thought is that if B is really intended only to provide the ability to store CG and similar data where the read "skips" back a few bases now and again vs. the reference one thing that would make this much easier on those parsing SAM/BAM would be to limit the use of the B operator so that it cannot skip backwards past the beginning of the read.

So something like 20M5B20M would be valid, but 50M5000B20M would not be.

While tools would still need to explicitly support the B operator it would allow any existing assumptions about not needing to move backwards in the reference from the read start point to still hold.

-t

On Nov 28, 2011, at 5:18 PM, Heng Li wrote:

> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
> 
> The backward operation is designed to describe such an alignment:
> 
> REF:: GCATACGATCGACTAGTCACGT  
> READ: --ATACGATCGA----------
> READ: ---------CGACTAGTCAC--
> 
> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
> 
> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
> 
> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
> 
> Do you think it is worth having a backward operation?
> 
> regards,
> 
> Heng
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE. 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Heng Li <lh...@sa...> - 2011-12-05 17:26:27

Hi Tim and Alec,

After brief email exchanges with someone from CG, I realized that they are defining backward with respect to the _query_ sequence, instead of to the reference as I was originally proposing. And the more I think about it, the more I feel CG is taking the right route. When 'B' is for moving backward on the query, we naturally disallow a CIGAR like 50M1000B20M. In addition, CG should always produce CIGAR consistent in a backward overlap because it is taking the 4 subreads as a single unit when doing the alignment. Demanding consistent CIGAR in backward overlaps is a reasonable requirement which also simplifies implementation.

I have played with 'B' in samtools. The implementation turns out to be simple especially when 'B' is defined w.r.t the query sequence. The support of 'B' can be done in ~120 coding lines without much interference with other part of the code. For Picard, I am thinking to have an API:

SAMReader.setRemoveB(bool toRemove);

When "toRemove" is true, the SAMReader iterator collapses overlaps on the fly and returns an alignment without 'B'. This way most existing tools depending on Picard such as GATK and IGV can work with BAMs containing 'B' without modifications.

If you also think implementing the backward operation in Picard can be done without much difficulty, I will probably go ahead to add 'B' in the spec. If you think adding this feature leads to unnecessary complexity or implementation headache in the Picard framework, I will reconsider the entire discussion. As Picard developers, your opinions are important to making the final decision.

Many thanks,

Heng

On Nov 29, 2011, at 9:28 AM, Tim Fennell wrote:

> One thought is that if B is really intended only to provide the ability to store CG and similar data where the read "skips" back a few bases now and again vs. the reference one thing that would make this much easier on those parsing SAM/BAM would be to limit the use of the B operator so that it cannot skip backwards past the beginning of the read.
> 
> So something like 20M5B20M would be valid, but 50M5000B20M would not be.
> 
> While tools would still need to explicitly support the B operator it would allow any existing assumptions about not needing to move backwards in the reference from the read start point to still hold.
> 
> -t
> 
> On Nov 28, 2011, at 5:18 PM, Heng Li wrote:
> 
>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation.
>> 
>> The backward operation is designed to describe such an alignment:
>> 
>> REF:: GCATACGATCGACTAGTCACGT  
>> READ: --ATACGATCGA----------
>> READ: ---------CGACTAGTCAC--
>> 
>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags.
>> 
>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line.
>> 
>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented.
>> 
>> Do you think it is worth having a backward operation?
>> 
>> regards,
>> 
>> Heng
>> 
>> -- 
>> The Wellcome Trust Sanger Institute is operated by Genome Research 
>> Limited, a charity registered in England with number 1021457 and a 
>> company registered in England with number 2742969, whose registered 
>> office is 215 Euston Road, London, NW1 2BE. 
>> 
>> ------------------------------------------------------------------------------
>> All the data continuously generated in your IT infrastructure 
>> contains a definitive record of customers, application performance, 
>> security threats, fraudulent activity, and more. Splunk takes this 
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Srinka G. <sg...@co...> - 2011-12-05 20:23:45


On 12/5/11 9:26 AM, "Heng Li" <lh...@sa...> wrote:

>Hi Tim and Alec,
>
>After brief email exchanges with someone from CG, I realized that they
>are defining backward with respect to the _query_ sequence,
This is correct. For Complete Genomics the query sequence is the
coordinate of the DNA nanoball(DNB).
> instead of to the reference as I was originally proposing. And the more
>I think about it, the more I feel CG is taking the right route. When 'B'
>is for moving backward on the query, we naturally disallow a CIGAR like
>50M1000B20M. In addition, CG should always produce CIGAR consistent in a
>backward overlap because it is taking the 4 subreads as a single unit
>when doing the alignment.
Yes.
>Demanding consistent CIGAR in backward overlaps is a reasonable
>requirement which also simplifies implementation.
>
>I have played with 'B' in samtools. The implementation turns out to be
>simple especially when 'B' is defined w.r.t the query sequence. The
>support of 'B' can be done in ~120 coding lines without much interference
>with other part of the code. For Picard, I am thinking to have an API:
>
>SAMReader.setRemoveB(bool toRemove);
>
>When "toRemove" is true, the SAMReader iterator collapses overlaps on the
>fly and returns an alignment without 'B'. This way most existing tools
>depending on Picard such as GATK and IGV can work with BAMs containing
>'B' without modifications.
We do the collapse operation in going from the evidence(local de-novo
assembly) to published BAM and it does work w/ IGV and GATK snp calling.
There are issues w/ GATK indel calling. I'm curious as to how you are
dealing with quality sores of the overlapping bases?
Thanks,
Srinka
>
>If you also think implementing the backward operation in Picard can be
>done without much difficulty, I will probably go ahead to add 'B' in the
>spec. If you think adding this feature leads to unnecessary complexity or
>implementation headache in the Picard framework, I will reconsider the
>entire discussion. As Picard developers, your opinions are important to
>making the final decision.
>
>Many thanks,
>
>Heng
>
>On Nov 29, 2011, at 9:28 AM, Tim Fennell wrote:
>
>> One thought is that if B is really intended only to provide the ability
>>to store CG and similar data where the read "skips" back a few bases now
>>and again vs. the reference one thing that would make this much easier
>>on those parsing SAM/BAM would be to limit the use of the B operator so
>>that it cannot skip backwards past the beginning of the read.
>>
>> So something like 20M5B20M would be valid, but 50M5000B20M would not be.
>>
>> While tools would still need to explicitly support the B operator it
>>would allow any existing assumptions about not needing to move backwards
>>in the reference from the read start point to still hold.
>>
>> -t
>>
>> On Nov 28, 2011, at 5:18 PM, Heng Li wrote:
>>
>>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR
>>>operator 'B' for an operation of moving backward along the reference
>>>genome. It is the opposite of the 'N', the reference skip. In a later
>>>discussion on a separate issue, Fred expressed his preference to a
>>>negative reference skip which is equivalent to a positive 'B'
>>>operation. Now the SRA group from NCBI intends to archive the CG
>>>alignment in the SAM format and raises this request again. I think it
>>>may be the time to add this backward operation.
>>>
>>> The backward operation is designed to describe such an alignment:
>>>
>>> REF:: GCATACGATCGACTAGTCACGT
>>> READ: --ATACGATCGA----------
>>> READ: ---------CGACTAGTCAC--
>>>
>>> i.e. there is an overlap between two segments of a read, which is
>>>quite frequent in CG data. We are unable to fully describe such an
>>>alignment with the original CIGAR. In the current spec, we suggest
>>>using a CIGAR 18M and storing the overlap information in optional tags.
>>>This is a little clumsy and is not compressed well for the purpose of
>>>archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional
>>>tags.
>>>
>>> Using "B" in this case is cleaner, but the major concern is that it
>>>breaks the compatibility and is also likely to complicate SNP calling
>>>and many other applications. As I think now, the solution is to
>>>implement a "remove_B()" routine in samtools. This routine collapses
>>>overlapping sequences, recalculates base quality in the overlap and
>>>gives a CIGAR without 'B'. For the example above, remove_B() gives
>>>CIGAR 18M. For SNP calling, we may call remove_B() immediately after
>>>the alignment loaded into memory. The downstream pileup engine does not
>>>need any changes. Other SNP callers can do the same. A new option will
>>>be added to "samtools view" as a way to remove 'B' operations on the
>>>command-line.
>>>
>>> The implementation of remove_B() may be quite complicated in the
>>>generic case - we may be dealing with a multiple-sequence alignment,
>>>but it should be straightforward in the simple cases such as the
>>>example above. Users may not need to care too much about how remove_B()
>>>is implemented.
>>>
>>> Do you think it is worth having a backward operation?
>>>
>>> regards,
>>>
>>> Heng
>>>
>>> --
>>> The Wellcome Trust Sanger Institute is operated by Genome Research
>>> Limited, a charity registered in England with number 1021457 and a
>>> company registered in England with number 2742969, whose registered
>>> office is 215 Euston Road, London, NW1 2BE.
>>>
>>>
>>>------------------------------------------------------------------------
>>>------
>>> All the data continuously generated in your IT infrastructure
>>> contains a definitive record of customers, application performance,
>>> security threats, fraudulent activity, and more. Splunk takes this
>>> data and makes sense of it. IT sense. And common sense.
>>> http://p.sf.net/sfu/splunk-novd2d
>>> _______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>
>>
>>
>>-------------------------------------------------------------------------
>>-----
>> All the data continuously generated in your IT infrastructure
>> contains a definitive record of customers, application performance,
>> security threats, fraudulent activity, and more. Splunk takes this
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>
>
>
>--
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
>--------------------------------------------------------------------------
>----
>All the data continuously generated in your IT infrastructure
>contains a definitive record of customers, application performance,
>security threats, fraudulent activity, and more. Splunk takes this
>data and makes sense of it. IT sense. And common sense.
>http://p.sf.net/sfu/splunk-novd2d
>_______________________________________________
>Samtools-devel mailing list
>Sam...@li...
>https://lists.sourceforge.net/lists/listinfo/samtools-devel


________________________________

The contents of this e-mail and any attachments are confidential and only for use by the intended recipient. Any unauthorized use, distribution or copying of this message is strictly prohibited. If you are not the intended recipient please inform the sender immediately by reply e-mail and delete this message from your system. Thank you for your co-operation.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Heng Li <lh...@sa...> - 2011-12-05 21:01:13

Hi Srinka,

On Dec 5, 2011, at 3:01 PM, Srinka Ghosh wrote:
> There are issues w/ GATK indel calling. I'm curious as to how you are
> dealing with quality sores of the overlapping bases?

Currently samtools does this way: if two overlapping bases agree, the base quality (baseQ) equals the higher one of the two; if the two disagree, the consensus base is taken as the one having the higher baseQ and the consensus baseQ equals the subtraction. I think how to compute baseQ does not matter too much especially given deep coverage.

As to SNP/INDEL calling, could you provide more details about how CG are doing this? I could think of three ways:

1) Call SNPs and INDELs from the read-to-evidence alignment and then map the coordinates of called variants back to the reference using the evidence-to-reference alignment.

2) Call variants from the evidence-to-reference alignment without looking back at the supporting reads.

3) Project the read-to-evidence alignments to the reference coordinate and then call variants from the resulting read-to-reference alignments. (I guess this is less likely to be happening?)

My question is actually more related to the other topic we have not discussed in samtools-devel. But let's hijack this thread for a minute.

Thanks,

Heng

> Thanks,
> Srinka

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Srinka G. <sg...@co...> - 2011-12-06 23:46:25

Hi Heng,

On 12/5/11 1:01 PM, "Heng Li" <lh...@sa...> wrote:

>Hi Srinka,
>
>On Dec 5, 2011, at 3:01 PM, Srinka Ghosh wrote:
>> There are issues w/ GATK indel calling. I'm curious as to how you are
>> dealing with quality sores of the overlapping bases?
>
>Currently samtools does this way: if two overlapping bases agree, the
>base quality (baseQ) equals the higher one of the two; if the two
>disagree, the consensus base is taken as the one having the higher baseQ
>and the consensus baseQ equals the subtraction. I think how to compute
>baseQ does not matter too much especially given deep coverage.
Our treatment is somewhat similar: if two overlapping bases agree, the
base quality (baseQ) equals the higher one of the two; if the two
disagree, the consensus base is taken as the one having the higher
baseQand the consensus baseQ equals the higher one of the two.


>
>As to SNP/INDEL calling, could you provide more details about how CG are
>doing this? I could think of three ways:
The small variant calling is based on a 2 step process: 1) fast initial
alignment to the reference genome, followed by 2)local de-novo assembly of
those regions of the genome which are putatively different from the
reference. The initial alignment serves as a scaffold to the LDN.
Detection of regions of putative variation is based on a bayesian model
which tests against the null that the base is a reference. The de-novo
sequence optimization takes into account the mate-pair structure. The
evidence score for the variance also invokes a bayesian model where the
best hypothesis is compared against the next best homozygous hypothesis to
generate an evidence score. We are in last bits of a paper acceptance
process, I can forward the details onto you when the e-pub becomes
available.
Hope this helps,
Srinka
>
>1) Call SNPs and INDELs from the read-to-evidence alignment and then map
>the coordinates of called variants back to the reference using the
>evidence-to-reference alignment.
>
>2) Call variants from the evidence-to-reference alignment without looking
>back at the supporting reads.
>
>3) Project the read-to-evidence alignments to the reference coordinate
>and then call variants from the resulting read-to-reference alignments.
>(I guess this is less likely to be happening?)
>
>My question is actually more related to the other topic we have not
>discussed in samtools-devel. But let's hijack this thread for a minute.
>
>Thanks,
>
>Heng
>
>> Thanks,
>> Srinka
>
>
>
>--
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.


________________________________

The contents of this e-mail and any attachments are confidential and only for use by the intended recipient. Any unauthorized use, distribution or copying of this message is strictly prohibited. If you are not the intended recipient please inform the sender immediately by reply e-mail and delete this message from your system. Thank you for your co-operation.

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Fred L. <fl...@sd...> - 2011-12-14 08:57:09

How about something like this for Picard:

    * Bump the SAM spec version number to 1.5 when the semantics of the
      "B" operator are finalized.
    * Add method "boolean SAMRecord.convertToVersion(String version)",
      which will convert the current SAM record to the specified
      version, if necessary.  If information is lost (e.g., QUAL values
      are merged), the return value is true, otherwise the return value
      is false.  If it is impossible to do the conversion, or a bad
      version number is specified, an exception is thrown.
    * For reading SAM files:
          o Add a method SAMFileReader.setAcceptedVersion(String
            accepted_version, boolean allow_conversion).
          o By default, setAcceptedVersion("1.4", true) is automatically
            called when opening a SAM/BAM file, which converts all
            records to version 1.4.
          o For filtering programs such as SortSam,
            setAcceptedVersion(null, false) should be called (no
            conversion necessary or allowed).
          o If the header's version number is greater than accepted_version:
                + A warning is emitted when the header is read, unless
                  ValidationStringency is set to SILENT.
                + Incoming SAM records are scanned for incompatible
                  operators.  If incompatible operators are found:
                      # If allow_conversion is false, then a fatal
                        exception is generated.
                      # Otherwise, convertToVersion(accepted_version) is
                        called on the current SAM record, and a warning
                        is emitted if information is lost, unless
                        ValidationStringency is set to SILENT.
          o (Alternatively, we could check all SAM records for illegal
            operators, no matter what the header's version number is,
            but this would be more expensive.)
    * For writing SAM files:
          o Add method SAMFileHeader.setVersion(String version),
            allowing the user to set the VN tag in the header.
          o If a CIGAR string is created that has illegal operator's wrt
            the version number in the header, a fatal exception is
            generated.
          o For writing version 1.5 records (that may contain the "B"
            operator), SAMFileHeader.setVersion("1.5") needs to be called.

The goals behind this are:

    * Allow the user to specify precisely how to handle unexpected
      operators and different SAM versions.
    * Minimize the risk of silent data loss (lossy conversion with no
      user notification).
    * Be a little bit more general than "setRemoveB" so that fewer
      changes will need to be made in the future.

FL

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Alec W. <al...@br...> - 2011-12-19 23:02:21

I hope you all will forgive these questions if they've been addressed 
previously.  I must confess that I have not followed this discussion all 
that closely.

    * I know it was decided that it would not be allowed to have a B
      operator that backed up more than the number of read-consuming
      bases prior to the B argument.  It is also a requirement that
      there be at least enough read-consuming bases after the B operator
      so that the bases after the B extends at least as far as the bases
      before.  E.g., would this be allowed: 10M3B2M ?
    * What operators are allowed in the portion of the fragment that is
      backed up over.  E.g. is it required that a B operator be
      surrounded by M operators that are at least as long as the B, or
      could other operators (I, D, etc) that can be in that part of the
      cigar?

Thanks, Alec

On 12/14/11 3:56 AM, Fred Long wrote:
> How about something like this for Picard:
>
>     * Bump the SAM spec version number to 1.5 when the semantics of
>       the "B" operator are finalized.
>     * Add method "boolean SAMRecord.convertToVersion(String version)",
>       which will convert the current SAM record to the specified
>       version, if necessary.  If information is lost (e.g., QUAL
>       values are merged), the return value is true, otherwise the
>       return value is false.  If it is impossible to do the
>       conversion, or a bad version number is specified, an exception
>       is thrown.
>     * For reading SAM files:
>           o Add a method SAMFileReader.setAcceptedVersion(String
>             accepted_version, boolean allow_conversion).
>           o By default, setAcceptedVersion("1.4", true) is
>             automatically called when opening a SAM/BAM file, which
>             converts all records to version 1.4.
>           o For filtering programs such as SortSam,
>             setAcceptedVersion(null, false) should be called (no
>             conversion necessary or allowed).
>           o If the header's version number is greater than
>             accepted_version:
>                 + A warning is emitted when the header is read, unless
>                   ValidationStringency is set to SILENT.
>                 + Incoming SAM records are scanned for incompatible
>                   operators.  If incompatible operators are found:
>                       # If allow_conversion is false, then a fatal
>                         exception is generated.
>                       # Otherwise, convertToVersion(accepted_version)
>                         is called on the current SAM record, and a
>                         warning is emitted if information is lost,
>                         unless ValidationStringency is set to SILENT.
>           o (Alternatively, we could check all SAM records for illegal
>             operators, no matter what the header's version number is,
>             but this would be more expensive.)
>     * For writing SAM files:
>           o Add method SAMFileHeader.setVersion(String version),
>             allowing the user to set the VN tag in the header.
>           o If a CIGAR string is created that has illegal operator's
>             wrt the version number in the header, a fatal exception is
>             generated.
>           o For writing version 1.5 records (that may contain the "B"
>             operator), SAMFileHeader.setVersion("1.5") needs to be called.
>
> The goals behind this are:
>
>     * Allow the user to specify precisely how to handle unexpected
>       operators and different SAM versions.
>     * Minimize the risk of silent data loss (lossy conversion with no
>       user notification).
>     * Be a little bit more general than "setRemoveB" so that fewer
>       changes will need to be made in the future.
>
> FLs
>
>
> ------------------------------------------------------------------------------
> Cloud Computing - Latest Buzzword or a Glimpse of the Future?
> This paper surveys cloud computing today: What are the benefits?
> Why are businesses embracing it? What are its payoffs and pitfalls?
> http://www.accelacomm.com/jaw/sdnl/114/51425149/
>
>
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Peter C. <p.j...@go...> - 2011-12-20 10:41:24

On Mon, Dec 19, 2011 at 11:02 PM, Alec Wysoker <al...@br...> wrote:
> I hope you all will forgive these questions if they've been addressed
> previously.  I must confess that I have not followed this discussion all
> that closely.
>
> I know it was decided that it would not be allowed to have a B operator that
> backed up more than the number of read-consuming bases prior to the B
> argument.  It is also a requirement that there be at least enough
> read-consuming bases after the B operator so that the bases after the B
> extends at least as far as the bases before.  E.g., would this be allowed:
> 10M3B2M ?

I raised that before, e.g. 20M10B5M, and I *think* Heng Li agrees
that too should be banned. The problem I was flagging here was
calculating the limits of the read for determining the index bin.

> What operators are allowed in the portion of the fragment that is backed up
> over.  E.g. is it required that a B operator be surrounded by M operators
> that are at least as long as the B, or could other operators (I, D, etc)
> that can be in that part of the cigar?

I don't recall that being discussed specifically, although James did come
up with some examples using a P next to the B.

Peter

Re: [Samtools-devel] Propose to add the backward CIGAR operation 'B'

From: Fred L. <fl...@sd...> - 2011-12-06 02:51:49

I'm confused about how the newly defined "B" operator works.  If it only 
backtracks in the query, then how do we get any overlapping bases or 
qual scores in the alignment?  Can you show an example alignment?

Thanks,

FL
> Hi Tim and Alec,
>
> After brief email exchanges with someone from CG, I realized that they are defining backward with respect to the _query_ sequence, instead of to the reference as I was originally proposing. And the more I think about it, the more I feel CG is taking the right route. When 'B' is for moving backward on the query, we naturally disallow a CIGAR like 50M1000B20M. In addition, CG should always produce CIGAR consistent in a backward overlap because it is taking the 4 subreads as a single unit when doing the alignment. Demanding consistent CIGAR in backward overlaps is a reasonable requirement which also simplifies implementation.

1 2 3 > >> (Page 1 of 3)