From: Heng Li <lh...@li...> - 2011-11-29 15:11:47
|
Hi Ewan, Thanks for the suggestion. To make sure I understand your proposal, could you show me the CIGAR, SEQ, QUAL and the additional tag for the following example? > REF:: GCATACGATCGACTAGTCACGT > READ: --ATACGATCGA---------- > READ: ---------CGACTAGTCAC-- Thanks, Heng On Nov 29, 2011, at 3:52 AM, Ewan Birney wrote: > > On 29 Nov 2011, at 08:48, Ewan Birney wrote: > >> >> I'd just to express an alternative way of storing CG data in sam/bam/cram >> which captures nearly all the data, can be efficiently compressed and >> will have little disruption on downstream tools though tools would have >> to become "CG aware" to fully leverage the data (whatever happens, tools will >> have to become CG aware to really leverage the data, so there is no I think 0 cost >> scheme) >> > > > Just to add to this (only in writing this email did I realise it): if one has > some additional tag to store the second quality values in the B() regions one > can achieve complete representation of CG data in this scheme. > > >> >> Complete genomics has two levels of reads: >> >> The paired ends - being fragments of 35bp separated by a variable gap (often 500bp) >> >> The "read fragments" inside of the paired ends, being a structured >> set of 3 fragments with a 2 base pair "gap" (that can be "negative" - the problem >> below) and a 5 base pair "gap". >> >> >> The scheme below looks at an aligment at the read-fragment level. >> >> >> An alternative scheme is to look at the alignment of the paired ends, >> with the sequence being the called CG reads with "Ns" in the gap positions >> (if they abut, or overlap, the gap contains no Ns), and, in addition, a variable length >> "CG mask" being an encoding of the characteristic gap structure in this CG read. >> This is, in effect, a cigar string, and as this will be CG specific, one could >> at this point allow a "B" operation to indicate overlap. Either the whole readset >> could be declared as "CG" or the presence of this "CG mask" would indicate it >> on per read basis (I think better to have an attribute at the readset level). >> >> >> >> This "mask" is highly structured, and so even in sam representation probably >> best represented as a huffman encoded dictionary - certainly in more compressed >> forms, this would be a huffman encode variable length, and in playing around >> with some chromosome 20 CG data (thanks Shrinka and Steve!), this actually >> pretty compressible, being ~3.5 bits per mask. >> >> >> In many ways this "CG mask" to represent the CG-level indels allows also for >> small indels to be represented as a traditional "read level" CIGAR string, >> (standard sam) to be mapped post the CG level mask. >> >> >> >> Just to look at this scheme in terms of legacy code vs non legacy code: >> >> For legacy code/non-CG aware code: >> >> There is a long read with "Ns". >> The Ns imply no knowledge of the bases - so SNP callers will not make >> a terrible mistake here >> >> >> For CG aware code: >> >> You read in the mask, look up the "CG level" string in the dictionary >> >> Apply the CG level string, including "B" operations >> >> You can recreate the actual fragments now internally. >> >> >> The only small wrinkle is that inside the B() operations there are two quality >> values, not 1. As I think B operations are rare, it is actually fine to perhaps >> have a second, rare tag that encode this (might be a bit fiddly but doable). >> It is also the case that although there are two "chemistry reads" at this point, >> there are other error processes (eg, template construction) in which the two >> bases are not independent. So - for any machinery which is sophisticated enough >> to want to think about 2 quality values will almost certainly do something >> sophisticated about (say) the recalibration (I would be thinking about calibrating >> the errors of these B regions separately from the quality codes elsewhere) >> >> >> >> >> >> >> Does this make sense? Comments? >> >> >> >> >> >> >> >> >> >> >> >> >> On 28 Nov 2011, at 22:18, Heng Li wrote: >> >>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR operator 'B' for an operation of moving backward along the reference genome. It is the opposite of the 'N', the reference skip. In a later discussion on a separate issue, Fred expressed his preference to a negative reference skip which is equivalent to a positive 'B' operation. Now the SRA group from NCBI intends to archive the CG alignment in the SAM format and raises this request again. I think it may be the time to add this backward operation. >>> >>> The backward operation is designed to describe such an alignment: >>> >>> REF:: GCATACGATCGACTAGTCACGT >>> READ: --ATACGATCGA---------- >>> READ: ---------CGACTAGTCAC-- >>> >>> i.e. there is an overlap between two segments of a read, which is quite frequent in CG data. We are unable to fully describe such an alignment with the original CIGAR. In the current spec, we suggest using a CIGAR 18M and storing the overlap information in optional tags. This is a little clumsy and is not compressed well for the purpose of archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional tags. >>> >>> Using "B" in this case is cleaner, but the major concern is that it breaks the compatibility and is also likely to complicate SNP calling and many other applications. As I think now, the solution is to implement a "remove_B()" routine in samtools. This routine collapses overlapping sequences, recalculates base quality in the overlap and gives a CIGAR without 'B'. For the example above, remove_B() gives CIGAR 18M. For SNP calling, we may call remove_B() immediately after the alignment loaded into memory. The downstream pileup engine does not need any changes. Other SNP callers can do the same. A new option will be added to "samtools view" as a way to remove 'B' operations on the command-line. >>> >>> The implementation of remove_B() may be quite complicated in the generic case - we may be dealing with a multiple-sequence alignment, but it should be straightforward in the simple cases such as the example above. Users may not need to care too much about how remove_B() is implemented. >>> >>> Do you think it is worth having a backward operation? >>> >>> regards, >>> >>> Heng >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> Limited, a charity registered in England with number 1021457 and a >>> company registered in England with number 2742969, whose registered >>> office is 215 Euston Road, London, NW1 2BE. >>> >>> ------------------------------------------------------------------------------ >>> All the data continuously generated in your IT infrastructure >>> contains a definitive record of customers, application performance, >>> security threats, fraudulent activity, and more. Splunk takes this >>> data and makes sense of it. IT sense. And common sense. >>> http://p.sf.net/sfu/splunk-novd2d >>> _______________________________________________ >>> Samtools-devel mailing list >>> Sam...@li... >>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >> >> >> ------------------------------------------------------------------------------ >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> Samtools-devel mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-devel > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |