From: Srinka G. <sg...@co...> - 2011-12-05 20:23:45
|
On 12/5/11 9:26 AM, "Heng Li" <lh...@sa...> wrote: >Hi Tim and Alec, > >After brief email exchanges with someone from CG, I realized that they >are defining backward with respect to the _query_ sequence, This is correct. For Complete Genomics the query sequence is the coordinate of the DNA nanoball(DNB). > instead of to the reference as I was originally proposing. And the more >I think about it, the more I feel CG is taking the right route. When 'B' >is for moving backward on the query, we naturally disallow a CIGAR like >50M1000B20M. In addition, CG should always produce CIGAR consistent in a >backward overlap because it is taking the 4 subreads as a single unit >when doing the alignment. Yes. >Demanding consistent CIGAR in backward overlaps is a reasonable >requirement which also simplifies implementation. > >I have played with 'B' in samtools. The implementation turns out to be >simple especially when 'B' is defined w.r.t the query sequence. The >support of 'B' can be done in ~120 coding lines without much interference >with other part of the code. For Picard, I am thinking to have an API: > >SAMReader.setRemoveB(bool toRemove); > >When "toRemove" is true, the SAMReader iterator collapses overlaps on the >fly and returns an alignment without 'B'. This way most existing tools >depending on Picard such as GATK and IGV can work with BAMs containing >'B' without modifications. We do the collapse operation in going from the evidence(local de-novo assembly) to published BAM and it does work w/ IGV and GATK snp calling. There are issues w/ GATK indel calling. I'm curious as to how you are dealing with quality sores of the overlapping bases? Thanks, Srinka > >If you also think implementing the backward operation in Picard can be >done without much difficulty, I will probably go ahead to add 'B' in the >spec. If you think adding this feature leads to unnecessary complexity or >implementation headache in the Picard framework, I will reconsider the >entire discussion. As Picard developers, your opinions are important to >making the final decision. > >Many thanks, > >Heng > >On Nov 29, 2011, at 9:28 AM, Tim Fennell wrote: > >> One thought is that if B is really intended only to provide the ability >>to store CG and similar data where the read "skips" back a few bases now >>and again vs. the reference one thing that would make this much easier >>on those parsing SAM/BAM would be to limit the use of the B operator so >>that it cannot skip backwards past the beginning of the read. >> >> So something like 20M5B20M would be valid, but 50M5000B20M would not be. >> >> While tools would still need to explicitly support the B operator it >>would allow any existing assumptions about not needing to move backwards >>in the reference from the read start point to still hold. >> >> -t >> >> On Nov 28, 2011, at 5:18 PM, Heng Li wrote: >> >>> A few years ago, Complete Genomics (CG) proposed to add a new CIGAR >>>operator 'B' for an operation of moving backward along the reference >>>genome. It is the opposite of the 'N', the reference skip. In a later >>>discussion on a separate issue, Fred expressed his preference to a >>>negative reference skip which is equivalent to a positive 'B' >>>operation. Now the SRA group from NCBI intends to archive the CG >>>alignment in the SAM format and raises this request again. I think it >>>may be the time to add this backward operation. >>> >>> The backward operation is designed to describe such an alignment: >>> >>> REF:: GCATACGATCGACTAGTCACGT >>> READ: --ATACGATCGA---------- >>> READ: ---------CGACTAGTCAC-- >>> >>> i.e. there is an overlap between two segments of a read, which is >>>quite frequent in CG data. We are unable to fully describe such an >>>alignment with the original CIGAR. In the current spec, we suggest >>>using a CIGAR 18M and storing the overlap information in optional tags. >>>This is a little clumsy and is not compressed well for the purpose of >>>archiving. With 'B', the new CIGAR is "10M3B11M" with no other optional >>>tags. >>> >>> Using "B" in this case is cleaner, but the major concern is that it >>>breaks the compatibility and is also likely to complicate SNP calling >>>and many other applications. As I think now, the solution is to >>>implement a "remove_B()" routine in samtools. This routine collapses >>>overlapping sequences, recalculates base quality in the overlap and >>>gives a CIGAR without 'B'. For the example above, remove_B() gives >>>CIGAR 18M. For SNP calling, we may call remove_B() immediately after >>>the alignment loaded into memory. The downstream pileup engine does not >>>need any changes. Other SNP callers can do the same. A new option will >>>be added to "samtools view" as a way to remove 'B' operations on the >>>command-line. >>> >>> The implementation of remove_B() may be quite complicated in the >>>generic case - we may be dealing with a multiple-sequence alignment, >>>but it should be straightforward in the simple cases such as the >>>example above. Users may not need to care too much about how remove_B() >>>is implemented. >>> >>> Do you think it is worth having a backward operation? >>> >>> regards, >>> >>> Heng >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> Limited, a charity registered in England with number 1021457 and a >>> company registered in England with number 2742969, whose registered >>> office is 215 Euston Road, London, NW1 2BE. >>> >>> >>>------------------------------------------------------------------------ >>>------ >>> All the data continuously generated in your IT infrastructure >>> contains a definitive record of customers, application performance, >>> security threats, fraudulent activity, and more. Splunk takes this >>> data and makes sense of it. IT sense. And common sense. >>> http://p.sf.net/sfu/splunk-novd2d >>> _______________________________________________ >>> Samtools-devel mailing list >>> Sam...@li... >>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >> >> >> >>------------------------------------------------------------------------- >>----- >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> Samtools-devel mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-devel > > > >-- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > >-------------------------------------------------------------------------- >---- >All the data continuously generated in your IT infrastructure >contains a definitive record of customers, application performance, >security threats, fraudulent activity, and more. Splunk takes this >data and makes sense of it. IT sense. And common sense. >http://p.sf.net/sfu/splunk-novd2d >_______________________________________________ >Samtools-devel mailing list >Sam...@li... >https://lists.sourceforge.net/lists/listinfo/samtools-devel ________________________________ The contents of this e-mail and any attachments are confidential and only for use by the intended recipient. Any unauthorized use, distribution or copying of this message is strictly prohibited. If you are not the intended recipient please inform the sender immediately by reply e-mail and delete this message from your system. Thank you for your co-operation. |