|
From: Tim F. <tfe...@br...> - 2009-11-03 20:20:46
|
I agree. My main concern is that most people using the Java API will then also have to react to the existence of these new operators. I suppose that we could offer a backward compatibility mode that translates =/X to M for clients that don't know how to deal with those operators yet. We haven't implemented the new operators yet though. It shouldn't take long to do though. I'll respond once that's done and we can figure out how to move on from there. -t On Nov 3, 2009, at 3:17 PM, Heng Li wrote: > Hello Tim, > > Using =/X has been written in Appendix B. Actually samtools already > parses =/X as M, which means a CIGAR with =/X will not break > samtools. However, =/X information is currently lost in samtools. I do > not know how picard deals with =/X, either. That is why I did not > formally put =/X in the spec. I think we should be ready before we do > this. What do you think? > > Heng > > On Tue, Nov 03, 2009 at 11:23:11AM -0500, Tim Fennell wrote: >> Hi Nathan, >> >> That's a good point. I feel like we came to a resolution on what >> should be implemented, but then never agreed on how to integrate it >> into the specification etc. Heng: do you think that this is a small >> enough change that we can just integrate it into the current spec, >> or do you think it will break enough things that it should wait for >> a larger spec revision? >> >> -t >> >> On Nov 2, 2009, at 9:26 AM, Nathan Johnson wrote: >> >>> Also, is there an upto date reference for this? >>> >>> The following doc does not contain the newer symbols: >>> >>> http://samtools.sourceforge.net/SAM1.pdf >>> >>> Thanks >>> >>> On 2 Nov 2009, at 14:23, Nathan Johnson wrote: >>> >>>> Hi >>>> >>>> Just thought I'd ping this as I saw a presentation from Thomas >>>> Keane at the recent on site Next Gen Sequencing wokshop. The >>>> extended cigar line shown did not contain V or X as agreed. >>>> >>>> I assume this is just an old slide, but I thought I'd just check >>>> to make sure my information isn't out of date. >>>> >>>> Thanks >>>> >>>> Nath >>>> >>>> >>>> On 28 Jul 2009, at 17:15, Tim Fennell wrote: >>>> >>>>> I'm happy to support whatever letters are chosen in the java/ >>>>> picard tools. If V/X are what we've come up with that's fine >>>>> by me. >>>>> >>>>> -t >>>>> >>>>> On Jul 28, 2009, at 9:25 AM, Heng Li wrote: >>>>> >>>>>> I think we samtools/picard group will support V/X in the >>>>>> long run. Is >>>>>> this the consensus? >>>>>> >>>>>> Heng >>>>>> >>>>>> On Mon, Jul 27, 2009 at 01:01:26PM +0100, Nathan Johnson wrote: >>>>>>> Hi >>>>>>> >>>>>>> I'm following up on this as we are approaching our >>>>>>> internal deadline >>>>>>> for the next release of Ensembl. Has a consensus been found for >>>>>>> "alignment match with sequence mismatch"? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Nath >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2 Jul 2009, at 17:04, Richard Durbin wrote: >>>>>>> >>>>>>>> We already allow V and X information in a separate tag field: >>>>>>>> MD. >>>>>>>> In a way this proposal is to extend CIGAR >>>>>>>> towards the MD format, while not quite getting there because in >>>>>>>> the case of a mismatch it does not say what the >>>>>>>> reference base was. >>>>>>>> >>>>>>>> I agree we should not reuse 'M'. I don't really like using >>>>>>>> lower >>>>>>>> case 'm' either. V and X would be OK. >>>>>>>> We could also use '=' for V, which fits with it being >>>>>>>> allowed as a >>>>>>>> match to the reference in the read. It would >>>>>>>> be very good for Nathan and us to agree the meanings of >>>>>>>> codes, so >>>>>>>> we don't have two CIGAR variants that use >>>>>>>> the same letter codes for different things. >>>>>>>> >>>>>>>> Richard >>>>>>>> >>>>>>>> Goncalo Abecasis wrote: >>>>>>>>> I agree we shouldn't change the meaning of 'M'. >>>>>>>>> >>>>>>>>> Although the current CIGAR string isn't perfect, it has the >>>>>>>>> virtue of being >>>>>>>>> simple. If we add the "V" and "X" extension, should we store >>>>>>>>> that in a >>>>>>>>> separate field? >>>>>>>>> >>>>>>>>> Gon?alo >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Heng Li [mailto:lh...@sa...] >>>>>>>>>> Sent: Thursday, July 02, 2009 10:19 AM >>>>>>>>>> To: Nathan Johnson >>>>>>>>>> Cc: samtools-devel >>>>>>>>>> Subject: Re: [Samtools-devel] SAM extended Cigar line format >>>>>>>>>> >>>>>>>>>> Hi Nathan, >>>>>>>>>> >>>>>>>>>> I am copying to the samtools mailing list. >>>>>>>>>> >>>>>>>>>> On 2 Jul 2009, at 14:08, Nathan Johnson wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> Hi Heng >>>>>>>>>>>> >>>>>>>>>>>> I have just read the recent paper you published >>>>>>>>>>>> regarding the SAM >>>>>>>>>>>> format, more specifically the extended cigar >>>>>>>>>>>> line format used in >>>>>>>>>>>> >>>>>>>>>> SAM. >>>>>>>>>> >>>>>>>>>>>> I am responsible for the array mapping pipeline in Ensembl >>>>>>>>>>>> and have >>>>>>>>>>>> also been working on an extended format to more accurately >>>>>>>>>>>> represent probe alignments to both genomic and transcript >>>>>>>>>>>> sequences. I was wondering whether it would be possible to >>>>>>>>>>>> converge >>>>>>>>>>>> our definitions to provide a standard. >>>>>>>>>>>> >>>>>>>>>>>> The extensions I have added are as follows: >>>>>>>>>>>> >>>>>>>>>>>> M - Sequence match & Alignment match >>>>>>>>>>>> m - Alignment match & Sequence mismatch >>>>>>>>>>>> >>>>>>>>>> I am concerned about reuse 'M' in the standard >>>>>>>>>> CIGAR. If you want to >>>>>>>>>> distinguish sequence match and mismatch, I would recommend to >>>>>>>>>> use "V" >>>>>>>>>> for match (standing for tick) and "X" for mismatch >>>>>>>>>> (standing for >>>>>>>>>> cross). Aligners that cannot generate V/X can simply write M. >>>>>>>>>> What do >>>>>>>>>> you think? And what do other people in the mailing list >>>>>>>>>> think? >>>>>>>>>> >>>>>>>>>> If people agree with this idea, I may implement it in >>>>>>>>>> samtools-C as an >>>>>>>>>> implementation-specific extension. However, I will >>>>>>>>>> not make it into >>>>>>>>>> the specification until Java people are ready for this. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> U - Unknown(Used when an alignment overhangs the end of a >>>>>>>>>>>> transcript sequence) >>>>>>>>>>>> >>>>>>>>>> To me, U looks more like "S", soft clipping. Padding is used >>>>>>>>>> in SAM as >>>>>>>>>> a place holder to align inserted sequences. I am not >>>>>>>>>> sure this is >>>>>>>>>> similar to your case. >>>>>>>>>> >>>>>>>>>> Heng >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> I will most likely change U to P(added) as used in SAM. I >>>>>>>>>>>> appreciate the change in the definition of M may cause some >>>>>>>>>>>> problems, and maybe too restrictive in terms of >>>>>>>>>>>> how it is already >>>>>>>>>>>> implemented. Do you think this would be a >>>>>>>>>>>> useful addition to the >>>>>>>>>>>> SAM format? >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> >>>>>>>>>>>> Nathan Johnson >>>>>>>>>>>> Scientific Programmer >>>>>>>>>>>> European Bioinformatics Institute >>>>>>>>>>>> Wellcome Trust Genome Campus >>>>>>>>>>>> Hinxton >>>>>>>>>>>> Cambridge CB10 1SD >>>>>>>>>>>> Email: njo...@eb... >>>>>>>>>>>> TelNo: (+44)1223 492629 >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> The Wellcome Trust Sanger Institute is operated by >>>>>>>>>> Genome Research >>>>>>>>>> Limited, a charity registered in England with number >>>>>>>>>> 1021457 and a >>>>>>>>>> company registered in England with number 2742969, >>>>>>>>>> whose registered >>>>>>>>>> office is 215 Euston Road, London, NW1 2BE. >>>>>>>>>> >>>>>>>>>> ----------------------------------------------------------------------- >>>>>>>>>> ------- >>>>>>>>>> _______________________________________________ >>>>>>>>>> Samtools-devel mailing list >>>>>>>>>> Sam...@li... >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>> _______________________________________________ >>>>>>>>> Samtools-devel mailing list >>>>>>>>> Sam...@li... >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> The Wellcome Trust Sanger Institute is operated by >>>>>>>> Genome Research >>>>>>>> Limited, a charity registered in England with number >>>>>>>> 1021457 and a >>>>>>>> company registered in England with number 2742969, whose >>>>>>>> registered office is 215 Euston Road, London, NW1 2BE. >>>>>>> >>>>>>> Nathan Johnson >>>>>>> Scientific Programmer >>>>>>> European Bioinformatics Institute >>>>>>> Wellcome Trust Genome Campus >>>>>>> Hinxton >>>>>>> Cambridge CB10 1SD >>>>>>> Email: njo...@eb... >>>>>>> TelNo: (+44)1223 492629 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> The Wellcome Trust Sanger Institute is operated by Genome >>>>>> Research >>>>>> Limited, a charity registered in England with number 1021457 >>>>>> and a >>>>>> company registered in England with number 2742969, whose >>>>>> registered >>>>>> office is 215 Euston Road, London, NW1 2BE. >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Let Crystal Reports handle the reporting - Free Crystal >>>>>> Reports 2008 30-Day >>>>>> trial. Simplify your report design, integration and >>>>>> deployment - and focus on >>>>>> what you do best, core application coding. Discover what's new >>>>>> with >>>>>> Crystal Reports now. http://p.sf.net/sfu/bobj-july >>>>>> _______________________________________________ >>>>>> Samtools-devel mailing list >>>>>> Sam...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >>>> >>>> Nathan Johnson >>>> Scientific Programmer >>>> European Bioinformatics Institute >>>> Wellcome Trust Genome Campus >>>> Hinxton >>>> Cambridge CB10 1SD >>>> Email: njo...@eb... >>>> TelNo: (+44)1223 492629 >>>> >>>> >>>> >>>> >>>> >>> >>> Nathan Johnson >>> Scientific Programmer >>> European Bioinformatics Institute >>> Wellcome Trust Genome Campus >>> Hinxton >>> Cambridge CB10 1SD >>> Email: njo...@eb... >>> TelNo: (+44)1223 492629 >>> >>> >>> >>> >>> > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. |