Re: [Samtools-help] SAM specification questions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello Nils,

Thanks for your enthusiasm, and constructive ideas.  However, there are 
some things below that would
fundamently change the current decisions, and I suggest it is the wrong 
time to do that now.

There was quite a lot of discussion before the current version of SAM, 
and it is beginning to be used in
at least one project with diverse data, and also for various other 
projects and purposes.  We need to use
it seriously, and get experience to learn about how it behaves.  My vote 
would be to stick with
essentially the current structure for the time being, meaning probably 
at least all of 2009, focusing
primarily on implementation, but also in the specification clarifying 
usage, and adding tags/features
where new platforms or methods or uses require them.  We should 
certainly work to support serious
platforms, including CGI, and serious use-cases.

I think the idea of an RFC process makes sense, but perhaps for a 
possible major revision in a year or so.

I have a few more specific comments below.

Richard

Nils Homer wrote:
> Dear SAM users and developers,
>
> I would love to share my vision and help develop (i.e. help code) the next
> version of the SAM.  For the next version of SAM I would envision removing
> some ambiguities and helping the format be a little bit more general for
> various sequencing platforms.  Also, I would suggest having a wider audience
> critique and make suggestions for this format by having an RFC for the next
> version. To contribute:
>
> Firstly, I would propose the following principle.  Require the minimal
> amount of information to store an alignment of a QUERY to the reference,
> with all other information optional.
>
> This includes:
>
> 0. Any field that has a dummy value an optional field.  Then any analysis
> tool wont make the mistake of using the "dummy" value.  To the extreme, an
> unmapped read would not require a "dummy" cigar value ("*") etc.  Most
> fields in the FLAG field would also be optional.
>   
Almost any field can be blank in one circumstance or another.  
Experience is that doing the whole
thing with tag:value notation is cumbersome - e.g. we don't want to fall 
back on XML.
> 1. Paired end meta data: we need only store if it is paired end, and if it
> is paired end (or triple end) we can have some required paired end
> information.  Paired information like "proper end", "status of the mate"
> would be optional, with paired end information limited to the minimum
> information to be able to "seek" to the other ends alignments (i.e. QNAME,
> MRNM, MPOS), with all other information being optional.  In this manner,
> single end data would store the answer "0" for the number of other ends etc.
> and no other dummy values for paired end info.
>
> 2. make MAPQ to be optional: there is no dummy value for this field
> actually, since what if an aligner does not specifically choose an alignment
> for a read but returns all alignments it found (it doesn't make the
> uniqueness assumption listed in the note)?  Like Heng Li said, it could also
> be intractable to compute (gulp), or unreliable given small amounts of data
> from which to empirically estimate (Nils, you only aligned four reads!).
> What MAPQ attempts to measure is well-defined, but again we could move this
> to optional. 
>
>   
I don't really think this is optional.  Once you have mapped something, 
if you are going to use the mapping
you need some idea of how much confidence should be placed in it.   I 
would encourage all methods to
think about this and give a value (and others than MAQ do so), but if 
they don't do so explicitly they imply
something implicitly, and it is good to force it to be explicit.
> 3. move QUAL to be optional: an aligner may only care about the raw sequence
> and not its qualities.  This is extreme, but would follow our design
> principle.
>
> 4. explicit support for CGI (Complete Genomics) data, as well as for data
> where a base is sequenced more than once from the same DNA fragment.  This
> is probably the most difficult request.
>   
I agree certainly we should support CGI data.  We need in the first 
instance a good pragmatic solution
that allows the full value of CGI data to be used in the SAM framework.
> 5. for ABI SOLiD data: require that the original read (color sequence and
> adaptor), CIGAR, and color corrected NT space representation be required.
> This would allow anyone to determine which colors were called color errors
> (or "corrected"), which is necessary to unambiguously specify the alignment.
>
>   
Fine.  This is effectively the standard.
> 6. for ABI SOLiD data: require that the QUAL field be the original color
> quality values.  This way any program can convert these color qualities to
> nt qualities using their own formula (if they so desire) as well as using
> the alignment (CIGAR, original read, and nt read).  A per-BASE quality
> differs between aligners and would not be analogous to the QUAL field for
> Illumina data, since for Illumina comes straight from the base-caller or
> previous analysis not the aligner (I assume).
>   
Not consistent with other data types.  The QUAL field should be in base 
space.  I think it is fine
for the colour qualities to be in tag:value fields.
> 7. Multiple alignments: we need a way to move between multiple alignments
> for the same end of a read.  Currently it is only an optional field that
> moves in one direction (next hit) although we could require the link to be a
> circular directed linked list or other data structure that makes the graph
> of all alignments from all ends of one read connected.
>   
Interesting to explore the circular list idea here.
> 8. Multiple ends: we could support an arbitrary # of ends for a read.  These
> data exist.
>   
We should think about how to do this without rewriting what we have 
done, for 2009.
> 9. ISIZE could be made optional (it is paired end metadata too).  Also, if
> it is 5'->5' end, the alignment itself could vary the ISIZE due to indels.
> If this is required paired end data, we could make this the minimum distance
> from any base sequenced on end E1 to any base sequenced on end E2.
>   
There has been quite a lot of discussion about this - there are 
compromises however it is done.
> Thank-you for being the ones to burden yourself with the responsibility of
> developing a standard.  Speaking from the community of users outside your
> various groups, we offer our sincere congratulations.
>
> Nils Homer
>
>
> On 3/22/09 5:24 PM, "Heng Li" <lh...@sa...> wrote:
>
>   
>> Hello Nils,
>>
>> Thanks for the comments. Here are the replies to 2 and 4. A position
>> solution to multi-end reads was given in a previous email.
>>
>> On 21 Mar 2009, at 21:25, Nils Homer wrote:
>>
>>     
>>> 2. INS TAG: this seem redundant given we store the alignment for
>>> each end
>>> (in MRNM and MPOS).  If the file is indexed based on chr/pos then it
>>> would
>>> simply be a quick lookup (that's a big IF).  Also, this TAG does not
>>> seem
>>> flexible enough for multi-end reads if you do keep the INS TAG.
>>>       
>> We mainly use ISIZE to detect potential PCR duplicates when we want to
>> know the external coordinates of a read pair without looking at the
>> mate. However, we realized that a better replacement of ISIZE would be
>> MLEN (Mate alignment LENgth) which is the length of the reference
>> sequence in the mate alignment. It is a bit late to make the change as
>> a lot of 1000G data have been generated.
>>
>>     
>>> 4. It seems to me that the MAPQ field should be optional, since this
>>> information is specific to an alignment algorithm and not to the
>>> actual
>>> alignment.  I think this field can be severely abused and misleading,
>>> especially since it is set to 255 if it is not present.  I can give
>>> specific
>>> examples upon request why this is precisely algorithm-dependent
>>> (seems
>>> MAQ-centric).
>>>       
>> Mapping quality is clearly defined, although how to calculate it may
>> be subjected to difference algorithms/models. We need a program-
>> independent way to measure how reliable the alignment is and I see
>> mapping quality is the best solution and works for a lot of existing
>> aligners. I would be interested to see examples why it is not.
>>
>> Thanks,
>>
>> Heng
>>
>>
>>     
>
>
>
> ------------------------------------------------------------------------------
> Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
> powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
> easily build your RIAs with Flex Builder, the Eclipse(TM)based development
> software that enables intelligent coding and step-through debugging.
> Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com
> _______________________________________________
> Samtools-help mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-help
>   

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.