Re: [Samtools-devel] Differenciating uniquely mapped reads from "best score" reads

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Mon, Dec 14, 2009 at 11:00:54AM -0800, Nils Homer wrote:
> Could we set the "NH" tag to the # of hits found during the alignment?

Yes, you can set this tag.

> If
> NH is one, then it would be "unique" (conditioned on the alignment
> sensitivity), otherwise multiple alignments were found.

However, I do not think NH is useful in some cases. For example, bwa
reports the number of optimal hits found by bwa and the number of
suboptimal hits. Summing them up in NH loses important information. One
may attempt to add new tags to differentiate optimal and suboptimal
hits, but few aligners have this report.

> Am I correct that if NH is greater than one it is still valid to only have
> one alignment reported per read in the SAM file?  I think "IH" indicates the
> # of alignments per read in the SAM file.

Yes.

Heng

> 
> Nils
> 
> On 12/14/09 8:07 AM, "Heng Li" <lh...@sa...> wrote:
> 
> > In the probabilistic framework, the first measurement is the probability
> > of data given the alignment, while the second is the posterior of the
> > alignment given data. Mapping quality belongs to the second category,
> > and the AS/NM/UQ/PQ tags to the first category. In read mapping, we are
> > mostly interested in the posterior and only occasionally look at NM/UQ
> > tags. That is why mapping quality is a mandatory field while others are
> > tags.
> > 
> > I entirely agree that mapping uniqueness is not clearly defined in
> > general cases especially for long reads; the only way to clearly define
> > uniqueness is to consider mapping quality. In SAM, we encourage aligners
> > to compute mapping quality. See also this link on FAQ:
> > 
> > http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ#Why_map
> > ping_quality.3F
> > 
> > Heng
> > 
> > On Mon, Dec 14, 2009 at 09:18:38AM -0500, Alec Wysoker wrote:
> >> Hi Folks,
> >> 
> >> Just to clarify the direction I think this discussion is taking...
> >> 
> >> I find the term "unique" misleading, since virtually any alignment
> >> could be non-unique if the alignment stringency were loose enough.
> >> Rather, it sounds like there are two somewhat independent measures
> >> of alignment goodness that people want: 1) how well a read matches
> >> the best alignment; and 2) the relative goodness of the best
> >> alignment to the next best alignment.  Do we need to store both
> >> these scores in SAM2?
> >> 
> >> -Alec
> >> 
> >> Benjamin Berman wrote:
> >>> It's true that it might be a bit aligner-specific, but that does not negate
> >>> it's primary use case which would be to determine simply whether  or not
> >>> *any* other strong matches exist. You could try to give it a more concrete
> >>> interpretation (any matches with a 5% or more probability of being correct,
> >>> any matches within an edit distance of N from the best match), but it would
> >>> be difficult or more likely impossible to get aligners to comply with this.
> >>> In practice, any aligner capable of returning multiple hits has to have some
> >>> cutoff, and this would be the same cutoff used for the proposed "numHits"
> >>> field.
> >>> 
> >>> ben.
> >>> 
> >>> 
> >>> On Dec 11, 2009, at 2:03 PM, Paul Anderson wrote:
> >>> 
> >>>> On Fri, Dec 11, 2009 at 3:13 PM, David Rio <dri...@gm...> wrote:
> >>>> 
> >>>>> I would like to suggest an extra tag in the SAM spec to differentiate
> >>>>> uniquely mapped reads
> >>>>> from "best score" reads. What I would suggest is adding an extra tag
> >>>>> that is either 0 for
> >>>>> uniquely mapped reads or # where # is the number of other hits for that
> >>>>> read.
> >>>>> 
> >>>>> What do you guys think?
> >>>> I think it is misleading, since depending on the aligner, or the
> >>>> settings of the aligner, you are going to get different answers.
> >>>> 
> >>>> In many index based aligners, (e.g. KARMA), if you use a smaller
> >>>> index, you will tend to get more possible matches.
> >>>> 
> >>>> That said, KARMA will write the number of locations it evaluated for a
> >>>> single ended read, but like Goncalo suggested, I think the quality
> >>>> score is really what you're looking for.
> >>>> 
> >>>> The mapping score, if aligners are doing a good job computing it,
> >>>> should be moderately comparable across aligners, since it has a
> >>>> mathematical basis in reality.
> >>>> 
> >>>> Paul
> >>>> 
> >>>> ---------------------------------------------------------------------------
> >>>> ---
> >>>> Return on Information:
> >>>> Google Enterprise Search pays you back
> >>>> Get the facts.
> >>>> http://p.sf.net/sfu/google-dev2dev
> >>>> _______________________________________________
> >>>> Samtools-devel mailing list
> >>>> Sam...@li...
> >>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> >>> 
> >>> 
> >>> ----------------------------------------------------------------------------
> >>> --
> >>> Return on Information:
> >>> Google Enterprise Search pays you back
> >>> Get the facts.
> >>> http://p.sf.net/sfu/google-dev2dev
> >>> _______________________________________________
> >>> Samtools-devel mailing list
> >>> Sam...@li...
> >>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> > 
> >> 
> ----------------------------------------------------------------------------->>
> -
> >> Return on Information:
> >> Google Enterprise Search pays you back
> >> Get the facts.
> >> http://p.sf.net/sfu/google-dev2dev
> > 
> >> _______________________________________________
> >> Samtools-devel mailing list
> >> Sam...@li...
> >> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> > 
> > 
> 

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.