Re: [Samtools-help] [Samtools-devel] Flag bits 0x0004, 0x0008, 0x0100 and handling ambiguous mappi

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Sean A. Irvine wrote:
> Thanks Bob and Alec for your input, it has helped resolved a couple of 
> our
> issues, but we are still confused about some of the answers.
>
> If SAM is to support aligners that output multiple mappings, it seems
> that MRNM/MPOS should only be required if 0x0002 flag is set, since
> otherwise selection of any particular MRNM/MPOS would be arbitrary.
> The spec doesn't currently indicate under what conditions MRNM/MPOS
> are required, but the Java samtools requires MRNM/MPOS whenever 0x0008
> indicates that a mapping exists for the mate.
>
> To help clarify what we are trying to achieve, consider the following
> contrived example with the following reference and a single read pair:
>
> > myref      (reference sequence)
> TCCAGCTAAGGCTGCCTCACCACCGATTTTCGTATGGGGCCAGATATAGTGGACCTGTTGGAGCGTACAGATCCGGGCTT 
>
> TTCTGAGTGTACTTTATTATATGAGCATATAATCTTCTGAGTGTACTTTATTATATGAGGAGAAGTCTTAACGAACAGTG 
>
> TTCTGAGTGTACTTTATTATATGAGGCTGGACTATGATTAGGTCCTTACTCCCACGTCGAGCCGGTGGGTGCCTACCCAA 
>
> TTTTTTTTTTTTTTTTTTTTTTTTTACCAGCTAAGGCTGCCTCACCACCGATTTTCGTATGGGGCCAGATATAGTGGACC 
>
> TTTTTTTTTTTTTTTTTTTTTTTTTACCAGCTAAGGCTGCCTCACCACCGATTTTCGTATGGGGCCAGATATAGTGGACC 
>
>
> > read-first  (first read of the pair)
> TTCTGAGTGTACTTTATTATATGAG
>
> > read-second (second read of the pair)
> AAAAAAAAAAAAAAAAAAAAAAAAA
>
> Might produce individual read mappings of:
> #TEMPLATE  FRAME   READ-NAME       TEMPLATE-START
> myref      F       read-first      81
> myref      F       read-first      115
> myref      F       read-first      161
> myref      R       read-second     321
> myref      R       read-second     241
>
> Assuming no additional filtering of mappings, we might currently
> output the following SAM records (after identifying potential proper
> pairing for each mapping):
>
> read-first      73      myref   81      255     25M     *       
> 0       0       TTCTGAGTGTACTTTATTATATGAG       *
> read-first      99      myref   115     255     25M     =       
> 241     151     TTCTGAGTGTACTTTATTATATGAG       *
> read-first      99      myref   161     255     25M     =       
> 241     105     TTCTGAGTGTACTTTATTATATGAG       *
> read-second      147     myref   241     255     25M     =       
> 161     -105    TTTTTTTTTTTTTTTTTTTTTTTTT       *
> read-second      153     myref   321     255     25M     *       
> 0       0       TTTTTTTTTTTTTTTTTTTTTTTTT       *
>
> The first mapping for read-first doesn't meet criteria for a proper
> pair with any of the read-second mappings. The alternatives that allow
> parsing by the JDK sam are: output a single record (as we have done
> above) with 0x0008 set to indicate the mate is unmapped (simply
> because we have no basis to prefer any of the other mappings);
> alternatively take the cross-product approach and output two SAM
> records, each one selecting a different one of the read-second
> mappings as its MRNM/MPOS (and allowing 0x0008 to be unset). This
> latter option seems to introduce a lot of redundancy for no real gain.
Good example. I retract what I wrote before.
I think 0x0008 should be interpreted to apply to the current record, and 
you are correct to set it on the first and last records above.
However, I think there are a couple of small problems with the sam 
records in this example:
1) Although I don't think the spec says this, and maybe it's technically 
ok without, most applications expect records for paired end alignments 
to come in pairs.
Thus, there should be another record corresponding to the second line 
but with the reads reversed.
2) I'm sure you did it just for clarity, but the read names must match 
for the paired records.
The two ends are told apart by the 0x40 and 0x80 flags.
Taking these together, I think the sam should look like:

read-name      73      myref   81      255     25M     *       0       
0       TTCTGAGTGTACTTTATTATATGAG       *
read-name      99      myref   115     255     25M     =       241     
151     TTCTGAGTGTACTTTATTATATGAG       *
read-name      99      myref   161     255     25M     =       241     
105     TTCTGAGTGTACTTTATTATATGAG       *
read-name      147     myref   241     255     25M     =       161     
-105    TTTTTTTTTTTTTTTTTTTTTTTTT       *
read-name      147     myref   241     255     25M     =       115     
-151    TTTTTTTTTTTTTTTTTTTTTTTTT       *
read-name      153     myref   321     255     25M     *       0       
0       TTTTTTTTTTTTTTTTTTTTTTTTT       *

Does this sound right to you?
>
> >> 3. Sometimes reads can map with the 5'-end off the left of the 
> reference;
> >> that is, with a nonpositive start position.  The POS field requires a
> >> positive value for the start position.  At present we are dealing with
> >> this by applying a hard clip to the left of the read.  If we do this,
> >> then
> >> which start position should be used for the computation of the ISIZE
> >> field?  (A similar problem can occur at the end of the reference, but
> >> it is less noticeable because it passes the current validation tests).
> >>
> > I think it would help me to know why the reads are aligning off the end
> > of the reference.
> > It is because the extra bases are vector or something?  Or are they
> > garbage generated by the instrument?
>
> The reference used for mapping might not be an accurate representation
> of the sequenced genome.  For example, the reference might consist of
> contigs, represent just coding regions, or be from a different species.
> In such cases it is not unreasonable that some reads will map overlap the
> ends of a reference sequence.
Sounds to me like this should be represented in the cigar string as soft 
clipping (or possibly I).
But if we define a formula for ISIZE based on the aligned positions, 
then it wouldn't affect ISIZE.
I don't know if others will weigh in, but I tend to think a precise 
definition for ISIZE is better than
an alternative that says "the producer of the sam file should make the 
best estimate for insert size".
>
> >> 4. We would like to be able to report a count of the number of 
> mappings
> >> in the case where there is a large number of mappings.  That is, in 
> the
> >> situation where a mapping is ambiguous we want to sometimes just 
> report
> >> the number of mappings rather than any specific mapping.  What is the
> >> recommended way of doing this?
> >>
> > There are the H0, H1 and H2 tags, although these are somewhat more
> > specific - would them do?
> > If you don't want to record any of the ambiguous mappings, then I think
> > you should flag the read as unmapped
> > (i.e. set flags 0x0004 and 0x0008 on the appropriate records).
> > Thus, to be more precise in this case, "is mapped" indicates whether 
> the
> > aligner generated and stored in the sam file
> > at least one mapping for this read.
>
> We are pleased that this answer is broadly in line with the approach we
> have taken.  Unfortunately, we cannot use H0, H1, and H2, because while
> we have a score for the ambiguous records we do not bother generating
> specific alignments for the ambiguous results and therefore do not know
> the precise count of differences.  At present we are just
> using the tag "XC" to record the count, but perhaps a case could be
> made for inclusion of a tag like "HN" as a generalization of the H0,
> H1, and H2 tags.
>
> Is a revised draft of the SAM format specification going to be produced
> soon?  It would seem a lot of questions in this forum could be answered
> if the specification were a little clearer.
Yes, we're hoping to produce a revised draft soon with community input.
So please send comments on other problems you find or suggestions for 
improvement.
-Bob
>
> Regards,
> Sean
>

Re: [Samtools-help] [Samtools-devel] Flag bits 0x0004, 0x0008, 0x0100 and handling ambiguous mappi

Re: [Samtools-help] [Samtools-devel] Flag bits 0x0004, 0x0008, 0x0100 and handling ambiguous mappings