|
From: Alec W. <al...@br...> - 2009-07-07 16:43:34
|
Hi Bob & Sean, I'm confused by question #1. I would think that if there are 1 or more alignments for a read, then there would not be any SAMRecord for the read with 0x0004 flag. Why would one want to create an unaligned SAMRecord in addition to the aligned ones for the same read? -Alec Bob Handsaker wrote: > Hi, Sean, > > These are excellent questions. I'll take a stab at some answers below. > > Sean A. Irvine wrote: > >> We would like some clarification on some aspects of the SAM specification. >> >> 1. What is the correct interpretation of 0x0004 and 0x0008 bits in the >> <flag> field? >> >> The specification is rather terse on the meaning of these flags. In >> particular, do they apply to just the current record or to all records >> for the current read? >> >> > I think the answer to this question is that it applies to both. > The spec was written based on the idea that "is mapped" is a property of > the read (and the aligner and reference sequence set). > The "is mapped" property is true if the aligner generates one (or more) > mappings for this read to the reference sequence. > > The other use/interpretation of these flags is the following: For reads > that do not map, you can still specify a reference name and position, > but these are > "for sorting and indexing only". In this case, the "is unmapped" flag > bits indicate that the corresponding reference name plus position is not > a true mapping. > This is commonly done, for example, to sort/index an unmapped mate at > the same location as the mapped end. > > I believe these two interpretations are consistent with each other for > all useful cases. > Although we tried to anticipate storing multiple mappings in the sam > spec, I'm not aware of anyone actually using that today, so there may be > subtle issues. > >> Supposing for the moment that they refer to the read as a whole, i.e. >> 0x0008 means that the mate is not mapped for ANY record, we have two >> further questions: >> >> 1(a). If 0x0008 is set, then Java Samtools requires both the MRNM and MPOS >> fields to be set, but in general there can be multiple such references and >> positions with no grounds for picking any particular one. It would seem >> that MRNM and MPOS should only be required if the record is part of a proper >> pairing (i.e. 0x0002 is set). >> >> > If the mate has more than one mapping, then there should be an alignment > record for each mate mapping. > The 0x0008 flag would be set on every record. > > The spec is (purposely?) ambiguous about the situation where both ends > map multiple times: Should the cross product of all mappings be recorded? > I think this is up to the aligner. The aligner can choose how many of > the mappings it chooses to store for each end and in what combinations. > >> 1(b). How should these values be set when the information is not available >> at the time the record is constructed? >> >> > Depending on how the aligner is implemented, it may be necessary to make > a second pass to "fix up" the mate information on each record. > The samtools fixmate command does this, for example. > Until the mate information is fixed up, however, the sam file is in an > intermediate state that I think isn't technically valid. > >> 2. What is the meaning of "the alignment is not primary" in the flag >> value of 0x0100? >> >> > If there are multiple alignments for the read, you can use this bit to > mark one of the alignments as primary (i.e. the "best" alignment or if there > are multiple "best" alignments you could choose one arbitrarily, etc.). > The idea is for the aligner to be able to record multiple alignments but > also communicate that if you want it to pick only one, this is the one > it would pick. > >> 3. Sometimes reads can map with the 5'-end off the left of the reference; >> that is, with a nonpositive start position. The POS field requires a >> positive value for the start position. At present we are dealing with >> this by applying a hard clip to the left of the read. If we do this, then >> which start position should be used for the computation of the ISIZE >> field? (A similar problem can occur at the end of the reference, but >> it is less noticeable because it passes the current validation tests). >> >> > I think it would help me to know why the reads are aligning off the end > of the reference. > It is because the extra bases are vector or something? Or are they > garbage generated by the instrument? > > We should clarify ISIZE in the spec. > The right approach, I think, is to have a definite formula for > calculating ISIZE. > I believe the intent of the current spec is that the formula is always: > last alignment position of mate minus first alignment position of this > read plus one. > This is if the read and its mate align to the same reference sequence > and ISIZE is zero otherwise (although perhaps in hindsight we should > have indicated this case by *). > >> 4. We would like to be able to report a count of the number of mappings >> in the case where there is a large number of mappings. That is, in the >> situation where a mapping is ambiguous we want to sometimes just report >> the number of mappings rather than any specific mapping. What is the >> recommended way of doing this? >> >> > There are the H0, H1 and H2 tags, although these are somewhat more > specific - would them do? > If you don't want to record any of the ambiguous mappings, then I think > you should flag the read as unmapped > (i.e. set flags 0x0004 and 0x0008 on the appropriate records). > Thus, to be more precise in this case, "is mapped" indicates whether the > aligner generated and stored in the sam file > at least one mapping for this read. > > Hope this helps, > -Bob > >> Regards, >> Sean Irvine >> NetValue NZ Ltd >> >> > |