|
From: Bob H. <han...@br...> - 2009-07-08 13:05:21
|
Sean A. Irvine wrote: > Thanks Bob and Alec for your input, it has helped resolved a couple of > our > issues, but we are still confused about some of the answers. > > If SAM is to support aligners that output multiple mappings, it seems > that MRNM/MPOS should only be required if 0x0002 flag is set, since > otherwise selection of any particular MRNM/MPOS would be arbitrary. > The spec doesn't currently indicate under what conditions MRNM/MPOS > are required, but the Java samtools requires MRNM/MPOS whenever 0x0008 > indicates that a mapping exists for the mate. > > To help clarify what we are trying to achieve, consider the following > contrived example with the following reference and a single read pair: > > > myref (reference sequence) > TCCAGCTAAGGCTGCCTCACCACCGATTTTCGTATGGGGCCAGATATAGTGGACCTGTTGGAGCGTACAGATCCGGGCTT > > TTCTGAGTGTACTTTATTATATGAGCATATAATCTTCTGAGTGTACTTTATTATATGAGGAGAAGTCTTAACGAACAGTG > > TTCTGAGTGTACTTTATTATATGAGGCTGGACTATGATTAGGTCCTTACTCCCACGTCGAGCCGGTGGGTGCCTACCCAA > > TTTTTTTTTTTTTTTTTTTTTTTTTACCAGCTAAGGCTGCCTCACCACCGATTTTCGTATGGGGCCAGATATAGTGGACC > > TTTTTTTTTTTTTTTTTTTTTTTTTACCAGCTAAGGCTGCCTCACCACCGATTTTCGTATGGGGCCAGATATAGTGGACC > > > > read-first (first read of the pair) > TTCTGAGTGTACTTTATTATATGAG > > > read-second (second read of the pair) > AAAAAAAAAAAAAAAAAAAAAAAAA > > Might produce individual read mappings of: > #TEMPLATE FRAME READ-NAME TEMPLATE-START > myref F read-first 81 > myref F read-first 115 > myref F read-first 161 > myref R read-second 321 > myref R read-second 241 > > Assuming no additional filtering of mappings, we might currently > output the following SAM records (after identifying potential proper > pairing for each mapping): > > read-first 73 myref 81 255 25M * > 0 0 TTCTGAGTGTACTTTATTATATGAG * > read-first 99 myref 115 255 25M = > 241 151 TTCTGAGTGTACTTTATTATATGAG * > read-first 99 myref 161 255 25M = > 241 105 TTCTGAGTGTACTTTATTATATGAG * > read-second 147 myref 241 255 25M = > 161 -105 TTTTTTTTTTTTTTTTTTTTTTTTT * > read-second 153 myref 321 255 25M * > 0 0 TTTTTTTTTTTTTTTTTTTTTTTTT * > > The first mapping for read-first doesn't meet criteria for a proper > pair with any of the read-second mappings. The alternatives that allow > parsing by the JDK sam are: output a single record (as we have done > above) with 0x0008 set to indicate the mate is unmapped (simply > because we have no basis to prefer any of the other mappings); > alternatively take the cross-product approach and output two SAM > records, each one selecting a different one of the read-second > mappings as its MRNM/MPOS (and allowing 0x0008 to be unset). This > latter option seems to introduce a lot of redundancy for no real gain. Good example. I retract what I wrote before. I think 0x0008 should be interpreted to apply to the current record, and you are correct to set it on the first and last records above. However, I think there are a couple of small problems with the sam records in this example: 1) Although I don't think the spec says this, and maybe it's technically ok without, most applications expect records for paired end alignments to come in pairs. Thus, there should be another record corresponding to the second line but with the reads reversed. 2) I'm sure you did it just for clarity, but the read names must match for the paired records. The two ends are told apart by the 0x40 and 0x80 flags. Taking these together, I think the sam should look like: read-name 73 myref 81 255 25M * 0 0 TTCTGAGTGTACTTTATTATATGAG * read-name 99 myref 115 255 25M = 241 151 TTCTGAGTGTACTTTATTATATGAG * read-name 99 myref 161 255 25M = 241 105 TTCTGAGTGTACTTTATTATATGAG * read-name 147 myref 241 255 25M = 161 -105 TTTTTTTTTTTTTTTTTTTTTTTTT * read-name 147 myref 241 255 25M = 115 -151 TTTTTTTTTTTTTTTTTTTTTTTTT * read-name 153 myref 321 255 25M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTTT * Does this sound right to you? > > >> 3. Sometimes reads can map with the 5'-end off the left of the > reference; > >> that is, with a nonpositive start position. The POS field requires a > >> positive value for the start position. At present we are dealing with > >> this by applying a hard clip to the left of the read. If we do this, > >> then > >> which start position should be used for the computation of the ISIZE > >> field? (A similar problem can occur at the end of the reference, but > >> it is less noticeable because it passes the current validation tests). > >> > > I think it would help me to know why the reads are aligning off the end > > of the reference. > > It is because the extra bases are vector or something? Or are they > > garbage generated by the instrument? > > The reference used for mapping might not be an accurate representation > of the sequenced genome. For example, the reference might consist of > contigs, represent just coding regions, or be from a different species. > In such cases it is not unreasonable that some reads will map overlap the > ends of a reference sequence. Sounds to me like this should be represented in the cigar string as soft clipping (or possibly I). But if we define a formula for ISIZE based on the aligned positions, then it wouldn't affect ISIZE. I don't know if others will weigh in, but I tend to think a precise definition for ISIZE is better than an alternative that says "the producer of the sam file should make the best estimate for insert size". > > >> 4. We would like to be able to report a count of the number of > mappings > >> in the case where there is a large number of mappings. That is, in > the > >> situation where a mapping is ambiguous we want to sometimes just > report > >> the number of mappings rather than any specific mapping. What is the > >> recommended way of doing this? > >> > > There are the H0, H1 and H2 tags, although these are somewhat more > > specific - would them do? > > If you don't want to record any of the ambiguous mappings, then I think > > you should flag the read as unmapped > > (i.e. set flags 0x0004 and 0x0008 on the appropriate records). > > Thus, to be more precise in this case, "is mapped" indicates whether > the > > aligner generated and stored in the sam file > > at least one mapping for this read. > > We are pleased that this answer is broadly in line with the approach we > have taken. Unfortunately, we cannot use H0, H1, and H2, because while > we have a score for the ambiguous records we do not bother generating > specific alignments for the ambiguous results and therefore do not know > the precise count of differences. At present we are just > using the tag "XC" to record the count, but perhaps a case could be > made for inclusion of a tag like "HN" as a generalization of the H0, > H1, and H2 tags. > > Is a revised draft of the SAM format specification going to be produced > soon? It would seem a lot of questions in this forum could be answered > if the specification were a little clearer. Yes, we're hoping to produce a revised draft soon with community input. So please send comments on other problems you find or suggestions for improvement. -Bob > > Regards, > Sean > |