|
From: Sean A. I. <sa...@xt...> - 2009-07-09 01:52:18
|
Thanks again. We now have a satisfactory resolution for the situation of reads mapping off the left end of a reference using soft clipping. Bob Handsaker wrote: > > I think 0x0008 should be interpreted to apply to the current record, and > > you are correct to set it on the first and last records above. [...] > > read-name 73 myref 81 255 25M * 0 0 TTCTGAGTGTACTTTATTATATGAG * Regarding this record, we are assuming the purpose of the flags field is to allow quick selection/filtering of SAM records based on flag settings. The reason we are not particularly happy with the current solution of marking the mate as unmapped is that makes it difficult to separate records worth processing for structural variance analysis (where there are good mappings for the mate), from those where there are actually no good mate mappings at all. We will probably go with Alec's suggestion of selecting an arbitrary MRNM/MPOS in order for us to allow 0x0008 to be unset (as we don't want to disable other validation). However, in general this seems rather unsatisfactory to us, since we can see no rational grounds for picking one set of MRNM/MPOS values over any of the others. > > However, I think there are a couple of small problems with the sam > > records in this example: We agree with your suggestions (and yes, we did name the reads that way for clarity). Alex Wysoker wrote: > > * set MRNM and MPOS to refer to the "first" SAMRecord for the mate. When I say first, I am > > referring to the ordered list of alignments for a read, as defined by IH, HI, CC and CP tags. > > You can then locate all the candidate alignments for the mate. Regarding these optional tags, could you please clarify: HI - Are the HI index values for a read assumed to be in increasing order in the SAM file? We are using the Java samtools and its built-in sorting capability, thus any HI values we set would be the order the records were added, not the order they appear in the file. Bob Handsaker wrote: > Yes, we're hoping to produce a revised draft soon with community input. > So please send comments on other problems you find or suggestions for improvement. For what it is worth here are a few other points on the spec that I have noted at various times. Opinions may vary on these and perhaps some of them have already been addressed. In no particular order: * There is no specified length restriction on reads. It seems they should be restricted to be at most 2^29 (as per the reference). Given that in the future read sequences might be considerably longer, thought should be given as to how longer sequences could be split over multiple lines. * Similarly, BAM has implicit limits on the length of sequence names (2^31-2 due to l_name field) but this is not clearly documented and not mentioned at all in the SAM part of the specification. * It would be nice to have a way of adding comment lines to SAM files. * Having the flag specified in ASCII rather than as a binary string makes it hard to visually determine which flags are set for a particular record. It would also be easier to select groups of records with grep like utilities if the flags were given as binary strings. * On page 5, footnote 7, the language is very unclear, in particular, the phrase "The bases are reverse complemented from the unmapped read" is not satisfactory, since the read is mapped (just in the reverse frame). * Line ending conventions are not documented (that is, are they Linux style, Windows style, or either) and should conversion to/from BAM preserve the line-endings of the original or convert to local standard. Sean. |