|
From: Bob H. <han...@br...> - 2009-07-11 00:48:12
|
Sean A. Irvine wrote: > Thanks again. We now have a satisfactory resolution for the situation > of reads mapping off the left end of a reference using soft clipping. > > Bob Handsaker wrote: > > > I think 0x0008 should be interpreted to apply to the current > record, and > > > you are correct to set it on the first and last records above. > [...] > > > read-name 73 myref 81 255 25M * > 0 0 TTCTGAGTGTACTTTATTATATGAG * > > Regarding this record, we are assuming the purpose of the flags field > is to allow quick selection/filtering of SAM records based on flag > settings. The reason we are not particularly happy with the current > solution of marking the mate as unmapped is that makes it difficult to > separate records worth processing for structural variance analysis > (where there are good mappings for the mate), from those where there > are actually no good mate mappings at all. > > We will probably go with Alec's suggestion of selecting an arbitrary > MRNM/MPOS in order for us to allow 0x0008 to be unset (as we don't > want to disable other validation). However, in general this seems > rather unsatisfactory to us, since we can see no rational grounds > for picking one set of MRNM/MPOS values over any of the others. It seems to me that if you want to analyze structural variations and you are going to the trouble to keep multiple alignments, then you will want to "see" all of the possible mappings for both ends. In the example you sent, you might have one set of alignments for this pair with aberrant spacing but another set of alignments for this pair with very plausible spacing. It's just my two cents, but I'm not sure trying to do filtering on the flags is the best approach. > > > > However, I think there are a couple of small problems with the sam > > > records in this example: > > We agree with your suggestions (and yes, we did name the reads that > way for clarity). > > > Alex Wysoker wrote: > > > * set MRNM and MPOS to refer to the "first" SAMRecord for the > mate. When I say first, I am > > > referring to the ordered list of alignments for a read, as > defined by IH, HI, CC and CP tags. > > > You can then locate all the candidate alignments for the mate. > > Regarding these optional tags, could you please clarify: > > HI - Are the HI index values for a read assumed to be in increasing > order in the SAM file? We are using the Java samtools and its built-in > sorting capability, thus any HI values we set would be the order the > records were added, not the order they appear in the file. I don't believe there was any intent to require that HI has to follow the order of records in the file. Sorting the file with a different sort order would reorder these records and we certainly didn't intend to make more work for sorting. The intent was that HI/IH in conjunction with CC/CP would allow you to create (and navigate, using the bam index) a linked list of alignments for the same read when the file is sorted in coordinate order. I don't know of anyone who is actually using these tags in this way, however. > > Bob Handsaker wrote: > > Yes, we're hoping to produce a revised draft soon with community input. > > So please send comments on other problems you find or suggestions > for improvement. > > For what it is worth here are a few other points on the spec that I have > noted at various times. Opinions may vary on these and perhaps some of > them have already been addressed. In no particular order: > > * There is no specified length restriction on reads. It seems they > should > be restricted to be at most 2^29 (as per the reference). Given that in > the future read sequences might be considerably longer, thought > should be > given as to how longer sequences could be split over multiple lines. > > * Similarly, BAM has implicit limits on the length of sequence names > (2^31-2 due to l_name field) but this is not clearly documented and > not mentioned at all in the SAM part of the specification. > > * It would be nice to have a way of adding comment lines to SAM files. > > * Having the flag specified in ASCII rather than as a binary string makes > it hard to visually determine which flags are set for a particular > record. > It would also be easier to select groups of records with grep like > utilities if the flags were given as binary strings. > > * On page 5, footnote 7, the language is very unclear, in particular, the > phrase "The bases are reverse complemented from the unmapped read" is > not satisfactory, since the read is mapped (just in the reverse frame). > > * Line ending conventions are not documented (that is, are they Linux > style, Windows style, or either) and should conversion to/from BAM > preserve the line-endings of the original or convert to local standard. Thanks! -Bob > > Sean. > > > > > |