From: James B. <jk...@sa...> - 2013-05-24 09:25:15
|
Hello Heng, On Thu, May 23, 2013 at 12:55:08PM -0400, Heng Li wrote: > As no one has raised major concerns about the proposal of the new > 0x800 flag and the SA tag, I am [now] writing these in the SAM spec. The > attached is a draft version. The description is probably > inaccurate. Please let me any improvements you can suggest. Thanks for updating the spec, but I have a minor issues. Section 1.1: (typo) The example alignment and corresponding SAM file differ. You updated the sequence in the alignment diagram for r001/2 so NM:i:1 can be added, but the SAM entry for r001/2 is the original. Section 1.2: I'm a bit unsure why the wording around Segments, Reads, Subreads, indexing and ordering. I think you're saying that the Read is raw unaligned data and *may* be split up into multiple Segments, and that Segments are the DNA pieces we have per SAM line. I don't think introducing indexing and ordering makes that point clear though. However Segment is defined as "A contiguous (sub)sequence on a template which is sequenced or assembled". The key word here is "template": it implies the case of a single contiguous fragment of sequenced template, that aligns in a chimeric/split case against the reference, is still one segment as you have defined segment to be contiguously template-oriented rather than contiguously reference-oriented. Is that intentional? It's also perhaps confusing for some technologies that sequence in one single read both the forward and reverse segments. Eg when circularising a template: [template-end][adapter][template-start] -------------------------------> Giving: 1st ---------> 2nd ------------> According to the wording, this is just one read, but I believe that is not the intention. The explanation of a read consisting of multiple segments, which are sometimes called subreads, is confusing. Is this the same use of segment as just defined above? I think not and suggest "fragment" instead if you want to make a minimal change. Maybe we should define: Raw read: (previously "read") A contiguous piece of DNA sequenced in one experiment by a sequencing instrument. Read: (previously "subread" in some cases, or "read" in others) A contiguous piece of template DNA as sequenced. [Footnote: often this is the same as the Raw read, but in some protocols, such as sequencing through a circularised template, we may have one Raw read being split into two Reads.] Segment: A portion of a Read, aligned against the reference. Each Read may produce 1 or more possibly overlapping Segments. Each SAM line represents one Segment. Canonical alignment: (etc) An alignment [of a read] consisting of ... This avoids confusion over using "read" elsewhere to sometimes mean "sub-read". It also changes the meaning of Segment to be reference based instead of purely template based. Section 1.4: PNEXT/RNEXT now refer to next read rather than next segment. [See above for the importance of understanding this.] Assuming you mean sub-read rather than read (or with my revised terms "read" suffices), then you're saying that you cannot determine the locations of the next segments without explicitly parsing the SA tag? If so it makes sense, but I perhaps haven't fully considered all the nuances yet. Also I assume it's still permitted to have 3 reads on a template, forming a circular linked list in PNEXT. -- James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |