Re: [Samtools-devel] Proposing a new bit flag 0x800

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello Heng,

On Thu, May 23, 2013 at 12:55:08PM -0400, Heng Li wrote:
> As no one has raised major concerns about the proposal of the new
> 0x800 flag and the SA tag, I am [now] writing these in the SAM spec. The
> attached is a draft version. The description is probably
> inaccurate. Please let me any improvements you can suggest. 

Thanks for updating the spec, but I have a minor issues.

Section 1.1: (typo)

The example alignment and corresponding SAM file differ. You updated
the sequence in the alignment diagram for r001/2 so NM:i:1 can be
added, but the SAM entry for r001/2 is the original.

Section 1.2: I'm a bit unsure why the wording around Segments,
Reads, Subreads, indexing and ordering.

I think you're saying that the Read is raw unaligned data and *may* be
split up into multiple Segments, and that Segments are the DNA pieces
we have per SAM line.  I don't think introducing indexing and
ordering makes that point clear though.

However Segment is defined as "A contiguous (sub)sequence on a
template which is sequenced or assembled". The key word here is
"template": it implies the case of a single contiguous fragment of
sequenced template, that aligns in a chimeric/split case against the
reference, is still one segment as you have defined segment to be
contiguously template-oriented rather than contiguously
reference-oriented. Is that intentional?

It's also perhaps confusing for some technologies that sequence in one
single read both the forward and reverse segments. Eg when
circularising a template:

[template-end][adapter][template-start]
    ------------------------------->

Giving:
1st --------->
2nd                    ------------> 

According to the wording, this is just one read, but I believe that is
not the intention.

The explanation of a read consisting of multiple segments, which are
sometimes called subreads, is confusing. Is this the same use of
segment as just defined above? I think not and suggest "fragment"
instead if you want to make a minimal change.

Maybe we should define:

  Raw read: (previously "read")
      A contiguous piece of DNA sequenced in one experiment by a
      sequencing instrument.

  Read: (previously "subread" in some cases, or "read" in others)
      A contiguous piece of template DNA as sequenced. [Footnote:
      often this is the same as the Raw read, but in some protocols,
      such as sequencing through a circularised template, we may have
      one Raw read being split into two Reads.]

  Segment:
      A portion of a Read, aligned against the reference. Each Read
      may produce 1 or more possibly overlapping Segments. Each SAM
      line represents one Segment.

  Canonical alignment: (etc)
      An alignment [of a read] consisting of ...

This avoids confusion over using "read" elsewhere to sometimes mean
"sub-read". It also changes the meaning of Segment to be reference
based instead of purely template based.

Section 1.4:
PNEXT/RNEXT now refer to next read rather than next segment.
[See above for the importance of understanding this.]

Assuming you mean sub-read rather than read (or with my revised terms
"read" suffices), then you're saying that you cannot determine the
locations of the next segments without explicitly parsing the SA tag?

If so it makes sense, but I perhaps haven't fully considered all the
nuances yet.

Also I assume it's still permitted to have 3 reads on a template,
forming a circular linked list in PNEXT.

-- 
James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova
                                  | Plurima gyrabant gymbolitare vabo;
  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.