Re: [Samtools-devel] non-spec-conformant @SQ lines somewhere in 1000 genomes stack

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 3 Sep 2013, at 01:04, Dan Kortschak wrote:
> I'm going through my pure Go BAM parsing code to get it up to reasonable
> functionality (this involves writing tests and I'm finding some
> surprising things - both in my own code [good] and in the BAM corpus
> [not so good]).
> 
> It's not clear the following come from, but if you look at the first few
> header lines of e.g. alignments from HG00096 from the 1000 genomes
> project you see:
> 
>> @HD	VN:1.0	SO:coordinate
>> @SQ	SN:1	LN:249250621	M5:1b22b98cdeb4a9304cb5d48026a85128	UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human
>> [snip]
> 
> This all looks fine until you realise that header lines are tab
> delimited and the AS and SP tags are space delimited. This breaks
> parsing (unless you relax it to be non-spec-conformant).

These headers can be parsed just fine, though some fields have different values from those the file's author surely intended: parsing as the existing implementations do and as the spec says, these headers have no AS or SP field and the UR field has an "interesting" value that would be file-not-found if you tried looking up the URL.

Because they can be parsed successfully and processed successfully (until your application wants to use the AS/SP/UR fields!), these erroneous headers in these few files were not noticed for a while.  This was an error (since corrected) in the construction of these few early 1000 genomes files rather than an endemic bug in one of the tools in the @PG headers you listed, so no SAM/BAM implementations should be relaxing their parsers and breaking on other files for the sake of these few.

> The spec doesn't define what the order of addition to @PG lines should
> be (it looks like bottom addition),

The spec doesn't define the order of @PG header lines in the file, because it is immaterial.  What is meaningful is the chains of history produced by following PP fields in the @PG headers pointed to by PG fields in alignment records or @RG headers.

> it's not obvious where these non-conformant lines are added.
> 
> Any ideas who is adding this and how prevalent it is?

This was a manual error back when these particular 1000 genomes files were constructed.  (A .dict file had some tabs munted by a text editor, or something similar.)  It was subsequently fixed (e.g. HG00097 is fine).  Thus it's not prevalent at all, though I suppose it would be possible for other groups to rediscover the same user mistake independently :-).

    John

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.