From: James B. <jk...@sa...> - 2013-10-28 16:47:03
|
Hello all, Nothing really much moved on this, but I now see the offending code (see after quote below). Summary of proposal: formalise the padded cigar tag as a new PC tag. On Thu, May 16, 2013 at 05:33:36PM +0100, James Bonfield wrote: > I tried testing fixmate on a template with 3 sequences. Two forward > and one reverse (the last). All 3 have PNEXT and RNEXT set correctly > and the flags are correct (I think). Shrunk for brevity (real thing > attached): > > @SQ SN:xx LN:20 > @SQ SN:yy LN:20 > a1 67 xx 1 1 10M = 6 20 AAAAAAAAAA ********** TC:i:3 > a1 35 xx 6 1 10M = 11 -20 AAAAATTTTT ********** TC:i:3 > a1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT ********** TC:i:3 > > samtools view -S -b a.sam |samtools fixmate - - | samtools view - > > a1 67 xx 1 1 10M = 6 5 AAAAAAAAAA ********** TC:i:3 CT:Z:1F10M-5T2F10M > a1 3 xx 6 1 10M = 1 -5 AAAAATTTTT ********** TC:i:3 > a1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT ********** TC:i:3 > > So several issues. > > 1) Read 1 has mysteriously gained a CT aux field, claimed to be > consensus tag in the spec. It really isn't. What is CT being used > for here? We shouldn't be repurposing already defined tags for > another use. So here CT:Z: tag comes from bam_template_cigar() function in bam_mate.c adding by Heng Li in this commit in September 2011: https://github.com/samtools/samtools/commit/df50415f5380510f5b44bee2361413fc39c8bbcf Also in december we (Peter Cock, Bastien Chevreux, Heng Li and myself) had a discussion on annotations within SAM, including sequence and consensus "tags". This gave us the CT:Z: tag, added by Peter in 2012: https://github.com/samtools/hts-specs/commit/6d94b25cebfa5770f189b5477b3e812cca3cf52e These CT:Z: tags are definitely in use and have been since end of 2011. However now discovering the alternate undocumented CT tag raises questions. Does anyone out there have any insight into where this code originated and what purpose it has? Any users? Anyone care to own up to other "generally accepted" tags which should be in the specification and are not. (I know 1000 genomes and/or GATK generate several undocumented ones that don't ahead to the private name space rules; care to document them?) I can see CT being written in samtools but there is no code to utilise it there. If it's still deemed to be a useful string then we could pick something else - I'd like to propose PC (padded cigar, if that is really what it's doing). James -- James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |