Re: [Samtools-help] @PG header entries

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Dave,

I think what we're doing is slightly different that you've described,  
but close.  We are, in essence, treating PG in a very similar manner  
to the way we treat the RG tag.  More specifically, the header block  
can contain any number of @PG records, and then ever read has a PG  
attribute against it telling you which PG header goes with it.

Here's an example from a recent file:
....
@PG	ID:0	VN:0.7.1-9	CL:/seq/software/picard/current/3rd_party/maq/maq  
map -D -s 0 -a 1500 -e 150 ...
....
42DFUAAXX090605:1:115:430:822#0	163	chrM	7	99	101M	=	169	262	AGGTCT...	 
ABB?BB...	PG:Z:0

Given that we currently align all reads within a read group with the  
same command this does seem a little duplicated, but it definitely  
lets us track exactly how each read was aligned (or rather, by which  
aligner with which options).  If we, at a later date, decide to do  
something like try to align all reads with bwa and then take all read  
that didn't align and run them through a second aligner, it'd allow us  
to track that too.

-t

On Jun 18, 2009, at 10:21 AM, Dave Larson wrote:

>   I had a question regarding the usage of the @PG header that I  
> hadn't found addressed in the archives. @PG seems to have some  
> weaknesses when storing multiple programs in the file. The biggest  
> one of these being that, if you have multiple programs generating  
> alignment records, there is currently no way to tie which reads were  
> aligned with which program. I've seen some forwarded emails  
> indicating that The Broad is circumventing this shortcoming by  
> replacing the ID field of @PG with the read group ID the program  
> entry is tied to. I've cc'd Tim directly so he can clarify if I am  
> mistaken and provide some additional input as they've clearly been  
> thinking about this issue much longer than I.
>   It seems to me that a community agreed upon convention for  
> creating files of alignment records generated from multiple programs  
> would be useful, especially for consortium type projects. I'm sure  
> there are many ways of accomplishing this. Personally, I think the  
> clearest would be two additional header fields. One optional tag  
> (PI?) for @RG indicating the program ID that generated the records  
> and a second non-optional tag (NM?) in @PG to store the program name  
> and free up ID to be a unique identifier for the program entry.  
> Alternatively, I suppose this sort of thing could be resolved with  
> user defined header tags, but it seemed of such general utility that  
> I thought I would query the list.
>
> Thanks,
>
> Dave