From: Michael G. R. <mg...@br...> - 2010-07-23 19:39:20
|
I think it's a good suggestion to keep the standard fixed, but have an option for samtools view to produce a human-readable representation. However, I think the default "view" mode should be the old school integer representation because a number of existing tools rely on "samtools view" to read BAM files and expect the integer flags. ---- Michael Ross mg...@br... On Jul 21, 2010, at 2:14 PM, Heng Li wrote: > My concern is if this is not part of the spec, one may dump 'samtools view -X' to a file and find picard rejects it. In addition, I use "zcat in.sam.gz | less -S" more often than "samtools view -X in.sam.gz | less -S". I guess this also happens to others. It would be anyway good to directly see readable flags in SAM when this is backward compatible. > > Heng > > On Jul 21, 2010, at 12:48 PM, Tim Fennell wrote: > >> I forget who said it, but I'd really prefer to see this as a "viewing" tool as opposed to an alternative format that is allowed in the SAM file itself. If were to stick with only allowing the numeric flag field in SAM files (and clearly in BAM files) but have tools like "samtools view" be able to essentially cat SAM files but provide a more human readable string on demand, I think that would be useful. >> >> -t >> >> On Jul 21, 2010, at 12:24 PM, Heng Li wrote: >> >>> >>> On Jul 21, 2010, at 11:54 AM, Lincoln Stein wrote: >>> >>>> Where can I get the 1.3 draft? The SVN version seems to be very short. >>>> >>>> Without reading the draft, I would agree with Paul's assessment. It has always annoyed me that the SAM flags are not easily human readable, but the deed is now done. I would suggest that the current FLAG column be maintained as is, but it be made possible for the same information to be inserted into the optional last-column flags field in human readable form. Redundant, obviously, but helpful. >>>> >>> >>> To clarify, a FLAG string is converted to integer when SAM is read. In BAM, there is only one integer representation and the BAM format is not changed at all. In addition, it is trivial for a program to tell whether the string or the integer representation is in use. To this end, this new mixed representation is fully backward compatible with v1.2 or prior. >>> >>> Personally, I think the integer flag is hurting. I would like to change this sooner rather than later. As we are updating the spec, we might as well improve the FLAG field. But if people all think having a mixed representation is confusing, I am happy to drop the string representation. >>> >>> The SAMv1.3 draft is available here: >>> >>> svn co https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/sam-spec >>> >>> You may compile with "pdflatex SAMv1". I am attaching the compiled PDF. >>> >>> Heng >>> >>> >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> >>> Limited, a charity registered in England with number 1021457 and a >>> compa >>> ny registered in England with number 2742969, whose registered >>> office is 2 >>> 15 Euston Road, London, NW1 2BE. >>> >>> >>> <SAMv1.pdf> >>>> Lincoln >>>> >>>> On Wed, Jul 21, 2010 at 11:25 AM, Paul Anderson <ph...@um...> wrote: >>>> My co-worker, Mary Kate Trost and I are looking the draft 1.3 over carefully, >>>> and the FLAG field has several issues regarding the inclusion of human >>>> readable characters as an alternate representation of the flag bits. >>>> >>>> In general, making FLAG have two possible representations for the FLAG >>>> field makes programming harder, not easier. >>>> >>>> Given the number of different sources for SAM and BAM files, even >>>> small environments won't readily be able to assume a single format for >>>> the FLAG field. In larger ones, it is almost guaranteed to be a mix of both >>>> (e.g. in merged BAM files from different sequencer runs for one >>>> individual). >>>> >>>> This means that even script writers using awk or perl will have to >>>> write messy code to correctly filter a record given either type of >>>> flag. The benefit, then, of simple perl and awk code is largely lost. >>>> >>>> Further, it increases work on the part of validation and testing to >>>> have two formats that say the same thing, but have ambiguous syntax >>>> (it appears not to be semantically ambiguous, but the syntax ambiguity does >>>> add complexity to some code paths). >>>> >>>>> From a standards document perspective, combining recommended human >>>> readable versions of internal data is already a slippery slope - they >>>> really need to be two different things - one is data storage, the >>>> other is recommended data presentation (not just human readable >>>> display, but also command line arguments for example). >>>> >>>> I suggest that it would be better to remove the notion of symbolic >>>> characters as an alternative form of the FLAGS value from the >>>> specification of SAM/BAM and instead add it to a recommended practice >>>> guide, where if people choose to input or output the flag values via >>>> other program command line options, for example, that they allow use of those >>>> specific character encodings. This will allow their use to be >>>> consistent across applications. >>>> >>>> An alternative way to express this is: please simplify the standard >>>> (i.e. allow only one representation of FLAG for SAM files), but >>>> explicitly state that tools that want to handle the complex case of >>>> both are no longer SAM compliant when used to handle two >>>> representations. So, for example, if a BAM to SAM viewer optionally >>>> can choose to output one or the other, it should be made clear in >>>> documentation of that tool that the result is no longer a valid SAM >>>> file. >>>> >>>> Thanks, >>>> >>>> Paul Anderson >>>> University of Michigan >>>> >>>> ------------------------------------------------------------------------------ >>>> This SF.net email is sponsored by Sprint >>>> What will you do first with EVO, the first 4G phone? >>>> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first >>>> _______________________________________________ >>>> Samtools-devel mailing list >>>> Sam...@li... >>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >>>> >>>> >>>> >>>> -- >>>> Lincoln D. Stein >>>> Director, Informatics and Biocomputing Platform >>>> Ontario Institute for Cancer Research >>>> 101 College St., Suite 800 >>>> Toronto, ON, Canada M5G0A3 >>>> 416 673-8514 >>>> Assistant: Renata Musa <Ren...@oi...> >>>> ------------------------------------------------------------------------------ >>>> This SF.net email is sponsored by Sprint >>>> What will you do first with EVO, the first 4G phone? >>>> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first_______________________________________________ >>>> Samtools-devel mailing list >>>> Sam...@li... >>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >>> >>> ------------------------------------------------------------------------------ >>> This SF.net email is sponsored by Sprint >>> What will you do first with EVO, the first 4G phone? >>> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first_______________________________________________ >>> Samtools-devel mailing list >>> Sam...@li... >>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >> > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Sprint > What will you do first with EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel |