|
From: James B. <jk...@sa...> - 2014-09-17 08:46:54
|
Sorry to resurrect an ancient thread, but it was never resolved at the time and it's now starting to become a real issue. I recently diagnosed a bug in our production code that turned out to be due to hitting this limit. Admittedly that was a consensus sequence, but it won't be long before the era of very long reads with relatively high indel rates makes this fail on raw reads. There were no objections about this at the time, but equally so nothing actually changed and the spec is still worded the same. I see 3 options. 1) Do nothing (not really viable IMO), even though SAM and CRAM resolve it BAM is still here to stay for a long time. 2) Change as proposed, utilising the top bit of the FLAG as a marker for an alternate meaning of the BIN field. 3) Deprecate BIN entirely (with no changes to FLAG) given that bins are basically broken and have been for years and reuse the bits for cigar. CSI invalidates bins and with the launch of htslib 1.0 it is not read by that code either (and presumably won't be by picard if they are supporting CSI indices). Note we cannot simply shuffle bits around to claim cigar is 32-bit. Although cigar and bin are adjacent in memory, the byte order would be 2143 (if 1234 is big endian and 4321 is little endian). Not quite PDP, but equally whacky. For now, I just patched Staden io_lib to deal with case 2 as I needed something to work around my internal representation being BAM (thus blocking > 64k ops from SAM and CRAM too), but I need an official view on this before things simply become standard through common usage. James On Mon, Apr 01, 2013 at 12:35:07PM -0400, Heng Li wrote: > I have thought about this when I implemented the CSI/BAI2 index. Another solution is to reuse the 16-bit 'bin' field. The key observation is that 'bin' is redundant as it can be computed directly from pos and cigar. In addition, CSI/BAI2 requires more bits for 'bin'. This field is entirely ignored in htslib. > > With 16 additional bits, we can do the following. If n_cigar is smaller than 0x10000, everything is exactly the same as the current BAM for the backward compatibility. If n_cigar>=0x10000, we set a hidden bit 0x8000 at the 'flag' field and write the higher 16 bits of n_cigar at 'bin'. This way older versions of samtools/picard can still parse BAMs without long CIGARs. > > This solution is more complicated than squeezing 5 bits from flag, but I am concerned with using up all the bits in the flag field. > > Heng > > > On Apr 1, 2013, at 11:33 AM, Peter Cock wrote: > > > Dear all, > > > > I have filed a pull request to fix an overflow bug in the samtools > > SAM parsing code which was causing crashes with core dumps: > > https://github.com/samtools/samtools/pull/39 > > > > The problem stems for the use of only an unsigned 16 bit int > > in BAM and the samtools data struct for the number of CIGAR > > operators (field n_cigar in the samtools source, n_cigar_op > > in the BAM definition). > > > > This means that as the SAM/BAM specification stands, > > BAM has an implicit limit of 65535 CIGAR operators, while > > SAM has no limit. > > > > Could someone please review and merge that simple fix? > > > > This is proving a problem in converting valid SAM files from > > MIRA into BAM. The SAM files in question are from reference > > guided assembly, where the reference genome is presented in > > the output as a very long mapped read. Using a traditional > > unpadded SAM/BAM reference this is less likely to be an > > issue, but with a padded SAM/BAM reference with gaps in > > it inevitably gives a more complex CIGAR string in this case. > > In this specific example, 428303 CIGAR operations. > > > > http://www.freelists.org/post/mira_talk/Parse-error-at-line-84-unmatched-CIGAR-operation,6 > > > > * Can we fix the CIGAR limit in BAM? > > > > Looking at the BAM specification, n_cigar_op and flag share > > the space of an unsigned 32 bit integer, getting 16 bits each. > > Currently only 11 bits are needed for the FLAG, so in theory > > we could split this to give the spare 5 bits to n_cigar_op > > and thus have an unsigned 21 bit integer. That would give > > up to 2**21 - 1 = 2097151 CIGAR operations, which would > > appear to be ample for the likely assemblies from MIRA. > > > > Of course, there are downsides - for instance this would > > also limit future additions to the FLAG field. > > > > * Can we fix the CIGAR limit in samtools? > > > > Even if we don't change the BAM definition, we can still > > fix samtools to increase this limit. This would allow users > > to parse and work with SAM files containing larger CIGAR > > strings, but they still wouldn't be able to represent tham > > as BAM files. This could be used in this particular case > > to run 'samtools depad' on the MIRA SAM file, which as > > a side effect would simplify the CIGAR string enough for > > it to be written as BAM under the current 16 bit limit. > > > > The simplest way to do this is to modify the struct to allow > > more space for n_cigar, but that would require more > > memory per read. Alternatively, this looks promising > > (stealing 5 bits from the flag memory allocation): > > > > typedef struct { > > int32_t tid; > > int32_t pos; > > uint32_t bin:16, qual:8, l_qname:8; > > uint32_t flag:11, n_cigar:21; > > int32_t l_qseq; > > int32_t mtid; > > int32_t mpos; > > int32_t isize; > > } bam1_core_t; > > > > This would of course require further care and attention in > > the BAM parsing and writing code, but seems to have > > potential. > > > > Thoughts? > > > > Peter > > > > ------------------------------------------------------------------------------ > > Own the Future-Intel® Level Up Game Demo Contest 2013 > > Rise to greatness in Intel's independent game demo contest. > > Compete for recognition, cash, and the chance to get your game > > on Steam. $5K grand prize plus 10 genre and skill prizes. > > Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d > > _______________________________________________ > > Samtools-devel mailing list > > Sam...@li... > > https://lists.sourceforge.net/lists/listinfo/samtools-devel > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > Own the Future-Intel® Level Up Game Demo Contest 2013 > Rise to greatness in Intel's independent game demo contest. > Compete for recognition, cash, and the chance to get your game > on Steam. $5K grand prize plus 10 genre and skill prizes. > Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel -- James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |