From: Heng Li <lh...@sa...> - 2013-04-01 16:35:16
|
I have thought about this when I implemented the CSI/BAI2 index. Another solution is to reuse the 16-bit 'bin' field. The key observation is that 'bin' is redundant as it can be computed directly from pos and cigar. In addition, CSI/BAI2 requires more bits for 'bin'. This field is entirely ignored in htslib. With 16 additional bits, we can do the following. If n_cigar is smaller than 0x10000, everything is exactly the same as the current BAM for the backward compatibility. If n_cigar>=0x10000, we set a hidden bit 0x8000 at the 'flag' field and write the higher 16 bits of n_cigar at 'bin'. This way older versions of samtools/picard can still parse BAMs without long CIGARs. This solution is more complicated than squeezing 5 bits from flag, but I am concerned with using up all the bits in the flag field. Heng On Apr 1, 2013, at 11:33 AM, Peter Cock wrote: > Dear all, > > I have filed a pull request to fix an overflow bug in the samtools > SAM parsing code which was causing crashes with core dumps: > https://github.com/samtools/samtools/pull/39 > > The problem stems for the use of only an unsigned 16 bit int > in BAM and the samtools data struct for the number of CIGAR > operators (field n_cigar in the samtools source, n_cigar_op > in the BAM definition). > > This means that as the SAM/BAM specification stands, > BAM has an implicit limit of 65535 CIGAR operators, while > SAM has no limit. > > Could someone please review and merge that simple fix? > > This is proving a problem in converting valid SAM files from > MIRA into BAM. The SAM files in question are from reference > guided assembly, where the reference genome is presented in > the output as a very long mapped read. Using a traditional > unpadded SAM/BAM reference this is less likely to be an > issue, but with a padded SAM/BAM reference with gaps in > it inevitably gives a more complex CIGAR string in this case. > In this specific example, 428303 CIGAR operations. > > http://www.freelists.org/post/mira_talk/Parse-error-at-line-84-unmatched-CIGAR-operation,6 > > * Can we fix the CIGAR limit in BAM? > > Looking at the BAM specification, n_cigar_op and flag share > the space of an unsigned 32 bit integer, getting 16 bits each. > Currently only 11 bits are needed for the FLAG, so in theory > we could split this to give the spare 5 bits to n_cigar_op > and thus have an unsigned 21 bit integer. That would give > up to 2**21 - 1 = 2097151 CIGAR operations, which would > appear to be ample for the likely assemblies from MIRA. > > Of course, there are downsides - for instance this would > also limit future additions to the FLAG field. > > * Can we fix the CIGAR limit in samtools? > > Even if we don't change the BAM definition, we can still > fix samtools to increase this limit. This would allow users > to parse and work with SAM files containing larger CIGAR > strings, but they still wouldn't be able to represent tham > as BAM files. This could be used in this particular case > to run 'samtools depad' on the MIRA SAM file, which as > a side effect would simplify the CIGAR string enough for > it to be written as BAM under the current 16 bit limit. > > The simplest way to do this is to modify the struct to allow > more space for n_cigar, but that would require more > memory per read. Alternatively, this looks promising > (stealing 5 bits from the flag memory allocation): > > typedef struct { > int32_t tid; > int32_t pos; > uint32_t bin:16, qual:8, l_qname:8; > uint32_t flag:11, n_cigar:21; > int32_t l_qseq; > int32_t mtid; > int32_t mpos; > int32_t isize; > } bam1_core_t; > > This would of course require further care and attention in > the BAM parsing and writing code, but seems to have > potential. > > Thoughts? > > Peter > > ------------------------------------------------------------------------------ > Own the Future-Intel® Level Up Game Demo Contest 2013 > Rise to greatness in Intel's independent game demo contest. > Compete for recognition, cash, and the chance to get your game > on Steam. $5K grand prize plus 10 genre and skill prizes. > Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |