This extension is actually a hack into the current index format, which may make some people unhappy. I am a little wary of writing hacks in the spec. But if we all think it is a good idea, we may add it to the spec.
In addition, we may also write other stats in the bin 37450 (for alignments, the maximal bin number is 37449). Besides the number of mapped and unmapped reads, do we want to collect more information when we index the alignment?
Heng
On Jul 7, 2010, at 10:58 AM, Tim Fennell wrote:
> Agreed - we're happy to add this information as it's clearly useful. And as you know, we tend to like things like this in the spec to make sure we're all doing the same thing and others can understand it.
>
> -t
>
> On Jul 7, 2010, at 10:57 AM, Richard Durbin wrote:
>
>> I suggest that if Picard are happy with writing this then you add this
>> into the spec and both samtools and Picard support it.
>> i.e. sort it out between Heng/samtools and Tim/Alex/Picard.
>>
>> Richard
>>
>> On 07/07/2010 15:14, Heng Li wrote:
>>> On Jul 7, 2010, at 9:56 AM, Alec Wysoker wrote:
>>>
>>>
>>>> Hi Folks,
>>>>
>>>> I'm very concerned that there has been no response to Martha's
>>>> question. It appears that 'samtools index' is putting more information
>>>> in the index than is described in the SAM spec. The Picard team is
>>>> creating index writing code so that we can generate an index on the fly
>>>> while writing a BAM. What are we to make of the fact that samtools is
>>>> writing some undocumented info to the index? Should the Picard indexer
>>>> also write this stuff? How are we supposed to know what to write?
>>>>
>>> Firstly, samtools index is an extension that is fully compatible with the spec. If you do not implement the extension, everything but the new "samtools idxstats" will work properly.
>>>
>>> If you want to read/write the additional information. Here is how it gets stored. In the extension, each chr has a bin 37450 which currently has two records (n_chunk=2). The first record keeps the start and end file offset of the entire chr (chunk_beg[0]=chr_start_off, chunk_end[0]=chr_end_off). The second record keeps the number of mapped reads and unmapped reads in the chr (chunk_beg[1]=#mapped, chunk_end[1]=#unmapped). The number of unmapped reads without coordinates is written at the end of the index file as an 8-byte integer. These information allows us to have quick stats on how many reads are aligned to each chr.
>>>
>>> Heng
>>>
>>>
>>>> Thanks, Alec
>>>>
>>>> Martha Borkan wrote:
>>>>
>>>>> I'm trying to understand the bam index format, and have the following
>>>>> question.
>>>>>
>>>>> When a bam alignment record has no start it can presumably be ignored in
>>>>> the index
>>>>> (except, as I see the bam_index.c code does, to count them as n_no_coor,
>>>>> and record this count in some extra information at the end of the file).
>>>>>
>>>>> However, what is done with records that have a start position but no end
>>>>> position (i.e. an unmapped mate stored at the mapped mate's position).
>>>>> How are these alignments recorded in the index, and what end position is
>>>>> passed to reg2bin?
>>>>> Are they also ignored, or is an end position fabricated somehow, or is
>>>>> something else done with them?
>>>>>
>>>>> Could the spec be updated to clarify this. Thanks,
>>>>>
>>>>> Martha Borkan
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> This SF.net email is sponsored by Sprint
>>>>> What will you do first with EVO, the first 4G phone?
>>>>> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
>>>>> _______________________________________________
>>>>> Samtools-devel mailing list
>>>>> Samtools-devel@...
>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>>
>>>>>
>>>> ------------------------------------------------------------------------------
>>>> This SF.net email is sponsored by Sprint
>>>> What will you do first with EVO, the first 4G phone?
>>>> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
>>>> _______________________________________________
>>>> Samtools-devel mailing list
>>>> Samtools-devel@...
>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>
>>>
>>>
>>>
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research
>> Limited, a charity registered in England with number 1021457 and a
>> company registered in England with number 2742969, whose registered
>> office is 215 Euston Road, London, NW1 2BE.
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Sprint
>> What will you do first with EVO, the first 4G phone?
>> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
>> _______________________________________________
>> Samtools-devel mailing list
>> Samtools-devel@...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> _______________________________________________
> Samtools-devel mailing list
> Samtools-devel@...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|