Re: [Samtools-help] Possible inconsistency between bam_parse_region and SAM specification (v1.3-r88

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 21/01/11 22:42, Peter wrote:
> On Fri, Jan 21, 2011 at 7:43 PM, Jonathan Crabtree  wrote:
>>
>> Hi,
>>
>> I think there may be a slight bug/inconsistency in the implementation
>> of bam_parse_region in samtools-0.1.12a.  Page 2 of the SAM
>> specification (http://samtools.sourceforge.net/SAM-1.3.pdf) has the
>> following to say about the syntax of valid reference sequence names:
>>
>> Reference sequence name. Each @SQ line must have a unique SN tag. The
>> value of this
>> fi eld is used in the alignment records in RNAME and PNEXT fi elds.
>> Regular expression: [!-)+-<>-~][!-~]*
>>
>> I believe that this particular regular expression is using ASCII
>> character ranges, which would mean that space characters are _not_
>> allowed in reference sequence names, but commas (which appear in the
>> range "!-)") and colons _are_ both allowed in reference sequence
>> names.  ...
>
> Certainly colons are very commonly used in sequence
> identifiers so they should be expected (while spaces
> should not be) IMHO.
>
>> ...
>>
>> Parse a region in the format: "chr2:100,000-200,000".
>>
>> ...
>>
>> ...  Another is that str is split on the _first_ colon instead of the
>> _last_ colon.  ...
>>
>
> That seems like a clear bug. I was chatting about this with
> a colleague who had Ensembl reference sequences (IIRC)
> and had been manually renaming them to avoid colons
> precisely because of this. I urged him to report his as a
> bug, but your timely email seems to have found the same
> issue.

That might be me.

The problem stems from the Ensembl Genomes project. All their fasta 
formatted genome sequences have this type of header:
 >EG:1 dna:chromosome chromosome:TAIR10:1:1:30427671:1

And thus bowtie searches of these genomes results in SAM file like thus:
@HD     VN:1.0  SO:unsorted
@SQ     SN:EG:1 LN:30427671
@SQ     SN:EG:2 LN:19698289
@SQ     SN:EG:3 LN:23459830
@SQ     SN:EG:4 LN:18585056
@SQ     SN:EG:5 LN:26975502
@SQ     SN:EG:Mt        LN:366924
@SQ     SN:EG:Pt        LN:154478
@PG     ID=Bowtie       VN=0.11.3       CL=<snip>
HWI-B5-690_0051_FC:2:1:11361:1076#ACTCCC/2      16      EG:3    22801277 
        255     101M    *       0       0 
AAGAGTTTGCACTACTAGCTTTTTANCCTCAACTTCCACAAACAACAGCAAAATCTAANNAACAAAAAT
CTTGGAGCTTNNCGTTTTTTTCTCTAACTCTA 
A@@@@@=56=55@@@@:667;8868%22039@@@@@AA<<AAAAAAAAAAA=<8;=79%$3<AAAAAAAAA??99?<9?%%B<????AAAAAAAAAB@@@B 
   XA:i:2  MD:Z:25T32T0T19T0A20    NM:i:5
HWI-B5-690_0051_FC:2:1:11699:1078#ACATAC/2      16      EG:1    7146317 
255     97M     *       0       0 
TAATAATGCTTGTAATCCCAAAAACTCATGTTTCCTTCCTTTCTACTCTCACTGTCTTNNTAATCTCTTTTTACAAT
AANNCAATGTTTCTTGAAGC 
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA@@@@@@@@@@@B@@@?9?<>=?%%?AAA?<@@@@@<9;;;=<?%%B@@@??AAAAAAAAAA 
       XA:i:2  MD:Z:58T0G19A0T16       NM:i:4
HWI-B5-690_0051_FC:2:1:13779:1077#TCCCCT/2      0       EG:2    2026960 
255     101M    *       0       0 
GTAATGAACTGCCTCAAGTTNNTGAGGCACTATAGATTAAGNNGAGAACTATGGGAATAAAGTTTGATTATGTAATG
TTTTATGTGGTTTGAGTTGTAATA 
@@@B@>9?><@@@B@1;919%%>;05451<@@B@9<>>9:2%%997/1</@@@@5@@@@@/86==@5@@@=<885@@@@@<5<<@5(/:;555=<5@@@@@ 
   XA:i:2  MD:Z:20T0T19A0A58       NM:i:4

After creating a sorted and indexed bam, I get this error when trying to 
retrieve a chromosomal region:
[main_samview] region "EG:5:3,173,497-3,179,448" specifies an unknown 
reference name. Continue anyway.

But, returns nothing...

Seeing as Ensembl is a huge resource for genomic sequences I'd say this 
issue needs fixing. Especially as it's probably a very simple fix to use 
the last colon rather than the first.

Re: [Samtools-help] Possible inconsistency between bam_parse_region and SAM specification (v1.3-r88

Re: [Samtools-help] Possible inconsistency between bam_parse_region and SAM specification (v1.3-r882)?