From: Chris C. <ch...@co...> - 2011-01-27 15:58:11
|
On 21/01/11 22:42, Peter wrote: > On Fri, Jan 21, 2011 at 7:43 PM, Jonathan Crabtree wrote: >> >> Hi, >> >> I think there may be a slight bug/inconsistency in the implementation >> of bam_parse_region in samtools-0.1.12a. Page 2 of the SAM >> specification (http://samtools.sourceforge.net/SAM-1.3.pdf) has the >> following to say about the syntax of valid reference sequence names: >> >> Reference sequence name. Each @SQ line must have a unique SN tag. The >> value of this >> fi eld is used in the alignment records in RNAME and PNEXT fi elds. >> Regular expression: [!-)+-<>-~][!-~]* >> >> I believe that this particular regular expression is using ASCII >> character ranges, which would mean that space characters are _not_ >> allowed in reference sequence names, but commas (which appear in the >> range "!-)") and colons _are_ both allowed in reference sequence >> names. ... > > Certainly colons are very commonly used in sequence > identifiers so they should be expected (while spaces > should not be) IMHO. > >> ... >> >> Parse a region in the format: "chr2:100,000-200,000". >> >> ... >> >> ... Another is that str is split on the _first_ colon instead of the >> _last_ colon. ... >> > > That seems like a clear bug. I was chatting about this with > a colleague who had Ensembl reference sequences (IIRC) > and had been manually renaming them to avoid colons > precisely because of this. I urged him to report his as a > bug, but your timely email seems to have found the same > issue. That might be me. The problem stems from the Ensembl Genomes project. All their fasta formatted genome sequences have this type of header: >EG:1 dna:chromosome chromosome:TAIR10:1:1:30427671:1 And thus bowtie searches of these genomes results in SAM file like thus: @HD VN:1.0 SO:unsorted @SQ SN:EG:1 LN:30427671 @SQ SN:EG:2 LN:19698289 @SQ SN:EG:3 LN:23459830 @SQ SN:EG:4 LN:18585056 @SQ SN:EG:5 LN:26975502 @SQ SN:EG:Mt LN:366924 @SQ SN:EG:Pt LN:154478 @PG ID=Bowtie VN=0.11.3 CL=<snip> HWI-B5-690_0051_FC:2:1:11361:1076#ACTCCC/2 16 EG:3 22801277 255 101M * 0 0 AAGAGTTTGCACTACTAGCTTTTTANCCTCAACTTCCACAAACAACAGCAAAATCTAANNAACAAAAAT CTTGGAGCTTNNCGTTTTTTTCTCTAACTCTA A@@@@@=56=55@@@@:667;8868%22039@@@@@AA<<AAAAAAAAAAA=<8;=79%$3<AAAAAAAAA??99?<9?%%B<????AAAAAAAAAB@@@B XA:i:2 MD:Z:25T32T0T19T0A20 NM:i:5 HWI-B5-690_0051_FC:2:1:11699:1078#ACATAC/2 16 EG:1 7146317 255 97M * 0 0 TAATAATGCTTGTAATCCCAAAAACTCATGTTTCCTTCCTTTCTACTCTCACTGTCTTNNTAATCTCTTTTTACAAT AANNCAATGTTTCTTGAAGC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA@@@@@@@@@@@B@@@?9?<>=?%%?AAA?<@@@@@<9;;;=<?%%B@@@??AAAAAAAAAA XA:i:2 MD:Z:58T0G19A0T16 NM:i:4 HWI-B5-690_0051_FC:2:1:13779:1077#TCCCCT/2 0 EG:2 2026960 255 101M * 0 0 GTAATGAACTGCCTCAAGTTNNTGAGGCACTATAGATTAAGNNGAGAACTATGGGAATAAAGTTTGATTATGTAATG TTTTATGTGGTTTGAGTTGTAATA @@@B@>9?><@@@B@1;919%%>;05451<@@B@9<>>9:2%%997/1</@@@@5@@@@@/86==@5@@@=<885@@@@@<5<<@5(/:;555=<5@@@@@ XA:i:2 MD:Z:20T0T19A0A58 NM:i:4 After creating a sorted and indexed bam, I get this error when trying to retrieve a chromosomal region: [main_samview] region "EG:5:3,173,497-3,179,448" specifies an unknown reference name. Continue anyway. But, returns nothing... Seeing as Ensembl is a huge resource for genomic sequences I'd say this issue needs fixing. Especially as it's probably a very simple fix to use the last colon rather than the first. |