#298 -a option reports different read matches for two identical repeats in reference

v0.9.0
open
nobody
None
5
2015-01-08
2013-12-26
No

Using this version of bowtie2:

$ bowtie2 --version
/Users/jbarrick/local/bin/bowtie2-align version 2.1.0
32-bit
Built on ifx5.ebalto.jhmi.edu
Wed Feb 20 11:41:54 EST 2013
Compiler: gcc version 4.2.1 (Apple Inc. build 5666) (dot 3)
Options: -O3 -m32 -msse2 -funroll-loops -g3 
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 4, 8, 4, 4, 8}

 
The reference has two copies of an exactly duplicated sequence. When running in -a mode, different matches are reported to each copy of the duplicate sequence (see CIGAR strings below) for a read matching in the middle of this long repeat (16S rRNA), even though the matches should be reported as exactly equivalent in terms of what parts of the read align. The repeats are on opposite strands of the reference genome, so I don't know if that is related to the discrepancy.

Input read file (test.fastq):

@test
TTGTGCAATATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCTTCCTCTCAGAACAGCTAGAGATCGTCGCC
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJGIFGIIJJJIJIIHIJJJJJJJJJJIJHHGHFFFFFEEEEEEEDDDDDDDDDDDCCDDDDBDDDDD

 
Input reference file (reference.fasta): ATTACHED

 
Command line and terminal output:

$ bowtie2 -t -p 2 --local --ma 1 --mp 3 --np 0 --rdg 2,3 --rfg 2,3 -a -i S,1,0.25 --score-min L,6,0.2 --reorder -x reference -U test.fastq -S test_matched.sam --un test_unaligned.fastq
Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Multiseed full-index search: 00:00:00
1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    1 (100.00%) aligned >1 times
100.00% overall alignment rate
Time searching: 00:00:00
Overall time: 00:00:00

 
SAM output (test_matched.sam):

@HD VN:1.0  SO:unsorted
@SQ SN:NC_009481    LN:2366980
@PG ID:bowtie2  PN:bowtie2  VN:2.1.0
test    16  NC_009481   534787  1   5S95M   *   0   0   GGCGACGATCTCTAGCTGTTCTGAGAGGAAGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAA    DDDDDBDDDDCCDDDDDDDDDDDEEEEEEEFFFFFHGHHJIJJJJJJJJJJIHIIJIJJJIIGFIGJJJIJJJJJJJJJJJJJJJJJHHHHHFFFFFCCC    AS:i:74 XS:i:74 XN:i:0  XM:i:7  XO:i:0  XG:i:0  NM:i:7  MD:Z:5A0G6G10T61T2C1G3  YT:Z:UU
test    256 NC_009481   2020604 1   9S86M5S *   0   0   TTGTGCAATATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCTTCCTCTCAGAACAGCTAGAGATCGTCGCC    CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJIJJJGIFGIIJJJIJIIHIJJJJJJJJJJIJHHGHFFFFFEEEEEEEDDDDDDDDDDDCCDDDDBDDDDD    AS:i:74 XS:i:74 XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4  MD:Z:61A10C6C0T5    YT:Z:UU

 
A BLASTN alignment shows that the first matches is correct, but that the 9S side of the second match should be matched, not padded. That is, the first 95 bases of the read match to coordinates 534787-534881 and 2020689-2020595 (reverse strand). Really, I would just like consistent behavior from bowtie2, so having both matches not align that end would also be fine if that's the way things should work out with the scoring scheme.

1 Attachments

Discussion


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks