I am getting a strange assembly artifact with bowtie2 beta 6.
I am assembling HiSeq and MiSeq reads of avian influenza A virus samples.
bird flu consists of 8 segments that total 13Kb. My reference sequence consists of :
segment 1: 2 sequences.
segment 2: 2 sequences.
segment 3: 3 sequences.
segment 5: 3 sequences.
segment 7: 2 sequences.
segment 8: 3 sequences.
segment 4: 20 sequences.
segment 6: 9 sequences.
The last two segments, segment 4 and segment 6, are highly divergent, and thus require large number of reference sequences. The sequences are present in the reference file in that order.
When I assemble with this set, I get correct assembly for the first 6 segments, 1,2,3,5,7,8, and sometimes correct segment4 and most often errounous assembly of segment 6. It maps significant number of reads to one of the segment 6 reference. This however is not the correct one.
If I assemble just to the segment 6 references, the same reads assemble correctly. The sequences in segments 4 and segment 6 are highly divergent (40% amino acids identity, no nucleotide level similarity) and reads should not get sucked into a similar looking reference.
At first I thought the program was running out of a physical resources but this strange behaviour happens for Miseq data as well.