Hi Guys,
We are trying to use bowtie 1.0.0 to evaluate the uniqueness of the genome, by extracting sequences, mapping them back to the full genome with -m1, and checking to see if there is a unique alignment. Overall, this works extremely well, but there are a small number of cases where the sequence does not map back to where it came, especially when the sequence contains 'N's.
For example:
Extract sequences from chr21
$ cat weird.bed
chr21 9775338 9775439
chr21 9775338 9775437
$ fastaFromBed -fi /isilon/seq/schatz/tgarvin/genomes/human/hg19/hg19.fa -bed weird.bed -fo weird.fa
$ cat weird.fa
chr21:9775338-9775439
tataagtgagaagatgcagtgtttggttttttcttcctgcattagtttgctgaagatatcagctttgggttcatccatatccctgcaaagagcatgatcNN
chr21:9775338-9775437
tataagtgagaagatgcagtgtttggttttttcttcctgcattagtttgctgaagatatcagctttgggttcatccatatccctgcaaagagcatgatc
(Notice the second is 2 bp shorter excluding the Ns)
$ /bluearc/data/schatz/software/packages/bowtie-1.0.0/bowtie --chunkmbs=200 /seq/schatz/tgarvin/genomes/human/hg19/hg19.fa -f -S -t -m1 --best --strata -p4 weird.fa
...
chr21:9775338-9775437 4 * 0 0 * * 0 0 TATAAGTGAGAAGATGCAGTGTTTGGTTTTTTCTTCCTGCATTAGTTTGCTGAAGATATCAGCTTTGGGTTCATCCATATCCCTGCAAAGAGCATGATC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XM:i:1
chr21:9775338-9775439 0 chr1 143384489 255 101M * 0 0 TATAAGTGAGAAGATGCAGTGTTTGGTTTTTTCTTCCTGCATTAGTTTGCTGAAGATATCAGCTTTGGGTTCATCCATATCCCTGCAAAGAGCATGATCNN IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:99T0C0 NM:i:2
Notice the read without Ns is correctly flagged as non-unique, but with Ns is flagged as unique with edit distance of 2 to chromosome 1!
For our application we know to look for this now, but seems like this could be a general problem.
Let us know if you need any more details or examples.
Thank you,
Mike
Hi Michael,
I am investigating now another N related bug. So good to know about this issue as well. If there is not much trouble for you I could use one more example.
thanks again,
Val
Hi Val,
Thanks for working on this. Here is another example where a sequence on
chr21 is "uniquely" mapped to chr17. It looks like it does that right thing
when there is 1 N or none, the error only occurs if there are 2 Ns. I
havent tried any more than this though.
$ cat weird.bed
chr21 33157053 33157154
chr21 33157054 33157154
chr21 33157055 33157154
$ fastaFromBed -fi /isilon/seq/schatz/tgarvin/genomes/human/hg19/hg19.fa
-bed weird.bed -fo weird.fa
$ cat weird.fa
$ /bluearc/data/schatz/software/packages/bowtie-1.0.0/bowtie --chunkmbs=200
/seq/schatz/tgarvin/genomes/human/hg19/hg19.fa -f -S -t -m1 --best --strata
-p4 weird.fa
...
chr21:33157055-33157154 4 * 0 0 * 0 0
ACCACCACCACCACCACCACCACCACCACCACCACCACCACCACCACCACCACCATCACCACCACCACCACCACCACCACCACCACCACCACCACCACC
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
XM:i:1
chr21:33157054-33157154 4 * 0 0 * 0 0
NACCACCACCACCACCACCACCACCACCACCACCACCACCACCACCACCACCACCATCACCACCACCACCACCACCACCACCACCACCACCACCACCACC
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
XM:i:1
chr21:33157053-33157154 16 chr17 64795043 255 101M * 0 0
GGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGATGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTGGTNN
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
XA:i:2 MD:Z:99G0G0 NM:i:2
Good luck!
Mike
On Wed, Mar 19, 2014 at 10:16 AM, Val valduboisvert@users.sf.net wrote:
Related
Bugs: #304
Thank you Michael. I will look into this.
Val