I'm also seeing some reads with invalid header lines in fastq output of unaligned reads.
Mine is a 2x300 miseq run that has been trimmed with cutadapt.
I ran bowtie like this:
bowtie2 -p 16 -x /mnt/galaxy/data/genome/eco_hg19/bowtie2_index/eco_hg19 -1 test_r1.fastq -2 test_r2.fastq -I 0 -X 1000 --un-conc unmapped_reads.fastq --local --sensitive --gbar 4 > mapped_reads.sam
and I see this invalid read (note the missing @ symbol):
grep -A 3 M00532:58:000000000-AA5N2:1:1105:11203:24523 unmapped_reads.1.fastq
M00532:58:000000000-AA5N2:1:1105:11203:24523 1:N:0:1 ACGGCAGCACCGTCACGCAGCACGAAATAAGCATCTGATTTTTCGCACGGCAGCTCAGGTAATGGCACCGGATCTTCTTTCGGTGGTGCCACTTCGCCGTTACGTAAAATCTTACGTGTGATTTTACACTCTTCGTTGGTGCAGGCCATGTATTTACCGAAACGCACCATTTTCAGGTGCATTTCAGAGCCACATTTTTCACACTCAACGATCGGGCCGTCATAACCTTTAATGCGGAATTCGCCCTCTTCGCTCTAGTAACCGTC + 8BCC<FFFFCFFFFEEFFFGGGGGGGE?FGGAFGGGGGCCDF,@F:@@FD@FC,EFGAFGGGGDFFGGGGGDCCFA<<C<,BBC:>BEFEF9,?EFC:C>CC:B,CFGGFFGEGF?F,4B,<B<CFFFGGGGGGGF:BFGGGGGGFGGGG<D,3<FFFFBFGC@F+@FFFF;B9DFG9,@FGG9;BC@@D>FFGGF@FFC9<FFGFFG7:4*,<::*1<1*=FC<<F9F9C;9C9C8=?8++/;AE>8EFF/21*2<:CC?A+:C:
however it's valid in the input R1 file.
@M00532:58:000000000-AA5N2:1:1105:11203:24523 1:N:0:1 ACGGCAGCACCGTCACGCAGCACGAAATAAGCATCTGATTTTTCGCACGGCAGCTCAGGTAATGGCACCGGATCTTCTTTCGGTGGTGCCACTTCGCCGTTACGTAAAATCTTACGTGTGATTTTACACTCTTCGTTGGTGCAGGCCATGTATTTACCGAAACGCACCATTTTCAGGTGCATTTCAGAGCCACATTTTTCACACTCAACGATCGGGCCGTCATAACCTTTAATGCGGAATTCGCCCTCTTCGCTCTAGTAACCGTC + 8BCC<FFFFCFFFFEEFFFGGGGGGGE?FGGAFGGGGGCCDF,@F:@@FD@FC,EFGAFGGGGDFFGGGGGDCCFA<<C<,BBC:>BEFEF9,?EFC:C>CC:B,CFGGFFGEGF?F,4B,<B<CFFFGGGGGGGF:BFGGGGGGFGGGG<D,3<FFFFBFGC@F+@FFFF;B9DFG9,@FGG9;BC@@D>FFGGF@FFC9<FFGFFG7:4*,<::*1<1*=FC<<F9F9C;9C9C8=?8++/;AE>8EFF/21*2<:CC?A+:C: ~~~~ this read immediately follows a read that was completely trimmed away by cutadapt (probably was adapter dimer) but cutadapt is set to keep all sequences in input fastq file (instead of throwing away short reads) because it's much easer to keep r1 and r2 the same length for downstream processing.
@M00532:58:000000000-AA5N2:1:1105:19868:24518 1:N:0:1
+
@M00532:58:000000000-AA5N2:1:1105:11203:24523 1:N:0:1
ACGGCAGCACCGTCACGCAGCACGAAATAAGCATCTGATTTTTCGCACGGCAGCTCAGGTAATGGCACCGGATCTTCTTTCGGTGGTGCCACTTCGCCGTTACGTAAAATCTTACGTGTGATTTTACACTCTTCGTTGGTGCAGGCCATGTATTTACCGAAACGCACCATTTTCAGGTGCATTTCAGAGCCACATTTTTCACACTCAACGATCGGGCCGTCATAACCTTTAAT
GCGGAATTCGCCCTCTTCGCTCTAGTAACCGTC
+
8BCCFFFFCFFFFEEFFFGGGGGGGE?FGGAFGGGGGCCDF,@F:@@FD@FC,EFGAFGGGGDFFGGGGGDCCFA<<C<,BBC:BEFEF9,?EFC:C>CC:B,CFGGFFGEGF?F,4B,B<CFFFGGGGGGGF:BFGGGGGGFGGGG<D,3<FFFFBFGC@F+@FFFF;B9DFG9,@FGG9;BC@@DFFGGF@FFC9<FFGFFG7:4*,<::*1<1*=FC<<F9F9C;9 C9C8="?8++/;AE">8EFF/21*2<:CC?A+:C:
if you map these to an e-coli K-12 reference you can reproduce the problem: read1
@M00532:58:000000000-AA5N2:1:1105:14588:24517 1:N:0:1
GGGATTTGGTGTACCGAGACGGGACGTAAAATCTGCAGGCATTATAGTGATCCACGCCACATTTTGTCAACGTTTATTGCTAATCATGTGAATGAATATCCAGTTCACTTTCATTTGTTGAATACTTTTGCCTTCTCCTGCTCTCCCTTAAGCGCATTATTTTACAAAAAACACACTAAACTCTTCCTGTCTCCGATAAAAGATGATTAAATGAAAACTCATTTATTTTGCATAAAAATTCAGTGAGAGCGGAAATCCAGGCTCATCATCAGTTAATTAAGCAGGGTGTTATTTTATGAC
+
CCCCCGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGFGGGGGGGGFFGGGFGGGGGGGGFGGGGEGGGGGGGGGGGGGGGGGGGGGGFFGGFGGGG9FGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGFFGGGGGGFGGGGGFGDGEGGGGGGGGGGGFGGGGGGGGGGGFFGGAFFDGFGGGGGGGGGDFGGAFFEGEGGGGGGGGGGGGGGGGGGGGGFGGGGGGCFGGGCFGFGGGCFGGGGGCEGG8F9CCFFGGGCEFGGGGGGGC?FFFFG=DGGF3CG>CCFGGF0
@M00532:58:000000000-AA5N2:1:1105:23508:24517 1:N:0:1
AAATGGAATCACTGGCCTCGCTCTATAAAAATCATATAGCTACCTTACAAGAACGGACTCGCGATGCGCTGGCGCGCTTCAAGCTGGATGCGTTACTTATTCACTCCGGCGAGCTGTTCAATGTTTTTCTCGACGATCATCCCTATCCGTTTAAAGTGAACCCGCAATTCAAAGCGTGGGTGCCGGTAACTCA
+
CCCCCFGGGGFGGGGGGGGGGGGGGGFGGGGGGGGGFGGGGGGGGFGGGGGGAFGGGGGGGGGEEGDGGGGGGGGGCGGGGGGGGGGG?EGEGGGGGCFGGGGGGFGGG@@FGGGGGGGFGG?FGGFGGGFFGGB8BFFGGGGDEFG@BBF<FDCCDGGGGGEG@FGGGGGCFBCCC7FFEGEGFGGGCF9
@M00532:58:000000000-AA5N2:1:1105:19868:24518 1:N:0:1
+
@M00532:58:000000000-AA5N2:1:1105:11203:24523 1:N:0:1
ACGGCAGCACCGTCACGCAGCACGAAATAAGCATCTGATTTTTCGCACGGCAGCTCAGGTAATGGCACCGGATCTTCTTTCGGTGGTGCCACTTCGCCGTTACGTAAAATCTTACGTGTGATTTTACACTCTTCGTTGGTGCAGGCCATGTATTTACCGAAACGCACCATTTTCAGGTGCATTTCAGAGCCACATTTTTCACACTCAACGATCGGGCCGTCATAACCTTTAATGCGGAATTCGCCCTCTTCGCTCTAGTAACCGTC
+
8BCCFFFFCFFFFEEFFFGGGGGGGE?FGGAFGGGGGCCDF,@F:@@FD@FC,EFGAFGGGGDFFGGGGGDCCFA<<C<,BBC:BEFEF9,?EFC:C>CC:B,CFGGFFGEGF?F,4B,B<CFFFGGGGGGGF:BFGGGGGGFGGGG<D,3<FFFFBFGC@F+@FFFF;B9DFG9,@FGG9;BC@@DFFGGF@FFC9<FFGFFG7:4*,<::*1<1*=FC<<F9F9C;9C9C8=?8++ ;AE="">8EFF/212<:CC?A+:C:
@M00532:58:000000000-AA5N2:1:1105:15427:24524 1:N:0:1
AATGCGGTCAGGCAATCGGAGGTTCAATTCCTGCCTTTATTTTGGGGTTAAGCGGATATATCGCCAATCAGGTGCAAACGCCGGAAGTTATTATGGGCATCCGCACATCAATTGCCTTAGTACCTTGCGGATTTATGCTACTGGCATTCGTTATTATCTGGTTTTATCCGCTCACGGATAAAAAATTCAAAGAAATCGTGGGTGAAATTGATAATAGTAAAAAAGTGCAGCAGCAATTAATAAGCGATATCACTAATTAATATTCAATAAAAATAATCAGAACATCAAAGGTGAAACTAT
+
CCCCCECFBBEFGGFGGGGGGGFFGGGGCGGFGFDGGFFGA@EEFDGGG@F9CCFGECFGE::BC+CEFGFG@8FAFFEGFFGG+@5FF,C<E?D?=4F8B4+@4C?FE9FD5B,EFFFFGG,EFCGGGGCF<FFGGGFFGFF@FEGA9B<D<DF<=,F<CDFC,DEFGG+@3FFGCCGCFFCCGFGF,3:>F@EG,BFEGG?FFGF,@CA;,@7@C9@FBB<:C@CF+2+?B:C,?<ECF5CC5?+03<7@F<9CE90<<FC:F7<9FFGGGGFFF:4*+F98:C6C2:729
@M00532:58:000000000-AA5N2:1:1105:14858:24525 1:N:0:1
CCACTAACTCTATGTGAAATAAATCAAAATTTCACGCCGAAATACTCCTTAGGATGTATAGCGAAAAGAGAAAAAGATATACCTCGATCACCCCCTTTCTCCCAAGTGAAAATAAAAGGTTATCAGTTTGCAACATTGAACAACATTCGTTGCAAATCGATAACAACATGCACCTTCAGGATACTATTTATTATGTTCGGCAATGATATTTTCACCCGCGTAGAACGTTCAGAAAATACAAAAATGGCGGAAATCGCCCAATTCCTGCATGAAAATGATTTGAGCGTTAACACCACAGTC
+
CCCCCFFFEGCFGF-C<E9<FGFCEFGDFGG9FF,FFFGGGGCFGGGGGFFG,CFAFGCFGGGGF+C,@CEGG8,C<CFCF,@FF7CFECECEFDG,,EFGF,CAE<5,,CFFGGG97,CF?ED9EE9FCGGGGG,EGGCFGGGGGF,4B<7D<F,FFG,BFFGGFEGCFGFFFFAA,ADBFGDCGCFF9=AFFFGGCCC<+<>DFFCEGFEF@D:FEE@7,>>:DFF9;DEGG7B2CCF:F?C8:E5C?:=/:>E+<C+AA<C+<F7CFGGGGFC+9C>)>:7C9<1/*
@M00532:58:000000000-AA5N2:1:1105:15530:24529 1:N:0:1
read2:
@M00532:58:000000000-AA5N2:1:1105:14588:24517 2:N:0:1
ACGAGGGATCGCATCATAATCCTCTTCGTCTGGCTGGCCCAGGTTTGCAGTATATGCATAAGGAACCGCTCCCTTTTGTCGCATCCACAGCAGTGCGGCACTGGTGTCCAGACCGCCAGAAAAAGCGATACCAATACGTTGACCTACCGGGAGATGCTTGAGAATCGTCGTCATAAAATAACACCCTGCTTAATTAACTGATGATGAGCCTGGATTTCAGCTCTCACTGACTTTTTATGCAAACTAAATGAGTTTTCATTTAATCCTCTTTTATCGGAGACAGGGAGAGTTTAGTGTGTT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGFGGDFGGGGGGGGGGGGGGFGGGGGEGGGGGGGGGGGGGGGGDGGGFGGGGFGGGGG7F:FFEE@FCFFFCFGFGGGGFDGGGGGGGCFFFDGGGGGGDFF,,@?7@GGGGGGGGGFDEGGGCAEGGCFE>FE89EFCCFFGGFBEFGGG787C)/88E8<87DE@<CG@:C?F(:@;=A67CFEG))+/9?E+)+9;5.5?5EDAFA)644-6)--:(431)(.(2(40(())).5)))1,,
@M00532:58:000000000-AA5N2:1:1105:23508:24517 2:N:0:1
TGAGTTACCGGCACCCACGCTTTGAATTGCGGGTTCACTTTAAACGGATAGGGATGATCGTCGAGAAAAACATTGAACAGCTCGCCGGAGTGAATAAGTAACGCATCCAGCTTGAAGCGCGCCAGCGCATCGCGAGTCCGTTCTTGTAAGGTAGCTATATGATTTTTATAGAGCGAGGCCAGTGATTCCATTT
+
C@CCCFEGGGDGGGGGGGGGGGFFGGGGDF;@:FCFGGGGGGGFFGE:FCF8F@BEFG?FGD@FF@GGGGGGGGGGGFGG<FEFGEF@F+FAFAF9FF,FGGGECGGAFFFF8F5?ECFEFGGGGEG:FGGGC<@FFGGGDFFGG=C9EFFGGGGFF;A@D9;D?9DGGG9FGB:CGGG56<,B,=C@;E9
@M00532:58:000000000-AA5N2:1:1105:19868:24518 2:N:0:1
+
@M00532:58:000000000-AA5N2:1:1105:11203:24523 2:N:0:1
GACGGTTACGAGATCGAAGAGGGCGAATTCCGCATTAAAGGTTATGACGGCCCGATCGTTGAGTGTGAAAAATGTGGATCTGAAATGCACCTGAAAATGGGGCGATTCGGTAAATACATGGCCTGCACCAACGAAGAGTGTAAAAACACACGTAAGAATTAACGTCACGGCGAAGTGGCACCACCGAAAGAAGATCCGGTGACATTACCTGAGCGGACGTGAGAAAAATCAGGAGTGTATTGTGTGCTGCGTGAGGGTGCCGCCGGAGCACGGAATGGCAGCAGGATGGGAAAGGGGGTG
+
BACC@FFGGGGCECD@FGGGGGDGGGFGGDFGGGGGFGACFGGGFGFEG@6C@F@6@C,CDECE<FCFG,CEFFFG,:E,CE<DF<EDAEGGDFGG,CFFF:C+4+C=E7ACF,BFFCFGD<<A8EEDFGCFE===,EDFEGD8+@BF+@,@FGF,>B,7EB:+3:>BE=C9<5,21=?EGGCCCFC>+<570)<77C2:<B)08*:*-C14 .*(2="">(/57/(>)(62/:-+...)/-)((.(-(((-(-,4.4((.32(,().()(-((((),-:(,((((-.((
@M00532:58:000000000-AA5N2:1:1105:15427:24524 2:N:0:1
GGCCACTATTTTTCTCATAGTTGCACCTTTGATGTTCTGATTATTTTTATTGAATATTAATTAGTGATATCGCTGATTAATTGCTGCTGCACTTTTTTACGATTATCAATTTCAACCACGATTTCTTTGAATTTTTTATCCGTGAGCGGATACAACCAGATAATAACGAATGCCAGTAGCATAAATCCGCAAGGTACTAAGGCAATTGATGTTCGGATGCCCATAATAACTTCCGGCGTTTGCACCTGCTTGGCCATATATCCGCTTACCCCCCAAATAACGGCAGGGAGTGGACCCCCC
+
C@BCCGFFFGGGGGAFFGFGGGGGGGGGGGGFGCGFGGGFGGGGGGGGGGGDGGG9FGG9FF9,CEFFA<EGDGDFFF9EEECGGGGFGDFCGGGGFGFGGFFGCFF9,AFFE,EE<CBFBF8DECFC,EFGGFEACFFFDED:FGGG+>@E,ECEEGGGGFGCEFFGCGFFA,EF@8,,@CAFGGFE6@E638@EGGGDFFFFFFC6@2+1+=C5@1@F)9)33+5037,=@>:@9?9<CCBC2?(043:() (="" 9)="" 1)(63.1;13="0((-1)-/)((((,(((,-((.4(3)47" @M00532:58:000000000-AA5N2:1:1105:14858:24525="" 2:N:0:1="" CTAATATTTCCGGCAATTCCACCGCACGCGATAAGCTTTTCATCGCGGGTTACGGTAATCAATACTTCGACTGTGGTGTCAACGCTCAAATCATTTTCATGCAGGACTTGGGCGATTTCCGCCATTTTTTTATTTTCTGAACGTTTTACGCGGGTGAAAATATCATTGCCGAACATAATAACTAGTATCCTGCAGGTGCATGTTGTTAACGATTTGCAACGACTGTTGTTAACTGTTGAAAACGTATAACCTTTTATTTTCACTTGGGGAAAAGGGGGGTGACTAGGGCAATACTATTTT="" +="" -AB8CGGGGGGD7CFGCGGGGGGGGGGDCEFGGGGGGFF<CE9FD="">FGG+FDFGGGGGG,EC@FFG<FCC,FGDCC,E<FFEGGGFGGDDDG<EFGGGGDFDFF7EFGD8FFDCD7EEC7?CDC<FGEC8FCFDAF9CFG?FD8BDA+:>:8BCC,@D9<DCF9,,2?544:9FCEBC4;;,=E@=9<:,761=0B6;ACCDGC++.:5;C6=5@<(221)-:0.7-:7<44).((-/)67)/-5)6944596)))/4-(((.((-(((),(.))(((((-.)).).*))
@M00532:58:000000000-AA5N2:1:1105:15530:24529 2:N:0:1
~~~~~~
PS
I could have sworn that I already reported this on a different dataset, but I can't find any evidence that I actually did ... sorry if this is a duplicate.
I should have mentioned versions...
i see this in bowtie 2.2.3 and in 2.1.0
Hi Brad,
One thing I can infer from the case you described is that the fastq file is not valid and bowtie2 fails to stop and print an error message at that point. A fastq record with no sequence is invalid. Considering your use case this might seem arguable, but if we agree upon fastq file format specification then the natural outcome would be to trim the other mate as well when one gets fully trimmed.
However this bowtie2 issue has to be fixed regardless. I will let you know how we decide to proceed about this next week. Until then let me know if I misunderstood anything about this case or if there is something I totally failed to take into consideration.
thanks,
Val
I think you have it clear...
i've since switched to a PE aware adapter remover instead of cutadapt
I don't know the fastq spec... but I don't consider a 0 length read to be
totally crazy ;)
Brad
On Fri, Jul 25, 2014 at 5:52 PM, Val valduboisvert@users.sf.net wrote:
Related
Bugs: #317
Hi Brad,
Although for an invalid fastq file bowtie2 should stop the execution with an error, we do not want to break any pipelines that are currently using bowtie2 and otherwise did not take into account this behavior. Therefore we decided bowtie2 should print the invalid non-existing records. A patch for this behavior is currently in github and we will include it in the next release. You can download the source code from here: https://github.com/BenLangmead/bowtie2/archive/master.zip
Let me know if this solves your issue.
thanks,
Val