Hi all,
I am using BBmap to remove contaminated reads from a transcriptome assembly (performed by trinity). After annotating my assembly, i noticed several reads that were associated with bacterial genes (likely due to contamination during sequencing of intestines). I created a reference file that contains genomes of these contaminants, and i ran BBmap on my assembled contigs using the contaminant genome reference. It seemed to remove the correct amount of reads after alignment using default parameters, but after looking at the sequence stats (using seqkit stats command), i notice that BBmap has chopped up my contigs. See below:
## BBmap with default parameters bbmap.sh -Xmx490g in=trinity_allSH_filter200.fasta ref=contaminated_genomes.fna outm=contaminants.fq outu=clean.fq # result Read 1 data: pct reads num reads pct bases num bases mapped: 0.9877% 30481 0.1361% 1680458 unambiguous: 0.0946% 2918 0.0494% 610196 ambiguous: 0.8932% 27563 0.0867% 1070262 low-Q discards: 0.0000% 0 0.0000% 0 Match Rate: NA NA 85.6736% 1448217 Error Rate: 23.2457% 26392 13.5243% 228613 # view stats /home/strickba/software/seqkit stats trinity_allSF_filter200.fasta trinity_allSF_filter200_decontam.fasta #file format type num_seqs sum_len min_len avg_len max_len #trinity_allSF_filter200.fasta FASTA DNA 1,399,089 1,235,212,438 200 882.9 40,508 #trinity_allSF_filter200_decontam.fasta FASTA DNA 3,055,543 1,233,323,386 15 403.6 500
Any idea as to why this is happening? I also used another command that you posted related to removing contaminants:
bbmap.sh -Xmx490g in=../trinity_allSF_filter200.fasta ref=contaminated_genomes.fna \ outm=contamination_SF.fq outu=clean_SF.fq minid=0.9 maxindel=20 fast qtrim=rl trimq=15 untrim # results mapped: 0.3022% 9098 0.0201% 243521 unambiguous: 0.0321% 965 0.0049% 59148 ambiguous: 0.2702% 8133 0.0152% 184373 low-Q discards: 0.0000% 0 0.0000% 0 Match Rate: NA NA 95.4188% 232405 Error Rate: 11.8530% 5317 4.5155% 10998 #view stats /home/strickba/software/seqkit stats clean_SF.fq clean_SH.fq #file format type num_seqs sum_len min_len avg_len max_len #clean_SF.fq.1 FASTQ DNA 3,001,299 1,212,624,206 15 404 500
Thanks all