Menu

#50 BBMap is unintentionally chopping up my sequences

1.0
open
None
2021-10-26
2021-10-26
Britton
No

Hi all,

I am using BBmap to remove contaminated reads from a transcriptome assembly (performed by trinity). After annotating my assembly, i noticed several reads that were associated with bacterial genes (likely due to contamination during sequencing of intestines). I created a reference file that contains genomes of these contaminants, and i ran BBmap on my assembled contigs using the contaminant genome reference. It seemed to remove the correct amount of reads after alignment using default parameters, but after looking at the sequence stats (using seqkit stats command), i notice that BBmap has chopped up my contigs. See below:

## BBmap with default parameters
bbmap.sh -Xmx490g in=trinity_allSH_filter200.fasta ref=contaminated_genomes.fna outm=contaminants.fq outu=clean.fq

# result
Read 1 data:            pct reads       num reads       pct bases          num bases
mapped:                   0.9877%           30481         0.1361%            1680458
unambiguous:              0.0946%            2918         0.0494%             610196
ambiguous:                0.8932%           27563         0.0867%            1070262
low-Q discards:           0.0000%               0         0.0000%                  0
Match Rate:                   NA               NA        85.6736%            1448217
Error Rate:              23.2457%           26392        13.5243%             228613

# view stats
/home/strickba/software/seqkit stats trinity_allSF_filter200.fasta trinity_allSF_filter200_decontam.fasta
#file                                    format  type   num_seqs        sum_len  min_len  avg_len  max_len
#trinity_allSF_filter200.fasta           FASTA   DNA   1,399,089  1,235,212,438      200    882.9   40,508
#trinity_allSF_filter200_decontam.fasta  FASTA   DNA   3,055,543  1,233,323,386       15    403.6      500

Any idea as to why this is happening? I also used another command that you posted related to removing contaminants:

bbmap.sh -Xmx490g in=../trinity_allSF_filter200.fasta ref=contaminated_genomes.fna \
outm=contamination_SF.fq outu=clean_SF.fq minid=0.9 maxindel=20 fast qtrim=rl trimq=15 untrim
# results
mapped:                   0.3022%            9098         0.0201%             243521
unambiguous:              0.0321%             965         0.0049%              59148
ambiguous:                0.2702%            8133         0.0152%             184373
low-Q discards:           0.0000%               0         0.0000%                  0
Match Rate:                   NA               NA        95.4188%             232405
Error Rate:              11.8530%            5317         4.5155%              10998

#view stats
/home/strickba/software/seqkit stats clean_SF.fq clean_SH.fq
#file           format  type   num_seqs        sum_len  min_len  avg_len  max_len
#clean_SF.fq.1  FASTQ   DNA   3,001,299  1,212,624,206       15      404      500

Thanks all

Discussion


Log in to post a comment.