I was looking for a java-based alternative to either cutadapt or ngmerge, and bbmerge seems to be able to do much of what both other tools can do (primarily I was looking for fixed read trimming, quality trimming, and paired read dovetail trimming). A few suggested improvements: * Allow simultaneous fixed trimming and q-trimming. Right now it seems like q-trimming supercedes fixed trimming if both are specified, regardless of which end you want to trim in each case; but there are situations where...
I was looking for a java-based global short-read aligner and this software performs excellently, with lots of neat features that many other aligners lack. In descending order of priority, here are some ideas for improvement: * Update SAM spec to v1.6 (currently 1.3/1.4), primarily to produce an updated @HD line to indicate query grouped output (GO:query). Many downstream tools (e.g. some fgbio tools) rely on this and won't work unless this is explicitly specified, and resorting just to add this annotation...
When I wrote that, PacBio did not have paired reads. They have a new sequencing machine now for short reads that I think does produce pairs but I have not seen any data for it so I'm not sure of the header structure.
I will check in with the lab, but my understanding is that these came from a NextSeq or NovaSeq and didn't have any modifications. Thanks for the quick response. I took a peak at FASTQ.java and saw the following code block: // Here we try to weed out PacBio, which will differ after the last slash: for (int i = idxSlash1 + 2; i < len1; i++) { if (id1.charAt(i) != id2.charAt(i)) { return false; } } I am using reformat.sh to do the following: - make sure reads are paired - count the number of reads/bases...
The problem here is that the read headers differ in two places. Normally, Illumina uses one of these two formats: @stuff/1 @stuff/2 or @stuff 1:morestuff @stuff 2:morestuff Of these, the /1 and /2 is obsolete for Illumina as far as I know, though Complete Genomics / BGI are adopting it. My effort to determine pairing is based on observation of Illumina data since there is no formal fastq specification regarding pair naming conventions, and they usually put the read identifier in the "optional description"....
The problem here is that the reads differ in two places. Normally, Illumina uses one of these two formats: @stuff/1 @stuff/2 or @stuff 1:morestuff @stuff 2:morestuff Of these, the /1 and /2 is obsolete for Illumina as far as I know, though Complete Genomics / BGI are adopting it. My effort to determine pairing is based on observation of Illumina data since there is no formal fastq specification regarding pair naming conventions, and they usually put the read identifier in the "optional description"....
I decided to subset my FASTQ to a single read so the files are more than manageable to demonstrate the issue: - bad_R* -> this FASTQ pair is the read as shown in the post above. This fails with vpair enabled. - bad-no-desc_R* -> this FASTQ pair is the same read where the optional description (text after the space) has been trimmed. This succeeds with vpair enabled. - bad-no-trail_R* -> this FASTQ pair is the same read except the /1 and /2 has been removed from the sequence identifier. This succeeds...
I decided to subset my FASTQ to a single read so the files are more than manageable to demonstrate the issue: - bad_R -> this FASTQ pair is the read as shown in the post above. This fails with vpair enabled. - bad-no-desc_R -> this FASTQ pair is the same read where the optional description (text after the space) has been trimmed. This succeeds with vpair enabled. - bad-no-trail_R -> this FASTQ pair is the same read except the /1 and /2 has been removed from the sequence identifier. This succeeds...
reformat.sh vpair fails matching reads.
clarifying that I missed cq=f. This can be closed.
Deprecate basecov files in pileup.sh
sendsketch does not obey proxy settings
Stitching read pair R1 andf R2 sequences based on alignment from SAM/BAM
`refromat.sh` mode that does not change base quality scores
Hi, I recently read on biostars HERE that I could use your tools to determine the GC content of my reads. My reads are paired reads though and I wanted to adjust this to determine the GC content for each chromosome. I was able to manually split my bam file into the various chromosomes but am unsure how best to use reformat and stats on these files. When I tried to tell reformat that it was paired, and provide two output file names, it didn't work. It runs properly if I say they are unpaired but I'm...
Hi, I recently read on biostars [https://www.biostars.org/p/9546248/]HERE that I could use your tools to determine the GC content of my reads. My reads are paired reads though and I wanted to adjust this to determine the GC content for each chromosome. I was able to manually split my bam file into the various chromosomes but am unsure how best to use reformat and stats on these files. When I tried to tell reformat that it was paired, and provide two output file names, it didn't work. It runs properly...
Hi, I recently read on biostars [https://www.biostars.org/p/9546248/] that I could use your tools to determine the GC content of my reads. My reads are paired reads though and I wanted to adjust this to determine the GC content for each chromosome. I was able to manually split my bam file into the various chromosomes but am unsure how best to use reformat and stats on these files. When I tried to tell reformat that it was paired, and provide two output file names, it didn't work. It runs properly...
Hi, I recently read on biostars [https://www.biostars.org/p/9546248/]HERE that I could use your tools to determine the GC content of my reads. My reads are paired reads though and I wanted to adjust this to determine the GC content for each chromosome. I was able to manually split my bam file into the various chromosomes but am unsure how best to use reformat and stats on these files. When I tried to tell reformat that it was paired, and provide two output file names, it didn't work. It runs properly...
Good suggestion; I'm opening a new process for bgzip and piping the input. Shouldn't be too hard to catch the error code.
Another alternative solution for me might just be to run gzip -dct ${FASTQ_PATH} on each gzipped fastq I'd like to analyze. This will catch gzip corruption. However it might be useful to propagate errors like these directly through bbmap suite.
Propagation of internal error codes
I was able to replicate this behavior and it is fixed for the next release (39.05, probably this week). Sometimes I don't notice this kind of issue because I always keep paired reads interleaved in a single file.
Though the issue above still stands. I realize that running repair.sh on paired FASTQs will toss the singletons and reorder the reads into the same order (though it appears to be randomly ordered relative to the two input FASTQs) as out and out2.
sortbyname.sh not creating out2 even when specified
bbduk.sh with maq option and outs keeps low-quality mates
filtering all not microbial reads
Hi Pierre, Sorry about that, those two and a couple other versions of CrisprFinder made it into the release accidentally. Only CrisprFinder.java should be there. I'll delete them for the next release. Thanks for notifying me! -Brian
Errors when compiling all Java classes in version 39.03
bbduk.sh fails with error"java.lang.NoSuchMethodError: java.lang.Process.isAlive()Z"
Hello, I am currently trying to make a more reproducible workflow (removing weird paths to files mostly) and I am having trouble figuring out the shortcut that can be used for phiX174 as a reference for BBDuk. It looks like the adapters file can be called like this: "ref=adapters", eliminating the extra path before it. Is there a similar way to write the phiX174_ill.ref.fa.gz file? Thanks!
Error from `taxtree.sh`: No level found for strain
taxtree.sh usage guide outdated?
dedupe.sh tries to access Index Integer.MAX_VALUE
Seems like it is done with -10*math.log(sum([10**(-(ord(c)-33)/10) for c in line4])/len(line4),10) rather than just sum([ord(c)-33 for c in line4])/len(line4) So, its the quality score of the average probability.
I've been using bbduk for quality filtering and I pretty much just took it for granted. However, I've been looking at it as I usually use minavequality=15 which removes about 15-20% of our reads. minavequality=30 removes almost 100% of them. However when I do the quality averaging by hand, almost all of my reads have an average quality above 30. I've been taking the ASCII values of the quality scores and subtracting 33, summing them up and dividing by the length. How is bbduk computing the average...
Hi, I am having issues with java version as well. Did you figure out ? I am using HPC, no luck. I have also tried installing using conda environment. Still no luck . Let me know if you have figured out the java version . Thank you.
java.lang.ClassNotFoundException: align2.BBSplitter
I am trying to run bbsplit: bbsplit.sh build=1 threads=12 ref_x=$ref_Human ref_y=$ref_Mouse path=$indexded_ref I am running into following error: Error: Could not find or load main class align2.BBSplitter Caused by: java.lang.ClassNotFoundException: align2.BBSplitter srun: error: b001: task 0: Exited with exit code 1 Is there a way to solve this issue ? Thank you.
info about compatible java versions/benchmarks
Hi! Is this still the current chromosome size limit? I'm trying to use BBSplit to do a dual RNA seq analysis (human contamination) and the indexing part fails with a “The reference file appears to be empty” error too: BBSplit Error: Creating merged reference file ~/ref/genome/1/merged_ref_6286725898541.fa.gz Ref merge time: 133.394 seconds. Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in1=R1_001.fastq.gz, in2=R2_001.fastq.gz,...
Hi! Is this still the current chromosome size limit? I'm trying to use BBSplit to do a dual RNA seq analysis (human contamination) and the indexing part fails with a “The reference file appears to be empty” error too: BBSplit Error: Creating merged reference file ~/ref/genome/1/merged_ref_6286725898541.fa.gz Ref merge time: 133.394 seconds. Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, in1=R1_001.fastq.gz, in2=R2_001.fastq.gz,...
pileup crashing
I forget to mention: we are trying to have the alignment run as quickly as possible, and we want to feed the alignment file into another function via a pipeline, which is why we need the tool to finish successfully.
bbmap taking large amounts of time // freezing?
Have you retried this recently? There was a problem with the configuration of the servers for a few weeks which should have been resolved on ~Jan 6th, but I'm not 100% sure it's fixed for everyone. Also, can you tell me which version of Java and BBTools you are using?
sendsketch error writing to server
I am currently using pileup.sh to determine the coverage of genes in my metagenomes by comparing my read mapping alignment file to the gene predictions of the contigs identified by Prodigal. I was wondering if anyone could tell me the output for the gene_coverage.tmp file, and if there are the same headers as the contig_comverage.tmp files? I am not sure if I should use the avgDepth value or the depthSum for my coverage value when I go to normalize these counts to counts per million (CPM). I am adding...
I am currently using pileup.sh to determine the coverage of genes in my metagenomes by comparing my read mapping alignment file to the gene predictions of the contigs identified by Prodigal. I was wondering if anyone could tell me the output for the gene_coverage.tmp file, and if there are the same headers as the contig_comverage.tmp files? I am not sure if I should use the avgDepth value or the depthSum for my coverage value when I go to normalize these counts to counts per million (CPM). I am adding...
BBMap is unintentionally chopping up my sequences
bbduk: 'rename' doesn't work if 'mincovfraction' is set
I see the same problem when running only bbmap.sh ref=Human/SAMEA103958167/SAMEA103958167_contigs.fasta in1=Human/SAMEA103958167/sequence_quality_control/SAMEA103958167_QC_R1.fastq.gz in2=Human/SAMEA103958167/sequence_quality_control/SAMEA103958167_QC_R2.fastq.gz minid=0.9 nodisk out=mapped.sam -Xmx50g threads=5
I set minid=0.9 and minratio was 0.56
Minid is not respected in bbwrap
I have a probably related "issue" where mismatches to "real bases" are preferred over matches to "NNNN", something like: ACGTNNNNNNGGC (reference) ACGT––––––AT– (observed) ACGTAT––––––– (expected) So, I would need to change the gap extension penalty or some other parameters, which I don't see to be accessible in the user interface. (I have worked around it by deleting the 3' end for the moment.) However, I'm not sure if these parameters are at all accessible since BBMap uses a "convex gap penalty"...
Chastity filter and flags
I'm having the same issue with currently latest version (38.90).
calcmem.sh: invalid use of [ -v
I would like to give an update to this. I found the issue it occurs when you pause the job and then try and resume it in the background using the "bg" command on a unix system.
bbmap.sh identifying "broken reads" that are not broken
reformat.sh
reformat.sh extin flag not working for bash process substitutions
callvariants.sh & marked duplicated reads
To be clear, the values are swapped in all stats.sh output formats, not just format=7.
stats.sh: N50 and L50 values are switched
Hey all, First of all I have to say that I really appreciate all the work that has been done on BBmap, Thanks! I wondered if there is any option of adding a flag to allow mismatches in the indices sequences while demultiplexing. When looking into it I noticed that I can retreive many reads that have only one mismatch in one of indices and that's a shame throwing them away. When designing and choosing my indices I make sure the hamming distance is large enough so I am not concerned by accidently classifying...
Just discovered the extin=fq parameter. But having reformat.sh in1=<(preprocess "$fq1") in2=<(preprocess "$fq2") extin=fq qin=33 ... does not solve the issue either: $ reformat.sh qin=33 extin=fq in1=<(<test_1.fastq) in2=<(<test_2.fastq) out=test.sam java -ea -Xms300m -cp /apps/BBMAP/38.57/bbmap/current/ jgi.ReformatReads qin=33 extin=fq in1=/dev/fd/63 in2=/dev/fd/62 out=test.sam Executing jgi.ReformatReads [qin=33, extin=fq, in1=/dev/fd/63, in2=/dev/fd/62, out=test.sam] Set INTERLEAVED to false...
Just discovered the extin=fq parameter. But having reformat.sh in1=<(preprocess "$fq1").fq in2=<(preprocess "$fq2").fq extin=fq qin=33 ... does not solve the issue either: $ reformat.sh qin=33 extin=fq in1=<(<test_1.fastq) in2=<(<test_2.fastq) out=test.sam java -ea -Xms300m -cp /apps/BBMAP/38.57/bbmap/current/ jgi.ReformatReads qin=33 extin=fq in1=/dev/fd/63 in2=/dev/fd/62 out=test.sam Executing jgi.ReformatReads [qin=33, extin=fq, in1=/dev/fd/63, in2=/dev/fd/62, out=test.sam] Set INTERLEAVED to...
Multiple streams input
Understanding BBMAP output
Change gap opening/extension penalty
Sequences lost in dedupe.sh
Warning against using phylip2fasta.sh
Internal Evolutionary Model for BBMap(?)