clumpify.sh error
BBMap short read aligner, and other bioinformatic tools.
Brought to you by:
brian-jgi
Hello,
clumpify.sh is failing on a few of my samples (paired reads).
This is the command I ran:
clumpify.sh in1=/data/autoScratch/weekly/bt17686/2018-09-Dewayne40ArgentinaMales_clean/results/uncompress_input/batch_1_run_1/SRR7028280.R1.fastq in2=/data/autoScratch/weekly/bt17686/2018-09-Dewayne40ArgentinaMales_clean/results/uncompress_input/batch_1_run_1/SRR7028280.R2.fastq out1=batch_1_run_1/SRR7028280.R1.fastq out2=batch_1_run_1/SRR7028280.R2.fastq dedupe optical dist=100
And below is the output. Is it wierd that out2 is set to null below?
[...]
Clumpify version 36.99
Memory Estimate: 53034 MB
Memory Available: 74538 MB
Set groups to 11
Executing clump.KmerSplit [in1=/data/autoScratch/weekly/bt17686/2018-09-Dewayne40ArgentinaMales_clean/results/uncompress_input/batch_1_run_1/SRR7028280.R1.fastq, in2=/data/autoScratch/weekly/bt17686/2018-09-Dewayne40ArgentinaMales_clean/results/uncompress_input/batch_1_run_1/SRR7028280.R2.fastq, out=SRR7028280.R1_clumpify_p1_temp%_55db1c9ffe70a6d7.fastq, out2=null, groups=11, ecco=false, addname=f, shortname=f, unpair=false, repair=f, namesort=f, ow=true, dedupe, optical, dist=100]
Set INTERLEAVED to false
Input is being processed as paired
Writing interleaved.
Made a comparator with k=31, seed=1, border=1, hashes=4
Time: 515.391 seconds.
Reads Processed: 73221k 142.07k reads/sec
Bases Processed: 7395m 14.35m bases/sec
Executing clump.KmerSort3 [in1=SRR7028280.R1_clumpify_p1_temp%_55db1c9ffe70a6d7.fastq, in2=null, out=batch_1_run_1/SRR7028280.R1.fastq, out2=batch_1_run_1/SRR7028280.R2.fastq, groups=11, ecco=f, addname=false, shortname=f, unpair=f, repair=false, namesort=false, ow=true, dedupe, optical, dist=100]
Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Making 2 fetch threads.
Starting threads.
Fetching reads.
Fetch time: 64.768 seconds.
Making clumps.
Clump time: 1.706 seconds.
Deduping.
Exception in thread "Thread-32" java.lang.AssertionError: SRR7028280.14591116.2 HISEQ:121:C1TKCACXX:1:1307:11367:83421 length=101
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.Clump.nearby(Clump.java:257)
at clump.Clump.removeDuplicates_inner(Clump.java:206)
at clump.Clump.removeDuplicates(Clump.java:170)
at clump.ClumpList$ProcessThread.run(ClumpList.java:358)
Dedupe time: 0.075 seconds.
Writing.
Exception in thread "Thread-20" java.lang.AssertionError:
SRR7028280.23153433.2 HISEQ:121:C1TKCACXX:1:2114:6321:8032 length=101 4276489 0 + 0 0 1000000000000000000 1 0 0 CTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT ##################################################################################################### . 2 . .
null
at stream.ReadStreamByteWriter.writeFastq(ReadStreamByteWriter.java:459)
at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:96)
at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:41)
at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:27)
Fetching reads.
Fetch time: 0.000 seconds.
Making clumps.
Clump time: 2.352 seconds.
Deduping.
Exception in thread "Thread-36" java.lang.AssertionError: SRR7028280.11723735.2 HISEQ:121:C1TKCACXX:1:1215:5641:75715 length=101
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.Clump.nearby(Clump.java:257)
at clump.Clump.removeDuplicates_inner(Clump.java:206)
at clump.Clump.removeDuplicates(Clump.java:170)
at clump.ClumpList$ProcessThread.run(ClumpList.java:358)
Dedupe time: 0.026 seconds.
Writing.
Fetching reads.
Fetch time: 64.091 seconds.
Making clumps.
Clump time: 2.614 seconds.
Deduping.
Exception in thread "Thread-40" java.lang.AssertionError: SRR7028280.2226138.2 HISEQ:121:C1TKCACXX:1:1107:19335:19032 length=101
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:53)
at clump.Clump.nearby(Clump.java:257)
at clump.Clump.removeDuplicates_inner(Clump.java:206)
at clump.Clump.removeDuplicates(Clump.java:170)
at clump.ClumpList$ProcessThread.run(ClumpList.java:358)
Dedupe time: 0.014 seconds.
Writing.
Spent like the last hour dealing with nearly the same cryptic error message, and I think it has to do with the parsing of fastq headers for the removal of optical duplicates.
from the manual...
I was trying to remove duplicates in an ncbi SRA download that didn't have the original illumina header names/formatting. Once I tried this on raw reads with illumina formatting that I'd gotten directly from the sequencing center the error message went away.
I didn't really do any other testing so I could be wrong here, but my explanation makes sense since illumina headers contain tiling coordinates. hopefully this helps someone else
see also - https://github.com/BioInfoTools/BBMap/issues/15#issuecomment-472574978