Issues when reading IDs with UMIs
BBMap short read aligner, and other bioinformatic tools.
Brought to you by:
brian-jgi
When processing Illumina reads whose IDs contains UMIs, the parsing for the Read ID fails due to the UMI presence.
Let's say we have a FastQ file with contents:
@LH00000:5:11JJJ3LT4:8:1101:49845:1056:TCATGAACT 1:N:0:NTTGCTGT+NAATGCGA
GNTGGTGTGTGGTTTGGTGTGTTTCAAGGTCAGAACAGGTTTTTTTGTTTTTGTTTTTTGTTCTTTGTTTTTTTT
+
9#IIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
The execution of the command
clumpify.sh -Xmx22g in=test.umi.fastq out=test.umi.dup.fastq addcount=t optical=t
leads to an error like
Executing clump.Clumpify [-Xmx22g, in=test.umi.fastq, out=test.umi.dup.fastq, addcount=t, optical=t]
Version 39.13
Read Estimate: 3
Memory Estimate: 0 MB
Memory Available: 18434 MB
Set groups to 1
Executing clump.KmerSort1 [in1=test.umi.fastq, in2=, out1=test.umi.dup.fastq, out2=, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true]
Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
Exception in thread "Thread-4" java.lang.AssertionError: LH00000:5:11JJJ3LT4:8:1101:49845:1056:TCATGAACT 1:N:0:NTTGCTGT+NAATGCGA
at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:94)
at clump.ReadKey.<init>(ReadKey.java:46)
at clump.ReadKey.<init>(ReadKey.java:33)
at clump.ReadKey.makeKey(ReadKey.java:23)
at clump.KmerComparator.hash_inner(KmerComparator.java:78)
at clump.KmerComparator.hash(KmerComparator.java:69)
at clump.KmerComparator.hash(KmerComparator.java:65)
at clump.KmerSort$FetchThread1.run(KmerSort.java:429)
Fetch time: 0.017 seconds.
Closing input stream.
Combining thread output.
Combine time: 0.000 seconds.
Exception in thread "main" java.lang.AssertionError: 0, 1, 1, 1, false
at clump.KmerSort.fetchReads1(KmerSort.java:327)
at clump.KmerSort1.processInner(KmerSort1.java:323)
at clump.KmerSort1.process(KmerSort1.java:286)
at clump.KmerSort1.main(KmerSort1.java:44)
at clump.Clumpify.process(Clumpify.java:263)
at clump.Clumpify.main(Clumpify.java:47)
The error goes away as son as the UMI is removed, so all points to the presence of the UMI as the cause of the issue.
Thanks for this report... JGI doesn't use UMI's so I haven't seen them before in Illumina reads. I've duplicated the error and am modifying my header parsers to support UMIs, so that will work correctly in the next release.
All fixed; will be released in BBTools 39.15. Along with the new flags "umi" and "umisubs" so that you can require reads to only be classified as duplicates if their UMIs match.
:-O Blazing fast!
Thanks a lot! We will run this typically with pair end data. Let's see how it goes. The flag to classify reads as duplicates only if UMIs match will be useful too. There are a lot of ways to decide if two different UMIs are the same, but this basic method (exact identity) will be more than enough for a quick classification of non-aligned data.
39.15 is out now.