BBMap / Tickets / #71 Issues when reading IDs with UMIs

#71 Issues when reading IDs with UMIs

Milestone: 1.0

Status: closed

Owner: nobody

Labels: None

Updated: 2025-01-15

Created: 2025-01-10

Creator: Jordi Camps

Private: No

When processing Illumina reads whose IDs contains UMIs, the parsing for the Read ID fails due to the UMI presence.

Let's say we have a FastQ file with contents:

@LH00000:5:11JJJ3LT4:8:1101:49845:1056:TCATGAACT 1:N:0:NTTGCTGT+NAATGCGA
GNTGGTGTGTGGTTTGGTGTGTTTCAAGGTCAGAACAGGTTTTTTTGTTTTTGTTTTTTGTTCTTTGTTTTTTTT
+
9#IIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

The execution of the command

clumpify.sh -Xmx22g in=test.umi.fastq out=test.umi.dup.fastq addcount=t optical=t

leads to an error like

Executing clump.Clumpify [-Xmx22g, in=test.umi.fastq, out=test.umi.dup.fastq, addcount=t, optical=t]
Version 39.13

Read Estimate:          3
Memory Estimate:        0 MB
Memory Available:       18434 MB
Set groups to 1
Executing clump.KmerSort1 [in1=test.umi.fastq, in2=, out1=test.umi.dup.fastq, out2=, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
Exception in thread "Thread-4" java.lang.AssertionError: LH00000:5:11JJJ3LT4:8:1101:49845:1056:TCATGAACT 1:N:0:NTTGCTGT+NAATGCGA
        at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:94)
        at clump.ReadKey.<init>(ReadKey.java:46)
        at clump.ReadKey.<init>(ReadKey.java:33)
        at clump.ReadKey.makeKey(ReadKey.java:23)
        at clump.KmerComparator.hash_inner(KmerComparator.java:78)
        at clump.KmerComparator.hash(KmerComparator.java:69)
        at clump.KmerComparator.hash(KmerComparator.java:65)
        at clump.KmerSort$FetchThread1.run(KmerSort.java:429)
Fetch time:     0.017 seconds.
Closing input stream.
Combining thread output.
Combine time:   0.000 seconds.
Exception in thread "main" java.lang.AssertionError: 0, 1, 1, 1, false
        at clump.KmerSort.fetchReads1(KmerSort.java:327)
        at clump.KmerSort1.processInner(KmerSort1.java:323)
        at clump.KmerSort1.process(KmerSort1.java:286)
        at clump.KmerSort1.main(KmerSort1.java:44)
        at clump.Clumpify.process(Clumpify.java:263)
        at clump.Clumpify.main(Clumpify.java:47)

The error goes away as son as the UMI is removed, so all points to the presence of the UMI as the cause of the issue.

Discussion

Brian Bushnell - 2025-01-10

Thanks for this report... JGI doesn't use UMI's so I haven't seen them before in Illumina reads. I've duplicated the error and am modifying my header parsers to support UMIs, so that will work correctly in the next release.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Bushnell - 2025-01-10

All fixed; will be released in BBTools 39.15. Along with the new flags "umi" and "umisubs" so that you can require reads to only be classified as duplicates if their UMIs match.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Bushnell - 2025-01-10

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jordi Camps - 2025-01-10

:-O Blazing fast!
Thanks a lot! We will run this typically with pair end data. Let's see how it goes. The flag to classify reads as duplicates only if UMIs match will be useful too. There are a lot of ways to decide if two different UMIs are the same, but this basic method (exact identity) will be more than enough for a quick classification of non-aligned data.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Bushnell - 2025-01-15

39.15 is out now.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Issues when reading IDs with UMIs

BBMap short read aligner, and other bioinformatic tools.

Milestone

Searches

Help

#71 Issues when reading IDs with UMIs

Discussion