Menu

#71 Issues when reading IDs with UMIs

1.0
closed
nobody
None
2025-01-15
2025-01-10
Jordi Camps
No

When processing Illumina reads whose IDs contains UMIs, the parsing for the Read ID fails due to the UMI presence.

Let's say we have a FastQ file with contents:

@LH00000:5:11JJJ3LT4:8:1101:49845:1056:TCATGAACT 1:N:0:NTTGCTGT+NAATGCGA
GNTGGTGTGTGGTTTGGTGTGTTTCAAGGTCAGAACAGGTTTTTTTGTTTTTGTTTTTTGTTCTTTGTTTTTTTT
+
9#IIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

The execution of the command

clumpify.sh -Xmx22g in=test.umi.fastq out=test.umi.dup.fastq addcount=t optical=t

leads to an error like

Executing clump.Clumpify [-Xmx22g, in=test.umi.fastq, out=test.umi.dup.fastq, addcount=t, optical=t]
Version 39.13

Read Estimate:          3
Memory Estimate:        0 MB
Memory Available:       18434 MB
Set groups to 1
Executing clump.KmerSort1 [in1=test.umi.fastq, in2=, out1=test.umi.dup.fastq, out2=, groups=1, ecco=false, rename=false, shortname=f, unpair=false, repair=false, namesort=false, ow=true]

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Starting cris 0.
Fetching reads.
Making fetch threads.
Starting threads.
Waiting for threads.
Exception in thread "Thread-4" java.lang.AssertionError: LH00000:5:11JJJ3LT4:8:1101:49845:1056:TCATGAACT 1:N:0:NTTGCTGT+NAATGCGA
        at hiseq.FlowcellCoordinate.setFrom(FlowcellCoordinate.java:94)
        at clump.ReadKey.<init>(ReadKey.java:46)
        at clump.ReadKey.<init>(ReadKey.java:33)
        at clump.ReadKey.makeKey(ReadKey.java:23)
        at clump.KmerComparator.hash_inner(KmerComparator.java:78)
        at clump.KmerComparator.hash(KmerComparator.java:69)
        at clump.KmerComparator.hash(KmerComparator.java:65)
        at clump.KmerSort$FetchThread1.run(KmerSort.java:429)
Fetch time:     0.017 seconds.
Closing input stream.
Combining thread output.
Combine time:   0.000 seconds.
Exception in thread "main" java.lang.AssertionError: 0, 1, 1, 1, false
        at clump.KmerSort.fetchReads1(KmerSort.java:327)
        at clump.KmerSort1.processInner(KmerSort1.java:323)
        at clump.KmerSort1.process(KmerSort1.java:286)
        at clump.KmerSort1.main(KmerSort1.java:44)
        at clump.Clumpify.process(Clumpify.java:263)
        at clump.Clumpify.main(Clumpify.java:47)

The error goes away as son as the UMI is removed, so all points to the presence of the UMI as the cause of the issue.

Discussion

  • Brian Bushnell

    Brian Bushnell - 2025-01-10

    Thanks for this report... JGI doesn't use UMI's so I haven't seen them before in Illumina reads. I've duplicated the error and am modifying my header parsers to support UMIs, so that will work correctly in the next release.

     
  • Brian Bushnell

    Brian Bushnell - 2025-01-10

    All fixed; will be released in BBTools 39.15. Along with the new flags "umi" and "umisubs" so that you can require reads to only be classified as duplicates if their UMIs match.

     
  • Brian Bushnell

    Brian Bushnell - 2025-01-10
    • status: open --> closed
     
  • Jordi Camps

    Jordi Camps - 2025-01-10

    :-O Blazing fast!
    Thanks a lot! We will run this typically with pair end data. Let's see how it goes. The flag to classify reads as duplicates only if UMIs match will be useful too. There are a lot of ways to decide if two different UMIs are the same, but this basic method (exact identity) will be more than enough for a quick classification of non-aligned data.

     
  • Brian Bushnell

    Brian Bushnell - 2025-01-15

    39.15 is out now.

     

Log in to post a comment.

MongoDB Logo MongoDB