Menu

#69 reformat.sh vpair fails matching reads.

1.0
open
nobody
None
2024-05-15
2024-05-15
No

Hi there.
I have been using bbmap for some time and love it. I am encountering an issue in my FASTQ qc pipeline where vpair raises that the following reads on my R1 and R2 file do not match.

Input is being processed as paired
Names do not appear to be correctly paired.
VH01092:63:AAFLNLCM5:1:1101:23003:1000:NTAATA:GTAATA/1 1:N:0:CCGATAGTCG+ACGTGAACAT
VH01092:63:AAFLNLCM5:1:1101:23003:1000:NTAATA:GTAATA/2 2:N:0:CCGATAGTCG+ACGTGAACAT

As far as I understand this is valid FASTQ format. "Field 1 begins with a '@' character and is followed by a sequence identifier and an optional description." I believe this optional description to be the issue.

While debugging I tried removing the optional description entirely and vpair raised no errors. Is there any way we could investigate this behavior? Is it possible that /1 and /2 are special delimiters and they are partially at fault for this error being raised? I would like to avoid trimming the optional description for bbmap usage and then needing to subsequently add it again.

Happy to discuss further and provide FASTQs (each is ~ 0.5 Gb). Thank you!

Discussion

  • Max Rozenblum

    Max Rozenblum - 2024-05-15

    I decided to subset my FASTQ to a single read so the files are more than manageable to demonstrate the issue:
    - bad_R* -> this FASTQ pair is the read as shown in the post above. This fails with vpair enabled.
    - bad-no-desc_R* -> this FASTQ pair is the same read where the optional description (text after the space) has been trimmed. This succeeds with vpair enabled.
    - bad-no-trail_R* -> this FASTQ pair is the same read except the /1 and /2 has been removed from the sequence identifier. This succeeds with vpair enabled.
    - bad-no-trail-no-desc_R* -> this has both the /1 and /2, and the optional description removed. As we would expect, this also succeeds with vpair enabled.

     

    Last edit: Max Rozenblum 2024-05-15
  • Brian Bushnell

    Brian Bushnell - 2024-05-15

    The problem here is that the read headers differ in two places. Normally, Illumina uses one of these two formats:

    @stuff/1
    @stuff/2

    or

    @stuff 1:morestuff
    @stuff 2:morestuff

    Of these, the /1 and /2 is obsolete for Illumina as far as I know, though Complete Genomics / BGI are adopting it. My effort to determine pairing is based on observation of Illumina data since there is no formal fastq specification regarding pair naming conventions, and they usually put the read identifier in the "optional description". So, I require one of those two formats where "stuff"="stuff" and "morestuff"="morestuff", and I've never observed "/1" "/2" and " 1:" " 2:" both used in the same headers. I guess the best thing to do would be to ignore everything following the whitespace if /1/2 are detected, so I'll modify the program to do that.

    I'm curious where these headers are coming from, though. Is this output from Illumina software, or modified in some way?

     

    Last edit: Brian Bushnell 2024-05-15
    • Max Rozenblum

      Max Rozenblum - 2024-05-15

      I will check in with the lab, but my understanding is that these came from a NextSeq or NovaSeq and didn't have any modifications. Thanks for the quick response. I took a peak at FASTQ.java and saw the following code block:

      // Here we try to weed out PacBio, which will differ after the last slash:
                  for (int i = idxSlash1 + 2; i < len1; i++) {
                      if (id1.charAt(i) != id2.charAt(i)) {
                          return false;
                      }
                  }
      

      I am using reformat.sh to do the following:
      - make sure reads are paired
      - count the number of reads/bases
      - downsample the reads using samplebasestarget

      with this in mind, is there any reason I couldn't use PacBio reads? (Not a common occurance but just looking to clarify)

       
      • Brian Bushnell

        Brian Bushnell - 2024-05-15

        When I wrote that, PacBio did not have paired reads. They have a new sequencing machine now for short reads that I think does produce pairs but I have not seen any data for it so I'm not sure of the header structure.

         

Log in to post a comment.