BBMap / Tickets / #69 reformat.sh vpair fails matching reads.

Max Rozenblum - 2024-05-15

I decided to subset my FASTQ to a single read so the files are more than manageable to demonstrate the issue:
- bad_R* -> this FASTQ pair is the read as shown in the post above. This fails with vpair enabled.
- bad-no-desc_R* -> this FASTQ pair is the same read where the optional description (text after the space) has been trimmed. This succeeds with vpair enabled.
- bad-no-trail_R* -> this FASTQ pair is the same read except the /1 and /2 has been removed from the sequence identifier. This succeeds with vpair enabled.
- bad-no-trail-no-desc_R* -> this has both the /1 and /2, and the optional description removed. As we would expect, this also succeeds with vpair enabled.

Last edit: Max Rozenblum 2024-05-15

bad-no-desc_R1.fastq.gz

bad-no-desc_R2.fastq.gz

bad-no-trail-no-desc_R1.fastq.gz

bad-no-trail-no-desc_R2.fastq.gz

bad-no-trail_R1.fastq.gz

bad-no-trail_R2.fastq.gz

bad_R1.fastq.gz

bad_R2.fastq.gz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Bushnell - 2024-05-15

The problem here is that the read headers differ in two places. Normally, Illumina uses one of these two formats:

@stuff/1
@stuff/2

or

@stuff 1:morestuff
@stuff 2:morestuff

Of these, the /1 and /2 is obsolete for Illumina as far as I know, though Complete Genomics / BGI are adopting it. My effort to determine pairing is based on observation of Illumina data since there is no formal fastq specification regarding pair naming conventions, and they usually put the read identifier in the "optional description". So, I require one of those two formats where "stuff"="stuff" and "morestuff"="morestuff", and I've never observed "/1" "/2" and " 1:" " 2:" both used in the same headers. I guess the best thing to do would be to ignore everything following the whitespace if /1/2 are detected, so I'll modify the program to do that.

I'm curious where these headers are coming from, though. Is this output from Illumina software, or modified in some way?

Last edit: Brian Bushnell 2024-05-15

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Max Rozenblum - 2024-05-15
  
  I will check in with the lab, but my understanding is that these came from a NextSeq or NovaSeq and didn't have any modifications. Thanks for the quick response. I took a peak at FASTQ.java and saw the following code block:
  
  // Here we try to weed out PacBio, which will differ after the last slash: for (int i = idxSlash1 + 2; i < len1; i++) { if (id1.charAt(i) != id2.charAt(i)) { return false; } }
  
  I am using reformat.sh to do the following:
  - make sure reads are paired
  - count the number of reads/bases
  - downsample the reads using samplebasestarget
  
  with this in mind, is there any reason I couldn't use PacBio reads? (Not a common occurance but just looking to clarify)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Brian Bushnell - 2024-05-15
    
    When I wrote that, PacBio did not have paired reads. They have a new sequencing machine now for short reads that I think does produce pairs but I have not seen any data for it so I'm not sure of the header structure.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

reformat.sh vpair fails matching reads.

BBMap short read aligner, and other bioinformatic tools.

Milestone

Searches

Help

#69 reformat.sh vpair fails matching reads.

Discussion