Good morning,
I use reformat.sh
in my bioinformatics workflow to count reads and bases, as well as verify FASTQ format. I encountered an example where the checksum for an input FASTQ and output FASTQ did not match. I confirmed the input FASTQ had a zlib compression level of 6 which was already propagated as zl=6
into my reformat.sh
parameters. After examining the diff
between the the input and output FASTQs, I noticed that the sequence itself was identical but the quality scores were not. I stumbled acrossmincalledquality
and maxcalledquality
. I figured that this was the issue so I set mincalledquality=0
and maxcalledquality=80
. I reran the code on my input FASTQ however the checksums still did not match. After looking at the diff
again (though much smaller this time), I found that ambiguous bases (N
) regardless of their input base quality score get squahsed to a 0 (!
ASCII 33). Even if I set mincalledquality
to be nonzero, ambiguous base quality scores are forced to be 0.
My goal would be that the input and output FASTQs maintain identical checksums. To accomplish this I'd like to request/clarify if there is a way I can indicate to reformat.sh
that I do not want it to touch the base quality scores at all? Ambiguous base N
quality scores can vary between sequencing facilities so setting a default base quality score for ambiguous bases would not accomplish my goal.
Is this feasible? Let me know what you think? I am attaching a snippet of a synthetic read to demonstrate this below:
$ zcat fake-read_R1.fastq.gz
@0/1
TNTAT
+
F#FFF
and I'll show what reformat.sh
outputs:
reformat.sh in=HG00101-singleread_R1.fastq.gz mincalledquality=0 maxcalledquality=80 zl=6 out=stdout.fastq silent
@0/1
TNTAT
+
F!FFF
clarifying that I missed
cq=f
. This can be closed.