The attached patch solves two issues in generating fastq files from Solid SRF files.
1) Added -es and -eq flags. To properly generate fastq from Solid, one needs to include explicit regions (representing the Primer) for only Sequence. By default the -e flag prepends the explicit region to both Sequence and Quality values in the fastq. Since we only want it in the Sequence, I added -es (explicit sequence) and -eq (explicit quality), to give user full control. One can specify none, either or both, or they can use the -e option which is equivalent to specifying both -es and -eq. For robustness, a user can specify -e alongside -eq and/or -es and things will still work as though they only specified -e. The functionality of the -e flag remains unmodified, to preserve backward compatibility.
2) Support for negative qualities. The Solid qual file contains qualities of -1. Previously, these would be incremented by 33 ('!'), resulting in a space (' '). The problem is that a space (ASCII = 32) is not a valid Sanger score. As such, I am setting all negative scores to zero, before adding 33. Therefore a Solid score of -1 would result in a Sanger score of 33, which is expected since 33 in Sanger translates to a quality score of 0 (the lowest possible quality).
In doing #2, I also moved the calculation of quality scores to a separate function called getQual(). This will clean things up a little and make it easier to modify quality score conversions as needed in the future.
For question 1, I can only see this leading to complete breakage. Do I understand correctly that you wish to allow the explicit sequence to only be added to one line and not the other, leading to different line lengths?
Fastq files have the same number of base calls as quality values. Allowing for these to differ by the addition of explicit sequence to one and not the other will break all the fastq parsers I know of (some silently, some not so).
Furthermore due to the line-wrapping permitted in fastq, it's not technically possible to even parse the format when the number of confidence calls differs to the number of base without changing the data to remove the possibility of @ appearing in the quality strings. (Fortunately this isn't a problem for existing short-read technologies due to the sequence length being short so people aren't line-wrapping now, but it was common place when fastq was initially invented for capillary data.)
Fix 2 sounds like good protection against buggy input, but I'd argue the input is indeed broken. the CNF1 and CNF4 chunks in SRF/ZTR are documented to be phred scale by default unless the SCALE meta-data tag is used to define them to be log-odds format. Negative values are impossible in phred scale as it would imply the probability of the base being in error is higher than 1. For log-odds we naturally expect both positives and negatives, although srf2fastq should do the conversion appropriately between scales before outputting the data.
Is this solid srf file a public one I can download from NCBI or EBI to test with?
In regards to #1, this was to accommodate the format of Solid output. It is not to say it makes sense, but rather to follow their pattern. For a 50base run, Solid will output 51mer lines for each sequence entry in the csfasta (Primer + 50mer of color-space sequence). In the qual file, it does not output the Primer, but only the 50 qual values.
If anything, it seems more sensible to make the first qual entry '!', rather than the primer, to imply it is 0 quality.
I will test your original -e option, and see if it makes either alignment or variant calling choke.
This makes a lot more sense now, plus I've had confirmation that bfast accepts N for illumina, but presumably not for solid.
I believe the correct output for srf2fastq is to "do the right thing", outputting N for DNA sequences and . for colour-space data. It can determine this by looking at the CSET meta-data in the SRF file.
The primer base in quality data is an intriguing point as I believe it was added there specifically because that's what ABI were doing at the time. Odd. If there are multiple variants around with and without this base (it wouldn't suprise me) then I guess we need to support producing them. Mutter... fastq used to be so simple at one time!
See the changes in sourceforge revision 1999:
-----------------------------------------
Fixed negative quality values - these shouldn't happen when specifying
SCALE as PH(red), but -1 appears in ABI SOLiD files.
Fixed N vs . character. We no longer force ambiguity bases to N,
keeping . in use for SOLiD data.
Improved the use of the "-c" option. The program will now
automatically find the appropriate CNF chunk, so -c is only necessary
when both CNF1 and CNF4 are present and you wish to request data comes
from CNF1 instead.
-----------------------------------------
The last of these 3 isn't to do with this patch, but I added it out of frustration while testing the changes.
So essentially for now I'm ignoring the 1st part of the supplied patch while we work out what the official csfastq format requires (NCBI seem to have a different view of this). Obviously we'll incorporate the changes if needed so I'm not resolving this yet.
James,
It looks like in SVN head, for Solid quality scores, the first character is being converted to a !. Previously, this was supposed to be the primer.
Example:
@VAB_germcase55eva_20090817secondslide_germcase55081709secondslide1_52_424
T03112310320100310212302221210321121021031010222031
+
!00/).2/))4//,&3+3,0,(4068-7.'7'9$-)&16&.%5+''01(&0
@VAB_germcase55eva_20090817secondslide_germcase55081709secondslide1_52_424
G011311031310.23022111311002221200222121112.0311222
+
!,1).554*'+8/!2*74'/.57%.%4:#)#0$%####'$&&$!%*((#2*
At the very least it is a bug, since that should instead be the primer. Have you given any more thought to adding an option to allow me to strip the primer from quality? I am emailing you separately about that.
This is the confidence of the first base. I changed it to ! as it appears to be the format that NCBI support - ie the solid data files they export look like this.
I'll try and get this sorted as it's silly to have multiple competing formats. Time to email all the parties I think.
James