Staden Package / Patches / #8 srf2fastq.c: Fix support for Solid fastq generation

Jordan Mendler - 2010-02-03

srf2fastq.c.svn-trunk-2010-02-03.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Bonfield - 2010-02-04

For question 1, I can only see this leading to complete breakage. Do I understand correctly that you wish to allow the explicit sequence to only be added to one line and not the other, leading to different line lengths?

Fastq files have the same number of base calls as quality values. Allowing for these to differ by the addition of explicit sequence to one and not the other will break all the fastq parsers I know of (some silently, some not so).

Furthermore due to the line-wrapping permitted in fastq, it's not technically possible to even parse the format when the number of confidence calls differs to the number of base without changing the data to remove the possibility of @ appearing in the quality strings. (Fortunately this isn't a problem for existing short-read technologies due to the sequence length being short so people aren't line-wrapping now, but it was common place when fastq was initially invented for capillary data.)

Fix 2 sounds like good protection against buggy input, but I'd argue the input is indeed broken. the CNF1 and CNF4 chunks in SRF/ZTR are documented to be phred scale by default unless the SCALE meta-data tag is used to define them to be log-odds format. Negative values are impossible in phred scale as it would imply the probability of the base being in error is higher than 1. For log-odds we naturally expect both positives and negatives, although srf2fastq should do the conversion appropriately between scales before outputting the data.

Is this solid srf file a public one I can download from NCBI or EBI to test with?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jordan Mendler - 2010-02-04

In regards to #1, this was to accommodate the format of Solid output. It is not to say it makes sense, but rather to follow their pattern. For a 50base run, Solid will output 51mer lines for each sequence entry in the csfasta (Primer + 50mer of color-space sequence). In the qual file, it does not output the Primer, but only the 50 qual values.

If anything, it seems more sensible to make the first qual entry '!', rather than the primer, to imply it is 0 quality.

I will test your original -e option, and see if it makes either alignment or variant calling choke.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Bonfield - 2010-02-05

This makes a lot more sense now, plus I've had confirmation that bfast accepts N for illumina, but presumably not for solid.

I believe the correct output for srf2fastq is to "do the right thing", outputting N for DNA sequences and . for colour-space data. It can determine this by looking at the CSET meta-data in the SRF file.

The primer base in quality data is an intriguing point as I believe it was added there specifically because that's what ABI were doing at the time. Odd. If there are multiple variants around with and without this base (it wouldn't suprise me) then I guess we need to support producing them. Mutter... fastq used to be so simple at one time!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Bonfield - 2010-02-05

See the changes in sourceforge revision 1999:

-----------------------------------------
Fixed negative quality values - these shouldn't happen when specifying
SCALE as PH(red), but -1 appears in ABI SOLiD files.

Fixed N vs . character. We no longer force ambiguity bases to N,
keeping . in use for SOLiD data.

Improved the use of the "-c" option. The program will now
automatically find the appropriate CNF chunk, so -c is only necessary
when both CNF1 and CNF4 are present and you wish to request data comes
from CNF1 instead.
-----------------------------------------

The last of these 3 isn't to do with this patch, but I added it out of frustration while testing the changes.

So essentially for now I'm ignoring the 1st part of the supplied patch while we work out what the official csfastq format requires (NCBI seem to have a different view of this). Obviously we'll incorporate the changes if needed so I'm not resolving this yet.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Bonfield - 2010-02-05

assigned_to: nobody --> jkbonfield
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jordan Mendler - 2010-02-19

James,

It looks like in SVN head, for Solid quality scores, the first character is being converted to a !. Previously, this was supposed to be the primer.

Example:
@VAB_germcase55eva_20090817secondslide_germcase55081709secondslide1_52_424
T03112310320100310212302221210321121021031010222031
+
!00/).2/))4//,&3+3,0,(4068-7.'7'9$-)&16&.%5+''01(&0
@VAB_germcase55eva_20090817secondslide_germcase55081709secondslide1_52_424
G011311031310.23022111311002221200222121112.0311222
+
!,1).554*'+8/!2*74'/.57%.%4:#)#0$%####'$&&$!%*((#2*

At the very least it is a bug, since that should instead be the primer. Have you given any more thought to adding an option to allow me to strip the primer from quality? I am emailing you separately about that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Bonfield - 2010-02-23

This is the confidence of the first base. I changed it to ! as it appears to be the format that NCBI support - ie the solid data files they export look like this.

I'll try and get this sorted as it's silly to have multiple competing formats. Time to email all the parties I think.

James

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

srf2fastq.c: Fix support for Solid fastq generation

Group

Searches

Help

#8 srf2fastq.c: Fix support for Solid fastq generation

Discussion