I'm interested in familiarising myself with the io-lib package. So I've installed the io_lib-1.12.4 version and decided to convert some SOLID srf files to fasta format (I first tried to convert them to fastq but the error message "No CNF chunks found" made me think the the quality information might be encoded somehow differently in these files(?)).
Anyway, I picked this file for starters :http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM512597
When using the srf2fasta, in the resulting fasta file all the reads are like that:
Do you have any thoughts of what is happening here? Why all the Ns?
I see it's a misplaced feature of srf2fasta. Technically fasta doesn't support anything other than legal DNA bases, and the code rigidly enforces this. However it should be checking the character-set field in SRF which clearly states this data is in colourspace. (srf2fastq does this, but unfortunately cannot work on this data due to "broken" SRFs).
I think this is overly aggressive behaviour (and clearly wrong for SOLiD) though, so I've changed it to let anything through (implying garbage in, garbage out :-)). I do change dot (.) to N, but that's the only thing now. The latest code is now in subversion - see http://staden.svn.sourceforge.net/viewvc/staden?revision=2182&view=revision to download a copy of the fixed source file.
Incidentally, who on earth thought it was a good idea to encode fasta in SRF format? That's the *only* information in those srfs: no traces, not even any qualities. Just sequence + name…
Thank you very much for that.
And oh yes, sometimes the "whateverness" of public datasets out there is noteworthy.