- Submit an mgf file with CRLF line terminators (eg SUB6787).
- Replace CRLF with LF and submit edited mgf as a new submission (eg SUB6816).
- Search both files. Different numbers of peptides will be found (3672 vs 2795).
A diff on the two mgfs in respective submissions folders reveals that the only change after submission is that the CRLF file titles are rewritten so that the scan numbers are 1,2,3 etc rather than reflecting actual scan numbers (they are still CRLF-terminated).
UPDATE:
Search results import needs to check that the TITLE attributes of mgfs (where the TITLEs are not re-written) are unique
Submission records in mysql db are identical apart from sub_id,sub_title
select * from submissions where sub_id in ("6787","6816");
Last edit: Phil Charles 2013-10-16
The whole MGF processing needs to be reviewed. It is currently badly broken for submission of MGFs from low-res instruments that won't have charge states written into the MGF for each spectrum - they will all be assigned 2+. We have to search LTQ data from MzML only because of this, but I haven't got around to fixing it yet.
I actually favor ditching MGF submission and providing recipes for raw file conversion to different MzML formats(e.g. MS1/MS2, MS2 only). There are so many TITLE formats for different MGF generating tools that I can't hope to have scan numbers imported correctly for all situations. I'll only keep up with making it work for the MGFs we use - and we may not use MGFs at all soon.
Replaced the mgf containing CRs with a CR-stripped version in the submission folder, so the only difference now is the TITLE attribute for each spectrum in the peaklists. Both are in the format 20130603_yeast_3rd_acid.x.x.charge, but for the file uploaded with CRs included (which, in a separate bug, causes the importer to misrecognise the title) x = 1 to 6445, while in the original file (with original scan numbers!), that was stripped of CRs and thus did NOT cause the bug during import x = scan number.
HOWEVER, as before - the file with x = 1-6445 actually gets MORE peptide hits than the file with x = original scan numbers
Last edit: Phil Charles 2013-10-16
Aha!:
rewritten scan numbers:
original scan numbers:
So
a) there's a bug in the mgf generation
b) the mgf import script needs to check that the TITLE attributes of imported spectra are unique
Diff:
Phil, what created the MGF? According to the Matrix Science MGF spec a TITLE line applies to a single spectrum. I.E. they must be unique.
http://www.matrixscience.com/help/data_file_help.html
Thinking back, when Ben was trying some peaklist creation software (forget the name) I saw this problem before.... and advised that this was a bug in that software.
In my view, if there are non-unique titles CPFP needs to refuse the file with a sensible error, not re-write them.
Yes, CPFP should definitely reject files with non-unique title headers as the file is no longer valid MGF. I believe the files were created by Progenesis ClearSpec, however I'll have to check that with user tomorrow.
It seems the files were created by Progenesis LC-MS which implies the problem may be not just restricted to ClearSpec output but a core Progenesis mgf-handling routine :(
Immediate issue has not cropped up again, but a TITLE uniqueness validator remains a good idea