Menu

#14 Discrepancy in search results if line endings are different

Oxford
accepted
Bug (24)
2014-10-23
2013-10-16
No
  1. Submit an mgf file with CRLF line terminators (eg SUB6787).
  2. Replace CRLF with LF and submit edited mgf as a new submission (eg SUB6816).
  3. Search both files. Different numbers of peptides will be found (3672 vs 2795).

A diff on the two mgfs in respective submissions folders reveals that the only change after submission is that the CRLF file titles are rewritten so that the scan numbers are 1,2,3 etc rather than reflecting actual scan numbers (they are still CRLF-terminated).

UPDATE:
Search results import needs to check that the TITLE attributes of mgfs (where the TITLEs are not re-written) are unique

Discussion

  • Phil Charles

    Phil Charles - 2013-10-16

    Submission records in mysql db are identical apart from sub_id,sub_title

    select * from submissions where sub_id in ("6787","6816");

     
  • Phil Charles

    Phil Charles - 2013-10-16
    select fsp_sub_id,count(fsp_id) from fragment_spectra where fsp_sub_id in ("6787","6816") group by fsp_sub_id;
    +------------+---------------+
    | fsp_sub_id | count(fsp_id) |
    +------------+---------------+
    |       6787 |          6445 |
    |       6816 |          6445 |
    +------------+---------------+
    
     

    Last edit: Phil Charles 2013-10-16
  • David Trudgian

    David Trudgian - 2013-10-16

    The whole MGF processing needs to be reviewed. It is currently badly broken for submission of MGFs from low-res instruments that won't have charge states written into the MGF for each spectrum - they will all be assigned 2+. We have to search LTQ data from MzML only because of this, but I haven't got around to fixing it yet.

    I actually favor ditching MGF submission and providing recipes for raw file conversion to different MzML formats(e.g. MS1/MS2, MS2 only). There are so many TITLE formats for different MGF generating tools that I can't hope to have scan numbers imported correctly for all situations. I'll only keep up with making it work for the MGFs we use - and we may not use MGFs at all soon.

     
  • Phil Charles

    Phil Charles - 2013-10-16

    Replaced the mgf containing CRs with a CR-stripped version in the submission folder, so the only difference now is the TITLE attribute for each spectrum in the peaklists. Both are in the format 20130603_yeast_3rd_acid.x.x.charge, but for the file uploaded with CRs included (which, in a separate bug, causes the importer to misrecognise the title) x = 1 to 6445, while in the original file (with original scan numbers!), that was stripped of CRs and thus did NOT cause the bug during import x = scan number.

    HOWEVER, as before - the file with x = 1-6445 actually gets MORE peptide hits than the file with x = original scan numbers

     

    Last edit: Phil Charles 2013-10-16
  • Phil Charles

    Phil Charles - 2013-10-16

    Aha!:

    rewritten scan numbers:

    grep "TITLE=" 20130603_yeast_3rd_acid.mgf | sort | uniq | wc -l
    6445
    

    original scan numbers:

    grep "TITLE=" 20130603_yeast_3rd_acid.mgf | sort | uniq | wc -l
    5768
    

    So
    a) there's a bug in the mgf generation
    b) the mgf import script needs to check that the TITLE attributes of imported spectra are unique

     
  • Phil Charles

    Phil Charles - 2013-10-16
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -3,3 +3,6 @@
     3.  Search both files.  Different numbers of peptides will be found (3672 vs 2795).
    
     A diff on the two mgfs in respective submissions folders reveals that the only change after submission is that the CRLF file titles are rewritten so that the scan numbers are 1,2,3 etc rather than reflecting actual scan numbers (they are still CRLF-terminated).
    +
    +UPDATE:
    +Search results import needs to check that the TITLE attributes of mgfs (where the TITLEs are not re-written) are unique
    
     
  • David Trudgian

    David Trudgian - 2013-10-16

    Phil, what created the MGF? According to the Matrix Science MGF spec a TITLE line applies to a single spectrum. I.E. they must be unique.

    http://www.matrixscience.com/help/data_file_help.html

    Thinking back, when Ben was trying some peaklist creation software (forget the name) I saw this problem before.... and advised that this was a bug in that software.

    In my view, if there are non-unique titles CPFP needs to refuse the file with a sensible error, not re-write them.

     
  • Phil Charles

    Phil Charles - 2013-10-16

    Yes, CPFP should definitely reject files with non-unique title headers as the file is no longer valid MGF. I believe the files were created by Progenesis ClearSpec, however I'll have to check that with user tomorrow.

     
  • Phil Charles

    Phil Charles - 2013-10-17

    It seems the files were created by Progenesis LC-MS which implies the problem may be not just restricted to ClearSpec output but a core Progenesis mgf-handling routine :(

     
  • Phil Charles

    Phil Charles - 2014-10-23
    • status: open --> accepted
     
  • Phil Charles

    Phil Charles - 2014-10-23

    Immediate issue has not cropped up again, but a TITLE uniqueness validator remains a good idea

     

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB
Gen AI apps are built with MongoDB Atlas
Atlas offers built-in vector search and global availability across 125+ regions. Start building AI apps faster, all in one place.