From: Jones, A. <And...@li...> - 2017-01-31 15:22:00
|
Hi all, The validation of proteogenomics example files has revealed a minor flaw in the proteogenomics specs. We use a MUST rule that says every peptide and protein that is not a decoy must state its genomic location. This doesn’t make sense for two reasons: - Some peptides may have been identified but not mappable to a chromosome, due to the approach taken i.e. the protein database and gene models are not consistent for sensible reasons - As we have done, we have a merged result file from hits to Ensembl and Uniprot (Uniprot hits are not mapped). Solutions: 1. Relax the MUST rule, to say that CV terms should be added for all mapped peptides/proteins. a. Downside: hard to encode this logic in the validator 2. Introduce another CV term for “unmapped peptide” and “unmapped protein” to cater for this case explicitly. Option 2 seems more formally sensible, but makes more work for data exporters to add CV terms to every peptide/protein, even if they only intended to map one subset. If possible, can people give opinions fairly quickly. We now have a time pressure to get MCP paper resubmitted within around 2 weeks otherwise it is a new submission. Best wishes Andy From: mayerg97 [mailto:ger...@ru...] Sent: 26 January 2017 13:33 To: Jones, Andy <jo...@li...>; Ghali, Fawaz <fg...@li...> Subject: Re: ProteoAnnotator_1_2.mzid Hi Fawaz and Andy, e.g. <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000004093|" id="dbseq_generic|B_GENSCAN00000004093|"></DBSequence> is referenced by <PeptideEvidence dBSequence_ref="dbseq_generic|B_GENSCAN00000004093|" peptide_ref="FAALDNEEEDK_" start="241" end="251" pre="K" post="E" isDecoy="false" id="FAALDNEEEDK_generic|B_GENSCAN00000004093|_241_251"></PeptideEvidence> but this PeptideEvidence is not a decoy and has also no genome mapping information defined, but the specification document defines in Figure 5 that the CV terms must be present on every PeptideEvidence, unless ifDecoy="true" Best wishes, Gerhard Am 26.01.2017 um 13:52 schrieb mayerg97: Hi Fawaz and Andy, it's because there are DBSequences contained, which have no genome mapping defined, e.g. <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000004093|" id="dbseq_generic|B_GENSCAN00000004093|"></DBSequence> <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000027223|" id="dbseq_generic|B_GENSCAN00000027223|"></DBSequence> <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000009965|" id="dbseq_generic|B_GENSCAN00000009965|"></DBSequence> <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000034417|" id="dbseq_generic|B_GENSCAN00000034417|"></DBSequence> If we want to allow here both sequences with and without genome mapping, we can change the ProteogenomicsDBSequence_must_rule into a SHOULD rule instead. Best wishes, Gerhard Am 26.01.2017 um 13:14 schrieb Jones, Andy: Hi Fawaz, I don't see anything wrong with this - Gerhard, do you have any ideas? Thanks Andy -----Original Message----- From: Ghali, Fawaz Sent: 26 January 2017 11:55 To: Jones, Andy <jo...@li...><mailto:jo...@li...> Subject: ProteoAnnotator_1_2.mzid Hi Andy, ProteoAnnotator_1_2.mzid has an error: Message 1: Rule ID: ProteogenomicsDBSequence_must_rule Level: ERROR Context(/cvParam/@accession ) in 380 locations --> None of the given CvTerms were found at '/MzIdentML/SequenceCollection/DBSequence/cvParam/@accession' because no values were found: - The sole term MS:1002637 (chromosome name) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name. - The sole term MS:1002638 (chromosome strand) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name. - The sole term MS:1002644 (genome reference version) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name. Example: <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|A_ENSP00000395953|" id="dbseq_generic|A_ENSP00000395953|"> <cvParam cvRef="PSI-MS" accession="MS:1002637" name="chromosome name" value="11"></cvParam> <cvParam cvRef="PSI-MS" accession="MS:1002638" name="chromosome strand" value="+"></cvParam> <cvParam cvRef="PSI-MS" accession="MS:1002644" name="genome reference version" value="Homo_sapiens.GRCh38.77.gff3"></cvParam> </DBSequence> Why it's complaining about the name? Best wishes, Fawaz -- -------------------------------------------------------------------- Dipl. Inform. med., Dipl. Wirtsch. Inf. GERHARD MAYER PhD student Medizinisches Proteom-Center DEPARTMENT Medical Bioinformatics Building ZKF E.049a | Universitätsstraße 150 | D-44801 Bochum Fon +49 (0)234 32-21006 | Fax +49 (0)234 32-14554 E-mail ger...@ru...<mailto:ger...@ru...> www.medizinisches-proteom-center.de<http://www.medizinisches-proteom-center.de/> -- -------------------------------------------------------------------- Dipl. Inform. med., Dipl. Wirtsch. Inf. GERHARD MAYER PhD student Medizinisches Proteom-Center DEPARTMENT Medical Bioinformatics Building ZKF E.049a | Universitätsstraße 150 | D-44801 Bochum Fon +49 (0)234 32-21006 | Fax +49 (0)234 32-14554 E-mail ger...@ru...<mailto:ger...@ru...> www.medizinisches-proteom-center.de<http://www.medizinisches-proteom-center.de/> ________________________________ No virus found in this message. Checked by AVG - www.avg.com<http://www.avg.com/email-signature> Version: 2016.0.7998 / Virus Database: 4749/13832 - Release Date: 01/25/17 |