[Psidev-pi-dev] FW: ProteoAnnotator_1_2.mzid

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

The validation of proteogenomics example files has revealed a minor flaw in the proteogenomics specs. We use a MUST rule that says every peptide and protein that is not a decoy must state its genomic location. This doesn’t make sense for two reasons:

-          Some peptides may have been  identified but not mappable to a chromosome, due to the approach taken i.e. the protein database and gene models are not consistent for sensible reasons

-          As we have done, we have a merged result file from hits to Ensembl and Uniprot (Uniprot hits are not mapped).

Solutions:

1.       Relax the MUST rule, to say that CV terms should be added for all mapped peptides/proteins.

a.       Downside: hard to encode this logic in the validator

2.       Introduce another CV term for “unmapped peptide” and “unmapped protein” to cater for this case explicitly.

Option 2 seems more formally sensible, but makes more work for data exporters to add CV terms to every peptide/protein, even if they only intended to map one subset.

If possible, can people give opinions fairly quickly. We now have a time pressure to get MCP paper resubmitted within around 2 weeks otherwise it is a new submission.

Best wishes
Andy

From: mayerg97 [mailto:ger...@ru...]
Sent: 26 January 2017 13:33
To: Jones, Andy <jo...@li...>; Ghali, Fawaz <fg...@li...>
Subject: Re: ProteoAnnotator_1_2.mzid

Hi Fawaz and Andy,

e.g.
  <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000004093|" id="dbseq_generic|B_GENSCAN00000004093|"></DBSequence>

is referenced by

  <PeptideEvidence dBSequence_ref="dbseq_generic|B_GENSCAN00000004093|" peptide_ref="FAALDNEEEDK_" start="241" end="251" pre="K" post="E" isDecoy="false" id="FAALDNEEEDK_generic|B_GENSCAN00000004093|_241_251"></PeptideEvidence>

but this PeptideEvidence is not a decoy and has also no genome mapping information defined,
but the specification document defines in Figure 5 that the CV terms must be present on every PeptideEvidence, unless ifDecoy="true"

Best wishes,
Gerhard

Am 26.01.2017 um 13:52 schrieb mayerg97:

Hi Fawaz and Andy,

it's because there are DBSequences contained, which have no genome mapping defined, e.g.

  <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000004093|" id="dbseq_generic|B_GENSCAN00000004093|"></DBSequence>
  <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000027223|" id="dbseq_generic|B_GENSCAN00000027223|"></DBSequence>
  <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000009965|" id="dbseq_generic|B_GENSCAN00000009965|"></DBSequence>
  <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000034417|" id="dbseq_generic|B_GENSCAN00000034417|"></DBSequence>

If we want to allow here both sequences with and without genome mapping, we can change the

ProteogenomicsDBSequence_must_rule
into a SHOULD rule instead.

Best wishes,
Gerhard

Am 26.01.2017 um 13:14 schrieb Jones, Andy:

Hi Fawaz,

I don't see anything wrong with this - Gerhard, do you have any ideas?

Thanks

Andy

-----Original Message-----

From: Ghali, Fawaz

Sent: 26 January 2017 11:55

To: Jones, Andy <jo...@li...><mailto:jo...@li...>

Subject: ProteoAnnotator_1_2.mzid

Hi Andy,

ProteoAnnotator_1_2.mzid has an error:

Message 1:

    Rule ID: ProteogenomicsDBSequence_must_rule

    Level: ERROR

    Context(/cvParam/@accession ) in 380 locations

    --> None of the given CvTerms were found at '/MzIdentML/SequenceCollection/DBSequence/cvParam/@accession' because no values were found:

  - The sole term MS:1002637 (chromosome name) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.

  - The sole term MS:1002638 (chromosome strand) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.

  - The sole term MS:1002644 (genome reference version) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.

Example:

  <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|A_ENSP00000395953|" id="dbseq_generic|A_ENSP00000395953|">

    <cvParam cvRef="PSI-MS" accession="MS:1002637" name="chromosome name" value="11"></cvParam>

    <cvParam cvRef="PSI-MS" accession="MS:1002638" name="chromosome strand" value="+"></cvParam>

    <cvParam cvRef="PSI-MS" accession="MS:1002644" name="genome reference version" value="Homo_sapiens.GRCh38.77.gff3"></cvParam>

  </DBSequence>

Why it's complaining about the name?

Best wishes,

Fawaz

--

--------------------------------------------------------------------

Dipl. Inform. med., Dipl. Wirtsch. Inf. GERHARD MAYER

PhD student

Medizinisches Proteom-Center

DEPARTMENT Medical Bioinformatics

Building ZKF E.049a | Universitätsstraße 150 | D-44801 Bochum

Fon +49 (0)234 32-21006 | Fax +49 (0)234 32-14554

E-mail ger...@ru...<mailto:ger...@ru...>

www.medizinisches-proteom-center.de<http://www.medizinisches-proteom-center.de/>

--

--------------------------------------------------------------------

Dipl. Inform. med., Dipl. Wirtsch. Inf. GERHARD MAYER

PhD student

Medizinisches Proteom-Center

DEPARTMENT Medical Bioinformatics

Building ZKF E.049a | Universitätsstraße 150 | D-44801 Bochum

Fon +49 (0)234 32-21006 | Fax +49 (0)234 32-14554

E-mail ger...@ru...<mailto:ger...@ru...>

www.medizinisches-proteom-center.de<http://www.medizinisches-proteom-center.de/>

________________________________
No virus found in this message.
Checked by AVG - www.avg.com<http://www.avg.com/email-signature>
Version: 2016.0.7998 / Virus Database: 4749/13832 - Release Date: 01/25/17