Re: [Psidev-pi-dev] ProteoAnnotator_1_2.mzid

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

I would also choose option 2. Because of multiple reasons identified
peptides can be not mappable (mutations, indels, differences between used
search space and genomic annotation, etc...).
The "unmapped" feature is something being used in the the proteogenomics
formats (proBAM and proBed) as well. See descriptions on the psi-dev
websites (http://www.psidev.info/proBAM and http://www.psidev.info/probed).
These CV's could be introduces there as well.

Cheers,
Gerben

On Tue, Jan 31, 2017 at 6:26 PM, Juan Antonio Vizcaino <ju...@eb...>
wrote:

> Hi all,
>
> On 31 Jan 2017, at 15:20, Jones, Andy <And...@li...>
> wrote:
>
> Hi all,
>
> The validation of proteogenomics example files has revealed a minor flaw
> in the proteogenomics specs. We use a MUST rule that says every peptide and
> protein that is not a decoy must state its genomic location. This doesn’t
> make sense for two reasons:
>
> -          Some peptides may have been  identified but not mappable to a
> chromosome, due to the approach taken i.e. the protein database and gene
> models are not consistent for sensible reasons
> -          As we have done, we have a merged result file from hits to
> Ensembl and Uniprot (Uniprot hits are not mapped).
>
> Solutions:
>
> 1.       Relax the MUST rule, to say that CV terms should be added for
> all mapped peptides/proteins.
> a.       Downside: hard to encode this logic in the validator
> 2.       Introduce another CV term for “unmapped peptide” and “unmapped
> protein” to cater for this case explicitly.
>
> Option 2 seems more formally sensible, but makes more work for data
> exporters to add CV terms to every peptide/protein, even if they only
> intended to map one subset.
>
> If possible, can people give opinions fairly quickly. We now have a time
> pressure to get MCP paper resubmitted within around 2 weeks otherwise it is
> a new submission.
>
>
> I would support option 2. It is more consistent and makes life easier for
> readers. Also, in most usual use cases (at least for the best annotated
> genomes), the number of unmapped peptides should be low.
>
> Cheers,
>
> Juan
>
>
>
>
>
> Best wishes
> Andy
>
>
>
>
> *From:* mayerg97 [mailto:ger...@ru... <ger...@ru...>]
> *Sent:* 26 January 2017 13:33
> *To:* Jones, Andy <jo...@li...>; Ghali, Fawaz <
> fg...@li...>
> *Subject:* Re: ProteoAnnotator_1_2.mzid
>
>
> Hi Fawaz and Andy,
>
> e.g.
>
>   <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000004093|"
> id="dbseq_generic|B_GENSCAN00000004093|"></DBSequence>
>
> is referenced by
>
>   <PeptideEvidence dBSequence_ref="dbseq_generic|B_GENSCAN00000004093|"
> peptide_ref="FAALDNEEEDK_" start="241" end="251" pre="K" post="E"
> isDecoy="false" id="FAALDNEEEDK_generic|B_GENSCAN00000004093|_241_251"><
> /PeptideEvidence>
>
> but this PeptideEvidence is not a decoy and has also no genome mapping
> information defined,
> but the specification document defines in Figure 5 that the CV terms must
> be present on every PeptideEvidence, unless ifDecoy="true"
>
> Best wishes,
> Gerhard
>
> Am 26.01.2017 um 13:52 schrieb mayerg97:
>
> Hi Fawaz and Andy,
>
> it's because there are DBSequences contained, which have no genome mapping
> defined, e.g.
>
>   <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000004093|"
> id="dbseq_generic|B_GENSCAN00000004093|"></DBSequence>
>   <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000027223|"
> id="dbseq_generic|B_GENSCAN00000027223|"></DBSequence>
>   <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000009965|"
> id="dbseq_generic|B_GENSCAN00000009965|"></DBSequence>
>   <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|B_GENSCAN00000034417|"
> id="dbseq_generic|B_GENSCAN00000034417|"></DBSequence>
>
> If we want to allow here both sequences with and without genome mapping,
> we can change the
>
> ProteogenomicsDBSequence_must_rule
>
> into a SHOULD rule instead.
>
> Best wishes,
> Gerhard
>
> Am 26.01.2017 um 13:14 schrieb Jones, Andy:
>
> Hi Fawaz,
>
>
>
> I don't see anything wrong with this - Gerhard, do you have any ideas?
>
> Thanks
>
> Andy
>
>
>
> -----Original Message-----
>
> From: Ghali, Fawaz
>
> Sent: 26 January 2017 11:55
>
> To: Jones, Andy <jo...@li...> <jo...@li...>
>
> Subject: ProteoAnnotator_1_2.mzid
>
>
>
> Hi Andy,
>
>
>
>
>
> ProteoAnnotator_1_2.mzid has an error:
>
>
>
> Message 1:
>
>     Rule ID: ProteogenomicsDBSequence_must_rule
>
>     Level: ERROR
>
>     Context(/cvParam/@accession ) in 380 locations
>
>     --> None of the given CvTerms were found at '/MzIdentML/SequenceCollection/DBSequence/cvParam/@accession' because no values were found:
>
>   - The sole term MS:1002637 (chromosome name) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.
>
>   - The sole term MS:1002638 (chromosome strand) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.
>
>   - The sole term MS:1002644 (genome reference version) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.
>
>
>
>
>
>
>
> Example:
>
>
>
>   <DBSequence searchDatabase_ref="SearchDB_1" accession="generic|A_ENSP00000395953|" id="dbseq_generic|A_ENSP00000395953|">
>
>     <cvParam cvRef="PSI-MS" accession="MS:1002637" name="chromosome name" value="11"></cvParam>
>
>     <cvParam cvRef="PSI-MS" accession="MS:1002638" name="chromosome strand" value="+"></cvParam>
>
>     <cvParam cvRef="PSI-MS" accession="MS:1002644" name="genome reference version" value="Homo_sapiens.GRCh38.77.gff3"></cvParam>
>
>   </DBSequence>
>
>
>
>
>
>
>
> Why it's complaining about the name?
>
>
>
>
>
>
>
> Best wishes,
>
> Fawaz
>
>
> --
>
> *--------------------------------------------------------------------*
>
> *Dipl. Inform. med., Dipl. Wirtsch. **Inf. GERHARD MAYER*
>
> *PhD student*
>
> *Medizinisches Proteom-Center*
>
> *DEPARTMENT Medical Bioinformatics*
>
> *Building *ZKF E.049a | Universitätsstraße 150 | D-44801 Bochum
>
> *Fon *+49 (0)234 32-21006 <+49%20234%203221006> | *Fax *+49 (0)234
> 32-14554 <+49%20234%203214554>
>
> *E-mail *ger...@ru...
>
> www.medizinisches-proteom-center.de
>
>
> --
>
> *--------------------------------------------------------------------*
>
> *Dipl. Inform. med., Dipl. Wirtsch. **Inf. GERHARD MAYER*
>
> *PhD student*
>
> *Medizinisches Proteom-Center*
>
> *DEPARTMENT Medical Bioinformatics*
>
> *Building *ZKF E.049a | Universitätsstraße 150 | D-44801 Bochum
>
> *Fon *+49 (0)234 32-21006 <+49%20234%203221006> | *Fax *+49 (0)234
> 32-14554 <+49%20234%203214554>
>
> *E-mail *ger...@ru...
>
> www.medizinisches-proteom-center.de
> ------------------------------
> No virus found in this message.
> Checked by AVG - www.avg.com <http://www.avg.com/email-signature>
> Version: 2016.0.7998 / Virus Database: 4749/13832 - Release Date: 01/25/17
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org <http://slashdot.org/>! http://
> sdm.link/slashdot_______________________________________________
> Psidev-pi-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev
>
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Psidev-pi-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev
>
>