Re: [Psidev-pi-dev] Results schema critical design question from Friday afternoon in Toledo

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

>But start, end, post and pre would now be CV?
>btw, Luisa recommends that we don't make too many things like this CV...
>Having been enthusiastic about the change, I think I'm now going off it - partly because with all the extra CV, file sizes may well explode. >Please persuade me otherwise!
>(btw, I've 'read but ignored' the quantitation suggestions based on decisions in Toledo.)

I would favour keeping things as attributes where there is a common understanding across all search engines what these mean, and they will regularly/always be required.

“start, end, post and pre” – these all look like good candidates for being attributes. 

“calculatedMassToCharge="670.86261" chargeState="2" experimentalMassToCharge="671.9"” – I would say the same for these, every additional thing in CV bloats the instance documents and makes more work for implementers.

Cheers

Andy

From: psi...@li... [mailto:psi...@li...] On Behalf Of David Creasy
Sent: 30 April 2008 18:12
To: Sean L Seymour
Cc: psi...@li...
Subject: Re: [Psidev-pi-dev] Results schema critical design question from Friday afternoon in Toledo

Hi Sean,

Thanks very much - must have taken quite a while and is very useful. One thing that may not be obvious to others is where the the <SpectrumIdentificationResultSet> comes from. I believe that this was just a 'rename' of PolypeptideResultSet made by the sub group that you were in at Toledo.

As we've usefully discussed, finding a way to communicate effectively is an issue. So, to make 100% sure I've understood I'll talk back to you in XML :)

This is a cut down of an example for an ms-ms search of a single spectrum with peptide results and protein inferencing. The protein inferencing (impossibly - 'cos just one peptide!) has a couple of similar proteins in the first group, and one in the second group.

<pf:DataCollection>
  <AnalyteDetectionResultSet type=MS_MS_peptide_matches>
    <AnalyteDetectionResult>
      <IdentificationResult>
        <SpectrumElement spectrumID="9" spectraDataInputRef_ref="file.1"/>
        <IdentificationHypothesis id="pep_match_x1" ref="peptide1_in_molecule_table">
          <pf:cvParam accession="PI:99999" name="score" value="62" />
        </IdentificationHypothesis>
        <IdentificationHypothesis id="pep_match_x2" ref="peptide2_in_molecule_table">
          <!-- A poorer match to same spectrum as "pep_match_x1" !>
          <pf:cvParam accession="PI:99999" name="score" value="12" />
        </IdentificationHypothesis>
      </IdentificationResult>
    </AnalyteDetectionResult>
  </AnalyteDetectionResultSet>

  <AnalyteDetectionResultSet type=Protein_inferencing>
    <AnalyteDetectionResult id="protein_group_1">
      <IdentificationResult>
        <SomeTagTBD id="PP" ref="pep_match_x">
          <pf:cvParam startpos = 23>
          <pf:cvParam endpos = 29>
        <SomeTagTBD />
        <IdentificationHypothesis id="TRYP_PIG" ref="protein1_in_molecule_table">
          <pf:cvParam accession="PI:99999" name="score" value="162" />
        </IdentificationHypothesis>
        <IdentificationHypothesis id="TRYP_BOV" ref="protein2_in_molecule_table">
          <pf:cvParam accession="PI:99999" name="score" value="162" />
        </IdentificationHypothesis>
      </IdentificationResult>
      <IdentificationResult>
      </IdentificationResult>
    </AnalyteDetectionResult>
  </AnalyteDetectionResultSet>
    <AnalyteDetectionResult id="protein_group_2">
      <IdentificationResult>
        <SomeTagTBD id="PP" ref="pep_match_y">
          <pf:cvParam startpos = 123>
          <pf:cvParam endpos = 129>
        <SomeTagTBD />
        <IdentificationHypothesis id="DODGY" ref="protein99_in_molecule_table">
          <pf:cvParam accession="PI:99999" name="score" value="1" />
        </IdentificationHypothesis>
      </IdentificationResult>
      <IdentificationResult>
      </IdentificationResult>
    </AnalyteDetectionResult>
  </AnalyteDetectionResultSet>
</pf:DataCollection>

Please correct where I haven't understood.

Before, we had in peptide ID:
<PolypeptideResultItem identifier="1_1"  calculatedMassToCharge="670.86261" chargeState="2" experimentalMassToCharge="671.9" polypeptideReference_ref="xxx">
New proposal is that calculatedMassToCharge, chargeState and experimentalMassToCharge are all just CV?

Likewise, for protein inferencing, we had:
          <_resultItems>
            <RelationResultItem identifier="" start="160" end="171" polypeptideReference_ref="1_1" post="K" pre="I">
            </RelationResultItem>
            <RelationResultItem identifier="" start="57" end="71" polypeptideReference_ref="3_1" post="K" pre="R">
            </RelationResultItem>

But start, end, post and pre would now be CV?
btw, Luisa recommends that we don't make too many things like this CV...
Having been enthusiastic about the change, I think I'm now going off it - partly because with all the extra CV, file sizes may well explode. Please persuade me otherwise!
(btw, I've 'read but ignored' the quantitation suggestions based on decisions in Toledo.)

One minor comment:

Slide 6: ..., but the results are always about the result from the user’s perspective – “What did I find and/or measure?”, rather than “How did I account for all of the spectra?” 
 - Many users do want to try and account for all their spectra because they believe that they are missing something useful.

David

Sean L Seymour wrote: 

Hi all, 

After the wrap up Friday afternoon, the few remaining people in the PI group had a short meeting where we discussed a potential generalization to the results portion of the schema. The big question that came out of this was whether or not we should keep the result description for the ID of peptides from MS/MS spectra as it was by midday Friday, or whether it made sense to restructure this so that it followed the more general structure for results that we would use for many other things, including protein inference from peptide IDs. I agreed to outline the various use cases and try to lay out the issues. I had hoped to send this out by Monday, but it's taken a lot longer than planned. Apologies for being a day late, but I hope you'll see that a lot of thought went into this. 

There are two documents. Please look at "AnalysisXML Results Design Question.ppt" first. This lays out the specific schema change question we face. One of the biggest concerns about this proposed change was that it was not immediately obvious to any of us last Friday whether this was a substantial restructuring or essentially a renaming process. As you'll see in the slide showing the alignment, I now believe that the change is largely a renaming process and not a large change. The only real change is the insertion of one additional level, but I can image a way around doing this. In fact, I think that the reason for inserting this level is not specific to the question of the schema change, rather it's simply making up for something that was missing in the original model. There needs to be a way of having things that are attributes of the overall identification rather than an individual identification hypothesis - for example, the probability that at least one of the identification hypotheses (hits/matches) is correct for the spectrum. Assuming we agree that this is true, I think there is zero difference in the schema other than using more generic names, and my opinion is that we should really make this change. 

The second document, "AnalysisXML Results Use Cases.ppt" tries to capture a lot of more specific use cases that demonstrate why the proposed schema change may be the right thing to do. I've done this using 'pseudo instance documents' which are explained in the slides. I hope this is a useful communication mechanism, and may have some use for documentation as well. If no one finds them useful, no big deal - I was just trying to find a way to communicate clearly. Please excuse inaccuracies in the details of some of the use cases. I was trying to assess whether or not the constant AnalysisResult frame was robust to a large number of variations. I think you'll see that it is, and it's really not clear to my why we should have a special case of element names for the ID of peptides from MS/MS spectra. The only good reason I can see for it is that it's what we already had drawn up in the schema. 

Please feel free to add, modify, or correct any of this as you see fit! 

Sean 

________________________________

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

________________________________

_______________________________________________
Psidev-pi-dev mailing list
Psi...@li...
https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev

-- 
David Creasy
Matrix Science
64 Baker Street
London W1U 7GB, UK
Tel: +44 (0)20 7486 1050
Fax: +44 (0)20 7224 1344

dc...@ma...
http://www.matrixscience.com

Matrix Science Ltd. is registered in England and Wales
Company number 3533898