From: David C. <dc...@ma...> - 2006-07-11 17:58:44
|
Hi, Philip Jones wrote: > Hi, > > Based upon the draft analysisXML UML model that Angel circulated a month > or so ago, here are a few requirements that may need to be addressed. > > It is very likely that many of these requirements are already addressed > in the draft model or by the FuGE model that it extends, especially > taking into account that the majority of the classes in the diagram do > not have fields included, but hopefully this list may provoke some > discussion... > > The current draft UML model can be retrieved from: > > http://psidev.sourceforge.net/proteomics-informatics/documents/analysisUML.ppt > > Information that may need to be captured: > > 1. Polypeptide (mostly obvious stuff, probably just fields that are > present but have not been included in the Powerpoint diagram): > * Sequence (esp. for peptides) [0..1] I don't think we should have this proteins - would be huge when you are searching the human genome... For a denovo search this may be an ambiguous sequence - e.g. ACQ[I|L]K We need to define a syntax for that I guess. > * database name / version [0..1] - /Appears to be addressed by > SequenceDatabase class, The current model suggests that all > proteins identified in one XML need to be identified from > the same database - presumably this is a safe assumption?/ I think that we have to allow multiple databases. See the whole "database searched" section in the spreadsheet. > * accession / accession version [0..1], mandatory for proteins? Yes, I think so. It would be hard to justify making it optional? > * database cross references [0..*] 0 is certainly possible for a denovo search. > * Start and end coordinates of peptides in relation to any > proteins that they identify [0..1] or [0..*] ?? Guess it depends on the structure of the XML... > * Upstream / Downstream flanking regions for peptides Remember that this won't be available for denovo sequencing. Not all the search engines have this (see the spreadsheet) > > Will polypeptide be sub-classed to allow (for example) > accession to be enforced for protein identifications? > > It's also worth reading Brian Searle's comments on the comments tab in the spreadsheet: "... Is it possible to add in the specifications a recommendation for reporting at least one peptide score for every spectrum?..." > 2. Presumably the 'SearchProtocol' class will handle details of: > > * search engine identity > * search engine version > * search input parameters / settings (Is this CV parameterised > stuff with CVs from individual search engine vendors?) We've agreed to use CVs as little as possible. The spreadsheet will help us get the largest possible number of parameters defined in the schema. > > 3. Protein modifications - CustomModification class: > > * Are the monoisotopicMass & averageMass values expected or > observed? Does analysisXML need to store both? Are expected > values [1..*]? The UML doesn't look nearly complete to me - I'm certainly not very clear what it is supposed to be... > > 4. Protein modifications - Modification class > > * 'position' field declared as mandatory, however the position > of the PTM is often unknown. Ability to handle 'fuzzy' or > approximate location? (Does this last requirement even exist?) Yes, it mustn't be mandatory - for PMF searches you certainly don't know the location. I guess approximate location may be useful. Any search engines support it? > > 5. SearchHypothesis / Search result: > > * As mentioned at the last PSI-PI teleconference, just to flag > up the need for a clear plan of how to (for example) connect > to a gel spot recorded in GelML format etc. > > Best regards, > > Phil. > -- David Creasy Matrix Science 8 Wyndham Place London W1H 1PP Tel +44 (0)20 7723 2142 Fax +44 (0)20 7725 9360 dc...@ma... http://www.matrixscience.com |