Re: [Psidev-pi-dev] analysis tree

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

OK, difficult to track this discussion.

Summary:
1. longer chain of analyses (as
discussed by Andy): not modeled, all
okay
2. data filtering issue (my use case:
quality assessment): its parameters and
results are currently modeled "next to"
peptide and protein detection (I will
start a discussion on that later, when
my use case is assembled ;-) )
3. tree-like structure of (protein)
analyses: is currently possible, makes
sense (Angel) / no sense (Andy)

My opinion:
I (of course) do not insist on an
"actual analysis" attribute;
I can imagine two possibilities:
i) problem could be ignored in the
schema and judged later by a "semantic
validator" as "wrong"; 
ii) if we generally want to forbid that,
we could allow zero or one ProteinDetect
under AnalysisCollection.

With ii) we can have EITHER one spectrum
ident OR many spectrum idents OR one
spectrum ident and one protein detect OR
many spectrum idents and one protein
detect.

That is "a bit" workflow, but without
tree-like protein analyses. I would
prefer that to the file solution
suggested by Angel,
because ontologies, databases, samples,
. are not doubled and we have less
problems with moving results
(partial file copy or uploading into
database).

My intuition is, that quantification
fits into that suggestion, but that is
no argument at the moment. ;-)

Bye
   Martin

Von:
psi...@li...urceforge.
net
[mailto:psi...@li...ur
ceforge.net] Im Auftrag von Angel
Pizarro
Gesendet: Wednesday, May 28, 2008 2:43
PM
An: Jones, Andy
Cc: psi...@li...
Betreff: Re: [Psidev-pi-dev] analysis
tree

On Wed, May 28, 2008 at 7:33 AM, Jones,
Andy <And...@li...>
wrote:
No I mean that there should be a generic
DataFiltering protocol to define
arbitrary data filtering operations. And
yes, we need some examples of this for
the CV and schema.

But then this gets back to the problem
that Martin highlighted of having
multiple peptide and protein sets within
one file...

PeptideSet1 --> DataFiltering -->
PeptideSet2
PeptideSet2 --> ProteinDetection -->
ProteinSet1
ProteinSet1 --> DataFiltering -->
ProteinSet2 

Where ProteinSet2 is the "final"
results...

Simply reconstructing this graph states
what most would call "final" is
ProteinSet2. Martin's example was more
ambiguous:

PeptideSet1 -> ProteinDetection1 ->
ProteinSet1
PeptideSet1 -> ProteinDetection2 ->
ProteinSet2

Reconstructing this graph would in no
way tell you what the author of the file
meant to be their canonical result.
There is no amount of  schema or CV
encoding that will automatically
disambiguate this for all cases, other
than a simple label which Martin
proposes. Frankly I think that while it
would "work" it is not such a good idea
to create a set schema element to
essentially encourage bad encoding of
results. 

BTW, FuGE has the same issue. All the
results are valid results and can stand
on their own. Ultimately it is the
consumer  that will determine what to
label as "final".

A way to get rid of all of these issues
from axml is to move it even closer to
mzML and not encode any workflow. A file
would restrict itself to just a single
node of a workflow that references into
some input file. E.g. (nodes are
individual files):

mzML1 -> axml1 (peptide ids : mascot)
mzML1 -> axml2 (peptide ids : sequest)
axml1 -> axml3 (protein determination :
mascot)
[axml1, axm2] -> axml5 (peptide
determination : peptideprophet)
axml5 -> axml6 (protein determination :
proteinprophet)
 etc. etc.

In this case the final result is a file
and as such unambiguously encoded. 

-angel