Re: [Psidev-qc-dev] Upcoming teleconference 15/07

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I'll try to break down the problem in order to know what possible 
solutions there are and what their implications (i.e. us imposing 
certain rules or creating control mechanisms) are. And also for the 
benefit of those who are not familiar with the issue we're facing, 
everyone else can jump directly to the alternatives

So we have (at the very end of a mzQCdocument) an element that indicates 
which particular controlled vocabularies are used

```

'controlledVocabularies': [

     { 'name': 'Proteomics Standards Initiative Quality Control Ontology',

'ref': 'QC',

'uri': 
'https://github.com/HUPO-PSI/qcML-development/blob/master/cv/v0_1_0/qc-cv.obo',

'version': '0.1.0'},

     { 'name': 'Proteomics Standards Initiative '

'Mass Spectrometry Ontology',

'ref': 'MS',

'uri': 'https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo',

'version': '4.1.7'}

]

```

Then, preceding, we have a direct 'instantiation' of a CV term as a 
metric in a mzQCdocument. That is, picking an accession of one of the 
controlled vocabularies mentioned with a reference to the latter (and 
fitting it with a 'value').

```

{'cvRef': 'QC', 'accession': QC:4000160, 'name': 'ID ratio', 'value': 
0.4644882527680259}

```

 1.

    This means one could potentially use multiple versions of a
    controlled vocabulary without any clashes.

 2.

    Consuming such a mzQCmeans retrieving terms is a two-step process
    (1. get the ref to the controlled vocabularies with terms
    (accessions) of interest, 2. then get the metrics that match
    reference and accession).

 3.

    A json-path query 'might' yield terms from different versions (or
    completely wrong ones if the namespaces, the part in front of a
    colon and the accession number, conflict, but that should not happen
    with established controlled vocabularies. Rather, that danger would
    come from homebrew CVs, which puts it out of our scope)

 4.

    A DB query would require a look-up table.

 5.

    Controlled vocabulary versions wouldn't mean much anymore, we'd have
    to bump the version with each addition.

There are some alternatives:

Remove the 'ref' and make the 'accession' a PURL. Implications are:

 1.

    We would need to have our (and all 3rdparty ones!) controlled
    vocabulary registered and timely updated for new entries.

 2.

    Retrieval would require a network connection

 3.

    The retrieval response for different 3rdparty controlled
    vocabularies is far from homogenous and I could not find any
    documentation on the structure of the response (compare
    http://purl.obolibrary.org/obo/MS_1002358with
    http://purl.obolibrary.org/obo/STATO_0000237)

Remove the 'ref' and reform accession to obo purl (as opposed to term 
purl), a separator (# in url-speak this denotes a 'fragment')  and the 
regular accession as found in the controlled vocabulary. Implications are:

 1.

    We'd still need all the controlled vocabularies registered

 2.

    Each metric would get lengthy

Remove the 'ref' and forget about the purl altogether. Implications are:

 1.

    We allow the use of only one version of a controlled vocabulary
    within any given mzQCdocument

 2.

    Any 3rdparty controlled vocabulary mustn't overlap in its accessions
    with any other used controlled vocabulary, and the term namespace
    becomes an essential part.

 3.

    A json query for a particular metric can be made with one step and
    only leaves open the issue of terms unavailable in a certain version
    which we can avoid by not deleting any.

Now some deliberations:

  *

    Do we want anyone to allow use of multiple versions of a controlled
    vocabulary? - Assuming, no definitions change between versions, this
    leaves only deprecated/deleted terms. This would mean that there
    would be a problem in the future only if term would 'disappear',
    implicating we should only deprecate terms but leave them in, which
    I think is in accordance with how controlled vocabulary terms are
    supposed to work anyways.

  *

    How would we deal with definition amendments (e.g. adding a unit
    formalisation where there was none before)? We should probably
    strive for adding 'fully defined'metrics only. That means the
    version given in a mzQCdocument is to be interpreted as 'this
    version or above'.

  *

    Do we know how obolibrarypurls work? The retrieval seems messy and
    are there supposed to be versions? A 'guaranteed' own namespace
    would be beneficial in any way.

  *

    We cannot control which controlled vocabularies are going to be used
    and therefore have no control over:

      o

        The purl having a common structure (to parse the accession and
        retrieve the term offline)

      o

        There even being a purl

      o

        Online retrieval of term is a nice-to-have, but involves some
        considerable technical maintenance burden.

  *

    We could also build some integration tests for our .obo PRs to make
    a best effort to keep downwards compatibility and compatibility with
    usual-suspect 3rdparty controlled vocabularies.

Overall, I am in agreement with Wout and think we would not lose much by 
getting rid of the 'ref' and we would gain clarity with how our 
controlled vocabulary accessions are supposed to work. Which is as 
unique keys withing the space of all obolibrarycontrolled vocabularies. 
Still we'd need to make sure we get our own accession namespace 
('QC:…').  Making the 'used controlled vocabulary' section of  the 
mzQCschema a list is a technically very valid point, which if I remember 
correctly we agreed upon already, and I'll make sure this is reflected 
in the schema json in the repository. Maybe we can get the accession 
issue sorted in a timely manner if everyone who has doubts or 
(dis)agrees with the possible solutions mails in response and we can 
come to an agreement before the next call.

best,

mths

On 15/07/2020 03:26, Wout Bittremieux via Psidev-qc-dev wrote:
> Unfortunately I won't be able to join the call because I'm co-chairing 
> the CompMS session at ISMB the whole day. (*Maybe* I'll be able to 
> join the first 10 minutes.) Final preparations for this have taken up 
> quite some time in the past two weeks as well, so unfortunately I 
> haven't yet had the time to contribute further to the specification 
> document.
>
> I've been thinking about using PURLs or CV references to link quality 
> metrics to CVs, and I'm starting to favor something that was suggested 
> two weeks ago: to stick to the old format, but reserve CV namespaces 
> for the most commonly used CVs. So that would be "QC" for our CV, "MS" 
> for the MS CV, and "UO" for the unit ontology. These cover the 
> majority of use cases I think, and if we identify other commonly used 
> CVs we can still add those as well.
>
> The advantage of this is that it's possible to efficiently query terms 
> in these CVs, because the CV keys are fixed. So they essentially 
> function the same way as the PURLS, without the need of actually 
> having to use PURLs. For other CVs that aren't reserved a two-query 
> solution will still be needed, but considering that such queries 
> should be rare I don't consider that too much of a problem.
> A further advantage is that we can avoid the somewhat clunky PURL 
> specification, don't depend on an external service, and (very 
> importantly imo!) don't require web lookups to the PURL service (i.e. 
> essential for doing stuff in firewalled compute environments).
>
> A small disadvantage would be that to validate these reserved CV keys 
> we'd need to add explicit functionality in the mzQC Python library to 
> do so. But this is hardly a showstopper.
>
> This solution seems to somewhat give us the best of both worlds I 
> think. It's also nice that we don't need to change the JSON schema. 
> (Although we should probably still change the CV references in the 
> JSON schema from a dictionary to a list.)
> The only thing would be to clearly document this behavior and the 
> reserved CV keys in the specification document.
>
> I think we should also adapt and explicitly document some best 
> practices that were discussed in function of adapting PURLs: that CV 
> terms are final and can only ever be deprecated (i.e. an accession 
> will always point to the same CV term) and that we should document an 
> official CV versioning scheme.
>
> Let me know if I've overlooked something here.
>
> Best,
> Wout
>
> On 14/07/2020 19:05, Wout Bittremieux wrote:
>> Dear colleagues,
>>
>> This is a reminder that our next teleconference is scheduled for
>> Wednesday, July 15, at 14h00 GMT (15h00 London, 16h00 Western
>> Europe, 16h00 Cape Town, 17h00 Turkey, 7h00 San Diego).
>>
>> You can connect to our teleconference on Zoom through the
>> following link: 
>> https://uchealth.zoom.us/j/92419363577?pwd=WVp5Q3FXNU9vaVdJT0ZNRllXWlN3Zz09
>> (Password: 012575)
>>
>> I'd like propose the following agenda items:
>>
>> - Update on finalization of specification document 
>> (https://docs.google.com/document/d/132F3MBgDJgtFlXxDZhpJ1oHGbKL8pT6dk9fvL55L5_M/edit). 
>>
>> - New CV requests for PTXQC via mailing list 
>> (https://sourceforge.net/p/psidev/mailman/message/37059772/). @Chris: 
>> Is this not covered yet?
>> - Continue discussion on CV references / PURLs in mzQC schema 
>> (https://github.com/HUPO-PSI/mzQC/pull/103).
>>
>> Thanks,
>> Wout
>
>
>
> _______________________________________________
> Psidev-qc-dev mailing list
> Psi...@li...
> https://lists.sourceforge.net/lists/listinfo/psidev-qc-dev

-- 
Mathias Walzer
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Office: +44 (0)1223 494 2610
E-mail: wa...@eb...