From: Walzer <wa...@eb...> - 2020-07-17 09:13:59
|
I'll try to break down the problem in order to know what possible solutions there are and what their implications (i.e. us imposing certain rules or creating control mechanisms) are. And also for the benefit of those who are not familiar with the issue we're facing, everyone else can jump directly to the alternatives So we have (at the very end of a mzQCdocument) an element that indicates which particular controlled vocabularies are used ``` 'controlledVocabularies': [ { 'name': 'Proteomics Standards Initiative Quality Control Ontology', 'ref': 'QC', 'uri': 'https://github.com/HUPO-PSI/qcML-development/blob/master/cv/v0_1_0/qc-cv.obo', 'version': '0.1.0'}, { 'name': 'Proteomics Standards Initiative ' 'Mass Spectrometry Ontology', 'ref': 'MS', 'uri': 'https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo', 'version': '4.1.7'} ] ``` Then, preceding, we have a direct 'instantiation' of a CV term as a metric in a mzQCdocument. That is, picking an accession of one of the controlled vocabularies mentioned with a reference to the latter (and fitting it with a 'value'). ``` {'cvRef': 'QC', 'accession': QC:4000160, 'name': 'ID ratio', 'value': 0.4644882527680259} ``` 1. This means one could potentially use multiple versions of a controlled vocabulary without any clashes. 2. Consuming such a mzQCmeans retrieving terms is a two-step process (1. get the ref to the controlled vocabularies with terms (accessions) of interest, 2. then get the metrics that match reference and accession). 3. A json-path query 'might' yield terms from different versions (or completely wrong ones if the namespaces, the part in front of a colon and the accession number, conflict, but that should not happen with established controlled vocabularies. Rather, that danger would come from homebrew CVs, which puts it out of our scope) 4. A DB query would require a look-up table. 5. Controlled vocabulary versions wouldn't mean much anymore, we'd have to bump the version with each addition. There are some alternatives: Remove the 'ref' and make the 'accession' a PURL. Implications are: 1. We would need to have our (and all 3rdparty ones!) controlled vocabulary registered and timely updated for new entries. 2. Retrieval would require a network connection 3. The retrieval response for different 3rdparty controlled vocabularies is far from homogenous and I could not find any documentation on the structure of the response (compare http://purl.obolibrary.org/obo/MS_1002358with http://purl.obolibrary.org/obo/STATO_0000237) Remove the 'ref' and reform accession to obo purl (as opposed to term purl), a separator (# in url-speak this denotes a 'fragment') and the regular accession as found in the controlled vocabulary. Implications are: 1. We'd still need all the controlled vocabularies registered 2. Each metric would get lengthy Remove the 'ref' and forget about the purl altogether. Implications are: 1. We allow the use of only one version of a controlled vocabulary within any given mzQCdocument 2. Any 3rdparty controlled vocabulary mustn't overlap in its accessions with any other used controlled vocabulary, and the term namespace becomes an essential part. 3. A json query for a particular metric can be made with one step and only leaves open the issue of terms unavailable in a certain version which we can avoid by not deleting any. Now some deliberations: * Do we want anyone to allow use of multiple versions of a controlled vocabulary? - Assuming, no definitions change between versions, this leaves only deprecated/deleted terms. This would mean that there would be a problem in the future only if term would 'disappear', implicating we should only deprecate terms but leave them in, which I think is in accordance with how controlled vocabulary terms are supposed to work anyways. * How would we deal with definition amendments (e.g. adding a unit formalisation where there was none before)? We should probably strive for adding 'fully defined'metrics only. That means the version given in a mzQCdocument is to be interpreted as 'this version or above'. * Do we know how obolibrarypurls work? The retrieval seems messy and are there supposed to be versions? A 'guaranteed' own namespace would be beneficial in any way. * We cannot control which controlled vocabularies are going to be used and therefore have no control over: o The purl having a common structure (to parse the accession and retrieve the term offline) o There even being a purl o Online retrieval of term is a nice-to-have, but involves some considerable technical maintenance burden. * We could also build some integration tests for our .obo PRs to make a best effort to keep downwards compatibility and compatibility with usual-suspect 3rdparty controlled vocabularies. Overall, I am in agreement with Wout and think we would not lose much by getting rid of the 'ref' and we would gain clarity with how our controlled vocabulary accessions are supposed to work. Which is as unique keys withing the space of all obolibrarycontrolled vocabularies. Still we'd need to make sure we get our own accession namespace ('QC:…'). Making the 'used controlled vocabulary' section of the mzQCschema a list is a technically very valid point, which if I remember correctly we agreed upon already, and I'll make sure this is reflected in the schema json in the repository. Maybe we can get the accession issue sorted in a timely manner if everyone who has doubts or (dis)agrees with the possible solutions mails in response and we can come to an agreement before the next call. best, mths On 15/07/2020 03:26, Wout Bittremieux via Psidev-qc-dev wrote: > Unfortunately I won't be able to join the call because I'm co-chairing > the CompMS session at ISMB the whole day. (*Maybe* I'll be able to > join the first 10 minutes.) Final preparations for this have taken up > quite some time in the past two weeks as well, so unfortunately I > haven't yet had the time to contribute further to the specification > document. > > I've been thinking about using PURLs or CV references to link quality > metrics to CVs, and I'm starting to favor something that was suggested > two weeks ago: to stick to the old format, but reserve CV namespaces > for the most commonly used CVs. So that would be "QC" for our CV, "MS" > for the MS CV, and "UO" for the unit ontology. These cover the > majority of use cases I think, and if we identify other commonly used > CVs we can still add those as well. > > The advantage of this is that it's possible to efficiently query terms > in these CVs, because the CV keys are fixed. So they essentially > function the same way as the PURLS, without the need of actually > having to use PURLs. For other CVs that aren't reserved a two-query > solution will still be needed, but considering that such queries > should be rare I don't consider that too much of a problem. > A further advantage is that we can avoid the somewhat clunky PURL > specification, don't depend on an external service, and (very > importantly imo!) don't require web lookups to the PURL service (i.e. > essential for doing stuff in firewalled compute environments). > > A small disadvantage would be that to validate these reserved CV keys > we'd need to add explicit functionality in the mzQC Python library to > do so. But this is hardly a showstopper. > > This solution seems to somewhat give us the best of both worlds I > think. It's also nice that we don't need to change the JSON schema. > (Although we should probably still change the CV references in the > JSON schema from a dictionary to a list.) > The only thing would be to clearly document this behavior and the > reserved CV keys in the specification document. > > I think we should also adapt and explicitly document some best > practices that were discussed in function of adapting PURLs: that CV > terms are final and can only ever be deprecated (i.e. an accession > will always point to the same CV term) and that we should document an > official CV versioning scheme. > > Let me know if I've overlooked something here. > > Best, > Wout > > On 14/07/2020 19:05, Wout Bittremieux wrote: >> Dear colleagues, >> >> This is a reminder that our next teleconference is scheduled for >> Wednesday, July 15, at 14h00 GMT (15h00 London, 16h00 Western >> Europe, 16h00 Cape Town, 17h00 Turkey, 7h00 San Diego). >> >> You can connect to our teleconference on Zoom through the >> following link: >> https://uchealth.zoom.us/j/92419363577?pwd=WVp5Q3FXNU9vaVdJT0ZNRllXWlN3Zz09 >> (Password: 012575) >> >> I'd like propose the following agenda items: >> >> - Update on finalization of specification document >> (https://docs.google.com/document/d/132F3MBgDJgtFlXxDZhpJ1oHGbKL8pT6dk9fvL55L5_M/edit). >> >> - New CV requests for PTXQC via mailing list >> (https://sourceforge.net/p/psidev/mailman/message/37059772/). @Chris: >> Is this not covered yet? >> - Continue discussion on CV references / PURLs in mzQC schema >> (https://github.com/HUPO-PSI/mzQC/pull/103). >> >> Thanks, >> Wout > > > > _______________________________________________ > Psidev-qc-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-qc-dev -- Mathias Walzer European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Office: +44 (0)1223 494 2610 E-mail: wa...@eb... |