From: Walzer <wa...@eb...> - 2020-07-30 09:00:56
|
Dear all, if no-one objects until the weekend, I will go ahead and realise a PR merge from branch v0.1.1 to our main branch which includes the proposed changes (in short, remove the cvRef). I already made the changes to a specification document version and the mzqc-pylib and it's looking good (https://github.com/bigbio/mzqc-pylib/tree/v0.1.1). all the best, stay safe everyone, mths On 17/07/2020 10:13, Walzer wrote: > I'll try to break down the problem in order to know what possible > solutions there are and what their implications (i.e. us imposing > certain rules or creating control mechanisms) are. And also for the > benefit of those who are not familiar with the issue we're facing, > everyone else can jump directly to the alternatives > > So we have (at the very end of a mzQCdocument) an element that > indicates which particular controlled vocabularies are used > > ``` > > 'controlledVocabularies': [ > > { 'name': 'Proteomics Standards Initiative Quality Control Ontology', > > 'ref': 'QC', > > 'uri': > 'https://github.com/HUPO-PSI/qcML-development/blob/master/cv/v0_1_0/qc-cv.obo', > > 'version': '0.1.0'}, > > { 'name': 'Proteomics Standards Initiative ' > > 'Mass Spectrometry Ontology', > > 'ref': 'MS', > > 'uri': 'https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo', > > 'version': '4.1.7'} > > ] > > ``` > > Then, preceding, we have a direct 'instantiation' of a CV term as a > metric in a mzQCdocument. That is, picking an accession of one of the > controlled vocabularies mentioned with a reference to the latter (and > fitting it with a 'value'). > > ``` > > {'cvRef': 'QC', 'accession': QC:4000160, 'name': 'ID ratio', 'value': > 0.4644882527680259} > > ``` > > 1. > > This means one could potentially use multiple versions of a > controlled vocabulary without any clashes. > > 2. > > Consuming such a mzQCmeans retrieving terms is a two-step process > (1. get the ref to the controlled vocabularies with terms > (accessions) of interest, 2. then get the metrics that match > reference and accession). > > 3. > > A json-path query 'might' yield terms from different versions (or > completely wrong ones if the namespaces, the part in front of a > colon and the accession number, conflict, but that should not happen > with established controlled vocabularies. Rather, that danger would > come from homebrew CVs, which puts it out of our scope) > > 4. > > A DB query would require a look-up table. > > 5. > > Controlled vocabulary versions wouldn't mean much anymore, we'd have > to bump the version with each addition. > > There are some alternatives: > > Remove the 'ref' and make the 'accession' a PURL. Implications are: > > 1. > > We would need to have our (and all 3rdparty ones!) controlled > vocabulary registered and timely updated for new entries. > > 2. > > Retrieval would require a network connection > > 3. > > The retrieval response for different 3rdparty controlled > vocabularies is far from homogenous and I could not find any > documentation on the structure of the response (compare > http://purl.obolibrary.org/obo/MS_1002358with > http://purl.obolibrary.org/obo/STATO_0000237) > > Remove the 'ref' and reform accession to obo purl (as opposed to term > purl), a separator (# in url-speak this denotes a 'fragment') and the > regular accession as found in the controlled vocabulary. Implications > are: > > 1. > > We'd still need all the controlled vocabularies registered > > 2. > > Each metric would get lengthy > > Remove the 'ref' and forget about the purl altogether. Implications are: > > 1. > > We allow the use of only one version of a controlled vocabulary > within any given mzQCdocument > > 2. > > Any 3rdparty controlled vocabulary mustn't overlap in its accessions > with any other used controlled vocabulary, and the term namespace > becomes an essential part. > > 3. > > A json query for a particular metric can be made with one step and > only leaves open the issue of terms unavailable in a certain version > which we can avoid by not deleting any. > > Now some deliberations: > > * > > Do we want anyone to allow use of multiple versions of a controlled > vocabulary? - Assuming, no definitions change between versions, this > leaves only deprecated/deleted terms. This would mean that there > would be a problem in the future only if term would 'disappear', > implicating we should only deprecate terms but leave them in, which > I think is in accordance with how controlled vocabulary terms are > supposed to work anyways. > > * > > How would we deal with definition amendments (e.g. adding a unit > formalisation where there was none before)? We should probably > strive for adding 'fully defined'metrics only. That means the > version given in a mzQCdocument is to be interpreted as 'this > version or above'. > > * > > Do we know how obolibrarypurls work? The retrieval seems messy and > are there supposed to be versions? A 'guaranteed' own namespace > would be beneficial in any way. > > * > > We cannot control which controlled vocabularies are going to be used > and therefore have no control over: > > o > > The purl having a common structure (to parse the accession and > retrieve the term offline) > > o > > There even being a purl > > o > > Online retrieval of term is a nice-to-have, but involves some > considerable technical maintenance burden. > > * > > We could also build some integration tests for our .obo PRs to make > a best effort to keep downwards compatibility and compatibility with > usual-suspect 3rdparty controlled vocabularies. > > Overall, I am in agreement with Wout and think we would not lose much > by getting rid of the 'ref' and we would gain clarity with how our > controlled vocabulary accessions are supposed to work. Which is as > unique keys withing the space of all obolibrarycontrolled > vocabularies. Still we'd need to make sure we get our own accession > namespace ('QC:…'). Making the 'used controlled vocabulary' section > of the mzQCschema a list is a technically very valid point, which if > I remember correctly we agreed upon already, and I'll make sure this > is reflected in the schema json in the repository. Maybe we can get > the accession issue sorted in a timely manner if everyone who has > doubts or (dis)agrees with the possible solutions mails in response > and we can come to an agreement before the next call. > > best, > > mths > > On 15/07/2020 03:26, Wout Bittremieux via Psidev-qc-dev wrote: >> Unfortunately I won't be able to join the call because I'm >> co-chairing the CompMS session at ISMB the whole day. (*Maybe* I'll >> be able to join the first 10 minutes.) Final preparations for this >> have taken up quite some time in the past two weeks as well, so >> unfortunately I haven't yet had the time to contribute further to the >> specification document. >> >> I've been thinking about using PURLs or CV references to link quality >> metrics to CVs, and I'm starting to favor something that was >> suggested two weeks ago: to stick to the old format, but reserve CV >> namespaces for the most commonly used CVs. So that would be "QC" for >> our CV, "MS" for the MS CV, and "UO" for the unit ontology. These >> cover the majority of use cases I think, and if we identify other >> commonly used CVs we can still add those as well. >> >> The advantage of this is that it's possible to efficiently query >> terms in these CVs, because the CV keys are fixed. So they >> essentially function the same way as the PURLS, without the need of >> actually having to use PURLs. For other CVs that aren't reserved a >> two-query solution will still be needed, but considering that such >> queries should be rare I don't consider that too much of a problem. >> A further advantage is that we can avoid the somewhat clunky PURL >> specification, don't depend on an external service, and (very >> importantly imo!) don't require web lookups to the PURL service (i.e. >> essential for doing stuff in firewalled compute environments). >> >> A small disadvantage would be that to validate these reserved CV keys >> we'd need to add explicit functionality in the mzQC Python library to >> do so. But this is hardly a showstopper. >> >> This solution seems to somewhat give us the best of both worlds I >> think. It's also nice that we don't need to change the JSON schema. >> (Although we should probably still change the CV references in the >> JSON schema from a dictionary to a list.) >> The only thing would be to clearly document this behavior and the >> reserved CV keys in the specification document. >> >> I think we should also adapt and explicitly document some best >> practices that were discussed in function of adapting PURLs: that CV >> terms are final and can only ever be deprecated (i.e. an accession >> will always point to the same CV term) and that we should document an >> official CV versioning scheme. >> >> Let me know if I've overlooked something here. >> >> Best, >> Wout >> >> On 14/07/2020 19:05, Wout Bittremieux wrote: >>> Dear colleagues, >>> >>> This is a reminder that our next teleconference is scheduled for >>> Wednesday, July 15, at 14h00 GMT (15h00 London, 16h00 Western >>> Europe, 16h00 Cape Town, 17h00 Turkey, 7h00 San Diego). >>> >>> You can connect to our teleconference on Zoom through the >>> following link: >>> https://uchealth.zoom.us/j/92419363577?pwd=WVp5Q3FXNU9vaVdJT0ZNRllXWlN3Zz09 >>> (Password: 012575) >>> >>> I'd like propose the following agenda items: >>> >>> - Update on finalization of specification document >>> (https://docs.google.com/document/d/132F3MBgDJgtFlXxDZhpJ1oHGbKL8pT6dk9fvL55L5_M/edit). >>> >>> - New CV requests for PTXQC via mailing list >>> (https://sourceforge.net/p/psidev/mailman/message/37059772/). >>> @Chris: Is this not covered yet? >>> - Continue discussion on CV references / PURLs in mzQC schema >>> (https://github.com/HUPO-PSI/mzQC/pull/103). >>> >>> Thanks, >>> Wout >> >> >> >> _______________________________________________ >> Psidev-qc-dev mailing list >> Psi...@li... >> https://lists.sourceforge.net/lists/listinfo/psidev-qc-dev > > > > _______________________________________________ > Psidev-qc-dev mailing list > Psi...@li... > https://lists.sourceforge.net/lists/listinfo/psidev-qc-dev -- Mathias Walzer European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Office: +44 (0)1223 494 2610 E-mail: wa...@eb... |