|
From: Dave V. <vie...@KU...> - 2004-05-11 14:44:48
|
dh...@gb... wrote: >=20 > This message was sent to: "Subcommittee for Data Access and Data= =20 > Interoperability " <gbi...@ro...> with copy to: "= ?GBIF=20 > Scientific Staff?" <sci...@ai...>, "?Hideaki Sugawara?"= =20 > <hsu...@ge...>, "?Gregor Hagedorn?" <G.H...@bb...>= ,=20 > "?Robert Morris?" <ra...@cs...>, "?Guy Baillargeon?"=20 > <bai...@ag...>, "?Lawrence Way?" <law...@jn...> > Remember that the messages sent to the list(s) go to all recipients= . >=20 >=20 > Recent discussion in the DiGIR-developers list on the intended use = of=20 > the InstitutionCode, CollectionCode and CatalogNumber identifiers i= n=20 > Darwin Core has made me realise that there is even more work requir= ed to=20 > get the =93collection id=94 element of an LSID well-defined (or at = least=20 > that LSIDs could not be implicitly deduced from DiGIR/Darwin Core= =20 > records as they stand right now). This issue has some aspects whic= h we=20 > will also have to face when we work on standardising metadata for D= arwin=20 > Core and other provider types. >=20 > =20 >=20 > Darwin Core records are only guaranteed to be unique within a given= =20 > resource if the InstitutionCode (IC), CollectionCode (CC) and=20 > CatalogNumber (CN) are all provided. (You understand the qualified= =20 > sense in which I use the word =93guaranteed=94.) It would be possi= ble and=20 > valid (even if very unfriendly) for a single DiGIR resource to incl= ude=20 > distinct records with the following combinations: >=20 > =20 >=20 > IC=3DInstitutionA, CC=3DFishes, CN=3D1234 >=20 > IC=3DInstitutionA, CC=3DMollusca, CN=3D1234 >=20 > IC=3DInstitutionB, CC=3DFishes, CN=3D1234 >=20 There is nothing unfriendly about this. It is entirely within the= =20 definition of the terms of the Darwin Core since each record has a= =20 unique identification based on the combination of IS, CC, and CN. I= =20 don't think it has ever been stated that a resource should map direct= ly=20 to a collection- this is something that is commonly done, but it is n= ot=20 necessary. A resource should only be considered a collection of reco= rds=20 that conform to a common schema. Also, it has always been recommended, indeed it is *required* that ea= ch=20 Darwin Core record has the IC, CC and CN included. It has always bee= n=20 intended, and the definition of IC, CC, and CN have stated that the= =20 combination of these elements should uniquely identify a record. > =20 >=20 > In this case InsititionA:Fishes is assumed to be a different data s= et=20 > from InstitutionB:Fishes. >=20 > =20 >=20 > My assumption in my previous messages was that we could assign a= =20 > =93collection id=94 (fourth element in an LSID) to data sets such t= hat the=20 > combination of the =93collection id=94 and a CatalogNumber/UnitId w= ould be=20 > unique. In the above cases my LSIDs could be: >=20 > =20 >=20 > URN:LSID:gbif.net:fishes.InstitutionA.org:1234 >=20 > URN:LSID:gbif.net:mollusca.InstitutionA.org:1234 >=20 > URN:LSID:gbif.net:fishes.InstitutionB.org:1234 >=20 Remember that LSIDs are resolved by the LSID service. This means tha= t=20 the only requirement is that the form of the LSID past the domain nam= e=20 section is *entirely* defined by the LSID resolution service. It may= be=20 that a particular LSID service would follow the form outlined above, = but=20 there is absolutely no reason why a particular service instance could= n't=20 provide a different value entirely, perhaps a key to an index table t= hat=20 provides a mapping between the LSID value and a particular record. So an LSID is menat to act only as a globally unique identifier syste= m.=20 Uniqueness down to the level of the LSID resolution service is=20 guaranteed by the DNS. Below that, it is up to the resolution servic= e=20 instance to resolve the remainder of LSID to an object. > =20 >=20 > This is still more or less what I would like to see in a case like = this,=20 > but it has some implications. I have been proposing a =93lax=94 mo= del for=20 > LSID management. A provider could identify the =93collection id= =94 (e.g.=20 > =93fishes.InstitutionA.org=94) in the metadata for the resource and= =20 > implicitly assert that every record in the resource should be treat= ed as=20 > having an LSID of the form=20 > =93URN:LSID:gbif.net:<collection_id_from_metadata>:<CatalogNumber_f= rom_record>.=20 > This will not work as it stands. There are a number of options (i= n no=20 > particular order): >=20 > =20 >=20 > 1. We follow a fully explicit model where records only have LSID= s if > such an id has been explicitly assigned to it (e.g. via an <l= sid> > element in the Darwin Core record). This establishes a rathe= r > higher threshold for specimen/observation data providers who = wish > for their data to have universal identifiers. I would like t= o > avoid this if possible, since it would be good for us to be a= ble > to refer to all records uniquely provided that those records = meet > the current recommendations on use of elements in Darwin Core= or > ABCD. > 2. We require DiGIR resources (and similar electronic data resou= rces) > to map to one and only one =93collection id=94 which can then= be > identified in the associated metadata. This deviates from th= e > structural freedom currently given to data providers by DiGIR= .=20 > (It may however in any case be a good general recommendation = that > where possible DiGIR providers should assign different collec= tions > to different DiGIR resources.) > 3. We use the =93collection id=94 as recommended by Gregor to re= present > the database, whatever it may contain, and find some way to > include all relevant fields within the =93object id=94 (LSID = element > 5), i.e. including some combination of the IC, CC and CN. A = major > problem here is that we would seem to be reintroducing a > lower-level syntax inside one of the LSID elements. This > inevitably leads to extra problems with encoding subelements = and > parsing them. > 4. We do something with DiGIR to allow different metadata to be > associated with different subsets of the records in the resou= rce. > At this point the problem intersects with general questions = of > metadata management for DiGIR/Darwin Core. In some ways this= may > just be a different perspective on option 2 above. >=20 > =20 >=20 > A possible hybrid solution would be to allow providers to use impli= cit=20 > LSIDs if and only if the CatalogNumber/UnitId is guaranteed to be u= nique=20 > (by itself) within the resource. In other cases the LSID must be= =20 > provided explicitly for each record. This could encourage the most= =20 > appropriate choice of what should form a primary collection resourc= e=20 > (i.e. one resource for one collection), while at the same time prov= iding=20 > a clear signal to aggregators of how they should handle the propaga= tion=20 > of LSIDs (explicit LSIDs preserved from primary resources and serve= d=20 > explicitly as an extra field). Of course this would become a sligh= tly=20 > difficult thing to police: GBIF could refuse to resolve LSIDs for a= ny=20 > resource which does not provide them explicitly and which has multi= ple=20 > records with the same CatalogNumber/UnitId. Maybe this is enough. = =20 > After all there are countless other ways in which supplied identifi= ers=20 > may be invalid. >=20 > =20 >=20 > Does anyone have any thoughts on these options?=20 >=20 LSIDs are used to provide a unique identifier for an entity, generall= y a=20 digital object. In the case of GBIF, the digital objects are instanc= es=20 of specimen (and observation) records. The use of LSIDs in the conte= xt=20 of GBIF appears primarily to provide a globally unique identifier for= =20 these records, as opposed to the more complete use of the LSID system= =20 which also involves the resolution of the LSID value to an instance o= f=20 the object. So, in the case of DiGIR, what is the use of LSIDs? From my (DiGIR a= nd=20 other projects that use DiGIR) point of view, LSIDs are to provide a= =20 convenient unique key for each record instance. The value of the LSI= D=20 shall be generated by the DiGIR data service, either as an amalgamati= on=20 of existing key values or through the provision of an index table tha= t=20 relates LSID values to unique records. In addition, we would like to= =20 move the LSID resolution service as close as possible to the DiGIR da= ta=20 provider. This means that each DiGIR provider will also act as an LS= ID=20 resolution service. The result of resolving an LSID would ideally= =20 provide an instance of the specific XML record that the LSID points t= o.=20 It may also provide a pointer to the actual data instance, which ma= y=20 be a DiGIR query that returns a DiGIR resultset containing only that = one=20 record (this later scenario is less desireable for a variety for a= =20 variety of reasons). The primary difficulty of implementing an LSID service for each DiGIR= =20 provider is the availability of suitable DNS records. In many cases,= =20 the institutions may not be able to provide suitable modification to= =20 their DNS entries (either because they have no control over them, or= =20 they do not own, or wish to own a domain name). In this case, the= =20 simple solution is for a DN owner to provide sub-domains that can be= =20 used to identify LSID services at each DiGIR provider. So for exampl= e,=20 the DNS _lsid._ku.digir.net SRV record might resolve to the KU instan= ce=20 of an LSID service. Then records being served up through the DiGIR= =20 provider associated with that DN would be of the form=20 URN:LSID:ku.digir.net:something where "something" is defined by the L= SID=20 service, but could be something like: something =3D DiGIR_instance_ID + ":" + DiGIR_resource_ID + ":" + val= ue or even: something =3D value So anyway, it seems that the discussion about LSIDs has been confused= =20 with the issue of GUID provision, and has pretty much ignored the= =20 implementation issues of LSID resolution services, and how these are = to=20 be implemented in DiGIR and used by communities outside of GBIF. regards, Dave V. |