[Digir-dev] Re: LSIDs and GBIF

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

dh...@gb... wrote:

>=20
> This message was sent to: "Subcommittee for Data Access and Data=
=20
> Interoperability " <gbi...@ro...> with copy to: "=
?GBIF=20
> Scientific Staff?" <sci...@ai...>, "?Hideaki Sugawara?"=
=20
> <hsu...@ge...>, "?Gregor Hagedorn?" <G.H...@bb...>=
,=20
> "?Robert Morris?" <ra...@cs...>, "?Guy Baillargeon?"=20
> <bai...@ag...>, "?Lawrence Way?" <law...@jn...>
> Remember that the messages sent to the list(s) go to all recipients=
.
>=20
>=20
> Recent discussion in the DiGIR-developers list on the intended use =
of=20
> the InstitutionCode, CollectionCode and CatalogNumber identifiers i=
n=20
> Darwin Core has made me realise that there is even more work requir=
ed to=20
> get the =93collection id=94 element of an LSID well-defined (or at =
least=20
> that LSIDs could not be implicitly deduced from DiGIR/Darwin Core=
=20
> records as they stand right now).  This issue has some aspects whic=
h we=20
> will also have to face when we work on standardising metadata for D=
arwin=20
> Core and other provider types.
>=20
> =20
>=20
> Darwin Core records are only guaranteed to be unique within a given=
=20
> resource if the InstitutionCode (IC), CollectionCode (CC) and=20
> CatalogNumber (CN) are all provided.  (You understand the qualified=
=20
> sense in which I use the word =93guaranteed=94.)  It would be possi=
ble and=20
> valid (even if very unfriendly) for a single DiGIR resource to incl=
ude=20
> distinct records with the following combinations:
>=20
> =20
>=20
> IC=3DInstitutionA, CC=3DFishes, CN=3D1234
>=20
> IC=3DInstitutionA, CC=3DMollusca, CN=3D1234
>=20
> IC=3DInstitutionB, CC=3DFishes, CN=3D1234
>=20

There is nothing unfriendly about this.  It is entirely within the=
=20
definition of the terms of the Darwin Core since each record has a=
=20
unique identification based on the combination of IS, CC, and CN.  I=
=20
don't think it has ever been stated that a resource should map direct=
ly=20
to a collection- this is something that is commonly done, but it is n=
ot=20
necessary.  A resource should only be considered a collection of reco=
rds=20
that conform to a common schema.

Also, it has always been recommended, indeed it is *required* that ea=
ch=20
Darwin Core record has the IC, CC and CN included.  It has always bee=
n=20
intended, and the definition of IC, CC, and CN have stated that the=
=20
combination of these elements should uniquely identify a record.

> =20
>=20
> In this case InsititionA:Fishes is assumed to be a different data s=
et=20
> from InstitutionB:Fishes.
>=20
> =20
>=20
> My assumption in my previous messages was that we could assign a=
=20
> =93collection id=94 (fourth element in an LSID) to data sets such t=
hat the=20
> combination of the =93collection id=94 and a CatalogNumber/UnitId w=
ould be=20
> unique.  In the above cases my LSIDs could be:
>=20
> =20
>=20
> URN:LSID:gbif.net:fishes.InstitutionA.org:1234
>=20
> URN:LSID:gbif.net:mollusca.InstitutionA.org:1234
>=20
> URN:LSID:gbif.net:fishes.InstitutionB.org:1234
>=20

Remember that LSIDs are resolved by the LSID service.  This means tha=
t=20
the only requirement is that the form of the LSID past the domain nam=
e=20
section is *entirely* defined by the LSID resolution service.  It may=
 be=20
that a particular LSID service would follow the form outlined above, =
but=20
there is absolutely no reason why a particular service instance could=
n't=20
provide a different value entirely, perhaps a key to an index table t=
hat=20
provides a mapping between the LSID value and a particular record.

So an LSID is menat to act only as a globally unique identifier syste=
m.=20
Uniqueness down to the level of the LSID resolution service is=20
guaranteed by the DNS.  Below that, it is up to the resolution servic=
e=20
instance to resolve the remainder of LSID to an object.

> =20
>=20
> This is still more or less what I would like to see in a case like =
this,=20
> but it has some implications.  I have been proposing a =93lax=94 mo=
del for=20
> LSID management.  A provider could identify the =93collection id=
=94 (e.g.=20
> =93fishes.InstitutionA.org=94) in the metadata for the resource and=
=20
> implicitly assert that every record in the resource should be treat=
ed as=20
> having an LSID of the form=20
> =93URN:LSID:gbif.net:<collection_id_from_metadata>:<CatalogNumber_f=
rom_record>.=20
>  This will not work as it stands.  There are a number of options (i=
n no=20
> particular order):
>=20
> =20
>=20
>    1. We follow a fully explicit model where records only have LSID=
s if
>       such an id has been explicitly assigned to it (e.g. via an <l=
sid>
>       element in the Darwin Core record).  This establishes a rathe=
r
>       higher threshold for specimen/observation data providers who =
wish
>       for their data to have universal identifiers.  I would like t=
o
>       avoid this if possible, since it would be good for us to be a=
ble
>       to refer to all records uniquely provided that those records =
meet
>       the current recommendations on use of elements in Darwin Core=
 or
>       ABCD.
>    2. We require DiGIR resources (and similar electronic data resou=
rces)
>       to map to one and only one =93collection id=94 which can then=
 be
>       identified in the associated metadata.  This deviates from th=
e
>       structural freedom currently given to data providers by DiGIR=
.=20
>       (It may however in any case be a good general recommendation =
that
>       where possible DiGIR providers should assign different collec=
tions
>       to different DiGIR resources.)
>    3. We use the =93collection id=94 as recommended by Gregor to re=
present
>       the database, whatever it may contain, and find some way to
>       include all relevant fields within the =93object id=94 (LSID =
element
>       5), i.e. including some combination of the IC, CC and CN.  A =
major
>       problem here is that we would seem to be reintroducing a
>       lower-level syntax inside one of the LSID elements.  This
>       inevitably leads to extra problems with encoding subelements =
and
>       parsing them.
>    4. We do something with DiGIR to allow different metadata to be
>       associated with different subsets of the records in the resou=
rce.
>        At this point the problem intersects with general questions =
of
>       metadata management for DiGIR/Darwin Core.  In some ways this=
 may
>       just be a different perspective on option 2 above.
>=20
> =20
>=20
> A possible hybrid solution would be to allow providers to use impli=
cit=20
> LSIDs if and only if the CatalogNumber/UnitId is guaranteed to be u=
nique=20
> (by itself) within the resource.  In other cases the LSID must be=
=20
> provided explicitly for each record.  This could encourage the most=
=20
> appropriate choice of what should form a primary collection resourc=
e=20
> (i.e. one resource for one collection), while at the same time prov=
iding=20
> a clear signal to aggregators of how they should handle the propaga=
tion=20
> of LSIDs (explicit LSIDs preserved from primary resources and serve=
d=20
> explicitly as an extra field).  Of course this would become a sligh=
tly=20
> difficult thing to police: GBIF could refuse to resolve LSIDs for a=
ny=20
> resource which does not provide them explicitly and which has multi=
ple=20
> records with the same CatalogNumber/UnitId.  Maybe this is enough. =
=20
> After all there are countless other ways in which supplied identifi=
ers=20
> may be invalid.
>=20
> =20
>=20
> Does anyone have any thoughts on these options?=20
>=20

LSIDs are used to provide a unique identifier for an entity, generall=
y a=20
digital object.  In the case of GBIF, the digital objects are instanc=
es=20
of specimen (and observation) records.  The use of LSIDs in the conte=
xt=20
of GBIF appears primarily to provide a globally unique identifier for=
=20
these records, as opposed to the more complete use of the LSID system=
=20
which also involves the resolution of the LSID value to an instance o=
f=20
the object.

So, in the case of DiGIR, what is the use of LSIDs?  From my (DiGIR a=
nd=20
other projects that use DiGIR) point of view, LSIDs are to provide a=
=20
convenient unique key for each record instance.  The value of the LSI=
D=20
shall be generated by the DiGIR data service, either as an amalgamati=
on=20
of existing key values or through the provision of an index table tha=
t=20
relates LSID values to unique records.  In addition, we would like to=
=20
move the LSID resolution service as close as possible to the DiGIR da=
ta=20
provider.  This means that each DiGIR provider will also act as an LS=
ID=20
resolution service.  The result of resolving an LSID would ideally=
=20
provide an instance of the specific XML record that the LSID points t=
o.=20
  It may also provide a pointer to the actual data instance, which ma=
y=20
be a DiGIR query that returns a DiGIR resultset containing only that =
one=20
record (this later scenario is less desireable for a variety for a=
=20
variety of reasons).

The primary difficulty of implementing an LSID service for each DiGIR=
=20
provider is the availability of suitable DNS records.  In many cases,=
=20
the institutions may not be able to provide suitable modification to=
=20
their DNS entries (either because they have no control over them, or=
=20
they do not own, or wish to own a domain name).  In this case, the=
=20
simple solution is for a DN owner to provide sub-domains that can be=
=20
used to identify LSID services at each DiGIR provider.  So for exampl=
e,=20
the DNS _lsid._ku.digir.net SRV record might resolve to the KU instan=
ce=20
of an LSID service.  Then records being served up through the DiGIR=
=20
provider associated with that DN would be of the form=20
URN:LSID:ku.digir.net:something where "something" is defined by the L=
SID=20
service, but could be something like:

something =3D DiGIR_instance_ID + ":" + DiGIR_resource_ID + ":" + val=
ue

or even:

something =3D value

So anyway, it seems that the discussion about LSIDs has been confused=
=20
with the issue of GUID provision, and has pretty much ignored the=
=20
implementation issues of LSID resolution services, and how these are =
to=20
be implemented in DiGIR and used by communities outside of GBIF.

regards,
   Dave V.