Hello all -

We've gotten the go-ahead to start assigning DOIs[1] to data sets! More precisely, Syracuse University is now an authorized DOI publication agent via TIB [2], the registration agency for scientific primary and secondary data, which is operated by the German national library of science and technology. The reason to do this is that it allows better data citations and serves as a permanent identifier.

The process of assigning DOIs to data sets involves resolving a few metadata ambiguities, so I wanted to check with the FLOSSmole user community for any suggestions before we finalize the details. Please feel free to make suggestions for the 9 fields [3] that are up for consideration; the rest are either fixed or completely unambiguous.  Below, I've pasted in an example XML file for a FLOSSmole dataset DOI definition [4]. 

 There aren't necessarily any right answers here, so any feedback on the proposed DOI naming convention and metadata values would be very helpful.

Thanks for your feedback!


[1] www.doi.org

[2] http://www.std-doi.de/front_content.php

[3] Fields that are up for discussion:
1) <DOI> - everything after the 10.4118/ is up to us. The naming convention I've suggested below uses the repository name as a prefix (flossmole) followed by the filename, separated by a dot. This is by no means the best DOI convention, just what occurred to me as reasonably logical. We can also just assign ascending numeric values, e.g. 10.4118/floss.000001, as they do not have to be human-readable.
2) <resourceIdentifier> I've included the eprints record and data set download URLs as additional identifiers. They are shown as "ProprietaryIdentifiers" because URL is not a legal type for the field (URN would be, but these are not URNs, strictly speaking).
3) <creator> Currently Kevin, Megan and James are credited as creators, as they are the PI's for the FLOSSmole project.
4) <contributor> The contributor is currently defined as the source repository from which FLOSSmole spidered the data.
5) <publisher> The publisher is currently defined as the entity that published the dataset. If we were to issue DOIs for datasets from FLOSSmetrics, for example, then the publisher would be FLOSSmetrics.
6) <creationDate> and <publicationDate> are the same in this case; as I understand it, the FLOSSmole data are created and then promptly published, so there's not much point of differentiation. We can choose to use one or the other, or both.
7) <description> The description shown here is the short abstract on the eprints record.
8) <publicationPlace> is currently listed as SourceForge.net but could just as appropriately be listed as Elon, NC.
9) <discipline> is currently listed as softwareEngineering. The field is intended to identify a scientific discipline in the most traditional sense; this was what we thought made the most sense.

[4] Sample XML record
<?xml version="1.0"?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="std-doi.xsd">
<resourceIdentifier type="ProprietaryIdentifier">http://flosspapers.org/1585/</resourceIdentifier>
<resourceIdentifier type="ProprietaryIdentifier">http://downloads.sourceforge.net/ossmole/fmProjectAuthors2008-May.txt.bz2</resourceIdentifier>
<description>Freshmeat project authorship data.</description>

Andrea Wiggins
PhD Student, School of Information Studies
Syracuse University

337 Hinds Hall
Syracuse, NY 13244