Filled in a good deal of explanation about the design
Editor: JimDowning <ojd20@...>
Date: 2004/12/06 09:13:40
- * Allow atomic CRUD of a stream of metadata and optionally a stream of data, keyed by a string identifier
+ === Allow atomic CRUD of a stream of metadata and optionally a stream of data, keyed by a string identifier ===
- * Data format / serialization agnostic
- * Versioning done in metadata (and hence not by lowest level)
- * Shouldn't assume any semantics in the identifier other than uniqueness (although this doesn't exclude semantic identifiers)
- * Notify higher layers of changes
- * Should it do checksumming at this level?
- Other features I've put in / will put in
+ The storage layer need not be OAIS specific (N.B. it '''must''' be compliant)
+ I'm aiming to have an API that can be implemented by a range of storage technologies to allow implementations based on file systems, distributed storage etc. Placing high requirements on transaction support is one of the things that could prohibitively raise the bar on this, so I've gone for the simplest atom of transaction as the base level of support. It may be that it turns out we need more transaction support as we go through the design. Better to under engineer now.
- * Change storage hints to a formal definition of a metadata subprofile that the asset storage layer gets access to.
- * Base implementation of asset store that implements asynchronous notification of listeners through durable queues. Done.
- * Transactions are described as command objects. Good way of defining and limiting the transaction support an asset store must implement.
- == Layering and config management ==
+ The atomic transactions are described as Command objects, which I think is a good way of defining and limiting the transaction support an asset store must implement.
- Strongly layered interfaces are good because they allow minimal implementations.
- Sometimes it's necessary to integrate vertically for efficiency reasons. For example, if you wish to lay out your asset store with the OAIS containers as directories you need to know the id of a package's container to know where to place it. If you enforce strict '''implementation''' layering you would have to deserialize the metadata in order to discover this (bad).
+ === Data format / serialization agnostic ===
- Instead, I think it's better to allow components to implement several interfaces on separate layers, and have a software configuration manager that ensures that the active components fulfil all the required interfaces and that there are no conflicts or gaps.
+ The storage layer should be completely agnostic of the data structures and serialization of the metadata stored. We (the digital preservation community) will change our minds about metadata standards in time, as our thinking progresses. This is entirely predictable and we should design tools that don't need a ground up rewrite when it happens.
- I've implemented this in the proto by allowing a storage hints object (a java.util.Properties) to pass extra information to the asset store layer. Still unsure what the best thing to do with this is - whether to have a standard set of hint keys (and if so, should they be required?)?
+ === Versioning done in metadata (and hence not by lowest level) ===
+ Versioning is another area we might well change policy on in the future. The only reason I can see for holding the versioning data at the storage level is the achieve diff based storage, which I don't feel is appropriate for a digital preservation system.
- Todo: -
- * Work up an example of a naive layered implementation, and a more sophisticated implementation that requires input to the asset store client and implementation
- * Implement basic config management approach that allows each of these to be configured and alerts if an error exists.
+ === Shouldn't assume any semantics in the identifier other than uniqueness ===
+ N.B that this doesn't exclude semantic identifiers. The whole debate on identifiers is far from finished. I don't see a reason for the asset storage layer to care as long as the uniqueness is guaranteed.
+ === Notify higher layers of changes ===
+ N.B. that the asset layer must regard everything else as a 'higher layer' - it must not depend on any of them.
+ I've implemented this using event listeners (the Observer pattern). This can lead to poor round trip performance when compared to a polling solution (what Rob calls 'pull'), although the throughput is generally comparable. To remove this performace bottleneck you can make the notifications asynchronous. I've achieved this using the excellent ActiveMQ framework, which is an extremely lightweight JMS (Java Messaging Service). The implementation takes less than 300 lines of code.
+ The alternative to this is for every asset store implementation to support an index of items against time, and probably a scheduling manager in the application layer to choreograph access to that index to prevent spike loads.
+ In case anyone missed the discussion on this subject in August: Compared to an asynchronous Observer (push) solution I think the query (pull) solution will: -
+ * be harder to implement
+ * be less efficient (Locked update of a large index vs serializing a message)
+ * handle heavy load less well (blocking threads more expensive than maintaining a queue)
+ * handle large asset stores less easily (the central index is a size bottleneck)
+ That's why I favour the use of the Observer pattern with asynchronous event delivery in this situation.
+ === Checksumming ===
+ I haven't put checksumming into this proto. I should have and will.
+ === Enabling layout strategies ===
+ Supposing you want your archive to be laid out on a file system such that there is a directory per community, one per collection, one per AIP and so on. Or that you want to insert a meta storage layer that stores one community's assets on a LOCKSS system and another's on an SRB. You'll need "some" semantics about the item stored to enable this. This is enabled by creating a data structure of the relevant information just before you serialise the metadata. At the moment I have this as a Properties object, but I think that a defined structure would be more appropriate.