Following on from the database thread, but slightly shifting the topic...
RDBMS referential integrity checks are just one validation tool amongst a whole lot of checking that will need to happen. It helps but is by no means a guarantee that data is correct. Since digital data fails 'silently', i.e. you don't know is corrupted until you try to read it, we will in any case need periodic auditing processes which go through the system verifying that everything is present and valid. Though of course in this case, unless you're doing regular mirroring/caching/backups it's rather a case of closing the stable door after the horse has bolted, but managing what happens when one copy of data becomes corrupt is a big part of what preservation is all about.
I guess this is at the root of one fundamental idea I have about how a scalable, preservation archive needs to function. One of the fundamental things you need to do to preserve things is (IMO) to 'liberate' it from any particular hardware/software platform. The AIPs that DSpace uses have to be usable outside a DSpace context, and you shouldn't need the DSpace software and compatible hardware to access them. By this I don't mean just have the files accessed from other places all the time, just that they should be stored in such a fashion that as long as you have the AIPs, it doesn't matter if the DSpace software stack and compatible hardware is around, you can still read the AIPs and load them into some other system.
In order to preserve the AIPs, one important strategy will be to make lots of copies of them in lots of places; hence mirroring/sharing of AIPs between DSpaces will be an important feature in the future. Indeed, the original vision of DSpace was to have a network of these systems which together would form a distributed preservation archive of institutions' research output; the very name 'DSpace' was meant to convey the notion that these things together form a big collective space rather than lots of 'silos'. In a system like this, I'm uncomfortable with the idea that everything distributed-system-wide has to be in the same, synchronised state all the time. Over time, things will change and go corrupt, and from time to time you will have to step back and consider, is everything I have in my index/cache/local copy etc. really correct? I think the whole idea of trying to keep everything in this wide system in sync, referentially integral and so forth simply won't scale. I'm
much more comfortable with the idea that you have an authoritative AIP, perhaps with a designated owner, and if that's changed by some appropriate process, you accept that for a time some indices and caches will be out of sync and plan around that.
In other words, in this scenario, I think a pull model will work better than a push model; it should be the responsibility of all the caches/indices/mirrors to make sure their contents are up to date wrt the 'authoritative' AIP rather than vice versa. There are complications in both cases, of course: In the pull model the problem is flow control, making sure that stores aren't getting hammered by lots of others at once. In the push model there's the messaging overhead, and whether you have guaranteed delivery, what to do if a listener is down, what that listener does later if it comes back up and missed some messages, and in the case of large-scale changes to a store (e.g. when the media filter runs over it) big spikes where lots of listeners pounce at once and you have the flow control problem again.
This is where I'm coming from on the polling methodology for the asset store. (This and the fact that of the three repository-type systems I've built the messaging-based one was the one that scaled the worst). The above is talking about a widely-distributed, networked environment, so maybe the logic doesn't necessarily transfer to a single-instance, localised environment. However, maybe we shouldn't assume that single instances of DSpaces won't involve networked aspects: LDAP servers for e-people etc.; clustered servers; grid-based storage. Perhaps Richard R or someone else from the SDSC project that's planning to integrate SRB with DSpace can chime in here, but I believe the plan is to put DSpace on top of a SRB- (grid)-based asset store that offers huge storage and replicates AIPs etc. Other DSpaces elsewhere would also be plugged into this grid and accessing the same AIPs, so you have the networking element right there.
In any case, I think (as we're all finding already) that periodic re-indexing of the asset store for some reason or other will be necessary. Here you have the flow control problem again -- imagine re-indexing a huge asset store of 10 million items. So I think the asset store API will need the sort of functionality that a polling methodology would require in any case.
So, there is the cost of instant gratification (which only happens when there's no gatekeeping/review submission workflow in any case). I don't think this is designing out usability, I think it's a case of weighing up costs and benefits. You can't just write down a list of every feature and characteristic you'd like and expect to be able to come up with something that meets them all; you always have to prioritise. Any approach will have pros and cons.
Anywhere, that's a big 'brain dump' (many may feel that an appropriate term)that hopefully illuminates where I'm coming from. Of course there are other ways of dealing with the various problems, but polling really feels to me to be the most robust and simple way of dealing with them all.
Let me in any case throw out another idea: An ingest pipeline. This could be considered an expansion of the submission workflow. Imagine a pipeline where you can plug in components that get invoked in sequence when something is put in one end. This is pretty much how Cocoon works. Then, modules which want to index things could plug a component into this pipeline which would get invoked whenever something is ingested. Conceptually, the effect is similar to a messaging system but you don't have a lot of the complexity. My thinking here isn't complete -- there are still server load issues (does every single item go through the pipeline one by one?), and what to do when mirrored AIPs come in from other DSpaces; but there are probably ways round them (you could probably have a 'mirror ingest' pipeline or something). But it's an idea I thought I'd throw out.
Credit for this last idea has to go to Bill Cattey at MIT: I have to doff my cap to Bill at this point -- I think the DSpace 2.0 design is coming perilously close to what was on that infamous napkin 4 years ago ;-)
Robert Tansley / Digital Media Systems Programme / HP Labs