oh ho! We've hit the inevitable "put up or shut up" moment! ;-)
My sleeves are rolled up, I'm back from my travels. We have an early sep. deadline. I should be contributing soon!
On Jun 30, 2008, at 8:46 AM, Andrew Nagy wrote:
Naomi - I would say that this is for the vufind-unicorn list. Each ILS works differently from this aspect. With voyager, I can simply request from the command line all bib records that have been touched in the past 24 hours - same with holdings, etc. I say build the rest based system for unicorn - and then we can port it to jangle so that it will work with all ILSs and we would have a standard way of getting bibs and holdings no matter which ILS we use.
Here's another set of thorny questions and ideas from yours truly. As always, I apologize for the really long posting.
Like everyone else, our bib and holdings data is changing all the time. We need nightly updates to our solr/lucene index to keep vufind synched with the latest greatest information from our ILS. We have too much data to do a full reload nightly, but our data has significant new records on a daily basis that need to be added. We also dread, yet expect to reload the index from scratch occasionally.
We believe a web service is the best way to facilitate these loads.
We have 5 million records, so just doing one giant request / response is NOT going to work. As it is we've "chunked" our *static* marc21 bib load into several 500K sections of marc21 records. I'm not sure how big the nightly update loads will be yet - perhaps after an initial load, we'll be able to manage with a few separate HTTP requests. Lots I don't know yet.
What if there was a ReST service that we could use to pull bib AND relevant holdings information from Unicorn? What if we could get updates OR a whole load this way? This would be a way to deal with 1) getting batch loads of data to create/update documents in the index and 2) getting updates from an ILS for both bib and holdings rec info.
Such a service would potentially have utility for all the ILS systems and all the Next Generation Discovery systems. Certainly for any large collection of bib/holdings data.
The ReST requests could ask for all records, or they could ask for records updated since a certain date, or records last updated between two dates (just like OAI-PMH requests).
I'm thinking the responses would look something like this:
[bib record in marcxml]
<item> [item1 holdings info] </item>
<item> [item2 holdings info] </item>
If you don't like xml, we could have the same service functionality with marc21 records as the responses (rather than xml). We could even have the holdings info stuck in the marc fields, just as it is currently.
Requests would be something like this:
getRecords (no dates)
getRecords (beg date)
getRecords (beg date, end date)
getRecords (end date)
The way this could work is that there would be ILS specific code to do the following:
- select appropriate bib records
- for each selected bib record, convert it to marcxml, then get/create xml for the related holdings/item/call number records, and combine these into a "record" per above.
- "chunk" responses into reasonable lengths and serve out each piece as requested (OAI-PMH has a resumptionToken mechanism to facilitate this).
Here's something Wayne wrote in response to a similar post of mine on the vufind-unicorn list:
On Jun 23, 2008, at 7:26 AM, Wayne Graham wrote:
I think the only thing I worry about is scaling may be an issue. This could be a possibility though once we finish the code to actually post a full marc record to Solr directly (the same way you post a CSV file). I would think that instead of passing XML back (at least for these methods), a marc file that could be indexed would make more sense.
I think we're actually talking about the same thing. The main difference is that you expose these scripts to a web service, correct? In stead of your Sirsi server pushing these updates, you pull them from Solr.
Again, I would argue that the overhead of doing this with large datasets may be a little prohibitive. However, it doesn't require setting up rsynch or scp routines, which means this may be very attractive for a lot of folks without big (or coorperative) IT teams.
So, would you like to add something like
records.get (start_date, end_date, format) where none of the elements are required; if no start and end date, it returns everything, and the formats are marcxml and marc (default is marc)?
Leaving the xml vs. marc21 format issue aside, the key point is to use a web service, with the NGDE requesting the pull.
One of the benefits to web services is that they can be well defined but not be implementation specific ... so it's another way to unify practice.
I'm guessing there are a bunch of possible performance bottlenecks:
1. getting item recs for each individual bib rec: ameliorate by requesting items for a list of bib recs instead of one at a time?
2. combining item recs and bib recs into a single "response" or single "indexing time" unit - this shouldn't be that slow, right?
These are "solved" performance problems:
3. http responses too big: break them into chunks, a la OAI-PMH
4. http request/responses slowing up indexing: pull all the responses first, then index them all as a lump.
other issues / ideas / comments ?