There is really no single "best" way to accomplish a task like this.  It is an integration task, and it depends on every peculiarity of the various systems involved.  

In the context of Evergreen ILS, I wrote code to manage "remote file" accounts (FTP, SFTP, SSH/SCP), so that an action could produce files that would be transferred to remote accounts (optionally retried a configurable number of times after *transfer* failure) and files could be retrieved from remote accounts, producing events.  This underlays EDI and Asterix functionality, among others, but it also depends on EG's complex action/trigger mechanic, to the extent that it is not something we could just port over to VuFind.  

There are different ways to think about this problem.  One is the publication/subscription model, which suggests you would abstract each source into a queue.  The consuming end starts to look a lot more like OAI-PMH cases, but that seems like too much work at outset and for each additional source. One could also imagine from the Solr point of view that processing could be triggered by defining an appropriate listener to receive each source type.  Then instead of FTP'ing to you, your sources could post right to Solr (after you properly add an authentication layer).  But that seems unlikely to integrate with the tools you would want and depends too much on capabilities in your source systems.  There is the ad hoc scripting model: maybe easier to implement, but harder to maintain.  So basically, I'd advise you to do whatever you think you can do.  You might end up with a combination of these.

I agree the statistics produced during various phases of the load processing would be the right way to track it, but that requires building in common logging conventions to each phase.  At a certain complexity and scale, that gets pretty hard.  For example, if you are piping a large set through yaz for a conversion, you won't necessarily know which of the 200,000 records caused it to crash.  Maybe you get output and maybe you don't.

Major conceptual problems:
  • data dependencies, chronology - for the most part, your records will be independent of on another, but sometimes order is important.  If one of your sources provides separate holdings data, for example, it will depend on bibliographic data from another "source" already having been loaded.  But the holdings file might arrive *before* the bib data.  Or the bib data transfer might fail, or be blocked somewhere in the processing, while the holdings data is still usable (but also referencing bibs that don't exist yet in VuFind).  
  • deduplication - how to avoid making the same changes twice, retrieving an already processed file, etc.
  • deletion - handling record deletion in formats where common representation is lacking.  I.E., how does the ILS/source tell you "delete from biblio where id=1234"?
  • cron/scheduling - having retrieval, processing and load jobs in crontab is the most reliable way to make sure jobs are attempted, but it is the least transparent way for VuFind as an application to know about them.  It also presents all kinds of security, execution environment and coordination problems.  For example, maybe you schedule a file retrieval job and then X minutes later a job to do step 1 of the processing.  That works until your data gets bigger or the provider *at their discretion* decides to throttle the transfer and it now takes longer than X.  Starting processing on the incomplete file produces incomplete data and data corruption.
Deletion affects chronology.  

I would be very interested to see how well RecordManager handles these problems, in particular with the newest versions of VuFind.  Good luck!

--joe


On Mon, Sep 23, 2013 at 11:22 AM, thomas schwaerzler <thomas.schwaerzler@uibk.ac.at> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hi,

we have a vufind instance running with over 20 different sources from a
growing number of partner libraries. those sources differ a lot.
our preferred interface to grab catalogue data from our partner
libraries is OAI-PMH. but there are also libraries offering aleph-x
server, file uploads to our ftp server or http or ftp downloads.

the formats range from marc21, marcxml, mabxml, unimarc, danmarc and so
on,..
so most of those formats need preprocessing, in some cases xml has to be
brought to a valid state before xslt can be used....

since handling so many sources needs a lot of effort, i am thinking
about writing some update software that keeps all sources in a database
table and also some importing statistics for each source to be able to
do an estimation if the new import is ok. (date of import, number of
records, maybe some characteristics about the records,..)

so my main questions are:

1) what approaches do there exist that already fulfill these tasks or
some similar tasks?
2) what do you think is the best way to do something like this?




- --
Thomas Schwaerzler
Department of Digital Services
University Innsbruck Library
6020 Innsbruck - Innrain 52 - Austria
Phone: ++43-(0)512-507-2489
Fax: ++43-(0)512-507-94908451
Email: <Thomas.Schwaerzler@uibk.ac.at>
URL: http://www.uibk.ac.at/ulb/ds
OpenPGP key: 0x104A5476, Key fingerprint = 98D8 F849 BCD4 56C2 3E1F
BFEC FAF7 7668 104A 5476
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJSQFyEAAoJEPr3dmgQSlR2d0kQAM6vAqIMeRyfWu4Yn20XLadh
o10LvumJEX9/vjxjZzjblosMhzk3sX8V9bCNGXVjw9HdqVRTxgKFfQ63442vqd++
c+Edk+QGHuVFvnEAa8AeiUj2xU7Mifv2kot0/3Ab/qWWLnNSW669ybuL3m5qTt6B
GSVgj1DBebBqX5XnDqoaUVq7JUUdSvHuER380tddfQIeanFzrZ1YnCazumzVlSdi
diaObW/uW+Q1AJ0jOah6uVrJcRz7O2nTyES9Cdz4BUzRKCeKA67FcMQ+7DzSqruS
1gp7Xd1E3CnCg3fBdJCJC27K8CrhWYP7ub/M1Grg66cIfbtXa5r5tC+LqSwfXL2q
2u987mujgS8VYvo+JvlYWyv/N6Cy3gu9yZ6qnYQlTNoYGhpEqbZtAb0w6ekFm8t7
s0e543JHrA+b/lFFm9dVJHgq85O4bVhknHdc+LBB00WIF8m1MCS92NHhN9SsgFFv
Mfb3MYPsRU4Pq9uLdQM6LLHHR6mD6ejxN05XsGGqigdKfDwPKMDgGpgzC26LDGhM
/u4HikgQeQWSIRv8AzsfrgCAELFGNV4jlERqyoY3giwL798BTuA19gOaKykAi2OC
FPdaGrBUNYPSuHm4moMprLNy9T1P6LSApAA4MomYz7x45sgClJRJ8gscYGA9d9kb
fHrAtz/7nd6x90IGtCYD
=f0RR
-----END PGP SIGNATURE-----

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Vufind-tech mailing list
Vufind-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/vufind-tech