There is really no single "best" way to accomplish a task like this. It is an integration task, and it depends on every peculiarity of the various systems involved.
In the context of Evergreen ILS, I wrote code to manage "remote file" accounts (FTP, SFTP, SSH/SCP), so that an action could produce files that would be transferred to remote accounts (optionally retried a configurable number of times after *transfer* failure) and files could be retrieved from remote accounts, producing events. This underlays EDI and Asterix functionality, among others, but it also depends on EG's complex action/trigger mechanic, to the extent that it is not something we could just port over to VuFind.
There are different ways to think about this problem. One is the publication/subscription model, which suggests you would abstract each source into a queue. The consuming end starts to look a lot more like OAI-PMH cases, but that seems like too much work at outset and for each additional source. One could also imagine from the Solr point of view that processing could be triggered by defining an appropriate listener to receive each source type. Then instead of FTP'ing to you, your sources could post right to Solr (after you properly add an authentication layer). But that seems unlikely to integrate with the tools you would want and depends too much on capabilities in your source systems. There is the ad hoc scripting model: maybe easier to implement, but harder to maintain. So basically, I'd advise you to do whatever you think you can do. You might end up with a combination of these.
I agree the statistics produced during various phases of the load processing would be the right way to track it, but that requires building in common logging conventions to each phase. At a certain complexity and scale, that gets pretty hard. For example, if you are piping a large set through yaz for a conversion, you won't necessarily know which of the 200,000 records caused it to crash. Maybe you get output and maybe you don't.
Major conceptual problems:
- data dependencies, chronology - for the most part, your records will be independent of on another, but sometimes order is important. If one of your sources provides separate holdings data, for example, it will depend on bibliographic data from another "source" already having been loaded. But the holdings file might arrive *before* the bib data. Or the bib data transfer might fail, or be blocked somewhere in the processing, while the holdings data is still usable (but also referencing bibs that don't exist yet in VuFind).
- deduplication - how to avoid making the same changes twice, retrieving an already processed file, etc.
- deletion - handling record deletion in formats where common representation is lacking. I.E., how does the ILS/source tell you "delete from biblio where id=1234"?
- cron/scheduling - having retrieval, processing and load jobs in crontab is the most reliable way to make sure jobs are attempted, but it is the least transparent way for VuFind as an application to know about them. It also presents all kinds of security, execution environment and coordination problems. For example, maybe you schedule a file retrieval job and then X minutes later a job to do step 1 of the processing. That works until your data gets bigger or the provider *at their discretion* decides to throttle the transfer and it now takes longer than X. Starting processing on the incomplete file produces incomplete data and data corruption.
Deletion affects chronology.
I would be very interested to see how well RecordManager handles these problems, in particular with the newest versions of VuFind. Good luck!