Re: [Queue-developers] Feedback requested on detailed plans and code for contrib project
Brought to you by:
wkrebs
From: Monica L. <ml...@al...> - 2000-09-16 00:15:16
|
Thank you for the feedback! Currently, if the queue_manager crashes, then all the jobs that have been submitted will be lost because the queue_manager keeps track of this information within its internal data structures. However, the queue daemons will be unaware that the queue_manager has crashed and will continue to run the jobs they are currently running; they simply won't be able to connect to the queue_manager. When the queue_manager starts back up on the same machine again, then everything resumes its usual course (users would have to resubmit their jobs). On the other hand, if you want to start the queue_manager on another machine, then you would have to reconfigure the whole system. We can fix this problem by putting the server name (that runs the queue_manager) in a file, and right before the queue daemons connect to the queue_manager, they will read from this file and connect to this particular server. Someone also brought up the idea of a back-up queue_manager. There would be a master queue_manager and a slave queue_manager. The slave queue_manager has all the information that the master queue_manager has, so if the slave ever detects that the master has failed, it will take over the master's role until the master starts back up again. Is this similar to what you have in mind? Monica Tavis Barr wrote: > > The code has got some neat features, I just have once oncern. As it > stands, the machine running queue_manager cannot be taken down for > service. With the old implementation, machines attempting to submit a > job would eventually give up on a dead machine, so no machine was > essential. The old code certainly wasn't perfect; if a machine was down > or even busy, it could take quite a long time for a job to get > submitted. But it seemed like it could be improved upon by allowing the > queue client package to fail out more quickly when trying to reach an > unresponsive host. (I understand thsi was in the works for the new > version; I haven't had a chance to try it yet since it doesn't seem to > work on Digital Unix, maybe if I have tme I'll try to do some > debugging.) > > With the new implementation, as I understand it you can't take the machine > running queue_manager out of service without reconfiguring the whole > system. This might be correctable by allowing for failover on the > queue_manager daemon (e.g., sharing the process database, having > queue_manager store a lock flag every second that is valid for ten seconds, > and having queued on a secondary machine to grab the lock and spawn its > own queue_manager if the primary machine stops renewing its own locks or > something). I don't know if this would cause problems interacting with > license managers. > > Anyway, I don't want to disparage what looks overall like some great > work, I'm just trying to suggest how it might be improved if I understand > it correctly. > > Cheers, > Tavis |