Re: [Queue-developers] Feedback requested on detailed plans and code for contrib project

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thank you for the feedback!  Currently, if the queue_manager crashes,
then all the jobs that have been submitted will be lost because the
queue_manager keeps track of this information within its internal data
structures.  However, the queue daemons will be unaware that the
queue_manager has crashed and will continue to run the jobs they are
currently running; they simply won't be able to connect to the
queue_manager.  When the queue_manager starts back up on the same
machine again, then everything resumes its usual course (users would
have to resubmit their jobs).  On the other hand, if you want to start
the queue_manager on another machine, then you would have to reconfigure
the whole system.  We can fix this problem by putting the server name
(that runs the queue_manager) in a file, and right before the queue
daemons connect to the queue_manager, they will read from this file and
connect to this particular server.  

Someone also brought up the idea of a back-up queue_manager.  There
would be a master queue_manager and a slave queue_manager.  The slave
queue_manager has all the information that the master queue_manager has,
so if the slave ever detects that the master has failed, it will take
over the master's role until the master starts back up again.  Is this
similar to what you have in mind?

Monica

Tavis Barr wrote:
> 
> The code has got some neat features, I just have once oncern.  As it
> stands, the machine running queue_manager cannot be taken down for
> service.  With the old implementation, machines attempting to submit a
> job would eventually give up on a dead machine, so no machine was
> essential. The old code certainly wasn't perfect; if a machine was down
> or even busy, it could take quite a long time for a job to get
> submitted.  But it seemed like it could be improved upon by allowing the
> queue client package to fail out more quickly when trying to reach an
> unresponsive host.  (I understand thsi was in the works for the new
> version; I haven't had a chance to try it yet since it doesn't seem to
> work on Digital Unix, maybe if I have tme I'll try to do some
> debugging.)
> 
> With the new implementation, as I understand it you can't take the machine
> running queue_manager out of service without reconfiguring the whole
> system.  This might be correctable by allowing for failover on the
> queue_manager daemon (e.g., sharing the process database, having
> queue_manager store a lock flag every second that is valid for ten seconds,
> and having queued on a secondary machine to grab the lock and spawn its
> own queue_manager if the primary machine stops renewing its own locks or
> something).  I don't know if this would cause problems interacting with
> license managers.
> 
> Anyway, I don't want to disparage what looks overall like some great
> work, I'm just trying to suggest how it might be improved if I understand
> it correctly.
> 
> Cheers,
> Tavis