Thread: [Queue-developers] [patch] Same job launched on multiple hosts, unlink errors

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,
I tried Queue with 2 machines, both running RH 6.2. I downloaded
the source from CVS on 05/01. I set it up in non root mode.
When I launched a couple of jobs (with queue -i -w -n or qsh),
I noticed some strange behavior.
A job starts on machine1, runs for a while, but then machine2 tries
to run it as well (same cfmXXX in supervisor.log).
Machine2 then immediately stops running that job, but also removes
the efmXXX and CFDIR/cfmXXX files in the "now" queue directory...
The job also gets terminated on machine1 (signal 9). When machine1's
queued daemon tries to remove the job's CFDIR/cfmXXX file, it's no
longer there and I get errors like "Can't unlink cfmXXX".

I came up with the patch to queued.c below. In startjob() I check if
there's already an efmXXX file for the job passed to startjob(). If
there is, it means that job is already running on some host and we'd
better not start it again. So I set the job's pid accordingly and
return ALREADY_LOCKED.

Let me know if it's the right approach.

Regards,
Cyril
bo...@us...

Index: queued.c
===================================================================
RCS file: /cvsroot/queue/queue-development/queued.c,v
retrieving revision 1.46
diff -u -r1.46 queued.c

--- queued.c    2001/04/11 20:46:10     1.46
+++ queued.c    2001/05/04 02:02:45
@@ -3525,6 +3525,17 @@
   checkpoint = qp->q_checkpointmode;
   restart = NO_RESTART;
 
+  /* Check if there's already an "ef" file, meaning the job
+   * is already running on some host.  borto 2001/05/03 */
+  sprintf(fname, "%s/e%s", qp->q_name, jp->j_cfname+1);
+  if(access(fname, F_OK)==0) {
+      mdebug1("queued queued.c startjob():\n"\
+          "\t%s is already running somewhere, skip it.\n",
+          jp->j_cfname);
+      jp->j_pid = ANOTHER_HOST;
+      return(ALREADY_LOCKED);
+  }
+
 #ifdef ENABLE_CHECKPOINT
   /*Migrator code. WGK 1999/3/6. If there's a corresponding mf file,
     only consider starting the job if we are allowed to restart jobs.*/


__________________________________________________
Do You Yahoo!?
Yahoo! Auctions - buy the things you want at great prices
http://auctions.yahoo.com/




Thread: [Queue-developers] [patch] Same job launched on multiple hosts, unlink errors

queue-developers