Thread: [Queue-developers] [patch] Same job launched on multiple hosts, unlink errors
Brought to you by:
wkrebs
From: Cyril B. <cyr...@ya...> - 2001-05-04 06:11:23
|
Hi, I tried Queue with 2 machines, both running RH 6.2. I downloaded the source from CVS on 05/01. I set it up in non root mode. When I launched a couple of jobs (with queue -i -w -n or qsh), I noticed some strange behavior. A job starts on machine1, runs for a while, but then machine2 tries to run it as well (same cfmXXX in supervisor.log). Machine2 then immediately stops running that job, but also removes the efmXXX and CFDIR/cfmXXX files in the "now" queue directory... The job also gets terminated on machine1 (signal 9). When machine1's queued daemon tries to remove the job's CFDIR/cfmXXX file, it's no longer there and I get errors like "Can't unlink cfmXXX". I came up with the patch to queued.c below. In startjob() I check if there's already an efmXXX file for the job passed to startjob(). If there is, it means that job is already running on some host and we'd better not start it again. So I set the job's pid accordingly and return ALREADY_LOCKED. Let me know if it's the right approach. Regards, Cyril bo...@us... Index: queued.c =================================================================== RCS file: /cvsroot/queue/queue-development/queued.c,v retrieving revision 1.46 diff -u -r1.46 queued.c --- queued.c 2001/04/11 20:46:10 1.46 +++ queued.c 2001/05/04 02:02:45 @@ -3525,6 +3525,17 @@ checkpoint = qp->q_checkpointmode; restart = NO_RESTART; + /* Check if there's already an "ef" file, meaning the job + * is already running on some host. borto 2001/05/03 */ + sprintf(fname, "%s/e%s", qp->q_name, jp->j_cfname+1); + if(access(fname, F_OK)==0) { + mdebug1("queued queued.c startjob():\n"\ + "\t%s is already running somewhere, skip it.\n", + jp->j_cfname); + jp->j_pid = ANOTHER_HOST; + return(ALREADY_LOCKED); + } + #ifdef ENABLE_CHECKPOINT /*Migrator code. WGK 1999/3/6. If there's a corresponding mf file, only consider starting the job if we are allowed to restart jobs.*/ __________________________________________________ Do You Yahoo!? Yahoo! Auctions - buy the things you want at great prices http://auctions.yahoo.com/ |
From: Mike C. <da...@ix...> - 2001-05-07 23:26:50
|
On Thu, May 03, 2001 at 11:10:10PM -0700, Cyril Bortolato wrote: > I came up with the patch to queued.c below. In startjob() I check if > there's already an efmXXX file for the job passed to startjob(). If > there is, it means that job is already running on some host and we'd > better not start it again. So I set the job's pid accordingly and There is still a race condition here, unfortunately. The efm file could still show up after you look for it but before you create it. You've reduced the window, but not eliminated it. One solution might be from the linux open(2) man page: O_EXCL When used with O_CREAT, if the file already exists it is an error and the open will fail. O_EXCL is broken on NFS file systems, programs which rely on it for performing locking tasks will contain a race condition. The solution for performing atomic file locking using a lockfile is to create a unique file on the same fs (e.g., incorporating hostname and pid), use link(2) to make a link to the lockfile. If link() returns 0, the lock is successful. Oth<AD> erwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful. We can't necessarily rely on flock working (not everyone has a working lockd). We could build in a lock protocol into queue, but there are so many other things broken with queue right now it's not even funny. (I've pretty much given up on queue for now and wrote a few cheesy shell scripts that work much better.) > return ALREADY_LOCKED. mrc -- Mike Castle Life is like a clock: You can work constantly da...@ix... and be right all the time, or not work at all www.netcom.com/~dalgoda/ and be right at least twice a day. -- mrc We are all of us living in the shadow of Manhattan. -- Watchmen |
From: Christian P. <cp...@el...> - 2001-05-08 07:52:30
|
Hi Mike ! > ... but there are so > many other things broken with queue right now it's not even funny. (I've > pretty much given up on queue for now and wrote a few cheesy shell scripts > that work much better.) My first approach for loadbalancing was a simple ruptime request for the load of several machines. The disadvantage was that the callculated loads deadtime was too long. If I waited 5 Minutes until starting the next job the balancing was ok. How do you determine the load ? Best regards, Christian |