From: Malcolm T. <mt...@wu...> - 2009-11-04 15:41:22
|
I'm running Opal 2.0 and using the DRMAAJobManager to interface with Sun Grid Engine (SGE). I've got the opal.hard_limit set to 3600s: # specify in seconds the hard limit for how long a job can run # only applicable if either DRMAA or Globus is being used, and if # the scheduler supports it opal.hard_limit=3600 but occasionally encounter 'run away' jobs that are never killed: [mtobias@sccne ~]$ qstat -u opal job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 3058 0.55500 pdb2pqr.py opal r 11/02/2009 14:34:19 all.q@compute-0-10.local 1 I've tried looking in the $TOMCAT/webapps/ROOT directory where the job data is stored, but I don't see any file that looks like a batch script to examine to see if it's limiting the CPU time limit. I've also looked at the compute-node that the job is running on and examined the 'trace' file which appears to be where SGE is setting up the job. It seems like it's setting some ridiculous limits: 11/02/2009 14:34:19 [400:22267]: setting limits 11/02/2009 14:34:19 [400:22267]: RLIMIT_CPU setting: (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 18446744073709551615) Any ideas on what might be going wrong or how to debug this further? Malcolm -- Malcolm Tobias 314.362.1594 |