[Opaltoolkit-users] opal.hard_limit being ignored?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I'm running Opal 2.0 and using the DRMAAJobManager to interface with Sun Grid 
Engine (SGE).  I've got the opal.hard_limit set to 3600s:

# specify in seconds the hard limit for how long a job can run
# only applicable if either DRMAA or Globus is being used, and if
# the scheduler supports it
opal.hard_limit=3600

but occasionally encounter 'run away' jobs that are never killed:

[mtobias@sccne ~]$ qstat -u opal
job-ID  prior   name       user         state submit/start at     queue                          
slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   3058 0.55500 pdb2pqr.py opal         r     11/02/2009 14:34:19 
all.q@compute-0-10.local           1

I've tried looking in the $TOMCAT/webapps/ROOT directory where the job data is 
stored, but I don't see any file that looks like a batch script to examine to 
see if it's limiting the CPU time limit.

I've also looked at the compute-node that the job is running on and examined 
the 'trace' file which appears to be where SGE is setting up the job.  It 
seems like it's setting some ridiculous limits:

11/02/2009 14:34:19 [400:22267]: setting limits
11/02/2009 14:34:19 [400:22267]: RLIMIT_CPU setting: (soft 
18446744073709551615 
hard 18446744073709551615) resulting: (soft 18446744073709551615 hard 
18446744073709551615)

Any ideas on what might be going wrong or how to debug this further?

Malcolm

-- 
Malcolm Tobias
314.362.1594