Jobs are not removed from LRMS on MiG job timeout if not using X-execution-leader

A grid middleware with minimal user and resource requirements

Brought to you by: jonasbardino, patchofscotland, ras, rehr

#64 Jobs are not removed from LRMS on MiG job timeout if not using X-execution-leader

Status: Accepted

Owner: nobody

Labels: Usability (20)

Priority: Medium

Component:

OpSys:

Type: Defect

Updated: 2011-03-19

Created: 2011-03-19

Creator: Anonymous

Private: No

Originally created by: jonas.ba...@gmail.com
Originally owned by: jonas.ba...@gmail.com

What steps will reproduce the problem?
1. Setting a LRMS resource to either of the non -execution-leader LRMS types
2. Having a real job time out

What is the expected output? What do you see instead?
The job may remain in the LRMS (seen with PBS) when the node gets restarted on MiG job time out. AFAICT we do handle the situation correctly in the X-execution-leader case where clean up takes place as part of the stop call in exe restart:
dummy_node_script.sh stop
In the non leader case the default stop action is a raw kill and thus no job removal

Please use labels and text to provide additional information.
We should implement a similar stop command in master_node_script and use it during restart.
Please refer to the 'Sending multiple bulk jobs to a PBS resource' thread on http://groups.google.com/group/migrid for the background details.

Jobs are not removed from LRMS on MiG job timeout if not using X-execution-leader

A grid middleware with minimal user and resource requirements

Milestone

Searches

Help

#64 Jobs are not removed from LRMS on MiG job timeout if not using X-execution-leader

Discussion