Originally created by: jonas.ba...@gmail.com
Originally owned by: jonas.ba...@gmail.com
What steps will reproduce the problem?
1. Setting a LRMS resource to either of the non -execution-leader LRMS types
2. Having a real job time out
What is the expected output? What do you see instead?
The job may remain in the LRMS (seen with PBS) when the node gets restarted on MiG job time out. AFAICT we do handle the situation correctly in the X-execution-leader case where clean up takes place as part of the stop call in exe restart:
dummy_node_script.sh stop
In the non leader case the default stop action is a raw kill and thus no job removal
Please use labels and text to provide additional information.
We should implement a similar stop command in master_node_script and use it during restart.
Please refer to the 'Sending multiple bulk jobs to a PBS resource' thread on http://groups.google.com/group/migrid for the background details.