From: hopesophite <hop...@gm...> - 2008-08-29 14:58:17
|
Hi all, A discussion about job rerun has been add to sourceforge. Here is the URL: http://sourceforge.net/forum/message.php?msg_id=5206076 The content is as follows. ======================================== Here is a quick start of rerun, and I want to know your opinions. What’s rerun? Job rerun is re-execution a previous job within the same working directory from a specified stage. Why we need rerun? The job re-run feature request tells us why we need rerun. http://sourceforge.net/tracker/index.php?func=detail&aid=2007784&group_id=141177&atid=748735 “In GridSAM, when something wrong happens to a job in the processing pipeline, then the job state will be immediately advanced to “Failed”, and the only way to retry is to resubmit the same job, then a new working directory will be generated. For example, after 2 days of hard computing, your job finally output a lot of result files, and GridSAM is ready to stage them out. Unfortunately, the FTP URL you specified for staging out is wrong. Then GridSAM will advance your job state to “Failed”, and you will never get the output results which have been successfully generated.” What should be done to add this new feature? 1) A client command gridsam-rerun to rerun a job. It will need three parameters: parentJobID: the ID of the job to be rerun and the job is called the parent job. startJobState: Which state should the rerun start from? It can be pending, staged-in and executed. JSDL: the new JSDL for this rerun. If omitted, the JSDL of the parent job will be used. If success, it will start a new job and return the jobID. 2) The gridsam-rerun command will pass the rerun request to the gridsam server. 3) DRMConnector should support the re-run. What’s our solution? 1) It is simple and needs no discussion. 2) A new web service operation called rerun should be added. It's not easy to add such an operation. And the “new” client with “old” server will fail to find the operation. 3) We add several new ProgressCriteria to decide whether the given DRMConnector should be executed or skipped for a rerun job. Any DRMConnectors want to support rerun MUST use the newly provided ProgressCriteria. What’s the current status and plan of the rerun? Now an initial version of rerun is under testing with PBSDRMConnectors. This feature will be released with GridSAM 2.0.3. ============================================================ hopesophite 2008-08-29 |