From: <bac...@li...> - 2007-11-27 20:00:46
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1012 ====================================================================== Reported By: asimon Assigned To: ====================================================================== Project: bacula Issue ID: 1012 Category: Director Reproducibility: always Severity: major Priority: normal Status: feedback ====================================================================== Date Submitted: 11-19-2007 08:14 UTC Last Modified: 11-27-2007 20:00 UTC ====================================================================== Summary: Jobs with "Max Run Time" canceled whereas not already started Description: === description of the problem when i start at the same time 2 jobs : - as i use only one "File" storage, the second job waits for the first to be completed (this is normal) - on this second job, i have set a "Max Run Time = 2 minutes" - the first job takes more than 2 minutes (no problem for it, because i have no "Max Run Time" on it) - **the problem is** that during the first job is running the second is canceled for this reason "Fatal error: Max run time exceeded. Job canceled." whereas it was not running (because waiting for the first to be completed). === explications and workaround after some C reading, it seems that in the function job_check_maxruntime in src/dird/job.c the code does not check if the job is running or not. So as "jcr->start_time" is set when job is created, then the test if ((watchdog_time - jcr->start_time) < jcr->job->MaxRunTime) { Dmsg3(200, "Job %p (%s) with MaxRunTime %d not expired\n", jcr, jcr->Job, jcr->job->MaxRunTime); return false; } fails whereas the job was not running. I think about a patch in this function like : --- src/dird/job.c.orig 2007-11-14 14:45:08.000000000 +0100 +++ src/dird/job.c 2007-11-14 14:46:31.000000000 +0100 @@ -556,7 +556,7 @@ */ static bool job_check_maxruntime(JCR *control_jcr, JCR *jcr) { - if (jcr->job->MaxRunTime == 0 || job_canceled(jcr)) { + if (jcr->job->MaxRunTime == 0 || job_canceled(jcr) || jcr->JobStatus != JS_Running) { return false; } if ((watchdog_time - jcr->start_time) < jcr->job->MaxRunTime) { I have see a bug that matches this problem (http://bugs.bacula.org/view.php?id=621) but it seems to be "resolved" for a reason that i don't understand whereas the problem still presents. ====================================================================== ---------------------------------------------------------------------- kern - 11-19-07 19:14 ---------------------------------------------------------------------- I haven't looked at the code in detail, so I can believe that there is a problem here. However, your proposed solution is not sufficient. A simple test on JS_Running is incorrect because once the job starts, it has *many* possible status code (connecting to FD, waiting on resources, ...). I'd be happy to see it fixed, but you will need to find a different way to ensure that the job has been started. ---------------------------------------------------------------------- kern - 11-19-07 19:21 ---------------------------------------------------------------------- It might be possible to simply test jcr->JobStatus != JS_Created to know when a job is "running". In the sense of Max Run Time, "running" means once the job has been scheduled and is waiting to be run or running. ---------------------------------------------------------------------- asimon - 11-26-07 13:52 ---------------------------------------------------------------------- Dear kern, i understand your last note. Would you like me to test with "jcr->JobStatus != JS_Created" in the code ? I did not watch at code in depth, so I hope that JS_created is not set when job is scheduled, because in this case the problem is the same. Please tell me how to help on this problem. ---------------------------------------------------------------------- kern - 11-26-07 22:46 ---------------------------------------------------------------------- Please try applying the patch that I have attached to this bug report. The instructions are in the top of the patch file. Please report back. ---------------------------------------------------------------------- asimon - 11-27-07 12:57 ---------------------------------------------------------------------- kern, I have try the attached patch with 2.2.6 bacula version. All is OK ! I use the same scenario as described in description, and now the second job is now not canceled whereas it was stop started. Of course the function "Max Run Time" still works: if a job is running more than "Max Run Time", it's canceled (starting counting time since then job is started). Would you like me to test "Max Run Time" in other scenario ? (like jobs waiting on max concurents, ...). Please let me know. ---------------------------------------------------------------------- kern - 11-27-07 20:00 ---------------------------------------------------------------------- Thanks for the feedback. Thanks also for the offer to test, which I would never refuse, so yes, please do test any other cases you think might fail. Issue History Date Modified Username Field Change ====================================================================== 11-19-07 08:14 asimon New Issue 11-19-07 10:15 asimon Issue Monitored: asimon 11-19-07 10:15 asimon Issue End Monitor: asimon 11-19-07 19:14 kern Note Added: 0002956 11-19-07 19:14 kern Status new => feedback 11-19-07 19:21 kern Note Added: 0002957 11-26-07 13:52 asimon Note Added: 0002992 11-26-07 22:45 kern File Added: 2.2.6-maxruntime.patch 11-26-07 22:46 kern Note Added: 0002993 11-27-07 12:57 asimon Note Added: 0002996 11-27-07 20:00 kern Note Added: 0002997 ====================================================================== |