From: <bac...@li...> - 2006-05-24 12:40:12
|
The following bug has been RESOLVED. ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000570 ====================================================================== Reported By: kodis Assigned To: ====================================================================== Project: bacula Bug ID: 570 Category: Director Reproducibility: always Severity: major Priority: normal Status: resolved Resolution: fixed ====================================================================== Date Submitted: 03-20-2006 08:44 PST Last Modified: 05-24-2006 05:40 PDT ====================================================================== Summary: Restarted job never completes Description: I set up a job with the "reschedule on error = yes" option, started a job, and forced the job to initially fail by disconnecting the client from the net. After ten minutes, the job gets rescheduled and begins running, but never completes. Half an hour after starting what would be a three minute backup job, the response from a status command shows the client and storage daemons looking in good shape, but the director status seems confused. The report of running jobs shows: Running Jobs: JobId Level Name Status ====================================================================== 3193 Increme royalscam.2006-03-20_10.30.00 has terminated An attempt to rerun the same job at this point gets put on hold: 3194 Increme royalscam.2006-03-20_11.18.28 is waiting execution Cancelling the initial job seems to succeed, but that isn't enough to get backups rolling again. I have to restart the director to recover from the error condition. ====================================================================== ---------------------------------------------------------------------- kern - 03-20-2006 10:09 PST ---------------------------------------------------------------------- I am unable to duplicate this. For me after the job initially fails then restarts, it runs with no problems. In order to get any further with this problem, you will need to provide some way I can duplicate the problem and many more details of what is going on -- e.g. exactly why the job is waiting. It is possible that in some cases the old job will not be completely flushed out of the SD for a minimum of 30 minutes, so any test where the old job is "stuck" needs to wait a minimum of 30 minutes. ---------------------------------------------------------------------- kodis - 03-21-2006 08:13 PST ---------------------------------------------------------------------- I've turned on debugging output, and the first indication I see that something has gone wrong is when the director writes: rubble-dir: backup.c:286 FDStatus=T Things on the file daemon side look okay. There are a series of heartbeat messages, and what looks like a normal completion: fd: heartbeat.c:82 wait_intr=0 stop=0 fd: backup.c:111 end blast_data ok=1 fd: job.c:208 Quit command loop. Cancelled=0 fd: job.c:289 Calling term_find_files fd: job.c:292 Done with term_find_files fd: job.c:294 Done with free_jcr I'll take a look in backup.c around line 286. It doesn't seem like the director should be getting an FDStatus of T under these conditions. ---------------------------------------------------------------------- enrique - 03-23-2006 04:14 PST ---------------------------------------------------------------------- I can duplicate it: Running Jobs: JobId Level Name Status ====================================================================== 412 Differe Pc-Fernando-Win.2006-03-23_04.10.00 has terminated 414 Full Pc-Silvia-Win.2006-03-23_04.10.02 is waiting for its start time 417 Increme Pcjose1.2006-03-23_04.10.05 has terminated 418 Increme Andrea.2006-03-23_08.00.00 has terminated 419 Increme Mendiaz.2006-03-23_08.00.01 has terminated ==== running beta6. The similar "has a fatal error" issue of bug 557 has been resolved on beta6. On 1.38.5 I couldn't duplicate this bug. I remember, I turned on the computer before last attempt, ths status was OK and was listed at "Terminated jobs". No problem. Right, Kern, you can see it on the mail I sent you about bug 557. Im not sure about the way to duplicate. More over, I think it worked right yesterday without configuration change, but Im not sure. ---------------------------------------------------------------------- kern - 03-29-2006 02:31 PST ---------------------------------------------------------------------- I see nothing wrong in the debug output below. From what you guys have written, I am a bit confused about whether or not the bug still exists. Please note that under certain conditions, it can take 30 minutes for the failed job to completely disappear from Bacula status output. Could you please upgrade to 1.38.6, where I have made some fixes that should resolve some race conditions. Then let me know whether or not the bug still exists, and if it does, could you please try to describe how to reproduce it. ---------------------------------------------------------------------- enrique - 03-30-2006 03:49 PST ---------------------------------------------------------------------- I'm using beta6 of 1.38.6. As I said on the other note, it didnt happen on 1.38.5. A client configuration example is: Rerun Failed Levels = yes Reschedule on Error = yes Reschedule Interval = 2 hours Reschedule Times = 8 Run Before Job = "nc -w5 -z pc-fernando 9102" Write Bootstrap = "/var/bootstraps/pc-fernando-fd.bsr" and after few failed attempts at night (poweroff, Run Before Job exit status -1) the computer was turned on and backup was done. No mail is sent and it is listed at "Running Jobs" instead of Terminated. Running Jobs: JobId Level Name Status ====================================================================== 652 Differe Pc-Fernando-Win.2006-03-30_04.10.00 has terminated 654 Full Pc-Silvia-Win.2006-03-30_04.10.02 is waiting for its start time 656 Differe pc-dsp.2006-03-30_04.10.04 is waiting for its start time 658 Increme Pcjose1.2006-03-30_04.10.06 is waiting for its start time 659 Increme Andrea.2006-03-30_08.00.00 has terminated 660 Increme Mendiaz.2006-03-30_08.00.01 has terminated 661 Increme Pc-roman.2006-03-30_08.00.02 has terminated ==== This bug is the opposite of 557, i mean: * on 557, if I turned on the computer, the job was done, mail was sent and It was listed at "Terminated Jobs". But if the LAST attempt also failed, it was listed at "Running Jobs" without mail sent. This was fixed at beta6 * on this bug, mail is not sent if It has OK status. 557 failed on ERROR status and this bug on OK status. Moreover, after some "has terminated" I have to restart bacula daemons to make It work. If I run more jobs manually it works, but If I wait until next morning, all night jobs are still waiting. ---------------------------------------------------------------------- kern - 03-30-2006 04:25 PST ---------------------------------------------------------------------- Please re-read my last bug note. I'm not very interested in what happened in 1.38.5 nor am I interested in what happened in 1.38.6 beta6. Both those things only complicate the problem as the code has changed. I need to know if there is still a problem in the released version of 1.38.6, and if there is a problem, the details of the current problem, particularly some *simple* steps to reproduce it, but without the previous history of older versions. ---------------------------------------------------------------------- enrique - 03-30-2006 07:56 PST ---------------------------------------------------------------------- Ive just installed new version. With the same configuration files, its not fixed: *status dir neptuno-dir Version: 1.38.6 (28 March 2006) x86_64-unknown-linux-gnu debian testing/unstable Daemon started 30-mar-06 15:37, 22 Jobs run since started. [...] Running Jobs: JobId Level Name Status ====================================================================== 663 Increme Sunipx1.2006-03-30_15.38.49 has terminated 667 Catalog Urbion-Verify.2006-03-30_15.38.53 is waiting for higher priority jobs to finish 678 Increme Pc-Fernando-Win.2006-03-30_15.39.04 is waiting for its start time 681 Full Pc-Silvia-Win.2006-03-30_15.39.07 is waiting for its start time 684 Differe pc-dsp.2006-03-30_15.39.10 is waiting for its start time 686 Increme Pcjose1.2006-03-30_15.39.12 is waiting for its start time 688 Increme Mendiaz.2006-03-30_15.39.14 is waiting for its start time ==== [...] ---------------------------------------------------------------------- enrique - 03-30-2006 08:05 PST ---------------------------------------------------------------------- The way to reproduce the last bug note: 1. stop my file daemon 2. run all jobs. approx 27. My job is the first one. 3. when third job is running (for example), I start my file daemon 4. wait till the end. ---------------------------------------------------------------------- kodis - 04-06-2006 12:59 PDT ---------------------------------------------------------------------- Unfortunately, I'm still seeing this problem. I've built and installed version 1.38.6 as a server on an AMD64 platform and as a client on both a PPC Mac running the current release of OS X and on an i386 PC running FC5, compiling with the "-O0" option, just to be sure that this isn't related to an optimization problem. As in previous tests, I've configured the job to reschedule on error after waiting 5 minutes, and forced the initial attempt to fail by disconnecting the client from the network. I've started the director with a -d200 debug level, and see the first attempt fail, get rescheduled, run to what seems like completion, and see the message: rumba-dir: backup.c:286 FDStatus=T as the last message written out by the director. Even after waiting several hours, there is no additional output, and a bconsole "status director" reports that the job has terminated. Both the Mac and Fedora clients display this behavior. Sadly, I have not idea what I'm doing that causes this to fail for me while it works perfectly for you. If there are any tests that I can run or any other information that I can provide to help sort this out I'll be happy to do so, or if you have any debugging hints I'd be grateful for the advice. ---------------------------------------------------------------------- enrique - 05-22-2006 07:15 PDT ---------------------------------------------------------------------- is this bug fixed in 1.38.9? ---------------------------------------------------------------------- jaimeventura - 05-23-2006 03:53 PDT ---------------------------------------------------------------------- Seems like the problem remains in 1.38.9. Here is my case: Im using 1.38.9 on a x86_64 rhel. I've rebuild from src.rpm usin the following command: rpmbuild --rebuild --define "build_rhel4 1" --define "build_mysql4 1" --define "build_x86_64 1" bacula-1.38.9-2.src.rpm A simple way to reproduce the bug: 1)From a clean install change passwords, a test connections between dir, sd and fd(all working on the same machine). 2)change Job for "Client1", adding the lines: Reschedule On Error = yes Reschedule Interval = 2 minutes Reschedule Times = 10 3) (re)start the dir with -d400, STOP the fd(so we can cause a connection error) 4) on the console run command "run Client1". type "yes" 5) quick result: a) job will be reported as "started" b) job will be reported as "finished/failed" c) job will be rescheduled d) job will wait execution e) job will be reported as "started" f) job will freese. In fact, job will fail, but dir waits for "message thread termination"... for ever. And so, that job will never be marked/reported as finished/failed Why? i'm trying to figure out, thought im not very experienced with threads. Developers: as you might seen, dir waits for "message thread termination" on file msgchan.c:368 I will try to upload my test configuration files and the debug result. I understant my note not doesnt refer exactly to the reported bug, but I beleive the issue causing problem is the same. I hope it helps, because im really need this working. If is there any other way I can help, please tell me. Thanks ---------------------------------------------------------------------- jaimeventura - 05-23-2006 12:43 PDT ---------------------------------------------------------------------- Hello Again After a bit of debug I've noticed that in msgchan.c:368 (pthread_cond_timedwait(&jcr->term_wait, &mutex, &timeout);) the termination condition variable(&jcr->term_wait) wasnt initialized(jcr->term_wait_inited=false) whenever the job has failed and was rescheduled. In fact, a "partial cleanup old stuff"(dird_free_jcr_pointers(jcr);) was cleaning the termination condition variable(jobq.c:483). And so the thread would be waiting for it on msgchan.c:368. Im not shure if i've made the correct changes, but the fact is that it works now. Besides that, I noticed that when the failed job write some data to sd, a new job would be created. Some values are copied, but not then jcr->sched_time. So, in this case, the job would be rescheduled to run immediatly. I believe this was not what was supposed to happen. Im uploading a patch (bacula-1.38.9-reschedule.patch) Please, send me some feedback on this. Thanks. ---------------------------------------------------------------------- kern - 05-24-2006 05:40 PDT ---------------------------------------------------------------------- I have never been able to reproduce this exact bug, so *many* thanks for your patch. It looks perfectly good to me and in any case, I cannot see that it will do any harm. I've uploaded your patch as I applied it (minor spacing difference). This patch, however, also has several fixes to ensure that the JobStatus of the restarted job is correctly set. I would appreciate it if the other users having this same problem would try my patch and report back if it corrects your problem. Bug History Date Modified Username Field Change ====================================================================== 03-20-06 08:44 kodis New Bug 03-20-06 10:09 kern Bugnote Added: 0001566 03-20-06 10:09 kern Status new => feedback 03-21-06 08:13 kodis Bugnote Added: 0001570 03-23-06 04:14 enrique Bugnote Added: 0001581 03-29-06 02:31 kern Bugnote Added: 0001601 03-30-06 03:49 enrique Bugnote Added: 0001606 03-30-06 04:25 kern Bugnote Added: 0001607 03-30-06 07:56 enrique Bugnote Added: 0001608 03-30-06 08:05 enrique Bugnote Added: 0001609 04-06-06 12:59 kodis Bugnote Added: 0001628 05-22-06 07:15 enrique Bugnote Added: 0001683 05-23-06 03:53 jaimeventura Bugnote Added: 0001684 05-23-06 03:56 jaimeventura File Added: ConsoleResult-BugNote0001684 05-23-06 03:57 jaimeventura File Added: debugDump-BugNote0001684 05-23-06 12:43 jaimeventura Bugnote Added: 0001685 05-23-06 12:44 jaimeventura File Added: bacula-1.38.9-reschedule.patch 05-24-06 05:37 kern File Added: 1.38.9-jobq.patch 05-24-06 05:40 kern Bugnote Added: 0001687 05-24-06 05:40 kern Resolution open => fixed 05-24-06 05:40 kern Status feedback => resolved ====================================================================== |