[Bacula-bugs] [bacula 0000570]: Restarted job never completes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The following bug has been RESOLVED.
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000570
======================================================================
Reported By:                kodis
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     570
Category:                   Director
Reproducibility:            always
Severity:                   major
Priority:                   normal
Status:                     resolved
Resolution:                 fixed
======================================================================
Date Submitted:             03-20-2006 08:44 PST
Last Modified:              05-24-2006 05:40 PDT
======================================================================
Summary:                    Restarted job never completes
Description: 
I set up a job with the "reschedule on error = yes" option, started a job,
and forced the job to initially fail by disconnecting the client from the
net.  After ten minutes, the job gets rescheduled and begins running, but
never completes.  

Half an hour after starting what would be a three minute backup job, the
response from a status command shows the client and storage daemons
looking in good shape, but the director status seems confused.  The report
of running jobs shows:

Running Jobs:
 JobId Level   Name                       Status
======================================================================
  3193 Increme  royalscam.2006-03-20_10.30.00 has terminated

An attempt to rerun the same job at this point gets put on hold:

  3194 Increme  royalscam.2006-03-20_11.18.28 is waiting execution

Cancelling the initial job seems to succeed, but that isn't enough to get
backups rolling again.  I have to restart the director to recover from the
error condition.
======================================================================

----------------------------------------------------------------------
 kern - 03-20-2006 10:09 PST 
----------------------------------------------------------------------
I am unable to duplicate this. For me after the job initially fails then
restarts, it runs with no problems.  In order to get any further with this
problem, you will need to provide some way I can duplicate the problem and
many more details of what is going on -- e.g. exactly why the job is
waiting.

It is possible that in some cases the old job will not be completely
flushed out of the SD for a minimum of 30 minutes, so any test where the
old job is "stuck" needs to wait a minimum of 30 minutes.

----------------------------------------------------------------------
 kodis - 03-21-2006 08:13 PST 
----------------------------------------------------------------------
I've turned on debugging output, and the first indication I see that
something has gone wrong is when the director writes:

    rubble-dir: backup.c:286 FDStatus=T

Things on the file daemon side look okay.  There are a series of heartbeat
messages, and what looks like a normal completion:

    fd: heartbeat.c:82 wait_intr=0 stop=0
    fd: backup.c:111 end blast_data ok=1
    fd: job.c:208 Quit command loop.  Cancelled=0
    fd: job.c:289 Calling term_find_files
    fd: job.c:292 Done with term_find_files
    fd: job.c:294 Done with free_jcr

I'll take a look in backup.c around line 286.  It doesn't seem like the
director should be getting an FDStatus of T under these conditions.

----------------------------------------------------------------------
 enrique - 03-23-2006 04:14 PST 
----------------------------------------------------------------------
I can duplicate it:

Running Jobs:
 JobId Level   Name                       Status
======================================================================
   412 Differe  Pc-Fernando-Win.2006-03-23_04.10.00 has terminated
   414 Full    Pc-Silvia-Win.2006-03-23_04.10.02 is waiting for its start
time
   417 Increme  Pcjose1.2006-03-23_04.10.05 has terminated
   418 Increme  Andrea.2006-03-23_08.00.00 has terminated
   419 Increme  Mendiaz.2006-03-23_08.00.01 has terminated
====

running beta6. The similar "has a fatal error" issue of bug 557 has been
resolved on beta6. On 1.38.5 I couldn't duplicate this bug. I remember, I
turned on the computer before last attempt, ths status was OK and was
listed at "Terminated jobs". No problem. Right, Kern, you can see it on
the mail I sent you about bug 557. 

Im not sure about the way to duplicate. More over, I think it worked right
yesterday without configuration change, but Im not sure.

----------------------------------------------------------------------
 kern - 03-29-2006 02:31 PST 
----------------------------------------------------------------------
I see nothing wrong in the debug output below.

From what you guys have written, I am a bit confused about whether or not
the bug still exists.  Please note that under certain conditions, it can
take 30 minutes for the failed job to completely disappear from Bacula
status output.

Could you please upgrade to 1.38.6, where I have made some fixes that
should resolve some race conditions.  Then let me know whether or not the
bug still exists, and if it does, could you please try to describe how to
reproduce it.

----------------------------------------------------------------------
 enrique - 03-30-2006 03:49 PST 
----------------------------------------------------------------------
I'm using beta6 of 1.38.6.

As I said on the other note, it didnt happen on 1.38.5. A client
configuration example is:

        Rerun Failed Levels = yes
        Reschedule on Error = yes
        Reschedule Interval = 2 hours
        Reschedule Times = 8
        Run Before Job = "nc -w5 -z pc-fernando 9102"
        Write Bootstrap = "/var/bootstraps/pc-fernando-fd.bsr"

and after few failed attempts at night (poweroff, Run Before Job exit
status -1) the computer was turned on and backup was done. No mail is sent
and it is listed at "Running Jobs" instead of Terminated.

Running Jobs:
 JobId Level   Name                       Status
======================================================================
   652 Differe  Pc-Fernando-Win.2006-03-30_04.10.00 has terminated
   654 Full    Pc-Silvia-Win.2006-03-30_04.10.02 is waiting for its start
time
   656 Differe  pc-dsp.2006-03-30_04.10.04 is waiting for its start time
   658 Increme  Pcjose1.2006-03-30_04.10.06 is waiting for its start time
   659 Increme  Andrea.2006-03-30_08.00.00 has terminated
   660 Increme  Mendiaz.2006-03-30_08.00.01 has terminated
   661 Increme  Pc-roman.2006-03-30_08.00.02 has terminated
====

This bug is the opposite of 557, i mean:

* on 557, if I turned on the computer, the job was done, mail was sent and
It was listed at "Terminated Jobs". But if the LAST attempt also failed, it
was listed at "Running Jobs" without mail sent. This was fixed at beta6

* on this bug, mail is not sent if It has OK status.

557 failed on ERROR status and this bug on OK status.

Moreover, after some "has terminated" I have to restart bacula daemons to
make It work. If I run more jobs manually it works, but If I wait until
next morning, all night jobs are still waiting.

----------------------------------------------------------------------
 kern - 03-30-2006 04:25 PST 
----------------------------------------------------------------------
Please re-read my last bug note. 

I'm not very interested in what happened in 1.38.5 nor am I interested in
what happened in 1.38.6 beta6. Both those things only complicate the
problem as the code has changed. 

I need to know if there is still a problem in the released version of
1.38.6, and if there is a problem, the details of the current problem,
particularly some *simple* steps to reproduce it, but without the previous
history of older versions.

----------------------------------------------------------------------
 enrique - 03-30-2006 07:56 PST 
----------------------------------------------------------------------
Ive just installed new version. With the same configuration files, its not
fixed:

*status dir
neptuno-dir Version: 1.38.6 (28 March 2006) x86_64-unknown-linux-gnu
debian testing/unstable
Daemon started 30-mar-06 15:37, 22 Jobs run since started.
[...]
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   663 Increme  Sunipx1.2006-03-30_15.38.49 has terminated
   667 Catalog  Urbion-Verify.2006-03-30_15.38.53 is waiting for higher
priority jobs to finish
   678 Increme  Pc-Fernando-Win.2006-03-30_15.39.04 is waiting for its
start time
   681 Full    Pc-Silvia-Win.2006-03-30_15.39.07 is waiting for its start
time
   684 Differe  pc-dsp.2006-03-30_15.39.10 is waiting for its start time
   686 Increme  Pcjose1.2006-03-30_15.39.12 is waiting for its start time
   688 Increme  Mendiaz.2006-03-30_15.39.14 is waiting for its start time
====

[...]

----------------------------------------------------------------------
 enrique - 03-30-2006 08:05 PST 
----------------------------------------------------------------------
The way to reproduce the last bug note:

1. stop my file daemon
2. run all jobs. approx 27. My job is the first one.
3. when third job is running (for example), I start my file daemon
4. wait till the end.

----------------------------------------------------------------------
 kodis - 04-06-2006 12:59 PDT 
----------------------------------------------------------------------
Unfortunately, I'm still seeing this problem.  I've built and installed
version 1.38.6 as a server on an AMD64 platform and as a client on both a
PPC Mac running the current release of OS X and on an i386 PC running FC5,
compiling with the "-O0" option, just to be sure that this isn't related to
an optimization problem.  As in previous tests, I've configured the job to 
reschedule on error after waiting 5 minutes, and forced the initial attempt
to fail by disconnecting the client from the network.

I've started the director with a -d200 debug level, and see the first
attempt fail, get rescheduled, run to what seems like completion, and see
the message:

    rumba-dir: backup.c:286 FDStatus=T

as the last message written out by the director.  Even after waiting
several hours, there is no additional output, and a bconsole "status
director" reports that the job has terminated.  Both the Mac and Fedora
clients display this behavior.

Sadly, I have not idea what I'm doing that causes this to fail for me
while it works perfectly for you.  If there are any tests that I can run
or any other information that I can provide to help sort this out I'll be
happy to do so, or if you have any debugging hints I'd be grateful for the
advice.

----------------------------------------------------------------------
 enrique - 05-22-2006 07:15 PDT 
----------------------------------------------------------------------
is this bug fixed in 1.38.9?

----------------------------------------------------------------------
 jaimeventura - 05-23-2006 03:53 PDT 
----------------------------------------------------------------------
Seems like the problem remains in 1.38.9.
Here is my case:
Im using 1.38.9 on a x86_64 rhel.
I've rebuild from src.rpm usin the following command:
rpmbuild --rebuild --define "build_rhel4 1" --define "build_mysql4 1"
--define "build_x86_64 1" bacula-1.38.9-2.src.rpm
A simple way to reproduce the bug:
1)From a clean install change passwords, a test connections between dir,
sd and fd(all working on the same machine).
2)change Job for "Client1", adding the lines:
  Reschedule On Error = yes
  Reschedule Interval = 2 minutes
  Reschedule Times = 10
3) (re)start the dir with -d400, STOP the fd(so we can cause a connection
error)
4) on the console run command "run Client1". type "yes"
5) quick result:
    a) job will be reported as "started"
    b) job will be reported as "finished/failed"
    c) job will be rescheduled
    d) job will wait execution
    e) job will be reported as "started"
    f) job will freese. In fact, job will fail, but dir waits for "message
thread termination"... for ever. And so, that job will never be
marked/reported as finished/failed
Why? i'm trying to figure out, thought im not very experienced with
threads.
Developers: as you might seen, dir waits for "message thread termination"
on file msgchan.c:368
I will try to upload my test configuration files and the debug result.
I understant my note not doesnt refer exactly to the reported bug, but I
beleive the issue causing problem is the same.
I hope it helps, because im really need this working.
If is there any other way I can help, please tell me.
Thanks

----------------------------------------------------------------------
 jaimeventura - 05-23-2006 12:43 PDT 
----------------------------------------------------------------------
Hello Again
After a bit of debug I've noticed that in msgchan.c:368
(pthread_cond_timedwait(&jcr->term_wait, &mutex, &timeout);) the
termination condition variable(&jcr->term_wait) wasnt
initialized(jcr->term_wait_inited=false) whenever the job has failed and
was rescheduled. In fact, a "partial cleanup old
stuff"(dird_free_jcr_pointers(jcr);) was cleaning the termination
condition variable(jobq.c:483). And so the thread would be waiting for it
on msgchan.c:368.
Im not shure if i've made the correct changes, but the fact is that it
works now.

Besides that, I noticed that when the failed job write some data to sd, a
new job would be created. Some values are copied, but not then
jcr->sched_time. So, in this case, the job would be rescheduled to run
immediatly. I believe this was not what was supposed to happen.

Im uploading a patch (bacula-1.38.9-reschedule.patch)
Please, send me some feedback on this.
Thanks.

----------------------------------------------------------------------
 kern - 05-24-2006 05:40 PDT 
----------------------------------------------------------------------
I have never been able to reproduce this exact bug, so *many* thanks for
your patch. It looks perfectly good to me and in any case, I cannot see
that it will do any harm.

I've uploaded your patch as I applied it (minor spacing difference). This
patch, however, also has several fixes to ensure that the JobStatus of the
restarted job is correctly set.

I would appreciate it if the other users having this same problem would
try my patch and report back if it corrects your problem.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
03-20-06 08:44 kodis          New Bug                                      
03-20-06 10:09 kern           Bugnote Added: 0001566                       
03-20-06 10:09 kern           Status                   new => feedback     
03-21-06 08:13 kodis          Bugnote Added: 0001570                       
03-23-06 04:14 enrique        Bugnote Added: 0001581                       
03-29-06 02:31 kern           Bugnote Added: 0001601                       
03-30-06 03:49 enrique        Bugnote Added: 0001606                       
03-30-06 04:25 kern           Bugnote Added: 0001607                       
03-30-06 07:56 enrique        Bugnote Added: 0001608                       
03-30-06 08:05 enrique        Bugnote Added: 0001609                       
04-06-06 12:59 kodis          Bugnote Added: 0001628                       
05-22-06 07:15 enrique        Bugnote Added: 0001683                       
05-23-06 03:53 jaimeventura   Bugnote Added: 0001684                       
05-23-06 03:56 jaimeventura   File Added: ConsoleResult-BugNote0001684                    
05-23-06 03:57 jaimeventura   File Added: debugDump-BugNote0001684                    
05-23-06 12:43 jaimeventura   Bugnote Added: 0001685                       
05-23-06 12:44 jaimeventura   File Added: bacula-1.38.9-reschedule.patch                    
05-24-06 05:37 kern           File Added: 1.38.9-jobq.patch                    
05-24-06 05:40 kern           Bugnote Added: 0001687                       
05-24-06 05:40 kern           Resolution               open => fixed       
05-24-06 05:40 kern           Status                   feedback => resolved
======================================================================