From: Mantis B. T. <bac...@li...> - 2008-12-20 17:09:32
|
The following issue has been CLOSED ====================================================================== http://bugs.bacula.org/view.php?id=1166 ====================================================================== Reported By: hemantshah Assigned To: ebollengier ====================================================================== Project: bacula Issue ID: 1166 Category: Director Reproducibility: always Severity: major Priority: normal Status: closed Resolution: fixed Fixed in Version: 2.4.3 ====================================================================== Date Submitted: 2008-10-06 19:41 UTC Last Modified: 2008-12-20 17:09 UTC ====================================================================== Summary: Problem canceling job if client looses connection while being backed up Description: If director looses the connection to the client while it is being backed up, I cannot cancel the job. When I cancel the job I get message that the job has been canceled and "stat dir" command says that it has been canceled, but the next job in the queue does not get started. The rest of jobs in the queue are waiting for the canceled job. You can refer to the thread "Problem cancelling a job" in bacula users mailing list. This has happened three times already. ====================================================================== ---------------------------------------------------------------------- (0003675) ebollengier (developer) - 2008-10-07 16:26 http://bugs.bacula.org/view.php?id=1166#c3675 ---------------------------------------------------------------------- Step to reproduce: - Run backup - do "killall -STOP bacula-fd" after few seconds - try to cancel the job The director will never finish the job because the EndJob message is expected without timeout. Thread 4 (Thread 0x42d34950 (LWP 23556)): http://bugs.bacula.org/view.php?id=0 0x00007f79224b25cb in read () from /lib/libpthread.so.0 http://bugs.bacula.org/view.php?id=1 0x00000000004542b6 in read_nbytes (bsock=0x6f1d88, ptr=0x42d33d24 "", nbytes=4) at bnet.c:82 http://bugs.bacula.org/view.php?id=2 0x0000000000457d82 in BSOCK::recv (this=0x6f1d88) at bsock.c:418 http://bugs.bacula.org/view.php?id=3 0x00000000004166fc in bget_dirmsg (bs=0x6f1d88) at getmsg.c:136 http://bugs.bacula.org/view.php?id=4 0x000000000040bc33 in wait_for_job_termination (jcr=0x8b2ee8, timeout=0) at backup.c:365 http://bugs.bacula.org/view.php?id=5 0x000000000040c603 in do_backup (jcr=0x8b2ee8) at backup.c:321 http://bugs.bacula.org/view.php?id=6 0x000000000041a17f in job_thread (arg=<value optimized out>) at job.c:310 http://bugs.bacula.org/view.php?id=7 0x000000000041caf3 in jobq_server (arg=0x6b2500) at jobq.c:466 http://bugs.bacula.org/view.php?id=8 0x00007f79224ab3f7 in start_thread () from /lib/libpthread.so.0 http://bugs.bacula.org/view.php?id=9 0x00007f79210a9b3d in clone () from /lib/libc.so.6 http://bugs.bacula.org/view.php?id=10 0x0000000000000000 in ?? () ---------------------------------------------------------------------- (0003676) ebollengier (developer) - 2008-10-07 16:39 http://bugs.bacula.org/view.php?id=1166#c3676 ---------------------------------------------------------------------- hemantshah, The next time, could you simply check the bacula-fd status (with a simple ps) on your server and restart it to see if it helps. In my tests, if i kill -9 the bacula-fd, the connection is broken and the director receive a network packet that ends the dialog. (But it's on the same machine) Thanks ---------------------------------------------------------------------- (0003677) hemantshah (reporter) - 2008-10-09 01:29 http://bugs.bacula.org/view.php?id=1166#c3677 ---------------------------------------------------------------------- Once the network connection to an offsite system was down while that system was being backed up. When I canceled the job, the network was still down so I could not run a ps command. Twice the client system had hung up due to hardware problem, and I could not even login into the client. I had to hard reset the client system. Even after the system was re-booted the bacula dir did not start next job in the queue. bacula starts automatically when boots so it was running when I re-booted the system. In my case the client and server are two different physical systems. ---------------------------------------------------------------------- (0003684) ebollengier (developer) - 2008-10-13 10:50 http://bugs.bacula.org/view.php?id=1166#c3684 ---------------------------------------------------------------------- This problem can be fixed with the 2.4.3-cancel-after-network-outage.patch Thanks to give me feedbacks. ---------------------------------------------------------------------- (0003852) kern (administrator) - 2008-12-20 17:09 http://bugs.bacula.org/view.php?id=1166#c3852 ---------------------------------------------------------------------- I believe that this is fixed now. Issue History Date Modified Username Field Change ====================================================================== 2008-10-06 19:41 hemantshah New Issue 2008-10-07 16:24 ebollengier Status new => assigned 2008-10-07 16:24 ebollengier Assigned To => ebollengier 2008-10-07 16:26 ebollengier Note Added: 0003675 2008-10-07 16:39 ebollengier Note Added: 0003676 2008-10-07 16:39 ebollengier Status assigned => feedback 2008-10-07 21:14 hemantshah Note Added: 0003677 2008-10-07 21:16 hemantshah Note Edited: 0003677 2008-10-08 02:53 hemantshah Issue Monitored: hemantshah 2008-10-09 01:29 hemantshah Note Edited: 0003677 2008-10-13 10:44 ebollengier File Added: 2.4.3-cancel-after-network-outage.patch 2008-10-13 10:50 ebollengier Status feedback => resolved 2008-10-13 10:50 ebollengier Fixed in Version => 2.4.3 2008-10-13 10:50 ebollengier Resolution open => fixed 2008-10-13 10:50 ebollengier Note Added: 0003684 2008-12-20 17:09 kern Note Added: 0003852 2008-12-20 17:09 kern Status resolved => closed ====================================================================== |