From: Mantis B. T. <no...@bu...> - 2009-03-30 08:32:02
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1258 ====================================================================== Reported By: RalfGross Assigned To: ====================================================================== Project: bacula Issue ID: 1258 Category: Director Reproducibility: sometimes Severity: minor Priority: normal Status: feedback ====================================================================== Date Submitted: 2009-03-25 08:38 GMT Last Modified: 2009-03-30 09:31 BST ====================================================================== Summary: verify job with differences doesn't terminate Description: bacula 2.4.4 (fd,sd,dir), postgres 8.1, debian etch I often (maybe every time) see that verify jobs with differences don't terminate. eg: * st dir [...] Running Jobs: JobId Level Name Status ====================================================================== 10632 VolumeT VerifyVU0EF005-MPC-Volume1.2009-03-24_08.26.25.15 has verify differences ==== [...] the client where the verify job is running: *st client=VU0EF005-fd [...] Running Jobs: JobId 10632 Job VerifyVU0EF005-MPC-Volume1.2009-03-24_08.26.25.15 is running. Verify Job started: 24-Mar-09 08:26 Files=15,818 Bytes=0 Bytes/sec=0 Errors=0 Files Examined=15,818 Processing file: /data/blah/blah/foobar.xls SDReadSeqNo=110424772 fd=6 Director connected at: 25-Mar-09 09:03 [...] status of the storage device from which the verify job is reading: * st stor [...] Running Jobs: Reading: Verify Volume to Catalog Restore job VerifyVU0EF005-MPC-Volume1.2009-03-24_08 JobId=10632 Volume="A00136L4" pool="Nightview-Full" device="ULTRIUM-TD4-D3" (/dev/ULTRIUM-TD4-D3) Writing: Verify Volume to Catalog Restore job VerifyVU0EF005-MPC-Volume1.2009-03-24_08 JobId=10632 Volume="A00136L4" pool="Nightview-Full" device="ULTRIUM-TD4-D3" (/dev/ULTRIUM-TD4-D3) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDReadSeqNo=74,138 in_msg=74137 out_msg=110425435 fd=5 [...] There is no activity anymore. The number of files in the st dir output doesn't grow and the error counter doesn't increase. There is also no tape activity anymore. I can't find anything related in the bacula log (not flushed because the job has not terminated yet?), and I can't scroll back all console messages. No error mail was sent yet. What can I do now? IIRC I had to cancel those jobs in the past. Right now I can't find out, which files have verify differences or what the problem is at all. ====================================================================== ---------------------------------------------------------------------- (0004069) kern (administrator) - 2009-03-25 09:37 http://bugs.bacula.org/view.php?id=1258#c4069 ---------------------------------------------------------------------- I am sorry, but with such little information we can do nothing for you. I cannot even verify that the Job is not running. Please read the Support page of the manual. In particular, you will need to prove that the it is really stuck by showing us that either nothing is happening in the SD, or giving some indication on how long the job has been running, then if it appears to be stuck in the SD, we need a traceback -- see the Kaboom manual of the chapter. Since we don't really know that it is a problem in the SD either submit tracebacks of the DIR and FD, or expect that we may ask for them. ---------------------------------------------------------------------- (0004070) RalfGross (reporter) - 2009-03-25 10:10 http://bugs.bacula.org/view.php?id=1258#c4070 ---------------------------------------------------------------------- Well, this is the information I get from bacula without raising debug level. >From the directors point of view, the job _is_ running, but I can't see what's going on and the dir tells me that there are verify differences. If I understand the kaboom chapter correctly, it only helps if a daemon crashes, that seems not to be the case here. The job is running 26 hours now. Alls daemons are responding (status query). My problem is exactly the same as yours, I don't see what's happening right now. The status client.... JobId 10632 Job VerifyVU0EF005-MPC-Volume1.2009-03-24_08.26.25.15 is running. Verify Job started: 24-Mar-09 08:26 Files=15,818 Bytes=0 Bytes/sec=0 Errors=0 Files Examined=15,818 Processing file: /data/--snip--- SDReadSeqNo=110424772 fd=6 Director connected at: 25-Mar-09 10:43 ...and storage output now: Reading: Verify Volume to Catalog Restore job VerifyVU0EF005-MPC-Volume1.2009-03-24_08 JobId=10632 Volume="A00136L4" pool="Nightview-Full" device="ULTRIUM-TD4-D3" (/dev/ULTRIUM-TD4-D3) Writing: Verify Volume to Catalog Restore job VerifyVU0EF005-MPC-Volume1.2009-03-24_08 JobId=10632 Volume="A00136L4" pool="Nightview-Full" device="ULTRIUM-TD4-D3" (/dev/ULTRIUM-TD4-D3) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDReadSeqNo=74,138 in_msg=74137 out_msg=110425435 fd=5 ==== The output is showing exactly the same amount of examined files as 2 hours ago. Is there any way to get more info about what's going on without canceling the jobs and starting the verify job with a higher debug level again? strace on the client where the verify job is running: root@VU0EF005:~$pgrep -lf bacula 4461 /usr/sbin/bacula-fd -c /etc/bacula/bacula-fd.conf -d10 root@VU0EF005:~$strace -s 1024 -Ff -p 4461 Process 21691 attached with 4 threads - interrupt to quit [pid 4461] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 4461] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 4461] select(5, [3 4], NULL, NULL, NULL <unfinished ...> [pid 21691] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 4461] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) [pid 4461] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 4461] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 4461] select(5, [3 4], NULL, NULL, NULL <unfinished ...> [pid 21691] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 21691] write(5, "\377\377\377\372", 4 <unfinished ...> [pid 4468] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 21688] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 4461] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) [pid 4461] select(5, [3 4], NULL, NULL, NULL <unfinished ...> [pid 4468] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 21688] --- SIGSTOP (Stopped (signal)) @ 0 (0) --- [pid 21691] <... write resumed> ) = ? ERESTARTSYS (To be restarted) [pid 21691] write(5, "\377\377\377\372", 4 <unfinished ...> [pid 4468] restart_syscall(<... resuming interrupted call ...> <unfinished ...> [pid 21688] write(5, "nk ALDW/Continuous_2007.11.28_at_18.26.21.rec_4480643666.xls\0hg BBWOoA IGw B /0 II/ A DwA BAA g BIbqoA BIbqoA BJLRvk A A C\0\0", 124 <unfinished ...> [pid 4468] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 4468] futex(0x54ed20, FUTEX_WAKE, 1) = 0 [pid 4468] clock_gettime(CLOCK_REALTIME, {1237975460, 134865660}) = 0 [pid 4468] futex(0x54ed64, FUTEX_WAIT, 114415, {29, 999954340}) = -1 ETIMEDOUT (Connection timed out) [pid 4468] futex(0x54ed20, FUTEX_WAKE, 1) = 0 [pid 4468] clock_gettime(CLOCK_REALTIME, {1237975490, 135153026}) = 0 [pid 4468] futex(0x54ed64, FUTEX_WAIT, 114417, {29, 999956974}) = -1 ETIMEDOUT (Connection timed out) [pid 4468] futex(0x54ed20, FUTEX_WAKE, 1) = 0 [pid 4468] clock_gettime(CLOCK_REALTIME, {1237975520, 135336561}) = 0 [pid 4468] futex(0x54ed64, FUTEX_WAIT, 114419, {29, 999981439} Anything interesting? I've attached a more detailed strace output, including a status client command. ---------------------------------------------------------------------- (0004071) kern (administrator) - 2009-03-25 15:39 http://bugs.bacula.org/view.php?id=1258#c4071 ---------------------------------------------------------------------- Well, it is very difficult to know what is going on inside Bacula with an strace, but from what I see here, it looks to me like the SD is trying to send attribute data to the Director or to the FD, and it is getting a timeout. This would indicate that something is wrong with your comm line. If this is the case, normally in something like 2 hours the SD should terminate the job so it would not remain stalled very long. I am not sure where the SIGSTOPs are coming from but as far as I know, that is something that should never happen to Bacula. It also seems to me that we had a lot of problems at one large site on Etch due to the OS -- for example, we could not get good tracebacks. The solution was to move up to Lenny. I am not proposing this as a solution for you, but I will be surprised if you can get a good traceback, in which case we are out of luck. ---------------------------------------------------------------------- (0004072) RalfGross (reporter) - 2009-03-25 15:49 http://bugs.bacula.org/view.php?id=1258#c4072 ---------------------------------------------------------------------- The jobs was still running after 6 more hours, so I canceled it. If this had been a single event, I wouldn't care much. But I've seen this more often in the last weeks/months. I'm not sure if this started to show up after updating from 2.2.8 to 2.4.x. I'm not running the verify again on an other client, lets see what happens... ---------------------------------------------------------------------- (0004073) RalfGross (reporter) - 2009-03-25 15:56 http://bugs.bacula.org/view.php?id=1258#c4073 ---------------------------------------------------------------------- After canceling the job, I now have the job output in the log file (would be helpfull, if the output would also be written to the file while the job is running, not at the end only.) At 24-Mär 20:14 bacula detected verify differences. There are no more messages in the log file after that until I cancled the job at 25-Mär 16:45. [...] 24-Mär 18:46 VU0EA003-sd JobId 10632: Ready to read from volume "A00136L4" on device "ULTRIUM-TD4-D3" (/dev/ULTRIUM-TD4-D3). 24-Mär 18:46 VU0EA003-sd JobId 10632: Forward spacing Volume "A00136L4" to file:block 0:1. 24-Mär 20:14 VUMEM004-dir JobId 10632: File: /data/MPC/MPC-Volume1/2008-08-Snapshots Freigabedatenbank ALDW/Continuous_2007.11.26_at_17.45.23.rec_160417087.jpg 24-Mär 20:14 VUMEM004-dir JobId 10632: st_ino differ. Cat: 1096302373 File: 18392901092376613 25-Mär 16:45 VU0EA003-sd JobId 10632: Job VerifyVU0EF005-MPC-Volume1.2009-03-24_08.26.25.15 marked to be canceled. 25-Mär 16:45 VU0EA003-sd JobId 10632: Job VerifyVU0EF005-MPC-Volume1.2009-03-24_08.26.25.15 marked to be canceled. ---------------------------------------------------------------------- (0004076) RalfGross (reporter) - 2009-03-30 09:31 http://bugs.bacula.org/view.php?id=1258#c4076 ---------------------------------------------------------------------- The job finished without problems on an other client. The client where the job was first started is behind a firewall, maybe this is was a problem? I would use the heartbeat option, but this doesn't work with 2.4.x and verify jobs at the moment (remember my old ticket...). I still think that bacula's behavior is not optimal at this point, given the fact that the job was doing nothing for >12h and nothing interesting was to get from the various status commands + I had to cancel the job manually to get on. Issue History Date Modified Username Field Change ====================================================================== 2009-03-25 08:38 RalfGross New Issue 2009-03-25 09:37 kern Note Added: 0004069 2009-03-25 09:37 kern Status new => feedback 2009-03-25 10:10 RalfGross Note Added: 0004070 2009-03-25 10:10 RalfGross File Added: fd.log 2009-03-25 15:39 kern Note Added: 0004071 2009-03-25 15:49 RalfGross Note Added: 0004072 2009-03-25 15:56 RalfGross Note Added: 0004073 2009-03-30 09:31 RalfGross Note Added: 0004076 ====================================================================== |