From: <bac...@li...> - 2004-08-19 13:25:12
|
The following bug has been CLOSED ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: closed ====================================================================== Date Submitted: 08-19-2004 05:19 PDT Last Modified: 08-19-2004 06:25 PDT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== ---------------------------------------------------------------------- kern - 08-19-2004 06:25 PDT ---------------------------------------------------------------------- I'm going to close this bug, because there is not enough information to fix it. I would appreciate it if you would do two things: 1. Make absolutely sure that: a. Your spool directory never fills. b. Make sure you never attempt to backup your spool directory. 2. If this happens again, please get a manual dump of the SD as described in the Kaboom chapter of the manual, and either re-open this bug or create a new one. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:04 PDT ---------------------------------------------------------------------- Reporting back. It's now 3am EDT (about 2 hours since I cancelled the jobs). The two jobs are still in the director's queue but marked as "has been canceled". The SD is still running. *status dir ... Running Jobs: JobId Level Name Status ====================================================================== 831 Increme orca.2004-08-18_23.00.03 has been canceled 828 Increme hawk.2004-08-18_23.00.00 has been canceled ==== ... The SD is (probably) not responding to director's requests, current status still shows 4 jobs in the SD: *st storage=Autochanger Connecting to Storage daemon Autochanger at hawk.redklay2:9103 hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 96 Jobs run since started. Running Jobs: Full Backup job orca_db_redo JobId=832 Volume="Daily02" device="/dev/nst1" Files=8 Bytes=658,453 Bytes/sec=45 FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15 Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1" Files=233 Bytes=5,770,856 Bytes/sec=395 FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13 Incremental Backup job panda JobId=830 Volume="Daily01" device="/dev/nst1" Files=650 Bytes=46,141,511 Bytes/sec=3,164 FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11 Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1" Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546 FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9 ==== ... I will be terminating/restarting the SD and director and restarting these backup jobs. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:07 PDT ---------------------------------------------------------------------- I should have included the client (FD) status as well, for these two jobs the clients still show they are running (probably waiting on SD): *st client=orca-fd Connecting to Client orca-fd at orca.redklay2:9102 orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3 Daemon started 15-Aug-04 09:10, 8 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 758 Sinc 691 82,734,632 OK 12-Aug-04 23:15 orca 767 Full 514,961 7,839,492,211 OK 14-Aug-04 11:59 orca 792 Sinc 80 3,104,564 OK 15-Aug-04 09:15 orca 797 Full 8 5,692 OK 15-Aug-04 23:00 orca_db_redo 796 Sinc 26 3,355,004 OK 15-Aug-04 23:04 orca 808 Full 8 1,090,026 OK 16-Aug-04 23:00 orca_db_redo 807 Sinc 1,861 6,756,867 OK 16-Aug-04 23:05 orca 819 Full 8 878,011 OK 17-Aug-04 23:00 orca_db_redo 818 Sinc 222 6,957,873 OK 17-Aug-04 23:06 orca 832 Full 8 657,684 Cancel 19-Aug-04 01:07 orca_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 831 Job orca.2004-08-18_23.00.03 is running. Backup Job started: 18-Aug-04 23:00 Files=235 Bytes=5,839,057 Bytes/sec=395 Files Examined=537,647 Processing file: /d01/app/oracle/product/8.1.5/network/log/listener.log SDReadSeqNo=5 fd=6 ==== *st client=hawk-fd Connecting to Client hawk-fd at hawk.redklay2:9102 hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 18 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 779 Full 0 0 OK 14-Aug-04 12:01 BackupCatalog 780 Full 1 686,017,053 OK 14-Aug-04 12:14 BackupCatalog 781 Sinc 156 43,287,087 OK 15-Aug-04 09:05 hawk 782 Full 6 5,004 OK 15-Aug-04 09:05 hawk_db_redo 793 Sinc 132 42,706,198 OK 15-Aug-04 23:00 hawk 794 Full 6 4,288 OK 15-Aug-04 23:00 hawk_db_redo 804 Sinc 123 43,114,702 OK 16-Aug-04 23:00 hawk 805 Full 6 4,414 OK 16-Aug-04 23:00 hawk_db_redo 815 Sinc 133 43,314,887 OK 17-Aug-04 23:00 hawk 816 Full 6 4,422 OK 17-Aug-04 23:00 hawk_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 828 Job hawk.2004-08-18_23.00.00 is running. Backup Job started: 18-Aug-04 23:00 Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532 Files Examined=71,869 Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe SDReadSeqNo=5 fd=7 ==== * All other clients report that there are no jobs running. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:24 PDT ---------------------------------------------------------------------- Final update tonight. When I killed the SD, the clients were immediately freed and the hung cancelled jobs released from the director's queue. I did not have to kill any other processes. However, when I restarted the SD, it could not read the "Daily01" volume label; it had been erased (though it was properly labled and used previously; also the earlier messages show that the SD device status with "Daily01"). I used btape to relabel it and then it remounted fine. Rerunning the jobs worked fine as this started at BOT. By the way....thanks for such a great product!!! Even with the small problems, I've been very happy with Bacula. Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz 08-19-04 06:25 kern Bugnote Added: 0000175 08-19-04 06:25 kern Resolution open => unable to duplicate 08-19-04 06:25 kern Status new => closed 08-19-04 07:04 KlayVessel Bugnote Added: 0000169 08-19-04 07:07 KlayVessel Bugnote Added: 0000170 08-19-04 07:24 KlayVessel Bugnote Added: 0000171 ====================================================================== |