From: <bac...@li...> - 2004-08-19 05:19:31
|
The following bug has been SUBMITTED. ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: new ====================================================================== Date Submitted: 08-19-2004 05:19 GMT Last Modified: 08-19-2004 05:19 GMT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz ====================================================================== |
From: <bac...@li...> - 2004-08-19 07:04:27
|
A BUGNOTE has been added to this bug. ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: new ====================================================================== Date Submitted: 08-19-2004 05:19 GMT Last Modified: 08-19-2004 07:04 GMT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:04 GMT ---------------------------------------------------------------------- Reporting back. It's now 3am EDT (about 2 hours since I cancelled the jobs). The two jobs are still in the director's queue but marked as "has been canceled". The SD is still running. *status dir ... Running Jobs: JobId Level Name Status ====================================================================== 831 Increme orca.2004-08-18_23.00.03 has been canceled 828 Increme hawk.2004-08-18_23.00.00 has been canceled ==== ... The SD is (probably) not responding to director's requests, current status still shows 4 jobs in the SD: *st storage=Autochanger Connecting to Storage daemon Autochanger at hawk.redklay2:9103 hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 96 Jobs run since started. Running Jobs: Full Backup job orca_db_redo JobId=832 Volume="Daily02" device="/dev/nst1" Files=8 Bytes=658,453 Bytes/sec=45 FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15 Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1" Files=233 Bytes=5,770,856 Bytes/sec=395 FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13 Incremental Backup job panda JobId=830 Volume="Daily01" device="/dev/nst1" Files=650 Bytes=46,141,511 Bytes/sec=3,164 FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11 Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1" Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546 FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9 ==== ... I will be terminating/restarting the SD and director and restarting these backup jobs. Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz 08-19-04 07:04 KlayVessel Bugnote Added: 0000169 ====================================================================== |
From: <bac...@li...> - 2004-08-19 07:07:47
|
A BUGNOTE has been added to this bug. ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: new ====================================================================== Date Submitted: 08-19-2004 05:19 GMT Last Modified: 08-19-2004 07:07 GMT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:04 GMT ---------------------------------------------------------------------- Reporting back. It's now 3am EDT (about 2 hours since I cancelled the jobs). The two jobs are still in the director's queue but marked as "has been canceled". The SD is still running. *status dir ... Running Jobs: JobId Level Name Status ====================================================================== 831 Increme orca.2004-08-18_23.00.03 has been canceled 828 Increme hawk.2004-08-18_23.00.00 has been canceled ==== ... The SD is (probably) not responding to director's requests, current status still shows 4 jobs in the SD: *st storage=Autochanger Connecting to Storage daemon Autochanger at hawk.redklay2:9103 hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 96 Jobs run since started. Running Jobs: Full Backup job orca_db_redo JobId=832 Volume="Daily02" device="/dev/nst1" Files=8 Bytes=658,453 Bytes/sec=45 FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15 Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1" Files=233 Bytes=5,770,856 Bytes/sec=395 FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13 Incremental Backup job panda JobId=830 Volume="Daily01" device="/dev/nst1" Files=650 Bytes=46,141,511 Bytes/sec=3,164 FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11 Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1" Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546 FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9 ==== ... I will be terminating/restarting the SD and director and restarting these backup jobs. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:07 GMT ---------------------------------------------------------------------- I should have included the client (FD) status as well, for these two jobs the clients still show they are running (probably waiting on SD): *st client=orca-fd Connecting to Client orca-fd at orca.redklay2:9102 orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3 Daemon started 15-Aug-04 09:10, 8 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 758 Sinc 691 82,734,632 OK 12-Aug-04 23:15 orca 767 Full 514,961 7,839,492,211 OK 14-Aug-04 11:59 orca 792 Sinc 80 3,104,564 OK 15-Aug-04 09:15 orca 797 Full 8 5,692 OK 15-Aug-04 23:00 orca_db_redo 796 Sinc 26 3,355,004 OK 15-Aug-04 23:04 orca 808 Full 8 1,090,026 OK 16-Aug-04 23:00 orca_db_redo 807 Sinc 1,861 6,756,867 OK 16-Aug-04 23:05 orca 819 Full 8 878,011 OK 17-Aug-04 23:00 orca_db_redo 818 Sinc 222 6,957,873 OK 17-Aug-04 23:06 orca 832 Full 8 657,684 Cancel 19-Aug-04 01:07 orca_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 831 Job orca.2004-08-18_23.00.03 is running. Backup Job started: 18-Aug-04 23:00 Files=235 Bytes=5,839,057 Bytes/sec=395 Files Examined=537,647 Processing file: /d01/app/oracle/product/8.1.5/network/log/listener.log SDReadSeqNo=5 fd=6 ==== *st client=hawk-fd Connecting to Client hawk-fd at hawk.redklay2:9102 hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 18 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 779 Full 0 0 OK 14-Aug-04 12:01 BackupCatalog 780 Full 1 686,017,053 OK 14-Aug-04 12:14 BackupCatalog 781 Sinc 156 43,287,087 OK 15-Aug-04 09:05 hawk 782 Full 6 5,004 OK 15-Aug-04 09:05 hawk_db_redo 793 Sinc 132 42,706,198 OK 15-Aug-04 23:00 hawk 794 Full 6 4,288 OK 15-Aug-04 23:00 hawk_db_redo 804 Sinc 123 43,114,702 OK 16-Aug-04 23:00 hawk 805 Full 6 4,414 OK 16-Aug-04 23:00 hawk_db_redo 815 Sinc 133 43,314,887 OK 17-Aug-04 23:00 hawk 816 Full 6 4,422 OK 17-Aug-04 23:00 hawk_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 828 Job hawk.2004-08-18_23.00.00 is running. Backup Job started: 18-Aug-04 23:00 Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532 Files Examined=71,869 Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe SDReadSeqNo=5 fd=7 ==== * All other clients report that there are no jobs running. Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz 08-19-04 07:04 KlayVessel Bugnote Added: 0000169 08-19-04 07:07 KlayVessel Bugnote Added: 0000170 ====================================================================== |
From: <bac...@li...> - 2004-08-19 07:25:10
|
A BUGNOTE has been added to this bug. ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: new ====================================================================== Date Submitted: 08-19-2004 05:19 GMT Last Modified: 08-19-2004 07:24 GMT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:04 GMT ---------------------------------------------------------------------- Reporting back. It's now 3am EDT (about 2 hours since I cancelled the jobs). The two jobs are still in the director's queue but marked as "has been canceled". The SD is still running. *status dir ... Running Jobs: JobId Level Name Status ====================================================================== 831 Increme orca.2004-08-18_23.00.03 has been canceled 828 Increme hawk.2004-08-18_23.00.00 has been canceled ==== ... The SD is (probably) not responding to director's requests, current status still shows 4 jobs in the SD: *st storage=Autochanger Connecting to Storage daemon Autochanger at hawk.redklay2:9103 hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 96 Jobs run since started. Running Jobs: Full Backup job orca_db_redo JobId=832 Volume="Daily02" device="/dev/nst1" Files=8 Bytes=658,453 Bytes/sec=45 FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15 Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1" Files=233 Bytes=5,770,856 Bytes/sec=395 FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13 Incremental Backup job panda JobId=830 Volume="Daily01" device="/dev/nst1" Files=650 Bytes=46,141,511 Bytes/sec=3,164 FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11 Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1" Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546 FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9 ==== ... I will be terminating/restarting the SD and director and restarting these backup jobs. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:07 GMT ---------------------------------------------------------------------- I should have included the client (FD) status as well, for these two jobs the clients still show they are running (probably waiting on SD): *st client=orca-fd Connecting to Client orca-fd at orca.redklay2:9102 orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3 Daemon started 15-Aug-04 09:10, 8 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 758 Sinc 691 82,734,632 OK 12-Aug-04 23:15 orca 767 Full 514,961 7,839,492,211 OK 14-Aug-04 11:59 orca 792 Sinc 80 3,104,564 OK 15-Aug-04 09:15 orca 797 Full 8 5,692 OK 15-Aug-04 23:00 orca_db_redo 796 Sinc 26 3,355,004 OK 15-Aug-04 23:04 orca 808 Full 8 1,090,026 OK 16-Aug-04 23:00 orca_db_redo 807 Sinc 1,861 6,756,867 OK 16-Aug-04 23:05 orca 819 Full 8 878,011 OK 17-Aug-04 23:00 orca_db_redo 818 Sinc 222 6,957,873 OK 17-Aug-04 23:06 orca 832 Full 8 657,684 Cancel 19-Aug-04 01:07 orca_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 831 Job orca.2004-08-18_23.00.03 is running. Backup Job started: 18-Aug-04 23:00 Files=235 Bytes=5,839,057 Bytes/sec=395 Files Examined=537,647 Processing file: /d01/app/oracle/product/8.1.5/network/log/listener.log SDReadSeqNo=5 fd=6 ==== *st client=hawk-fd Connecting to Client hawk-fd at hawk.redklay2:9102 hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 18 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 779 Full 0 0 OK 14-Aug-04 12:01 BackupCatalog 780 Full 1 686,017,053 OK 14-Aug-04 12:14 BackupCatalog 781 Sinc 156 43,287,087 OK 15-Aug-04 09:05 hawk 782 Full 6 5,004 OK 15-Aug-04 09:05 hawk_db_redo 793 Sinc 132 42,706,198 OK 15-Aug-04 23:00 hawk 794 Full 6 4,288 OK 15-Aug-04 23:00 hawk_db_redo 804 Sinc 123 43,114,702 OK 16-Aug-04 23:00 hawk 805 Full 6 4,414 OK 16-Aug-04 23:00 hawk_db_redo 815 Sinc 133 43,314,887 OK 17-Aug-04 23:00 hawk 816 Full 6 4,422 OK 17-Aug-04 23:00 hawk_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 828 Job hawk.2004-08-18_23.00.00 is running. Backup Job started: 18-Aug-04 23:00 Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532 Files Examined=71,869 Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe SDReadSeqNo=5 fd=7 ==== * All other clients report that there are no jobs running. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:24 GMT ---------------------------------------------------------------------- Final update tonight. When I killed the SD, the clients were immediately freed and the hung cancelled jobs released from the director's queue. I did not have to kill any other processes. However, when I restarted the SD, it could not read the "Daily01" volume label; it had been erased (though it was properly labled and used previously; also the earlier messages show that the SD device status with "Daily01"). I used btape to relabel it and then it remounted fine. Rerunning the jobs worked fine as this started at BOT. By the way....thanks for such a great product!!! Even with the small problems, I've been very happy with Bacula. Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz 08-19-04 07:04 KlayVessel Bugnote Added: 0000169 08-19-04 07:07 KlayVessel Bugnote Added: 0000170 08-19-04 07:24 KlayVessel Bugnote Added: 0000171 ====================================================================== |
From: <bac...@li...> - 2004-08-19 13:25:12
|
The following bug has been CLOSED ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: closed ====================================================================== Date Submitted: 08-19-2004 05:19 PDT Last Modified: 08-19-2004 06:25 PDT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== ---------------------------------------------------------------------- kern - 08-19-2004 06:25 PDT ---------------------------------------------------------------------- I'm going to close this bug, because there is not enough information to fix it. I would appreciate it if you would do two things: 1. Make absolutely sure that: a. Your spool directory never fills. b. Make sure you never attempt to backup your spool directory. 2. If this happens again, please get a manual dump of the SD as described in the Kaboom chapter of the manual, and either re-open this bug or create a new one. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:04 PDT ---------------------------------------------------------------------- Reporting back. It's now 3am EDT (about 2 hours since I cancelled the jobs). The two jobs are still in the director's queue but marked as "has been canceled". The SD is still running. *status dir ... Running Jobs: JobId Level Name Status ====================================================================== 831 Increme orca.2004-08-18_23.00.03 has been canceled 828 Increme hawk.2004-08-18_23.00.00 has been canceled ==== ... The SD is (probably) not responding to director's requests, current status still shows 4 jobs in the SD: *st storage=Autochanger Connecting to Storage daemon Autochanger at hawk.redklay2:9103 hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 96 Jobs run since started. Running Jobs: Full Backup job orca_db_redo JobId=832 Volume="Daily02" device="/dev/nst1" Files=8 Bytes=658,453 Bytes/sec=45 FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15 Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1" Files=233 Bytes=5,770,856 Bytes/sec=395 FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13 Incremental Backup job panda JobId=830 Volume="Daily01" device="/dev/nst1" Files=650 Bytes=46,141,511 Bytes/sec=3,164 FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11 Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1" Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546 FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9 ==== ... I will be terminating/restarting the SD and director and restarting these backup jobs. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:07 PDT ---------------------------------------------------------------------- I should have included the client (FD) status as well, for these two jobs the clients still show they are running (probably waiting on SD): *st client=orca-fd Connecting to Client orca-fd at orca.redklay2:9102 orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3 Daemon started 15-Aug-04 09:10, 8 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 758 Sinc 691 82,734,632 OK 12-Aug-04 23:15 orca 767 Full 514,961 7,839,492,211 OK 14-Aug-04 11:59 orca 792 Sinc 80 3,104,564 OK 15-Aug-04 09:15 orca 797 Full 8 5,692 OK 15-Aug-04 23:00 orca_db_redo 796 Sinc 26 3,355,004 OK 15-Aug-04 23:04 orca 808 Full 8 1,090,026 OK 16-Aug-04 23:00 orca_db_redo 807 Sinc 1,861 6,756,867 OK 16-Aug-04 23:05 orca 819 Full 8 878,011 OK 17-Aug-04 23:00 orca_db_redo 818 Sinc 222 6,957,873 OK 17-Aug-04 23:06 orca 832 Full 8 657,684 Cancel 19-Aug-04 01:07 orca_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 831 Job orca.2004-08-18_23.00.03 is running. Backup Job started: 18-Aug-04 23:00 Files=235 Bytes=5,839,057 Bytes/sec=395 Files Examined=537,647 Processing file: /d01/app/oracle/product/8.1.5/network/log/listener.log SDReadSeqNo=5 fd=6 ==== *st client=hawk-fd Connecting to Client hawk-fd at hawk.redklay2:9102 hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 18 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 779 Full 0 0 OK 14-Aug-04 12:01 BackupCatalog 780 Full 1 686,017,053 OK 14-Aug-04 12:14 BackupCatalog 781 Sinc 156 43,287,087 OK 15-Aug-04 09:05 hawk 782 Full 6 5,004 OK 15-Aug-04 09:05 hawk_db_redo 793 Sinc 132 42,706,198 OK 15-Aug-04 23:00 hawk 794 Full 6 4,288 OK 15-Aug-04 23:00 hawk_db_redo 804 Sinc 123 43,114,702 OK 16-Aug-04 23:00 hawk 805 Full 6 4,414 OK 16-Aug-04 23:00 hawk_db_redo 815 Sinc 133 43,314,887 OK 17-Aug-04 23:00 hawk 816 Full 6 4,422 OK 17-Aug-04 23:00 hawk_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 828 Job hawk.2004-08-18_23.00.00 is running. Backup Job started: 18-Aug-04 23:00 Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532 Files Examined=71,869 Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe SDReadSeqNo=5 fd=7 ==== * All other clients report that there are no jobs running. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:24 PDT ---------------------------------------------------------------------- Final update tonight. When I killed the SD, the clients were immediately freed and the hung cancelled jobs released from the director's queue. I did not have to kill any other processes. However, when I restarted the SD, it could not read the "Daily01" volume label; it had been erased (though it was properly labled and used previously; also the earlier messages show that the SD device status with "Daily01"). I used btape to relabel it and then it remounted fine. Rerunning the jobs worked fine as this started at BOT. By the way....thanks for such a great product!!! Even with the small problems, I've been very happy with Bacula. Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz 08-19-04 06:25 kern Bugnote Added: 0000175 08-19-04 06:25 kern Resolution open => unable to duplicate 08-19-04 06:25 kern Status new => closed 08-19-04 07:04 KlayVessel Bugnote Added: 0000169 08-19-04 07:07 KlayVessel Bugnote Added: 0000170 08-19-04 07:24 KlayVessel Bugnote Added: 0000171 ====================================================================== |
From: <bac...@li...> - 2004-09-01 17:40:49
|
The following bug has been REOPENED. ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: feedback ====================================================================== Date Submitted: 08-19-2004 05:19 PDT Last Modified: 09-01-2004 10:43 PDT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== ---------------------------------------------------------------------- kern - 08-19-2004 06:25 PDT ---------------------------------------------------------------------- I'm going to close this bug, because there is not enough information to fix it. I would appreciate it if you would do two things: 1. Make absolutely sure that: a. Your spool directory never fills. b. Make sure you never attempt to backup your spool directory. 2. If this happens again, please get a manual dump of the SD as described in the Kaboom chapter of the manual, and either re-open this bug or create a new one. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:04 PDT ---------------------------------------------------------------------- Reporting back. It's now 3am EDT (about 2 hours since I cancelled the jobs). The two jobs are still in the director's queue but marked as "has been canceled". The SD is still running. *status dir ... Running Jobs: JobId Level Name Status ====================================================================== 831 Increme orca.2004-08-18_23.00.03 has been canceled 828 Increme hawk.2004-08-18_23.00.00 has been canceled ==== ... The SD is (probably) not responding to director's requests, current status still shows 4 jobs in the SD: *st storage=Autochanger Connecting to Storage daemon Autochanger at hawk.redklay2:9103 hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 96 Jobs run since started. Running Jobs: Full Backup job orca_db_redo JobId=832 Volume="Daily02" device="/dev/nst1" Files=8 Bytes=658,453 Bytes/sec=45 FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15 Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1" Files=233 Bytes=5,770,856 Bytes/sec=395 FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13 Incremental Backup job panda JobId=830 Volume="Daily01" device="/dev/nst1" Files=650 Bytes=46,141,511 Bytes/sec=3,164 FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11 Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1" Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546 FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9 ==== ... I will be terminating/restarting the SD and director and restarting these backup jobs. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:07 PDT ---------------------------------------------------------------------- I should have included the client (FD) status as well, for these two jobs the clients still show they are running (probably waiting on SD): *st client=orca-fd Connecting to Client orca-fd at orca.redklay2:9102 orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3 Daemon started 15-Aug-04 09:10, 8 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 758 Sinc 691 82,734,632 OK 12-Aug-04 23:15 orca 767 Full 514,961 7,839,492,211 OK 14-Aug-04 11:59 orca 792 Sinc 80 3,104,564 OK 15-Aug-04 09:15 orca 797 Full 8 5,692 OK 15-Aug-04 23:00 orca_db_redo 796 Sinc 26 3,355,004 OK 15-Aug-04 23:04 orca 808 Full 8 1,090,026 OK 16-Aug-04 23:00 orca_db_redo 807 Sinc 1,861 6,756,867 OK 16-Aug-04 23:05 orca 819 Full 8 878,011 OK 17-Aug-04 23:00 orca_db_redo 818 Sinc 222 6,957,873 OK 17-Aug-04 23:06 orca 832 Full 8 657,684 Cancel 19-Aug-04 01:07 orca_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 831 Job orca.2004-08-18_23.00.03 is running. Backup Job started: 18-Aug-04 23:00 Files=235 Bytes=5,839,057 Bytes/sec=395 Files Examined=537,647 Processing file: /d01/app/oracle/product/8.1.5/network/log/listener.log SDReadSeqNo=5 fd=6 ==== *st client=hawk-fd Connecting to Client hawk-fd at hawk.redklay2:9102 hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 18 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 779 Full 0 0 OK 14-Aug-04 12:01 BackupCatalog 780 Full 1 686,017,053 OK 14-Aug-04 12:14 BackupCatalog 781 Sinc 156 43,287,087 OK 15-Aug-04 09:05 hawk 782 Full 6 5,004 OK 15-Aug-04 09:05 hawk_db_redo 793 Sinc 132 42,706,198 OK 15-Aug-04 23:00 hawk 794 Full 6 4,288 OK 15-Aug-04 23:00 hawk_db_redo 804 Sinc 123 43,114,702 OK 16-Aug-04 23:00 hawk 805 Full 6 4,414 OK 16-Aug-04 23:00 hawk_db_redo 815 Sinc 133 43,314,887 OK 17-Aug-04 23:00 hawk 816 Full 6 4,422 OK 17-Aug-04 23:00 hawk_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 828 Job hawk.2004-08-18_23.00.00 is running. Backup Job started: 18-Aug-04 23:00 Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532 Files Examined=71,869 Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe SDReadSeqNo=5 fd=7 ==== * All other clients report that there are no jobs running. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:24 PDT ---------------------------------------------------------------------- Final update tonight. When I killed the SD, the clients were immediately freed and the hung cancelled jobs released from the director's queue. I did not have to kill any other processes. However, when I restarted the SD, it could not read the "Daily01" volume label; it had been erased (though it was properly labled and used previously; also the earlier messages show that the SD device status with "Daily01"). I used btape to relabel it and then it remounted fine. Rerunning the jobs worked fine as this started at BOT. By the way....thanks for such a great product!!! Even with the small problems, I've been very happy with Bacula. ---------------------------------------------------------------------- KlayVessel - 09-01-2004 10:43 PDT ---------------------------------------------------------------------- I am now able to reproduce this problem. It occurs after doing a large restore -- in the case that happened last night, I had restored >450,000 files. As you will note in the original bug report I had also noted that particular condition. Unfortunately, I had not noticed your additional note about the btraceback so restarted bacula-sd before I could capture that. I will do so the next time and (when I can get a chance) create this situation again so I can get one for you. Some additional info about this case, bacula-sd responded with a tape full almost immediately even though only 10GB was used on the tape (these tapes support 80GB native); switched to the second tape in the pool, then reported a problem and, then asked for another tape (only two in this pool). When I restarted bacula-sd, it could not mount the tape as the label had been damaged; I had to use btape; rebuilt the label and then the new instance could mount the tape and complete the remaining, waiting jobs. Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz 08-19-04 06:25 kern Bugnote Added: 0000175 08-19-04 06:25 kern Resolution open => unable to duplicate 08-19-04 06:25 kern Status new => closed 08-19-04 07:04 KlayVessel Bugnote Added: 0000169 08-19-04 07:07 KlayVessel Bugnote Added: 0000170 08-19-04 07:24 KlayVessel Bugnote Added: 0000171 09-01-04 10:43 KlayVessel Status closed => feedback 09-01-04 10:43 KlayVessel Bugnote Added: 0000198 09-01-04 10:43 KlayVessel Resolution unable to duplicate => reopened ====================================================================== |
From: <bac...@li...> - 2004-09-01 18:55:13
|
====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: feedback ====================================================================== Date Submitted: 08-19-2004 05:19 PDT Last Modified: 09-01-2004 11:58 PDT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== ---------------------------------------------------------------------- kern - 08-19-2004 06:25 PDT ---------------------------------------------------------------------- I'm going to close this bug, because there is not enough information to fix it. I would appreciate it if you would do two things: 1. Make absolutely sure that: a. Your spool directory never fills. b. Make sure you never attempt to backup your spool directory. 2. If this happens again, please get a manual dump of the SD as described in the Kaboom chapter of the manual, and either re-open this bug or create a new one. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:04 PDT ---------------------------------------------------------------------- Reporting back. It's now 3am EDT (about 2 hours since I cancelled the jobs). The two jobs are still in the director's queue but marked as "has been canceled". The SD is still running. *status dir ... Running Jobs: JobId Level Name Status ====================================================================== 831 Increme orca.2004-08-18_23.00.03 has been canceled 828 Increme hawk.2004-08-18_23.00.00 has been canceled ==== ... The SD is (probably) not responding to director's requests, current status still shows 4 jobs in the SD: *st storage=Autochanger Connecting to Storage daemon Autochanger at hawk.redklay2:9103 hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 96 Jobs run since started. Running Jobs: Full Backup job orca_db_redo JobId=832 Volume="Daily02" device="/dev/nst1" Files=8 Bytes=658,453 Bytes/sec=45 FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15 Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1" Files=233 Bytes=5,770,856 Bytes/sec=395 FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13 Incremental Backup job panda JobId=830 Volume="Daily01" device="/dev/nst1" Files=650 Bytes=46,141,511 Bytes/sec=3,164 FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11 Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1" Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546 FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9 ==== ... I will be terminating/restarting the SD and director and restarting these backup jobs. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:07 PDT ---------------------------------------------------------------------- I should have included the client (FD) status as well, for these two jobs the clients still show they are running (probably waiting on SD): *st client=orca-fd Connecting to Client orca-fd at orca.redklay2:9102 orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3 Daemon started 15-Aug-04 09:10, 8 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 758 Sinc 691 82,734,632 OK 12-Aug-04 23:15 orca 767 Full 514,961 7,839,492,211 OK 14-Aug-04 11:59 orca 792 Sinc 80 3,104,564 OK 15-Aug-04 09:15 orca 797 Full 8 5,692 OK 15-Aug-04 23:00 orca_db_redo 796 Sinc 26 3,355,004 OK 15-Aug-04 23:04 orca 808 Full 8 1,090,026 OK 16-Aug-04 23:00 orca_db_redo 807 Sinc 1,861 6,756,867 OK 16-Aug-04 23:05 orca 819 Full 8 878,011 OK 17-Aug-04 23:00 orca_db_redo 818 Sinc 222 6,957,873 OK 17-Aug-04 23:06 orca 832 Full 8 657,684 Cancel 19-Aug-04 01:07 orca_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 831 Job orca.2004-08-18_23.00.03 is running. Backup Job started: 18-Aug-04 23:00 Files=235 Bytes=5,839,057 Bytes/sec=395 Files Examined=537,647 Processing file: /d01/app/oracle/product/8.1.5/network/log/listener.log SDReadSeqNo=5 fd=6 ==== *st client=hawk-fd Connecting to Client hawk-fd at hawk.redklay2:9102 hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 18 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 779 Full 0 0 OK 14-Aug-04 12:01 BackupCatalog 780 Full 1 686,017,053 OK 14-Aug-04 12:14 BackupCatalog 781 Sinc 156 43,287,087 OK 15-Aug-04 09:05 hawk 782 Full 6 5,004 OK 15-Aug-04 09:05 hawk_db_redo 793 Sinc 132 42,706,198 OK 15-Aug-04 23:00 hawk 794 Full 6 4,288 OK 15-Aug-04 23:00 hawk_db_redo 804 Sinc 123 43,114,702 OK 16-Aug-04 23:00 hawk 805 Full 6 4,414 OK 16-Aug-04 23:00 hawk_db_redo 815 Sinc 133 43,314,887 OK 17-Aug-04 23:00 hawk 816 Full 6 4,422 OK 17-Aug-04 23:00 hawk_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 828 Job hawk.2004-08-18_23.00.00 is running. Backup Job started: 18-Aug-04 23:00 Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532 Files Examined=71,869 Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe SDReadSeqNo=5 fd=7 ==== * All other clients report that there are no jobs running. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:24 PDT ---------------------------------------------------------------------- Final update tonight. When I killed the SD, the clients were immediately freed and the hung cancelled jobs released from the director's queue. I did not have to kill any other processes. However, when I restarted the SD, it could not read the "Daily01" volume label; it had been erased (though it was properly labled and used previously; also the earlier messages show that the SD device status with "Daily01"). I used btape to relabel it and then it remounted fine. Rerunning the jobs worked fine as this started at BOT. By the way....thanks for such a great product!!! Even with the small problems, I've been very happy with Bacula. ---------------------------------------------------------------------- KlayVessel - 09-01-2004 10:43 PDT ---------------------------------------------------------------------- I am now able to reproduce this problem. It occurs after doing a large restore -- in the case that happened last night, I had restored >450,000 files. As you will note in the original bug report I had also noted that particular condition. Unfortunately, I had not noticed your additional note about the btraceback so restarted bacula-sd before I could capture that. I will do so the next time and (when I can get a chance) create this situation again so I can get one for you. Some additional info about this case, bacula-sd responded with a tape full almost immediately even though only 10GB was used on the tape (these tapes support 80GB native); switched to the second tape in the pool, then reported a problem and, then asked for another tape (only two in this pool). When I restarted bacula-sd, it could not mount the tape as the label had been damaged; I had to use btape; rebuilt the label and then the new instance could mount the tape and complete the remaining, waiting jobs. ---------------------------------------------------------------------- kern - 09-01-2004 11:58 PDT ---------------------------------------------------------------------- I think there are a number of problems occurring simultaneously: 1. You are getting a hard I/O error on a tape, then switching to another tape and immediately the first tape is recycled, then later it is rewritten. This obviously is not going to work. 2. The hard I/O error should have caused the job to terminate in error but did not (this has been reported and corrected in another bug report). This is also quite possibly the reason your tape header was damaged. 3. You are dealing with such a large amount of data that I cannot duplicate this. I suggest: - Ensuring you have enough tapes so you don't recycle a tape you have just written. - Apply patch 1.34.6-block.patch (in another bug report, sorry, I fogot the number). - Instead of the above item, you could upgrade to 1.35.2. - If you can tie this down better or find a way for me to reproduce it, I'll fix it. Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz 08-19-04 06:25 kern Bugnote Added: 0000175 08-19-04 06:25 kern Resolution open => unable to duplicate 08-19-04 06:25 kern Status new => closed 08-19-04 07:04 KlayVessel Bugnote Added: 0000169 08-19-04 07:07 KlayVessel Bugnote Added: 0000170 08-19-04 07:24 KlayVessel Bugnote Added: 0000171 09-01-04 10:43 KlayVessel Status closed => feedback 09-01-04 10:43 KlayVessel Bugnote Added: 0000198 09-01-04 10:43 KlayVessel Resolution unable to duplicate => reopened 09-01-04 11:58 kern Bugnote Added: 0000199 ====================================================================== |
From: <bac...@li...> - 2004-09-01 18:56:30
|
The following bug has been CLOSED ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087 ====================================================================== Reported By: KlayVessel Assigned To: ====================================================================== Project: bacula Bug ID: 87 Category: Storage Daemon Reproducibility: have not tried Severity: minor Priority: normal Status: closed ====================================================================== Date Submitted: 08-19-2004 05:19 PDT Last Modified: 09-01-2004 11:59 PDT ====================================================================== Summary: Storage daemon goes into infinite loop during media change Description: The storage daemon went into an infinite loop after the "Daily02" tape (see additional info) filled and it was trying to switch to "Daily01" -- our Daily pool only has 2 tapes. Notice that the output reports a "permission error"; however as /dev/nst1 is set to rw for root and the bacula-sd process runs as root I don't see why that should have happened. There are no messages in /var/log/messages related to this problem. This may also be related to an earlier restore. Earlier today I did a test restore for one of our XP desktops (just to validate the backup process). I noticed once before in an earlier version of bacula (1.34.2 I believe) when I tested a very large full restore (>500K files) that the storage daemon also went into an infinite loop the next time it needed to write. But that is a different system and writing only to file volumes; not tape. I also see that only one job switch to the "Daily01" tape (in status storage output). Perhaps there is something related to doing this while also using simultaneous jobs? I have included in "Additonal Information" the console output at the time and some status information and a snapshot from top -- this was after about 1.75 hours; the backups start at 11pm. The file I've uploaded is a .tar.gz containing the three main bacula-**.conf files on the backup server. I've redacted the passwords. Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it appears that the tape change worked successful. Backups and restore have been working (btape works fine) Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD. I cancelled the running jobs and, as of right now (1:13am) all but two completed the cancellation. I have left everything running in the hopes that the remaining two will clear/cancel and then I can restart the SD. If these don't clear in a few hours, I will add to this report. ====================================================================== ---------------------------------------------------------------------- kern - 08-19-2004 06:25 PDT ---------------------------------------------------------------------- I'm going to close this bug, because there is not enough information to fix it. I would appreciate it if you would do two things: 1. Make absolutely sure that: a. Your spool directory never fills. b. Make sure you never attempt to backup your spool directory. 2. If this happens again, please get a manual dump of the SD as described in the Kaboom chapter of the manual, and either re-open this bug or create a new one. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:04 PDT ---------------------------------------------------------------------- Reporting back. It's now 3am EDT (about 2 hours since I cancelled the jobs). The two jobs are still in the director's queue but marked as "has been canceled". The SD is still running. *status dir ... Running Jobs: JobId Level Name Status ====================================================================== 831 Increme orca.2004-08-18_23.00.03 has been canceled 828 Increme hawk.2004-08-18_23.00.00 has been canceled ==== ... The SD is (probably) not responding to director's requests, current status still shows 4 jobs in the SD: *st storage=Autochanger Connecting to Storage daemon Autochanger at hawk.redklay2:9103 hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 96 Jobs run since started. Running Jobs: Full Backup job orca_db_redo JobId=832 Volume="Daily02" device="/dev/nst1" Files=8 Bytes=658,453 Bytes/sec=45 FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15 Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1" Files=233 Bytes=5,770,856 Bytes/sec=395 FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13 Incremental Backup job panda JobId=830 Volume="Daily01" device="/dev/nst1" Files=650 Bytes=46,141,511 Bytes/sec=3,164 FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11 Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1" Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546 FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9 ==== ... I will be terminating/restarting the SD and director and restarting these backup jobs. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:07 PDT ---------------------------------------------------------------------- I should have included the client (FD) status as well, for these two jobs the clients still show they are running (probably waiting on SD): *st client=orca-fd Connecting to Client orca-fd at orca.redklay2:9102 orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3 Daemon started 15-Aug-04 09:10, 8 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 758 Sinc 691 82,734,632 OK 12-Aug-04 23:15 orca 767 Full 514,961 7,839,492,211 OK 14-Aug-04 11:59 orca 792 Sinc 80 3,104,564 OK 15-Aug-04 09:15 orca 797 Full 8 5,692 OK 15-Aug-04 23:00 orca_db_redo 796 Sinc 26 3,355,004 OK 15-Aug-04 23:04 orca 808 Full 8 1,090,026 OK 16-Aug-04 23:00 orca_db_redo 807 Sinc 1,861 6,756,867 OK 16-Aug-04 23:05 orca 819 Full 8 878,011 OK 17-Aug-04 23:00 orca_db_redo 818 Sinc 222 6,957,873 OK 17-Aug-04 23:06 orca 832 Full 8 657,684 Cancel 19-Aug-04 01:07 orca_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 831 Job orca.2004-08-18_23.00.03 is running. Backup Job started: 18-Aug-04 23:00 Files=235 Bytes=5,839,057 Bytes/sec=395 Files Examined=537,647 Processing file: /d01/app/oracle/product/8.1.5/network/log/listener.log SDReadSeqNo=5 fd=6 ==== *st client=hawk-fd Connecting to Client hawk-fd at hawk.redklay2:9102 hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3 Daemon started 10-Aug-04 11:51, 18 Jobs run since started. Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 779 Full 0 0 OK 14-Aug-04 12:01 BackupCatalog 780 Full 1 686,017,053 OK 14-Aug-04 12:14 BackupCatalog 781 Sinc 156 43,287,087 OK 15-Aug-04 09:05 hawk 782 Full 6 5,004 OK 15-Aug-04 09:05 hawk_db_redo 793 Sinc 132 42,706,198 OK 15-Aug-04 23:00 hawk 794 Full 6 4,288 OK 15-Aug-04 23:00 hawk_db_redo 804 Sinc 123 43,114,702 OK 16-Aug-04 23:00 hawk 805 Full 6 4,414 OK 16-Aug-04 23:00 hawk_db_redo 815 Sinc 133 43,314,887 OK 17-Aug-04 23:00 hawk 816 Full 6 4,422 OK 17-Aug-04 23:00 hawk_db_redo ==== Running Jobs: Director connected at: 19-Aug-04 03:06 JobId 828 Job hawk.2004-08-18_23.00.00 is running. Backup Job started: 18-Aug-04 23:00 Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532 Files Examined=71,869 Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe SDReadSeqNo=5 fd=7 ==== * All other clients report that there are no jobs running. ---------------------------------------------------------------------- KlayVessel - 08-19-2004 07:24 PDT ---------------------------------------------------------------------- Final update tonight. When I killed the SD, the clients were immediately freed and the hung cancelled jobs released from the director's queue. I did not have to kill any other processes. However, when I restarted the SD, it could not read the "Daily01" volume label; it had been erased (though it was properly labled and used previously; also the earlier messages show that the SD device status with "Daily01"). I used btape to relabel it and then it remounted fine. Rerunning the jobs worked fine as this started at BOT. By the way....thanks for such a great product!!! Even with the small problems, I've been very happy with Bacula. ---------------------------------------------------------------------- KlayVessel - 09-01-2004 10:43 PDT ---------------------------------------------------------------------- I am now able to reproduce this problem. It occurs after doing a large restore -- in the case that happened last night, I had restored >450,000 files. As you will note in the original bug report I had also noted that particular condition. Unfortunately, I had not noticed your additional note about the btraceback so restarted bacula-sd before I could capture that. I will do so the next time and (when I can get a chance) create this situation again so I can get one for you. Some additional info about this case, bacula-sd responded with a tape full almost immediately even though only 10GB was used on the tape (these tapes support 80GB native); switched to the second tape in the pool, then reported a problem and, then asked for another tape (only two in this pool). When I restarted bacula-sd, it could not mount the tape as the label had been damaged; I had to use btape; rebuilt the label and then the new instance could mount the tape and complete the remaining, waiting jobs. ---------------------------------------------------------------------- kern - 09-01-2004 11:58 PDT ---------------------------------------------------------------------- I think there are a number of problems occurring simultaneously: 1. You are getting a hard I/O error on a tape, then switching to another tape and immediately the first tape is recycled, then later it is rewritten. This obviously is not going to work. 2. The hard I/O error should have caused the job to terminate in error but did not (this has been reported and corrected in another bug report). This is also quite possibly the reason your tape header was damaged. 3. You are dealing with such a large amount of data that I cannot duplicate this. I suggest: - Ensuring you have enough tapes so you don't recycle a tape you have just written. - Apply patch 1.34.6-block.patch (in another bug report, sorry, I fogot the number). - Instead of the above item, you could upgrade to 1.35.2. - If you can tie this down better or find a way for me to reproduce it, I'll fix it. ---------------------------------------------------------------------- kern - 09-01-2004 11:59 PDT ---------------------------------------------------------------------- For your convenience, I have uploaded the patch noted in the previous bugnote. Bug History Date Modified Username Field Change ====================================================================== 08-19-04 05:19 KlayVessel New Bug 08-19-04 05:19 KlayVessel File Added: bacula-conf-files-20040819.tar.gz 08-19-04 06:25 kern Bugnote Added: 0000175 08-19-04 06:25 kern Resolution open => unable to duplicate 08-19-04 06:25 kern Status new => closed 08-19-04 07:04 KlayVessel Bugnote Added: 0000169 08-19-04 07:07 KlayVessel Bugnote Added: 0000170 08-19-04 07:24 KlayVessel Bugnote Added: 0000171 09-01-04 10:43 KlayVessel Status closed => feedback 09-01-04 10:43 KlayVessel Bugnote Added: 0000198 09-01-04 10:43 KlayVessel Resolution unable to duplicate => reopened 09-01-04 11:58 kern Bugnote Added: 0000199 09-01-04 11:58 kern File Added: 1.34.6-block.patch 09-01-04 11:59 kern Bugnote Added: 0000200 09-01-04 11:59 kern Resolution reopened => unable to duplicate 09-01-04 11:59 kern Status feedback => closed ====================================================================== |