Thread: [Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

Brought to you by: kerns, ricozz

bacula-bugs

[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

From: <bac...@li...> - 2004-08-19 05:19:31

The following bug has been SUBMITTED.
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     new
======================================================================
Date Submitted:             08-19-2004 05:19 GMT
Last Modified:              08-19-2004 05:19 GMT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
======================================================================

[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

From: <bac...@li...> - 2004-08-19 07:04:27

A BUGNOTE has been added to this bug.
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     new
======================================================================
Date Submitted:             08-19-2004 05:19 GMT
Last Modified:              08-19-2004 07:04 GMT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:04 GMT 
----------------------------------------------------------------------
Reporting back.  It's now 3am EDT (about 2 hours since I cancelled the
jobs).  The two jobs are still in the director's queue but marked as "has
been canceled".  The  SD is still running.

*status dir
...
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   831 Increme  orca.2004-08-18_23.00.03 has been canceled
   828 Increme  hawk.2004-08-18_23.00.00 has been canceled
====
...

The SD is (probably) not responding to director's requests, current status
still shows 4 jobs in the SD:

*st storage=Autochanger
Connecting to Storage daemon Autochanger at hawk.redklay2:9103

hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 96 Jobs run since started.

Running Jobs:
Full Backup job orca_db_redo JobId=832 Volume="Daily02"
device="/dev/nst1"
    Files=8 Bytes=658,453 Bytes/sec=45
    FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15
Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1"
    Files=233 Bytes=5,770,856 Bytes/sec=395
    FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13
Incremental Backup job panda JobId=830 Volume="Daily01"
device="/dev/nst1"
    Files=650 Bytes=46,141,511 Bytes/sec=3,164
    FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11
Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1"
    Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546
    FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9
====
...

I will be terminating/restarting the SD and director and restarting these
backup jobs.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
08-19-04 07:04 KlayVessel     Bugnote Added: 0000169                       
======================================================================

[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

From: <bac...@li...> - 2004-08-19 07:07:47

A BUGNOTE has been added to this bug.
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     new
======================================================================
Date Submitted:             08-19-2004 05:19 GMT
Last Modified:              08-19-2004 07:07 GMT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:04 GMT 
----------------------------------------------------------------------
Reporting back.  It's now 3am EDT (about 2 hours since I cancelled the
jobs).  The two jobs are still in the director's queue but marked as "has
been canceled".  The  SD is still running.

*status dir
...
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   831 Increme  orca.2004-08-18_23.00.03 has been canceled
   828 Increme  hawk.2004-08-18_23.00.00 has been canceled
====
...

The SD is (probably) not responding to director's requests, current status
still shows 4 jobs in the SD:

*st storage=Autochanger
Connecting to Storage daemon Autochanger at hawk.redklay2:9103

hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 96 Jobs run since started.

Running Jobs:
Full Backup job orca_db_redo JobId=832 Volume="Daily02"
device="/dev/nst1"
    Files=8 Bytes=658,453 Bytes/sec=45
    FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15
Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1"
    Files=233 Bytes=5,770,856 Bytes/sec=395
    FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13
Incremental Backup job panda JobId=830 Volume="Daily01"
device="/dev/nst1"
    Files=650 Bytes=46,141,511 Bytes/sec=3,164
    FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11
Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1"
    Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546
    FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9
====
...

I will be terminating/restarting the SD and director and restarting these
backup jobs.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:07 GMT 
----------------------------------------------------------------------
I should have included the client (FD) status as well, for these two jobs
the clients still show they are running (probably waiting on SD):

*st client=orca-fd
Connecting to Client orca-fd at orca.redklay2:9102

orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3
Daemon started 15-Aug-04 09:10, 8 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   758  Sinc        691     82,734,632 OK       12-Aug-04 23:15 orca
   767  Full    514,961  7,839,492,211 OK       14-Aug-04 11:59 orca
   792  Sinc         80      3,104,564 OK       15-Aug-04 09:15 orca
   797  Full          8          5,692 OK       15-Aug-04 23:00
orca_db_redo
   796  Sinc         26      3,355,004 OK       15-Aug-04 23:04 orca
   808  Full          8      1,090,026 OK       16-Aug-04 23:00
orca_db_redo
   807  Sinc      1,861      6,756,867 OK       16-Aug-04 23:05 orca
   819  Full          8        878,011 OK       17-Aug-04 23:00
orca_db_redo
   818  Sinc        222      6,957,873 OK       17-Aug-04 23:06 orca
   832  Full          8        657,684 Cancel   19-Aug-04 01:07
orca_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 831 Job orca.2004-08-18_23.00.03 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=235 Bytes=5,839,057 Bytes/sec=395
    Files Examined=537,647
    Processing file:
/d01/app/oracle/product/8.1.5/network/log/listener.log
    SDReadSeqNo=5 fd=6
====
*st client=hawk-fd
Connecting to Client hawk-fd at hawk.redklay2:9102

hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 18 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   779  Full          0              0 OK       14-Aug-04 12:01
BackupCatalog
   780  Full          1    686,017,053 OK       14-Aug-04 12:14
BackupCatalog
   781  Sinc        156     43,287,087 OK       15-Aug-04 09:05 hawk
   782  Full          6          5,004 OK       15-Aug-04 09:05
hawk_db_redo
   793  Sinc        132     42,706,198 OK       15-Aug-04 23:00 hawk
   794  Full          6          4,288 OK       15-Aug-04 23:00
hawk_db_redo
   804  Sinc        123     43,114,702 OK       16-Aug-04 23:00 hawk
   805  Full          6          4,414 OK       16-Aug-04 23:00
hawk_db_redo
   815  Sinc        133     43,314,887 OK       17-Aug-04 23:00 hawk
   816  Full          6          4,422 OK       17-Aug-04 23:00
hawk_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 828 Job hawk.2004-08-18_23.00.00 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532
    Files Examined=71,869
    Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe
    SDReadSeqNo=5 fd=7
====
*

All other clients report that there are no jobs running.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
08-19-04 07:04 KlayVessel     Bugnote Added: 0000169                       
08-19-04 07:07 KlayVessel     Bugnote Added: 0000170                       
======================================================================

[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

From: <bac...@li...> - 2004-08-19 07:25:10

A BUGNOTE has been added to this bug.
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     new
======================================================================
Date Submitted:             08-19-2004 05:19 GMT
Last Modified:              08-19-2004 07:24 GMT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:04 GMT 
----------------------------------------------------------------------
Reporting back.  It's now 3am EDT (about 2 hours since I cancelled the
jobs).  The two jobs are still in the director's queue but marked as "has
been canceled".  The  SD is still running.

*status dir
...
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   831 Increme  orca.2004-08-18_23.00.03 has been canceled
   828 Increme  hawk.2004-08-18_23.00.00 has been canceled
====
...

The SD is (probably) not responding to director's requests, current status
still shows 4 jobs in the SD:

*st storage=Autochanger
Connecting to Storage daemon Autochanger at hawk.redklay2:9103

hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 96 Jobs run since started.

Running Jobs:
Full Backup job orca_db_redo JobId=832 Volume="Daily02"
device="/dev/nst1"
    Files=8 Bytes=658,453 Bytes/sec=45
    FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15
Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1"
    Files=233 Bytes=5,770,856 Bytes/sec=395
    FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13
Incremental Backup job panda JobId=830 Volume="Daily01"
device="/dev/nst1"
    Files=650 Bytes=46,141,511 Bytes/sec=3,164
    FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11
Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1"
    Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546
    FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9
====
...

I will be terminating/restarting the SD and director and restarting these
backup jobs.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:07 GMT 
----------------------------------------------------------------------
I should have included the client (FD) status as well, for these two jobs
the clients still show they are running (probably waiting on SD):

*st client=orca-fd
Connecting to Client orca-fd at orca.redklay2:9102

orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3
Daemon started 15-Aug-04 09:10, 8 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   758  Sinc        691     82,734,632 OK       12-Aug-04 23:15 orca
   767  Full    514,961  7,839,492,211 OK       14-Aug-04 11:59 orca
   792  Sinc         80      3,104,564 OK       15-Aug-04 09:15 orca
   797  Full          8          5,692 OK       15-Aug-04 23:00
orca_db_redo
   796  Sinc         26      3,355,004 OK       15-Aug-04 23:04 orca
   808  Full          8      1,090,026 OK       16-Aug-04 23:00
orca_db_redo
   807  Sinc      1,861      6,756,867 OK       16-Aug-04 23:05 orca
   819  Full          8        878,011 OK       17-Aug-04 23:00
orca_db_redo
   818  Sinc        222      6,957,873 OK       17-Aug-04 23:06 orca
   832  Full          8        657,684 Cancel   19-Aug-04 01:07
orca_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 831 Job orca.2004-08-18_23.00.03 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=235 Bytes=5,839,057 Bytes/sec=395
    Files Examined=537,647
    Processing file:
/d01/app/oracle/product/8.1.5/network/log/listener.log
    SDReadSeqNo=5 fd=6
====
*st client=hawk-fd
Connecting to Client hawk-fd at hawk.redklay2:9102

hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 18 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   779  Full          0              0 OK       14-Aug-04 12:01
BackupCatalog
   780  Full          1    686,017,053 OK       14-Aug-04 12:14
BackupCatalog
   781  Sinc        156     43,287,087 OK       15-Aug-04 09:05 hawk
   782  Full          6          5,004 OK       15-Aug-04 09:05
hawk_db_redo
   793  Sinc        132     42,706,198 OK       15-Aug-04 23:00 hawk
   794  Full          6          4,288 OK       15-Aug-04 23:00
hawk_db_redo
   804  Sinc        123     43,114,702 OK       16-Aug-04 23:00 hawk
   805  Full          6          4,414 OK       16-Aug-04 23:00
hawk_db_redo
   815  Sinc        133     43,314,887 OK       17-Aug-04 23:00 hawk
   816  Full          6          4,422 OK       17-Aug-04 23:00
hawk_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 828 Job hawk.2004-08-18_23.00.00 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532
    Files Examined=71,869
    Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe
    SDReadSeqNo=5 fd=7
====
*

All other clients report that there are no jobs running.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:24 GMT 
----------------------------------------------------------------------
Final update tonight.  When I killed the SD, the clients were immediately
freed and the hung cancelled jobs released from the director's queue.  I
did not have to kill any other processes.  However, when I restarted the
SD, it could not read the "Daily01" volume label; it had been erased
(though it was properly labled and used previously; also the earlier
messages show that the SD device status with "Daily01").

I used btape to relabel it and then it remounted fine.  Rerunning the jobs
worked fine as this started at BOT.

By the way....thanks for such a great product!!!  Even with the small
problems, I've been very happy with Bacula.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
08-19-04 07:04 KlayVessel     Bugnote Added: 0000169                       
08-19-04 07:07 KlayVessel     Bugnote Added: 0000170                       
08-19-04 07:24 KlayVessel     Bugnote Added: 0000171                       
======================================================================

[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

From: <bac...@li...> - 2004-08-19 13:25:12

The following bug has been CLOSED
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     closed
======================================================================
Date Submitted:             08-19-2004 05:19 PDT
Last Modified:              08-19-2004 06:25 PDT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

----------------------------------------------------------------------
 kern - 08-19-2004 06:25 PDT 
----------------------------------------------------------------------
I'm going to close this bug, because there is not enough information to fix
it. I would appreciate it if you would do two things:

1. Make absolutely sure that:
   a. Your spool directory never fills.
   b. Make sure you never attempt to backup your spool directory.

2. If this happens again, please get a manual dump of the SD as described

   in the Kaboom chapter of the manual, and either re-open this bug or
create
   a new one.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:04 PDT 
----------------------------------------------------------------------
Reporting back.  It's now 3am EDT (about 2 hours since I cancelled the
jobs).  The two jobs are still in the director's queue but marked as "has
been canceled".  The  SD is still running.

*status dir
...
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   831 Increme  orca.2004-08-18_23.00.03 has been canceled
   828 Increme  hawk.2004-08-18_23.00.00 has been canceled
====
...

The SD is (probably) not responding to director's requests, current status
still shows 4 jobs in the SD:

*st storage=Autochanger
Connecting to Storage daemon Autochanger at hawk.redklay2:9103

hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 96 Jobs run since started.

Running Jobs:
Full Backup job orca_db_redo JobId=832 Volume="Daily02"
device="/dev/nst1"
    Files=8 Bytes=658,453 Bytes/sec=45
    FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15
Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1"
    Files=233 Bytes=5,770,856 Bytes/sec=395
    FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13
Incremental Backup job panda JobId=830 Volume="Daily01"
device="/dev/nst1"
    Files=650 Bytes=46,141,511 Bytes/sec=3,164
    FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11
Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1"
    Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546
    FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9
====
...

I will be terminating/restarting the SD and director and restarting these
backup jobs.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:07 PDT 
----------------------------------------------------------------------
I should have included the client (FD) status as well, for these two jobs
the clients still show they are running (probably waiting on SD):

*st client=orca-fd
Connecting to Client orca-fd at orca.redklay2:9102

orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3
Daemon started 15-Aug-04 09:10, 8 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   758  Sinc        691     82,734,632 OK       12-Aug-04 23:15 orca
   767  Full    514,961  7,839,492,211 OK       14-Aug-04 11:59 orca
   792  Sinc         80      3,104,564 OK       15-Aug-04 09:15 orca
   797  Full          8          5,692 OK       15-Aug-04 23:00
orca_db_redo
   796  Sinc         26      3,355,004 OK       15-Aug-04 23:04 orca
   808  Full          8      1,090,026 OK       16-Aug-04 23:00
orca_db_redo
   807  Sinc      1,861      6,756,867 OK       16-Aug-04 23:05 orca
   819  Full          8        878,011 OK       17-Aug-04 23:00
orca_db_redo
   818  Sinc        222      6,957,873 OK       17-Aug-04 23:06 orca
   832  Full          8        657,684 Cancel   19-Aug-04 01:07
orca_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 831 Job orca.2004-08-18_23.00.03 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=235 Bytes=5,839,057 Bytes/sec=395
    Files Examined=537,647
    Processing file:
/d01/app/oracle/product/8.1.5/network/log/listener.log
    SDReadSeqNo=5 fd=6
====
*st client=hawk-fd
Connecting to Client hawk-fd at hawk.redklay2:9102

hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 18 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   779  Full          0              0 OK       14-Aug-04 12:01
BackupCatalog
   780  Full          1    686,017,053 OK       14-Aug-04 12:14
BackupCatalog
   781  Sinc        156     43,287,087 OK       15-Aug-04 09:05 hawk
   782  Full          6          5,004 OK       15-Aug-04 09:05
hawk_db_redo
   793  Sinc        132     42,706,198 OK       15-Aug-04 23:00 hawk
   794  Full          6          4,288 OK       15-Aug-04 23:00
hawk_db_redo
   804  Sinc        123     43,114,702 OK       16-Aug-04 23:00 hawk
   805  Full          6          4,414 OK       16-Aug-04 23:00
hawk_db_redo
   815  Sinc        133     43,314,887 OK       17-Aug-04 23:00 hawk
   816  Full          6          4,422 OK       17-Aug-04 23:00
hawk_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 828 Job hawk.2004-08-18_23.00.00 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532
    Files Examined=71,869
    Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe
    SDReadSeqNo=5 fd=7
====
*

All other clients report that there are no jobs running.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:24 PDT 
----------------------------------------------------------------------
Final update tonight.  When I killed the SD, the clients were immediately
freed and the hung cancelled jobs released from the director's queue.  I
did not have to kill any other processes.  However, when I restarted the
SD, it could not read the "Daily01" volume label; it had been erased
(though it was properly labled and used previously; also the earlier
messages show that the SD device status with "Daily01").

I used btape to relabel it and then it remounted fine.  Rerunning the jobs
worked fine as this started at BOT.

By the way....thanks for such a great product!!!  Even with the small
problems, I've been very happy with Bacula.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
08-19-04 06:25 kern           Bugnote Added: 0000175                       
08-19-04 06:25 kern           Resolution               open => unable to duplicate
08-19-04 06:25 kern           Status                   new => closed       
08-19-04 07:04 KlayVessel     Bugnote Added: 0000169                       
08-19-04 07:07 KlayVessel     Bugnote Added: 0000170                       
08-19-04 07:24 KlayVessel     Bugnote Added: 0000171                       
======================================================================

[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

From: <bac...@li...> - 2004-09-01 17:40:49

The following bug has been REOPENED.
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     feedback
======================================================================
Date Submitted:             08-19-2004 05:19 PDT
Last Modified:              09-01-2004 10:43 PDT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

----------------------------------------------------------------------
 kern - 08-19-2004 06:25 PDT 
----------------------------------------------------------------------
I'm going to close this bug, because there is not enough information to fix
it. I would appreciate it if you would do two things:

1. Make absolutely sure that:
   a. Your spool directory never fills.
   b. Make sure you never attempt to backup your spool directory.

2. If this happens again, please get a manual dump of the SD as described

   in the Kaboom chapter of the manual, and either re-open this bug or
create
   a new one.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:04 PDT 
----------------------------------------------------------------------
Reporting back.  It's now 3am EDT (about 2 hours since I cancelled the
jobs).  The two jobs are still in the director's queue but marked as "has
been canceled".  The  SD is still running.

*status dir
...
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   831 Increme  orca.2004-08-18_23.00.03 has been canceled
   828 Increme  hawk.2004-08-18_23.00.00 has been canceled
====
...

The SD is (probably) not responding to director's requests, current status
still shows 4 jobs in the SD:

*st storage=Autochanger
Connecting to Storage daemon Autochanger at hawk.redklay2:9103

hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 96 Jobs run since started.

Running Jobs:
Full Backup job orca_db_redo JobId=832 Volume="Daily02"
device="/dev/nst1"
    Files=8 Bytes=658,453 Bytes/sec=45
    FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15
Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1"
    Files=233 Bytes=5,770,856 Bytes/sec=395
    FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13
Incremental Backup job panda JobId=830 Volume="Daily01"
device="/dev/nst1"
    Files=650 Bytes=46,141,511 Bytes/sec=3,164
    FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11
Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1"
    Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546
    FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9
====
...

I will be terminating/restarting the SD and director and restarting these
backup jobs.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:07 PDT 
----------------------------------------------------------------------
I should have included the client (FD) status as well, for these two jobs
the clients still show they are running (probably waiting on SD):

*st client=orca-fd
Connecting to Client orca-fd at orca.redklay2:9102

orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3
Daemon started 15-Aug-04 09:10, 8 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   758  Sinc        691     82,734,632 OK       12-Aug-04 23:15 orca
   767  Full    514,961  7,839,492,211 OK       14-Aug-04 11:59 orca
   792  Sinc         80      3,104,564 OK       15-Aug-04 09:15 orca
   797  Full          8          5,692 OK       15-Aug-04 23:00
orca_db_redo
   796  Sinc         26      3,355,004 OK       15-Aug-04 23:04 orca
   808  Full          8      1,090,026 OK       16-Aug-04 23:00
orca_db_redo
   807  Sinc      1,861      6,756,867 OK       16-Aug-04 23:05 orca
   819  Full          8        878,011 OK       17-Aug-04 23:00
orca_db_redo
   818  Sinc        222      6,957,873 OK       17-Aug-04 23:06 orca
   832  Full          8        657,684 Cancel   19-Aug-04 01:07
orca_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 831 Job orca.2004-08-18_23.00.03 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=235 Bytes=5,839,057 Bytes/sec=395
    Files Examined=537,647
    Processing file:
/d01/app/oracle/product/8.1.5/network/log/listener.log
    SDReadSeqNo=5 fd=6
====
*st client=hawk-fd
Connecting to Client hawk-fd at hawk.redklay2:9102

hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 18 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   779  Full          0              0 OK       14-Aug-04 12:01
BackupCatalog
   780  Full          1    686,017,053 OK       14-Aug-04 12:14
BackupCatalog
   781  Sinc        156     43,287,087 OK       15-Aug-04 09:05 hawk
   782  Full          6          5,004 OK       15-Aug-04 09:05
hawk_db_redo
   793  Sinc        132     42,706,198 OK       15-Aug-04 23:00 hawk
   794  Full          6          4,288 OK       15-Aug-04 23:00
hawk_db_redo
   804  Sinc        123     43,114,702 OK       16-Aug-04 23:00 hawk
   805  Full          6          4,414 OK       16-Aug-04 23:00
hawk_db_redo
   815  Sinc        133     43,314,887 OK       17-Aug-04 23:00 hawk
   816  Full          6          4,422 OK       17-Aug-04 23:00
hawk_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 828 Job hawk.2004-08-18_23.00.00 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532
    Files Examined=71,869
    Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe
    SDReadSeqNo=5 fd=7
====
*

All other clients report that there are no jobs running.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:24 PDT 
----------------------------------------------------------------------
Final update tonight.  When I killed the SD, the clients were immediately
freed and the hung cancelled jobs released from the director's queue.  I
did not have to kill any other processes.  However, when I restarted the
SD, it could not read the "Daily01" volume label; it had been erased
(though it was properly labled and used previously; also the earlier
messages show that the SD device status with "Daily01").

I used btape to relabel it and then it remounted fine.  Rerunning the jobs
worked fine as this started at BOT.

By the way....thanks for such a great product!!!  Even with the small
problems, I've been very happy with Bacula.

----------------------------------------------------------------------
 KlayVessel - 09-01-2004 10:43 PDT 
----------------------------------------------------------------------
I am now able to reproduce this problem.  It occurs after doing a large
restore -- in the case that happened last night, I had restored >450,000
files.  As you will note in the original bug report I had also noted that
particular condition.

Unfortunately, I had not noticed your additional note about the btraceback
so restarted bacula-sd before I could capture that.  I will do so the next
time and (when I can get a chance) create this situation again so I can
get one for you.

Some additional info about this case, bacula-sd responded with a tape full
almost immediately even though only 10GB was used on the tape (these tapes
support 80GB native); switched to the second tape in the pool, then
reported a problem and, then asked for another tape (only two in this
pool).  When I restarted bacula-sd, it could not mount the tape as the
label had been damaged; I had to use btape; rebuilt the label and then the
new instance could mount the tape and complete the remaining, waiting jobs.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
08-19-04 06:25 kern           Bugnote Added: 0000175                       
08-19-04 06:25 kern           Resolution               open => unable to duplicate
08-19-04 06:25 kern           Status                   new => closed       
08-19-04 07:04 KlayVessel     Bugnote Added: 0000169                       
08-19-04 07:07 KlayVessel     Bugnote Added: 0000170                       
08-19-04 07:24 KlayVessel     Bugnote Added: 0000171                       
09-01-04 10:43 KlayVessel     Status                   closed => feedback  
09-01-04 10:43 KlayVessel     Bugnote Added: 0000198                       
09-01-04 10:43 KlayVessel     Resolution               unable to duplicate => reopened
======================================================================

[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

From: <bac...@li...> - 2004-09-01 18:55:13

======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     feedback
======================================================================
Date Submitted:             08-19-2004 05:19 PDT
Last Modified:              09-01-2004 11:58 PDT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

----------------------------------------------------------------------
 kern - 08-19-2004 06:25 PDT 
----------------------------------------------------------------------
I'm going to close this bug, because there is not enough information to fix
it. I would appreciate it if you would do two things:

1. Make absolutely sure that:
   a. Your spool directory never fills.
   b. Make sure you never attempt to backup your spool directory.

2. If this happens again, please get a manual dump of the SD as described

   in the Kaboom chapter of the manual, and either re-open this bug or
create
   a new one.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:04 PDT 
----------------------------------------------------------------------
Reporting back.  It's now 3am EDT (about 2 hours since I cancelled the
jobs).  The two jobs are still in the director's queue but marked as "has
been canceled".  The  SD is still running.

*status dir
...
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   831 Increme  orca.2004-08-18_23.00.03 has been canceled
   828 Increme  hawk.2004-08-18_23.00.00 has been canceled
====
...

The SD is (probably) not responding to director's requests, current status
still shows 4 jobs in the SD:

*st storage=Autochanger
Connecting to Storage daemon Autochanger at hawk.redklay2:9103

hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 96 Jobs run since started.

Running Jobs:
Full Backup job orca_db_redo JobId=832 Volume="Daily02"
device="/dev/nst1"
    Files=8 Bytes=658,453 Bytes/sec=45
    FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15
Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1"
    Files=233 Bytes=5,770,856 Bytes/sec=395
    FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13
Incremental Backup job panda JobId=830 Volume="Daily01"
device="/dev/nst1"
    Files=650 Bytes=46,141,511 Bytes/sec=3,164
    FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11
Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1"
    Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546
    FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9
====
...

I will be terminating/restarting the SD and director and restarting these
backup jobs.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:07 PDT 
----------------------------------------------------------------------
I should have included the client (FD) status as well, for these two jobs
the clients still show they are running (probably waiting on SD):

*st client=orca-fd
Connecting to Client orca-fd at orca.redklay2:9102

orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3
Daemon started 15-Aug-04 09:10, 8 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   758  Sinc        691     82,734,632 OK       12-Aug-04 23:15 orca
   767  Full    514,961  7,839,492,211 OK       14-Aug-04 11:59 orca
   792  Sinc         80      3,104,564 OK       15-Aug-04 09:15 orca
   797  Full          8          5,692 OK       15-Aug-04 23:00
orca_db_redo
   796  Sinc         26      3,355,004 OK       15-Aug-04 23:04 orca
   808  Full          8      1,090,026 OK       16-Aug-04 23:00
orca_db_redo
   807  Sinc      1,861      6,756,867 OK       16-Aug-04 23:05 orca
   819  Full          8        878,011 OK       17-Aug-04 23:00
orca_db_redo
   818  Sinc        222      6,957,873 OK       17-Aug-04 23:06 orca
   832  Full          8        657,684 Cancel   19-Aug-04 01:07
orca_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 831 Job orca.2004-08-18_23.00.03 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=235 Bytes=5,839,057 Bytes/sec=395
    Files Examined=537,647
    Processing file:
/d01/app/oracle/product/8.1.5/network/log/listener.log
    SDReadSeqNo=5 fd=6
====
*st client=hawk-fd
Connecting to Client hawk-fd at hawk.redklay2:9102

hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 18 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   779  Full          0              0 OK       14-Aug-04 12:01
BackupCatalog
   780  Full          1    686,017,053 OK       14-Aug-04 12:14
BackupCatalog
   781  Sinc        156     43,287,087 OK       15-Aug-04 09:05 hawk
   782  Full          6          5,004 OK       15-Aug-04 09:05
hawk_db_redo
   793  Sinc        132     42,706,198 OK       15-Aug-04 23:00 hawk
   794  Full          6          4,288 OK       15-Aug-04 23:00
hawk_db_redo
   804  Sinc        123     43,114,702 OK       16-Aug-04 23:00 hawk
   805  Full          6          4,414 OK       16-Aug-04 23:00
hawk_db_redo
   815  Sinc        133     43,314,887 OK       17-Aug-04 23:00 hawk
   816  Full          6          4,422 OK       17-Aug-04 23:00
hawk_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 828 Job hawk.2004-08-18_23.00.00 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532
    Files Examined=71,869
    Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe
    SDReadSeqNo=5 fd=7
====
*

All other clients report that there are no jobs running.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:24 PDT 
----------------------------------------------------------------------
Final update tonight.  When I killed the SD, the clients were immediately
freed and the hung cancelled jobs released from the director's queue.  I
did not have to kill any other processes.  However, when I restarted the
SD, it could not read the "Daily01" volume label; it had been erased
(though it was properly labled and used previously; also the earlier
messages show that the SD device status with "Daily01").

I used btape to relabel it and then it remounted fine.  Rerunning the jobs
worked fine as this started at BOT.

By the way....thanks for such a great product!!!  Even with the small
problems, I've been very happy with Bacula.

----------------------------------------------------------------------
 KlayVessel - 09-01-2004 10:43 PDT 
----------------------------------------------------------------------
I am now able to reproduce this problem.  It occurs after doing a large
restore -- in the case that happened last night, I had restored >450,000
files.  As you will note in the original bug report I had also noted that
particular condition.

Unfortunately, I had not noticed your additional note about the btraceback
so restarted bacula-sd before I could capture that.  I will do so the next
time and (when I can get a chance) create this situation again so I can
get one for you.

Some additional info about this case, bacula-sd responded with a tape full
almost immediately even though only 10GB was used on the tape (these tapes
support 80GB native); switched to the second tape in the pool, then
reported a problem and, then asked for another tape (only two in this
pool).  When I restarted bacula-sd, it could not mount the tape as the
label had been damaged; I had to use btape; rebuilt the label and then the
new instance could mount the tape and complete the remaining, waiting jobs.

----------------------------------------------------------------------
 kern - 09-01-2004 11:58 PDT 
----------------------------------------------------------------------
I think there are a number of problems occurring simultaneously:
1. You are getting a hard I/O error on a tape, then switching to another
tape and immediately the first tape is recycled, then later it is
rewritten. This obviously is not going to work.
2. The hard I/O error should have caused the job to terminate in error but
did not (this has been reported and corrected in another bug report). This
is also quite possibly the reason your tape header was damaged.
3. You are dealing with such a large amount of data that I cannot
duplicate this.

I suggest:
- Ensuring you have enough tapes so you don't recycle a tape you have just
written.
- Apply patch 1.34.6-block.patch (in another bug report, sorry, I fogot
the number).
- Instead of the above item, you could upgrade to 1.35.2.  
- If you can tie this down better or find a way for me to reproduce it,
I'll fix it.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
08-19-04 06:25 kern           Bugnote Added: 0000175                       
08-19-04 06:25 kern           Resolution               open => unable to duplicate
08-19-04 06:25 kern           Status                   new => closed       
08-19-04 07:04 KlayVessel     Bugnote Added: 0000169                       
08-19-04 07:07 KlayVessel     Bugnote Added: 0000170                       
08-19-04 07:24 KlayVessel     Bugnote Added: 0000171                       
09-01-04 10:43 KlayVessel     Status                   closed => feedback  
09-01-04 10:43 KlayVessel     Bugnote Added: 0000198                       
09-01-04 10:43 KlayVessel     Resolution               unable to duplicate => reopened
09-01-04 11:58 kern           Bugnote Added: 0000199                       
======================================================================

[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

From: <bac...@li...> - 2004-09-01 18:56:30

The following bug has been CLOSED
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     closed
======================================================================
Date Submitted:             08-19-2004 05:19 PDT
Last Modified:              09-01-2004 11:59 PDT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

----------------------------------------------------------------------
 kern - 08-19-2004 06:25 PDT 
----------------------------------------------------------------------
I'm going to close this bug, because there is not enough information to fix
it. I would appreciate it if you would do two things:

1. Make absolutely sure that:
   a. Your spool directory never fills.
   b. Make sure you never attempt to backup your spool directory.

2. If this happens again, please get a manual dump of the SD as described

   in the Kaboom chapter of the manual, and either re-open this bug or
create
   a new one.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:04 PDT 
----------------------------------------------------------------------
Reporting back.  It's now 3am EDT (about 2 hours since I cancelled the
jobs).  The two jobs are still in the director's queue but marked as "has
been canceled".  The  SD is still running.

*status dir
...
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   831 Increme  orca.2004-08-18_23.00.03 has been canceled
   828 Increme  hawk.2004-08-18_23.00.00 has been canceled
====
...

The SD is (probably) not responding to director's requests, current status
still shows 4 jobs in the SD:

*st storage=Autochanger
Connecting to Storage daemon Autochanger at hawk.redklay2:9103

hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 96 Jobs run since started.

Running Jobs:
Full Backup job orca_db_redo JobId=832 Volume="Daily02"
device="/dev/nst1"
    Files=8 Bytes=658,453 Bytes/sec=45
    FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15
Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1"
    Files=233 Bytes=5,770,856 Bytes/sec=395
    FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13
Incremental Backup job panda JobId=830 Volume="Daily01"
device="/dev/nst1"
    Files=650 Bytes=46,141,511 Bytes/sec=3,164
    FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11
Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1"
    Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546
    FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9
====
...

I will be terminating/restarting the SD and director and restarting these
backup jobs.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:07 PDT 
----------------------------------------------------------------------
I should have included the client (FD) status as well, for these two jobs
the clients still show they are running (probably waiting on SD):

*st client=orca-fd
Connecting to Client orca-fd at orca.redklay2:9102

orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3
Daemon started 15-Aug-04 09:10, 8 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   758  Sinc        691     82,734,632 OK       12-Aug-04 23:15 orca
   767  Full    514,961  7,839,492,211 OK       14-Aug-04 11:59 orca
   792  Sinc         80      3,104,564 OK       15-Aug-04 09:15 orca
   797  Full          8          5,692 OK       15-Aug-04 23:00
orca_db_redo
   796  Sinc         26      3,355,004 OK       15-Aug-04 23:04 orca
   808  Full          8      1,090,026 OK       16-Aug-04 23:00
orca_db_redo
   807  Sinc      1,861      6,756,867 OK       16-Aug-04 23:05 orca
   819  Full          8        878,011 OK       17-Aug-04 23:00
orca_db_redo
   818  Sinc        222      6,957,873 OK       17-Aug-04 23:06 orca
   832  Full          8        657,684 Cancel   19-Aug-04 01:07
orca_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 831 Job orca.2004-08-18_23.00.03 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=235 Bytes=5,839,057 Bytes/sec=395
    Files Examined=537,647
    Processing file:
/d01/app/oracle/product/8.1.5/network/log/listener.log
    SDReadSeqNo=5 fd=6
====
*st client=hawk-fd
Connecting to Client hawk-fd at hawk.redklay2:9102

hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 18 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   779  Full          0              0 OK       14-Aug-04 12:01
BackupCatalog
   780  Full          1    686,017,053 OK       14-Aug-04 12:14
BackupCatalog
   781  Sinc        156     43,287,087 OK       15-Aug-04 09:05 hawk
   782  Full          6          5,004 OK       15-Aug-04 09:05
hawk_db_redo
   793  Sinc        132     42,706,198 OK       15-Aug-04 23:00 hawk
   794  Full          6          4,288 OK       15-Aug-04 23:00
hawk_db_redo
   804  Sinc        123     43,114,702 OK       16-Aug-04 23:00 hawk
   805  Full          6          4,414 OK       16-Aug-04 23:00
hawk_db_redo
   815  Sinc        133     43,314,887 OK       17-Aug-04 23:00 hawk
   816  Full          6          4,422 OK       17-Aug-04 23:00
hawk_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 828 Job hawk.2004-08-18_23.00.00 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532
    Files Examined=71,869
    Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe
    SDReadSeqNo=5 fd=7
====
*

All other clients report that there are no jobs running.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:24 PDT 
----------------------------------------------------------------------
Final update tonight.  When I killed the SD, the clients were immediately
freed and the hung cancelled jobs released from the director's queue.  I
did not have to kill any other processes.  However, when I restarted the
SD, it could not read the "Daily01" volume label; it had been erased
(though it was properly labled and used previously; also the earlier
messages show that the SD device status with "Daily01").

I used btape to relabel it and then it remounted fine.  Rerunning the jobs
worked fine as this started at BOT.

By the way....thanks for such a great product!!!  Even with the small
problems, I've been very happy with Bacula.

----------------------------------------------------------------------
 KlayVessel - 09-01-2004 10:43 PDT 
----------------------------------------------------------------------
I am now able to reproduce this problem.  It occurs after doing a large
restore -- in the case that happened last night, I had restored >450,000
files.  As you will note in the original bug report I had also noted that
particular condition.

Unfortunately, I had not noticed your additional note about the btraceback
so restarted bacula-sd before I could capture that.  I will do so the next
time and (when I can get a chance) create this situation again so I can
get one for you.

Some additional info about this case, bacula-sd responded with a tape full
almost immediately even though only 10GB was used on the tape (these tapes
support 80GB native); switched to the second tape in the pool, then
reported a problem and, then asked for another tape (only two in this
pool).  When I restarted bacula-sd, it could not mount the tape as the
label had been damaged; I had to use btape; rebuilt the label and then the
new instance could mount the tape and complete the remaining, waiting jobs.

----------------------------------------------------------------------
 kern - 09-01-2004 11:58 PDT 
----------------------------------------------------------------------
I think there are a number of problems occurring simultaneously:
1. You are getting a hard I/O error on a tape, then switching to another
tape and immediately the first tape is recycled, then later it is
rewritten. This obviously is not going to work.
2. The hard I/O error should have caused the job to terminate in error but
did not (this has been reported and corrected in another bug report). This
is also quite possibly the reason your tape header was damaged.
3. You are dealing with such a large amount of data that I cannot
duplicate this.

I suggest:
- Ensuring you have enough tapes so you don't recycle a tape you have just
written.
- Apply patch 1.34.6-block.patch (in another bug report, sorry, I fogot
the number).
- Instead of the above item, you could upgrade to 1.35.2.  
- If you can tie this down better or find a way for me to reproduce it,
I'll fix it.

----------------------------------------------------------------------
 kern - 09-01-2004 11:59 PDT 
----------------------------------------------------------------------
For your convenience, I have uploaded the patch noted in the previous
bugnote.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
08-19-04 06:25 kern           Bugnote Added: 0000175                       
08-19-04 06:25 kern           Resolution               open => unable to duplicate
08-19-04 06:25 kern           Status                   new => closed       
08-19-04 07:04 KlayVessel     Bugnote Added: 0000169                       
08-19-04 07:07 KlayVessel     Bugnote Added: 0000170                       
08-19-04 07:24 KlayVessel     Bugnote Added: 0000171                       
09-01-04 10:43 KlayVessel     Status                   closed => feedback  
09-01-04 10:43 KlayVessel     Bugnote Added: 0000198                       
09-01-04 10:43 KlayVessel     Resolution               unable to duplicate => reopened
09-01-04 11:58 kern           Bugnote Added: 0000199                       
09-01-04 11:58 kern           File Added: 1.34.6-block.patch                    
09-01-04 11:59 kern           Bugnote Added: 0000200                       
09-01-04 11:59 kern           Resolution               reopened => unable to duplicate
09-01-04 11:59 kern           Status                   feedback => closed  
======================================================================