[Bacula-bugs] [bacula 0000087]: Storage daemon goes into infinite loop during media change

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The following bug has been CLOSED
======================================================================
http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000087
======================================================================
Reported By:                KlayVessel
Assigned To:                
======================================================================
Project:                    bacula
Bug ID:                     87
Category:                   Storage Daemon
Reproducibility:            have not tried
Severity:                   minor
Priority:                   normal
Status:                     closed
======================================================================
Date Submitted:             08-19-2004 05:19 PDT
Last Modified:              08-19-2004 06:25 PDT
======================================================================
Summary:                    Storage daemon goes into infinite loop during media change
Description: 
The storage daemon went into an infinite loop after the "Daily02" tape (see
additional info) filled and it was trying to switch to "Daily01" -- our
Daily pool only has 2 tapes.

Notice that the output reports a "permission error"; however as /dev/nst1
is set to rw for root and the bacula-sd process runs as root I don't see
why that should have happened.  There are no messages in /var/log/messages
related to this problem.

This may also be related to an earlier restore.  Earlier today I did a
test restore for one of our XP desktops (just to validate the backup
process).  I noticed once before in an earlier version of bacula (1.34.2 I
believe) when I tested a very large full restore (>500K files) that the
storage daemon also went into an infinite loop the next time it needed to
write.  But that is a different system and writing only to file volumes;
not tape.

I also see that only one job switch to the "Daily01" tape (in status
storage output).  Perhaps there is something related to doing this while
also using simultaneous jobs?

I have included in "Additonal Information" the console output at the time
and some status information and a snapshot from top -- this was after
about 1.75 hours; the backups start at 11pm.

The file I've uploaded is a .tar.gz containing the three main
bacula-**.conf files on the backup server.  I've redacted the passwords.

Tape drive: Exabyte VXA-2 with autoloader (10 bay; single drive); it
appears that the tape change worked successful.
Backups and restore have been working (btape works fine)

Unrelated device /dev/nst0 is a HP DDS-3 unit but managed by the same SD.

I cancelled the running jobs and, as of right now (1:13am) all but two
completed the cancellation.  I have left everything running in the hopes
that the remaining two will clear/cancel and then I can restart the SD. 
If these don't clear in a few hours, I will add to this report.

======================================================================

----------------------------------------------------------------------
 kern - 08-19-2004 06:25 PDT 
----------------------------------------------------------------------
I'm going to close this bug, because there is not enough information to fix
it. I would appreciate it if you would do two things:

1. Make absolutely sure that:
   a. Your spool directory never fills.
   b. Make sure you never attempt to backup your spool directory.

2. If this happens again, please get a manual dump of the SD as described

   in the Kaboom chapter of the manual, and either re-open this bug or
create
   a new one.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:04 PDT 
----------------------------------------------------------------------
Reporting back.  It's now 3am EDT (about 2 hours since I cancelled the
jobs).  The two jobs are still in the director's queue but marked as "has
been canceled".  The  SD is still running.

*status dir
...
Running Jobs:
 JobId Level   Name                       Status
======================================================================
   831 Increme  orca.2004-08-18_23.00.03 has been canceled
   828 Increme  hawk.2004-08-18_23.00.00 has been canceled
====
...

The SD is (probably) not responding to director's requests, current status
still shows 4 jobs in the SD:

*st storage=Autochanger
Connecting to Storage daemon Autochanger at hawk.redklay2:9103

hawk-sd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 96 Jobs run since started.

Running Jobs:
Full Backup job orca_db_redo JobId=832 Volume="Daily02"
device="/dev/nst1"
    Files=8 Bytes=658,453 Bytes/sec=45
    FDReadSeqNo=131 in_msg=114 out_msg=6 fd=15
Incremental Backup job orca JobId=831 Volume="Daily02" device="/dev/nst1"
    Files=233 Bytes=5,770,856 Bytes/sec=395
    FDReadSeqNo=3,762 in_msg=3133 out_msg=6 fd=13
Incremental Backup job panda JobId=830 Volume="Daily01"
device="/dev/nst1"
    Files=650 Bytes=46,141,511 Bytes/sec=3,164
    FDReadSeqNo=4,469 in_msg=3436 out_msg=6 fd=11
Incremental Backup job hawk JobId=828 Volume="Daily02" device="/dev/nst1"
    Files=17,236 Bytes=4,236,172,557 Bytes/sec=290,546
    FDReadSeqNo=268,189 in_msg=220601 out_msg=6 fd=9
====
...

I will be terminating/restarting the SD and director and restarting these
backup jobs.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:07 PDT 
----------------------------------------------------------------------
I should have included the client (FD) status as well, for these two jobs
the clients still show they are running (probably waiting on SD):

*st client=orca-fd
Connecting to Client orca-fd at orca.redklay2:9102

orca-fd Version: 1.34.6 (28 July 2004) powerpc-ibm-aix4.3.3.0 aix 3
Daemon started 15-Aug-04 09:10, 8 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   758  Sinc        691     82,734,632 OK       12-Aug-04 23:15 orca
   767  Full    514,961  7,839,492,211 OK       14-Aug-04 11:59 orca
   792  Sinc         80      3,104,564 OK       15-Aug-04 09:15 orca
   797  Full          8          5,692 OK       15-Aug-04 23:00
orca_db_redo
   796  Sinc         26      3,355,004 OK       15-Aug-04 23:04 orca
   808  Full          8      1,090,026 OK       16-Aug-04 23:00
orca_db_redo
   807  Sinc      1,861      6,756,867 OK       16-Aug-04 23:05 orca
   819  Full          8        878,011 OK       17-Aug-04 23:00
orca_db_redo
   818  Sinc        222      6,957,873 OK       17-Aug-04 23:06 orca
   832  Full          8        657,684 Cancel   19-Aug-04 01:07
orca_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 831 Job orca.2004-08-18_23.00.03 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=235 Bytes=5,839,057 Bytes/sec=395
    Files Examined=537,647
    Processing file:
/d01/app/oracle/product/8.1.5/network/log/listener.log
    SDReadSeqNo=5 fd=6
====
*st client=hawk-fd
Connecting to Client hawk-fd at hawk.redklay2:9102

hawk-fd Version: 1.34.6 (28 July 2004) i386-redhat-linux-gnu redhat 7.3
Daemon started 10-Aug-04 11:51, 18 Jobs run since started.

Terminated Jobs:
 JobId  Level   Files          Bytes Status   Finished        Name
======================================================================
   779  Full          0              0 OK       14-Aug-04 12:01
BackupCatalog
   780  Full          1    686,017,053 OK       14-Aug-04 12:14
BackupCatalog
   781  Sinc        156     43,287,087 OK       15-Aug-04 09:05 hawk
   782  Full          6          5,004 OK       15-Aug-04 09:05
hawk_db_redo
   793  Sinc        132     42,706,198 OK       15-Aug-04 23:00 hawk
   794  Full          6          4,288 OK       15-Aug-04 23:00
hawk_db_redo
   804  Sinc        123     43,114,702 OK       16-Aug-04 23:00 hawk
   805  Full          6          4,414 OK       16-Aug-04 23:00
hawk_db_redo
   815  Sinc        133     43,314,887 OK       17-Aug-04 23:00 hawk
   816  Full          6          4,422 OK       17-Aug-04 23:00
hawk_db_redo
====
Running Jobs:
Director connected at: 19-Aug-04 03:06
JobId 828 Job hawk.2004-08-18_23.00.00 is running.
    Backup Job started: 18-Aug-04 23:00
    Files=17,236 Bytes=4,233,229,628 Bytes/sec=286,532
    Files Examined=71,869
    Processing file: /web/public/Java/j2sdk-1_3_1_02-win.exe
    SDReadSeqNo=5 fd=7
====
*

All other clients report that there are no jobs running.

----------------------------------------------------------------------
 KlayVessel - 08-19-2004 07:24 PDT 
----------------------------------------------------------------------
Final update tonight.  When I killed the SD, the clients were immediately
freed and the hung cancelled jobs released from the director's queue.  I
did not have to kill any other processes.  However, when I restarted the
SD, it could not read the "Daily01" volume label; it had been erased
(though it was properly labled and used previously; also the earlier
messages show that the SD device status with "Daily01").

I used btape to relabel it and then it remounted fine.  Rerunning the jobs
worked fine as this started at BOT.

By the way....thanks for such a great product!!!  Even with the small
problems, I've been very happy with Bacula.

Bug History
Date Modified  Username       Field                    Change              
======================================================================
08-19-04 05:19 KlayVessel     New Bug                                      
08-19-04 05:19 KlayVessel     File Added: bacula-conf-files-20040819.tar.gz                    
08-19-04 06:25 kern           Bugnote Added: 0000175                       
08-19-04 06:25 kern           Resolution               open => unable to duplicate
08-19-04 06:25 kern           Status                   new => closed       
08-19-04 07:04 KlayVessel     Bugnote Added: 0000169                       
08-19-04 07:07 KlayVessel     Bugnote Added: 0000170                       
08-19-04 07:24 KlayVessel     Bugnote Added: 0000171                       
======================================================================