From: Chris R. <ro...@ph...> - 2006-12-28 19:38:48
|
Dear all, I have bacula (bacula-mysql-1.38.2-1.rpm) installed on a Redhat FC4 system (Linux lilac.home 2.6.17-1.2142_FC4 #1 Tue Jul 11 22:41:14 EDT 2006 i686 athlon i386 GNU/Linux). Bacula is set to back up various systems, including another PC on my home network. Backups are saved to hard disk storage in /var/spool/bacula and /var/spool/bacula2 (a second HDD). When the other PC on the network is powered down, bacula cannot run its bacups (obviously). Unfortunately, it then fails to back up the catalog and jams. The bacula console displays a message such as: "27-Dec 21:05 lilac-sd: Job BackupCatalog.2006-12-27_21.05.00 waiting to reserve a device." How can I fix this problem? Everything works quite happily when the other PC is powered on. I have provided a copy of all my configuration files (with passwords removed) at http://laplace.chem.ox.ac.uk/b/) in case they're useful. Here is some output on the bacula console from a backup session where the other PC (ngorongoro) was powered down. 1) The main jobs have now run. Here is some output from "status dir": *status dir lilac-dir Version: 1.38.2 (20 November 2005) i686-redhat-linux-gnu redhat (Stentz) Daemon started 28-Dec-06 20:59, 6 Jobs run since started. Scheduled Jobs: Level Type Pri Scheduled Name Volume =================================================================================== Full Backup 12 28-Dec-06 21:05 BackupCatalog Full-0069 Incremental Backup 9 29-Dec-06 21:00 Ngorongoro-SystemState Full-0069 Incremental Backup 10 29-Dec-06 21:00 Lilac Full-0069 Incremental Backup 10 29-Dec-06 21:00 Ngorongoro Full-0069 Incremental Backup 11 29-Dec-06 21:00 PlusNetWebspace *unknown* Incremental Backup 11 29-Dec-06 21:00 PlusNetEmail *unknown* Incremental Backup 11 29-Dec-06 21:00 RodgersOrgUkWebspace *unknown* ==== Running Jobs: No Jobs running. ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ======================================================================== 559 Incr 0 0 Error 27-Dec-06 21:01 Ngorongoro 560 Incr 0 0 Error 27-Dec-06 21:01 PlusNetWebspace 561 Incr 0 0 Error 27-Dec-06 21:01 PlusNetEmail 562 Incr 0 0 OK 27-Dec-06 21:01 RodgersOrgUkWebspace 564 Incr 0 0 Error 28-Dec-06 21:00 Ngorongoro-SystemState 565 Incr 108 52,577,981 OK 28-Dec-06 21:00 Lilac 566 Incr 0 0 Error 28-Dec-06 21:01 Ngorongoro 567 Incr 25 17,931 OK 28-Dec-06 21:01 PlusNetWebspace 568 Incr 10 11,486,488 OK 28-Dec-06 21:01 PlusNetEmail 569 Incr 0 0 OK 28-Dec-06 21:01 RodgersOrgUkWebspace ==== And "status sto" *status sto The defined Storage resources are: 1: FileDiskA 2: FileDiskB Select Storage resource (1-2): 1 Connecting to Storage daemon FileDiskA at lilac:9103 lilac-sd Version: 1.38.2 (20 November 2005) i686-redhat-linux-gnu redhat (Stentz) Daemon started 28-Dec-06 20:59, 4 Jobs run since started. Running Jobs: Backup Job Ngorongoro-SystemState.2006-12-28_21.00.00 waiting for Client connection. Incremental Backup job Ngorongoro-SystemState JobId=564 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Backup Job Ngorongoro.2006-12-28_21.00.02 waiting for Client connection. Incremental Backup job Ngorongoro JobId=566 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 558 Incr 45 22,287,855 OK 27-Dec-06 21:00 Lilac 560 Incr 0 0 OK 27-Dec-06 21:01 PlusNetWebspace 561 Incr 0 0 OK 27-Dec-06 21:01 PlusNetEmail 562 Incr 0 0 OK 27-Dec-06 21:01 RodgersOrgUkWebspace 557 Incr 0 0 Other 27-Dec-06 21:30 Ngorongoro-SystemState 559 Incr 0 0 Other 27-Dec-06 21:30 Ngorongoro 565 Incr 108 52,589,545 OK 28-Dec-06 21:00 Lilac 567 Incr 25 21,345 OK 28-Dec-06 21:01 PlusNetWebspace 568 Incr 10 11,487,596 OK 28-Dec-06 21:01 PlusNetEmail 569 Incr 0 0 OK 28-Dec-06 21:01 RodgersOrgUkWebspace ==== Device status: Device "FileStorageDiskA" (/var/spool/bacula) is not open or does not exist. Device "FileStorageDiskB" (/var/spool/bacula2) is not open or does not exist. ==== In Use Volume status: ==== *status sto The defined Storage resour *status sto The defined Storage resources are: 1: FileDiskA 2: FileDiskB Select Storage resource (1-2): 2 Connecting to Storage daemon FileDiskB at lilac:9103 lilac-sd Version: 1.38.2 (20 November 2005) i686-redhat-linux-gnu redhat (Stentz) Daemon started 28-Dec-06 20:59, 4 Jobs run since started. Running Jobs: Backup Job Ngorongoro-SystemState.2006-12-28_21.00.00 waiting for Client connection. Incremental Backup job Ngorongoro-SystemState JobId=564 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Backup Job Ngorongoro.2006-12-28_21.00.02 waiting for Client connection. Incremental Backup job Ngorongoro JobId=566 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 558 Incr 45 22,287,855 OK 27-Dec-06 21:00 Lilac 560 Incr 0 0 OK 27-Dec-06 21:01 PlusNetWebspace 561 Incr 0 0 OK 27-Dec-06 21:01 PlusNetEmail 562 Incr 0 0 OK 27-Dec-06 21:01 RodgersOrgUkWebspace 557 Incr 0 0 Other 27-Dec-06 21:30 Ngorongoro-SystemState 559 Incr 0 0 Other 27-Dec-06 21:30 Ngorongoro 565 Incr 108 52,589,545 OK 28-Dec-06 21:00 Lilac 567 Incr 25 21,345 OK 28-Dec-06 21:01 PlusNetWebspace 568 Incr 10 11,487,596 OK 28-Dec-06 21:01 PlusNetEmail 569 Incr 0 0 OK 28-Dec-06 21:01 RodgersOrgUkWebspace ==== Device status: Device "FileStorageDiskA" (/var/spool/bacula) is not open or does not exist. Device "FileStorageDiskB" (/var/spool/bacula2) is not open or does not exist. ==== In Use Volume status: ==== 2) Once the catalog backup starts, the middle of "status dir" now shows this text: Running Jobs: JobId Level Name Status ====================================================================== 570 Full BackupCatalog.2006-12-28_21.05.00 is waiting on Storage FileDiskB ==== 3) "status storage" shows this for the jammed up catalog backup: *status storage The defined Storage resources are: 1: FileDiskA 2: FileDiskB Select Storage resource (1-2): 2 Connecting to Storage daemon FileDiskB at lilac:9103 lilac-sd Version: 1.38.2 (20 November 2005) i686-redhat-linux-gnu redhat (Stentz) Daemon started 28-Dec-06 20:59, 4 Jobs run since started. Running Jobs: Backup Job Ngorongoro-SystemState.2006-12-28_21.00.00 waiting for Client connection. Incremental Backup job Ngorongoro-SystemState JobId=564 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Backup Job Ngorongoro.2006-12-28_21.00.02 waiting for Client connection. Incremental Backup job Ngorongoro JobId=566 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 558 Incr 45 22,287,855 OK 27-Dec-06 21:00 Lilac 560 Incr 0 0 OK 27-Dec-06 21:01 PlusNetWebspace 561 Incr 0 0 OK 27-Dec-06 21:01 PlusNetEmail 562 Incr 0 0 OK 27-Dec-06 21:01 RodgersOrgUkWebspace 557 Incr 0 0 Other 27-Dec-06 21:30 Ngorongoro-SystemState 559 Incr 0 0 Other 27-Dec-06 21:30 Ngorongoro 565 Incr 108 52,589,545 OK 28-Dec-06 21:00 Lilac 567 Incr 25 21,345 OK 28-Dec-06 21:01 PlusNetWebspace 568 Incr 10 11,487,596 OK 28-Dec-06 21:01 PlusNetEmail 569 Incr 0 0 OK 28-Dec-06 21:01 RodgersOrgUkWebspace ==== Device status: Device "FileStorageDiskA" (/var/spool/bacula) is not open or does not exist. Device "FileStorageDiskB" (/var/spool/bacula2) is not open or does not exist. ==== In Use Volume status: ==== One thing I really don't understand is why the storage daemon reports status "Other" on the two Ngorongoro jobs, whilst the director reports status "Error" (which is correct since ngorongoro is powered down). Does anyone have any ideas about that? Many thanks in advance for any tips, etc. Chris. |
From: Chris R. <ro...@ph...> - 2007-01-03 17:38:52
|
Does anyone have any ideas what is causing bacula to jam up like this? I'm sorry if I have provided too much information, but I don't have any real idea which parts of the configuration/etc. may be to blame. Thanks, Chris. |
From: Erich P. <ep...@sp...> - 2007-01-03 19:13:25
|
You might configure concurrent jobs as a solution. That way, if one client is off-line, it doesn't hold up the whole show for the other jobs. Otherwise, the jobs will queue and wait patiently for the device to become available. Erich On Jan 3, 2007, at 11:38 AM, Chris Rodgers wrote: > Does anyone have any ideas what is causing bacula to jam up like this? > > I'm sorry if I have provided too much information, but I don't have > any > real idea which parts of the configuration/etc. may be to blame. > > Thanks, > > Chris. > > ---------------------------------------------------------------------- > --- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to > share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php? > page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users > |
From: Alan B. <aj...@ms...> - 2007-01-04 10:40:02
|
On Wed, 3 Jan 2007, Erich Prinz wrote: > > You might configure concurrent jobs as a solution. It won't help in this situation and the message tends to indicate concurrent jobs are setup. "waiting to reserve a device" means that all available tape drives are in use by jobs using other Pools. In the case of BackupCatalog it is POTENTIALLY DANGEROUS for it to be running when any other job is also running. This job shold be run at either highest or lowest priority (It's lower than normal jobs by default. I've moved it to "highest" because I want the catalog backed up every day, even if that means holding up other stuff - when at lower priority there have been incidents where the catalog wasn't backed up for 2+ weeks while other jobs were running.) > That way, if one client is off-line, it doesn't hold up the whole > show for the other jobs. Otherwise, the jobs will queue and wait > patiently for the device to become available. Because BackupCatalog should be running at a higher or lower priority than all other jobs, all jobs should queue before and after it anyway. Coming back to "waiting to reserve a device" What other jobs are running? What does "status director" show? What does "status storage" show? |
From: Chris R. <ro...@ph...> - 2007-01-04 11:53:38
|
Alan Brown wrote: >> You might configure concurrent jobs as a solution. > > > It won't help in this situation and the message tends to indicate > concurrent jobs are setup. I don't think they are. My config file (http://laplace.chem.ox.ac.uk/b/bacula-dir.conf) contains this block: Director { # define myself Name = lilac-dir DIRport = 9101 # where we listen for UA connections QueryFile = "/etc/bacula/query.sql" WorkingDirectory = "/var/bacula" PidDirectory = "/var/run" Maximum Concurrent Jobs = 1 Messages = Daemon FDConnectTimeout = 10 } which should stop concurrent jobs, I think. > "waiting to reserve a device" means that all available tape drives are > in use by jobs using other Pools. What does it mean when the "tape device" is actually a hard disk directory with different files in there, one for each "tape"? > In the case of BackupCatalog it is POTENTIALLY DANGEROUS for it to be > running when any other job is also running. This job shold be run at > either highest or lowest priority (It's lower than normal jobs by > default. I've moved it to "highest" because I want the catalog backed up > every day, even if that means holding up other stuff - when at lower > priority there have been incidents where the catalog wasn't backed up > for 2+ weeks while other jobs were running.) # Backup the catalog database (after the nightly save) Job { Name = "BackupCatalog" JobDefs = "DefaultJobDiskB" Client = lilac-fd Level = Full FileSet="Catalog" Schedule = "WeeklyCycleAfterBackup" # This creates an ASCII copy of the catalog RunBeforeJob = "/etc/bacula/make_catalog_backup bacula bacula" # This deletes the copy of the catalog RunAfterJob = "/etc/bacula/delete_catalog_backup" Write Bootstrap = "/var/bacula/BackupCatalog.bsr" Priority = 12 # run after main backup } I have the catalog set to a lower priority and starting 5 min after all the other backup jobs. > Coming back to "waiting to reserve a device" > > What other jobs are running? Every night, at 9pm, I backup two PC's and also various IMAP accounts and FTP sites. Then at 9.05pm I backup the catalog. In practice, the backups usually take 10-15min to complete, except when there is a Full backup scheduled which often takes up to 1 hour. Here is a typical list of scheduled jobs: Scheduled Jobs: Level Type Pri Scheduled Name =================================================================================== Incremental Backup 9 29-Dec-06 21:00 Ngorongoro-SystemState Incremental Backup 10 29-Dec-06 21:00 Lilac Incremental Backup 10 29-Dec-06 21:00 Ngorongoro Incremental Backup 11 29-Dec-06 21:00 PlusNetWebspace Incremental Backup 11 29-Dec-06 21:00 PlusNetEmail Incremental Backup 11 29-Dec-06 21:00 RodgersOrgUkWebspace Full Backup 12 29-Dec-06 21:05 BackupCatalog After the main (9pm) jobs have tried to run, which means that the Ngorongoro ones have failed, the status reports look like this: > What does "status director" show? Running Jobs: JobId Level Name Status ====================================================================== 570 Full BackupCatalog.2006-12-28_21.05.00 is waiting on Storage FileDiskB ==== > What does "status storage" show? 3) "status storage" shows this for the jammed up catalog backup: *status storage The defined Storage resources are: 1: FileDiskA 2: FileDiskB Select Storage resource (1-2): 2 Connecting to Storage daemon FileDiskB at lilac:9103 lilac-sd Version: 1.38.2 (20 November 2005) i686-redhat-linux-gnu redhat (Stentz) Daemon started 28-Dec-06 20:59, 4 Jobs run since started. Running Jobs: Backup Job Ngorongoro-SystemState.2006-12-28_21.00.00 waiting for Client connection. Incremental Backup job Ngorongoro-SystemState JobId=564 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed Backup Job Ngorongoro.2006-12-28_21.00.02 waiting for Client connection. Incremental Backup job Ngorongoro JobId=566 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 558 Incr 45 22,287,855 OK 27-Dec-06 21:00 Lilac 560 Incr 0 0 OK 27-Dec-06 21:01 PlusNetWebspace 561 Incr 0 0 OK 27-Dec-06 21:01 PlusNetEmail 562 Incr 0 0 OK 27-Dec-06 21:01 RodgersOrgUkWebspace 557 Incr 0 0 Other 27-Dec-06 21:30 Ngorongoro-SystemState 559 Incr 0 0 Other 27-Dec-06 21:30 Ngorongoro 565 Incr 108 52,589,545 OK 28-Dec-06 21:00 Lilac 567 Incr 25 21,345 OK 28-Dec-06 21:01 PlusNetWebspace 568 Incr 10 11,487,596 OK 28-Dec-06 21:01 PlusNetEmail 569 Incr 0 0 OK 28-Dec-06 21:01 RodgersOrgUkWebspace ==== Device status: Device "FileStorageDiskA" (/var/spool/bacula) is not open or does not exist. Device "FileStorageDiskB" (/var/spool/bacula2) is not open or does not exist. ==== In Use Volume status: ==== Does anyone know what the lines: Device status: Device "FileStorageDiskA" (/var/spool/bacula) is not open or does not exist. Device "FileStorageDiskB" (/var/spool/bacula2) is not open or does not exist. mean??? /var/spool/bacula definitely does exist! Chris. |
From: Chris R. <ro...@ph...> - 2007-01-04 15:10:02
|
Alan Brown wrote: >>> "waiting to reserve a device" means that all available tape drives are >>> in use by jobs using other Pools. >> >> >> What does it mean when the "tape device" is actually a hard disk >> directory with different files in there, one for each "tape"? > > > Effectively the same thing. Do you mean that Bacula has opened a file with a particular name in that directory ("mounted a tape") and is stuck writing to that one file??? >> Running Jobs: >> JobId Level Name Status >> ====================================================================== >> 570 Full BackupCatalog.2006-12-28_21.05.00 is waiting on Storage >> FileDiskB >> ==== >> >> > What does "status storage" show? >> >> 3) "status storage" shows this for the jammed up catalog backup: >> *status storage >> The defined Storage resources are: >> 1: FileDiskA >> 2: FileDiskB > > >> Connecting to Storage daemon FileDiskB at lilac:9103 >> lilac-sd Version: 1.38.2 (20 November 2005) i686-redhat-linux-gnu redhat >> (Stentz) >> Daemon started 28-Dec-06 20:59, 4 Jobs run since started. >> Running Jobs: >> Backup Job Ngorongoro-SystemState.2006-12-28_21.00.00 waiting for Client >> connection. >> Incremental Backup job Ngorongoro-SystemState JobId=564 Volume="" >> pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" >> Files=0 Bytes=0 Bytes/sec=0 >> FDSocket closed >> Backup Job Ngorongoro.2006-12-28_21.00.02 waiting for Client connection. >> Incremental Backup job Ngorongoro JobId=566 Volume="" >> pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" >> Files=0 Bytes=0 Bytes/sec=0 >> FDSocket closed > > > Why are these two still listed as running? BackupCatalog shouldn't start > until these have finished. That's a good question! This backup scheme works fine when the ngorongoro machine is powered on. All the backup jobs run in order, and then the catalog backup runs last. The problems only arise when ngorongoro (a Windows XP client) is powered down at the time bacula runs. > Can the clients see the -sd port on the -sd machine? > Do the configured passwords match? > Why do these 2 jobs have null volumes? They don't when ngorongoro is powered on. Could it be that bacula's autolabelling has a bug and that it only gets properly triggered when the client file daemon connects back to the storage daemon. Since this never happens, the storage daemon gets "left in limbo"? > Try mounting them and see if that fixes it. Sorry to be silly, but what command should I use and when? i.e. should I mount the volume before the backup, or before the catalog job is scheduled to run, or should I wait until things have got stuck? Chris. |
From: Alan B. <aj...@ms...> - 2007-01-04 16:58:44
|
On Thu, 4 Jan 2007, Chris Rodgers wrote: > > Effectively the same thing. > > Do you mean that Bacula has opened a file with a particular name in that > directory ("mounted a tape") and is stuck writing to that one file??? Kind of.... The wedged jobs are effectively causing the same thine. >>> Backup Job Ngorongoro-SystemState.2006-12-28_21.00.00 waiting for Client >>> connection. >> Why are these two still listed as running? BackupCatalog shouldn't start >> until these have finished. > > That's a good question! > > This backup scheme works fine when the ngorongoro machine is powered on. > All the backup jobs run in order, and then the catalog backup runs last. > The problems only arise when ngorongoro (a Windows XP client) is powered > down at the time bacula runs. Ah........ You need to add a RunBeforeJob which will exit with an error if ngorongoro is not online. Ther ewas some discussion on this in the past relating to laptops and there may already be a script for it somewhere. > Could it be that bacula's autolabelling has a bug and that it only gets > properly triggered when the client file daemon connects back to the > storage daemon. Possibly, but more importantly these jobs should never had shown as finished on the Director status until the Storage daemon said they were finished - the fact that they're wedged means that the BackupCatalog job should never have started in the first place.... >> Try mounting them and see if that fixes it. > > Sorry to be silly, but what command should I use and when? mount - but given what you've already described it won't fix things. You need to cancel these wedged jobs - if they don't exit off the storage daemon you will need to restart it AB |
From: Chris R. <ro...@ph...> - 2007-01-04 17:25:38
|
Alan Brown wrote: >> This backup scheme works fine when the ngorongoro machine is powered on. >> All the backup jobs run in order, and then the catalog backup runs last. >> The problems only arise when ngorongoro (a Windows XP client) is powered >> down at the time bacula runs. > > > Ah........ > > You need to add a RunBeforeJob which will exit with an error if > ngorongoro is not online. Ther ewas some discussion on this in the past > relating to laptops and there may already be a script for it somewhere. This sounds like a bacula bug to me. Surely bacula should be able to cope if one of the client machines is unavailable for some reason. If it cannot, it makes it very easy to perfom a denial of service attack. It could also mean that in a larger office / etc. where machines are occasionally unavailable for some reason that the machines last in the backup list are rather likely not to be backed up. That seems quite a significant flaw to me. Should I report this problem somewhere? >> Could it be that bacula's autolabelling has a bug and that it only >> gets properly triggered when the client file daemon connects back to >> the storage daemon. > > > Possibly, but more importantly these jobs should never had shown as > finished on the Director status until the Storage daemon said they were > finished - the fact that they're wedged means that the BackupCatalog job > should never have started in the first place.... To my mind, the correct behaviour would be for the director to terminate the job after FDConnectTimeout has elapsed and then to inform the storage daemon that the job has been cancelled / had an error. At the moment, it looks like the director cancels the job but leaves the storage daemon in an inconsistent state. Do you think that would be easy to fix? Chris. |
From: Alan B. <aj...@ms...> - 2007-01-04 18:23:52
|
On Thu, 4 Jan 2007, Chris Rodgers wrote: >> You need to add a RunBeforeJob which will exit with an error if >> ngorongoro is not online. Ther ewas some discussion on this in the past >> relating to laptops and there may already be a script for it somewhere. > > This sounds like a bacula bug to me. Surely bacula should be able to > cope if one of the client machines is unavailable for some reason. Bacula will give up and continue on other machines.... eventually. However for a machine which is known to not always be there, it is better to test in the first place or shorten the "giveup" timeout from default settings. > If it cannot, it makes it very easy to perfom a denial of service > attack. It could also mean that in a larger office / etc. where machines > are occasionally unavailable for some reason that the machines last in > the backup list are rather likely not to be backed up. That seems quite > a significant flaw to me. If you have concurrency set high enough and the max start delays set long enough, the backups will run... :-) >> Possibly, but more importantly these jobs should never had shown as >> finished on the Director status until the Storage daemon said they were >> finished - the fact that they're wedged means that the BackupCatalog job >> should never have started in the first place.... > > To my mind, the correct behaviour would be for the director to terminate > the job after FDConnectTimeout has elapsed and then to inform the > storage daemon that the job has been cancelled / had an error. At the > moment, it looks like the director cancels the job but leaves the > storage daemon in an inconsistent state. "This should not be happening" and if it is occuring with 2.0 then it's definitely worthy of a bug report. As for older versions, "Your Milage May Vary". AB |
From: Erich P. <ep...@sp...> - 2007-01-04 22:31:53
|
Alan is correct, you don't want the catalog backup occurring when other client jobs are running. You can setup the job (as Alan noted) to run after all other jobs have completed -- the stock configuration is set up this way out of the oven. I get this same message if there are still jobs running or waiting to time out (this happens with a client that is off line) but it eventually fires and we get a good backup of the catalog. Setting the Heartbeat option keeps the FD <--> SD communication link 'open' (this is my rudimentary understanding) to cope with congestion on the network. It won't solve your network problems but will allow bacula to get the job done if there are issues with the data link between the FD and SD. For a number of client machines we backup over the 'net, this configuration setting is a life saver. We have our client jobs running concurrently and storing to disk. There is a significant risk (apparently) if attempting to do this to tape - if I recall, you're setup was going to disk. Check your Max Concurrent Job settings in the Dir and SD configurations. Unless I've missed something here, there isn't a bug to track down, just an appropriate configuration needs to be made. On Jan 4, 2007, at 12:23 PM, Alan Brown wrote: > On Thu, 4 Jan 2007, Chris Rodgers wrote: > >>> You need to add a RunBeforeJob which will exit with an error if >>> ngorongoro is not online. Ther ewas some discussion on this in >>> the past >>> relating to laptops and there may already be a script for it >>> somewhere. >> >> This sounds like a bacula bug to me. Surely bacula should be able to >> cope if one of the client machines is unavailable for some reason. > > Bacula will give up and continue on other machines.... eventually. > > However for a machine which is known to not always be there, it is > better > to test in the first place or shorten the "giveup" timeout from > default > settings. > >> If it cannot, it makes it very easy to perfom a denial of service >> attack. It could also mean that in a larger office / etc. where >> machines >> are occasionally unavailable for some reason that the machines >> last in >> the backup list are rather likely not to be backed up. That seems >> quite >> a significant flaw to me. > > If you have concurrency set high enough and the max start delays > set long > enough, the backups will run... :-) > >>> Possibly, but more importantly these jobs should never had shown as >>> finished on the Director status until the Storage daemon said >>> they were >>> finished - the fact that they're wedged means that the >>> BackupCatalog job >>> should never have started in the first place.... >> >> To my mind, the correct behaviour would be for the director to >> terminate >> the job after FDConnectTimeout has elapsed and then to inform the >> storage daemon that the job has been cancelled / had an error. At the >> moment, it looks like the director cancels the job but leaves the >> storage daemon in an inconsistent state. > > "This should not be happening" and if it is occuring with 2.0 then > it's > definitely worthy of a bug report. > > As for older versions, "Your Milage May Vary". > > AB > > > ---------------------------------------------------------------------- > --- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to > share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php? > page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users > |
From: Chris R. <ro...@ph...> - 2007-01-04 23:11:08
|
>> To my mind, the correct behaviour would be for the director to terminate >> the job after FDConnectTimeout has elapsed and then to inform the >> storage daemon that the job has been cancelled / had an error. At the >> moment, it looks like the director cancels the job but leaves the >> storage daemon in an inconsistent state. > > > "This should not be happening" and if it is occuring with 2.0 then it's > definitely worthy of a bug report. > > As for older versions, "Your Milage May Vary". Right. I will try upgrading sometime soon. Is it an easy matter to upgrade my existing installation from v1.38.2 to v2.0? I presume that I need to download, compile and install the new bacula binaries. Do I need to do anything to upgrade the database / config files? Will I need to delete my existing backup files? Any pointers or a link to some instructions would be great! Many thanks, Chris. |
From: Arno L. <al...@it...> - 2007-01-04 23:56:10
|
Hi, On 1/5/2007 12:09 AM, Chris Rodgers wrote: >>>To my mind, the correct behaviour would be for the director to terminate >>>the job after FDConnectTimeout has elapsed and then to inform the >>>storage daemon that the job has been cancelled / had an error. At the >>>moment, it looks like the director cancels the job but leaves the >>>storage daemon in an inconsistent state. >> >> >>"This should not be happening" and if it is occuring with 2.0 then it's >>definitely worthy of a bug report. This might all be a result of the timeouts being longer than you expect... anyway, I would not consider Baculas default behaviour to be a bug; it's just one way of operating. If you prefer another mode of work, it's just a question of configuring what you want. >>As for older versions, "Your Milage May Vary". > > > Right. I will try upgrading sometime soon. > > Is it an easy matter to upgrade my existing installation from v1.38.2 to > v2.0? Yes, but it's more than just... > I presume that I need to download, compile and install the new bacula > binaries. > > Do I need to do anything to upgrade the database / config files? ... the above. You have to upgrade the catalog. > Will I need to delete my existing backup files? No. > Any pointers or a link to some instructions would be great! Well, the ReleaseNotes file is a really good starting point. > Many thanks, > > Chris. Arno > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users -- IT-Service Lehmann al...@it... Arno Lehmann http://www.its-lehmann.de |
From: Chris R. <ro...@ph...> - 2007-01-05 01:25:11
|
>>>>To my mind, the correct behaviour would be for the director to terminate >>>>the job after FDConnectTimeout has elapsed and then to inform the >>>>storage daemon that the job has been cancelled / had an error. At the >>>>moment, it looks like the director cancels the job but leaves the >>>>storage daemon in an inconsistent state. >>> >>> >>>"This should not be happening" and if it is occuring with 2.0 then it's >>>definitely worthy of a bug report. OK. This behaviour is still happening in version 2.0. How can I arrange for the storage daemon to time out at the same rate as the director? Alternatively, how can I arrange for the storage daemon to be informed when the director cancels a job so that a volume called "" doesn't end up mounted? Thanks, Chris. Here is some output from bconsole with v2.0 of bacula: *status sto The defined Storage resources are: 1: FileDiskA 2: FileDiskB Select Storage resource (1-2): 2 Connecting to Storage daemon FileDiskB at lilac:9103 lilac-sd Version: 2.0.0 (04 January 2007) i686-redhat-linux-gnu redhat (Stentz) Daemon started 05-Jan-07 00:54, 4 Jobs run since started. Heap: bytes=222,428 max_bytes=285,780 bufs=120 max_bufs=122 Running Jobs: Backup Job Ngorongoro.2007-01-05_01.20.04 waiting for Client connection. Writing: Incremental Backup job Ngorongoro JobId=633 Volume="" pool="PoolIncDiskB" device=""FileStorageDiskB" (/var/spool/bacula2)" Files=0 Bytes=0 Bytes/sec=0 FDSocket closed ==== Jobs waiting to reserve a drive: 3608 JobId=634 wants Pool="PoolFullDiskB" but have Pool="PoolIncDiskB" on drive "FileStorageDiskB" (/var/spool/bacula2). 3607 JobId=634 wants Vol="Full-0065" drive has Vol="" on drive "FileStorageDiskB" (/var/spool/bacula2). ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name =================================================================== 629 Incr 0 0 Error 05-Jan-07 00:55 Ngorongoro 630 Full 0 0 Error 05-Jan-07 00:55 BackupCatalog 631 Incr 120 21.41 M OK 05-Jan-07 01:00 Ngorongoro 632 Full 1 8.809 M OK 05-Jan-07 01:03 BackupCatalog ==== Device status: Device "FileStorageDiskA" (/var/spool/bacula) is not open. Device "FileStorageDiskB" (/var/spool/bacula2) is not open. ==== In Use Volume status: ==== *unmount The defined Storage resources are: 1: FileDiskA 2: FileDiskB Select Storage resource (1-2): 2 3901 Device "FileStorageDiskB" (/var/spool/bacula2) is already unmounted. *mount The defined Storage resources are: 1: FileDiskA 2: FileDiskB Select Storage resource (1-2): 2 3906 File device "FileStorageDiskB" (/var/spool/bacula2) is always mounted. |
From: Chris R. <ro...@ph...> - 2007-01-04 08:24:01
|
Chris Rodgers wrote: > Does anyone have any ideas what is causing bacula to jam up like this? > > I'm sorry if I have provided too much information, but I don't have any > real idea which parts of the configuration/etc. may be to blame. That's what used to happen before I added the line # Timeout after ten minutes connecting to ngorongoro FDConnectTimeout = 10 to my bacula-dir.conf file. (See http://laplace.chem.ox.ac.uk/b/bacula-dir.conf) I thought that that would make bacula move on to the next job after 10 min. It seems to _almost_ work --> the director realises that it needs to move onto the new job, but the storage daemon ends up reporting job status "Other" and gets jammed up. (See my first e-mail for details of this.) At least, that's how it seems to me... What do you think? Chris. |
From: Chris R. <ro...@ph...> - 2007-01-04 09:02:31
|
Chris Rodgers wrote: > Chris Rodgers wrote: > >>Does anyone have any ideas what is causing bacula to jam up like this? >> >>I'm sorry if I have provided too much information, but I don't have any >>real idea which parts of the configuration/etc. may be to blame. > > > That's what used to happen before I added the line > > # Timeout after ten minutes connecting to ngorongoro > FDConnectTimeout = 10 > > to my bacula-dir.conf file. (See > http://laplace.chem.ox.ac.uk/b/bacula-dir.conf) > > I thought that that would make bacula move on to the next job after 10 > min. It seems to _almost_ work --> the director realises that it needs > to move onto the new job, but the storage daemon ends up reporting job > status "Other" and gets jammed up. (See my first e-mail for details of > this.) > > At least, that's how it seems to me... P.S. This timeout worked (and didn't leave stuck jobs) until I added my second hard disk (http://laplace.chem.ox.ac.uk/b/bacula-sd.conf): Device { Name = FileStorageDiskB Media Type = FileDiskB Archive Device = /var/spool/bacula2 LabelMedia = yes; # lets Bacula label unlabeled media Random Access = Yes; AutomaticMount = yes; # when device opened, read it RemovableMedia = no; AlwaysOpen = no; } Could this be a bacula bug, or have I misunderstood the configuration file format? Chris. |
From: Chris R. <ro...@ph...> - 2007-01-08 09:10:40
|
>>>>To my mind, the correct behaviour would be for the director to terminate >>>>the job after FDConnectTimeout has elapsed and then to inform the >>>>storage daemon that the job has been cancelled / had an error. At the >>>>moment, it looks like the director cancels the job but leaves the >>>>storage daemon in an inconsistent state. >>> >>> >>>"This should not be happening" and if it is occuring with 2.0 then it's >>>definitely worthy of a bug report. > > > This might all be a result of the timeouts being longer than you > expect... anyway, I would not consider Baculas default behaviour to be a > bug; it's just one way of operating. If you prefer another mode of work, > it's just a question of configuring what you want. I've lost interest in chasing this further, but before I go, I want to say that I really do think this is a bug. With the configuration that I have posted (i.e. two different disk backup pools) bacula gets completely wedged if a client is offline. The only way to recover is to stop and restart the daemons on the server because otherwise the storage daemon gets out of sync with the director. Do the Bacula authors really intend that a client being offline should cause bacula to end up with an inconsistent state between the director and the storage daemon that can only be resolved by manually restarting both daemons? Many thanks, Chris. |
From: Kern S. <ke...@si...> - 2007-01-08 09:24:42
|
On Monday 08 January 2007 10:10, Chris Rodgers wrote: > >>>>To my mind, the correct behaviour would be for the director to terminate > >>>>the job after FDConnectTimeout has elapsed and then to inform the > >>>>storage daemon that the job has been cancelled / had an error. At the > >>>>moment, it looks like the director cancels the job but leaves the > >>>>storage daemon in an inconsistent state. > >>> > >>> > >>>"This should not be happening" and if it is occuring with 2.0 then it's > >>>definitely worthy of a bug report. > > > > > > This might all be a result of the timeouts being longer than you > > expect... anyway, I would not consider Baculas default behaviour to be a > > bug; it's just one way of operating. If you prefer another mode of work, > > it's just a question of configuring what you want. > > I've lost interest in chasing this further, but before I go, I want to > say that I really do think this is a bug. > > With the configuration that I have posted (i.e. two different disk > backup pools) bacula gets completely wedged if a client is offline. The > only way to recover is to stop and restart the daemons on the server > because otherwise the storage daemon gets out of sync with the director. > > Do the Bacula authors really intend that a client being offline should > cause bacula to end up with an inconsistent state between the director > and the storage daemon that can only be resolved by manually restarting > both daemons? It appears that you have a rather unique case, since many other users including myself rely on Bacula continuing when a client cannot be contacted. As someone on this list previously showed, the times specified on the Bacula timeout directives are not correct due to the fact that different OSes return immediately when a connection fails and others wait some undetermined time. In addition, the SD typically can take up to 30 minutes to cancel a job depending on the exact state of the FD when it hangs. There are ways to work around each one of these problems though. |