|
From: Steve C. <st...@ch...> - 2011-07-10 19:18:47
|
-----Original Message----- From: Dan Langille [mailto:da...@la...] Sent: Sunday, July 10, 2011 12:58 PM To: st...@ch... Cc: bac...@li... Subject: Re: [Bacula-users] Catastrophic error. Cannot write overflow block to device "LTO4" >> >> 2) since everything is spooled first, there should be NO error that should cancel a job. A tape drive could fail, a tape could burst into flame, all that would be needed was bacula to know that >>there was an issue and give the admin a simple statement do you want to fix the issue or cancel?, the admin to fix the problem, and then bacula told to restart from the last block that was >>stored successfully OR if need be from the beginning of the spooled data file. >This I do know. Although, at first glance it seems easy to do this, it is not. If it was trivial to do, I assure you, it would already be in place. >> Canceling jobs that run for days for TB's of data is just screwed up. >I suggest running smaller jobs. I don't mean to sound trite, but that really is the solution. Given that the alternative is non-trivial, the sensible choice is, I'm afraid, cancel the job. I'm already kicking off 20+ jobs for a single system already. This does not work when we're talking over the 100TB/nearly 200TB mark. And when these errors happen it does not matter how many jobs you have as /all/ outstanding jobs fail when you have concurancy (in this case all jobs that were qued and were not even writing to the same tape were canceled). This does not happen with any other enterprise backup software not that they should be 100% mimicked. With the data sizes we have today I don't see why there are not better error handling checks/routines. |