From: Kern S. <ke...@si...> - 2006-08-18 20:00:01
|
Hello Wolfgang, As I have mentioned before, I believe that this is an OS bug. I have been able to reproduce it here, and I believe it happens when you send too much data too fast to the OS. Please see below for more comments: On Friday 18 August 2006 18:22, Wolfgang Denk wrote: > Dear Kern, > > in message <200...@si...> you wrote: > > > > Thanks for the feedback. I'm happy to see that your problem is resolved. > > I start seeing this problem now, too, especially when running several > jobs (with data spooling) in parallel, and when the tape gets full. > > > I guess that the next time I respond, I should not only mention the fact that > > Bacula tends to expose a lot of driver bugs AND hardware problems, both of > > which are difficult and time consuming to debug ... > > I don't think it's either of these in my case. When I run into this > situation, I can reliably reproduce the problem (but I don't know > which exact conditions trigger it in the first place). Here is what > happens: > > 17-Aug 00:40 nyx-sd: Spooling data ... > 17-Aug 00:40 nyx-sd: Committing spooled data to Volume "K-V-DAT-3". Despooling 1,985,286 bytes ... > 17-Aug 00:40 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: block.c:538 Write error at 22:2740 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. > 17-Aug 00:40 nyx-sd: Re-read of last block succeeded. > 17-Aug 00:40 nyx-sd: End of medium on Volume "K-V-DAT-3" Bytes=15,118,094,426 Blocks=234,370 at 17-Aug-2006 00:40. > 17-Aug 00:41 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 17-Aug 00:41 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 3. > 17-Aug 00:41 nyx-sd: 3307 Issuing autochanger "unload slot 3, drive 0" command. > 17-Aug 00:42 nyx-sd: 3304 Issuing autochanger "load slot 4, drive 0" command. > 17-Aug 00:43 nyx-sd: 3305 Autochanger "load slot 4, drive 0", status is OK. > 17-Aug 00:43 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 17-Aug 00:43 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 4. > 17-Aug 00:43 nyx-sd: Wrote label to prelabeled Volume "K-V-DAT-4" on device "DDS-3" (/dev/nst0) > 17-Aug 00:43 nyx-sd: New volume "K-V-DAT-4" mounted on device "DDS-3" (/dev/nst0) at 17-Aug-2006 00:43. > 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: block.c:538 Write error at 0:1 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. The above error "Device or resource" should never happen, from what I know about Linux. The write() system call should be a blocking call and it should never return busy, thus I consider this is most likely an operating system bug. The proper behavior is for the OS to simply wait until the I/O is complete and then return. At one point, I was able to reproduce this problem here, and in 1.39.x I have what I consider a really crude workaround. Basically Bacula will retry the I/O after waiting a bit. Using this technique, I was able to eliminate the problem here, but I am not really satisfied with the fix as the amount of time that Bacula needs to wait is indeterminant. As far as I can tell, this problem occurs because Bacula is feeding data to the tape drive too fast. I can only guess that the OS memory is getting completely full and at some point rather than making Bacula wait as it should, the OS returns an error code. After that point, everything for that particular job is more or less screwed up as you can see in the output that follows, because Bacula gets an error during the rewind since the OS is probably still busy. > 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Error: Re-read of last block OK, but block numbers differ. Last block=0 Current block=0. > 17-Aug 00:43 nyx-sd: Source-Other.2006-08-17_00.10.04 Fatal error: spool.c:249 Fatal append error on device "DDS-3" (/dev/nst0): ERR=dev.c:678 Rewind error on "DDS-3" (/dev/nst0). ERR=Input/output error. > > ... > > 18-Aug 00:40 nyx-sd: Committing spooled data to Volume "K-V-DAT-4". Despooling 948,750,016 bytes ... > 18-Aug 00:40 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: block.c:538 Write error at 2:6438 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. > 18-Aug 00:41 nyx-sd: Re-read of last block succeeded. > 18-Aug 00:41 nyx-sd: End of medium on Volume "K-V-DAT-4" Bytes=691,688,327 Blocks=10,727 at 18-Aug-2006 00:41. > 18-Aug 00:41 castor-dir: Recycled volume "K-V-DAT-1" > 18-Aug 00:41 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 18-Aug 00:41 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 4. > 18-Aug 00:41 nyx-sd: 3307 Issuing autochanger "unload slot 4, drive 0" command. > 18-Aug 00:41 nyx-sd: 3304 Issuing autochanger "load slot 1, drive 0" command. > 18-Aug 00:42 nyx-sd: 3305 Autochanger "load slot 1, drive 0", status is OK. > 18-Aug 00:42 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 18-Aug 00:42 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1. > 18-Aug 00:42 nyx-sd: Recycled volume "K-V-DAT-1" on device "DDS-3" (/dev/nst0), all previous data lost. > 18-Aug 00:42 nyx-sd: New volume "K-V-DAT-1" mounted on device "DDS-3" (/dev/nst0) at 18-Aug-2006 00:42. > 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: block.c:538 Write error at 0:1 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. > 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Error: Re-read of last block OK, but block numbers differ. Last block=0 Current block=0. > 18-Aug 00:42 nyx-sd: Castor-Home-1.2006-08-18_00.05.00 Fatal error: spool.c:249 Fatal append error on device "DDS-3" (/dev/nst0): ERR=dev.c:678 Rewind error on "DDS-3" (/dev/nst0). ERR=Input/output error. > ... > 18-Aug 00:42 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 18-Aug 00:42 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1. > 18-Aug 00:42 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. > 18-Aug 00:43 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace record at EOT failed. ERR=Input/output error > 18-Aug 00:43 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error on "DDS-3" (/dev/nst0). ERR=Input/output error. > > 18-Aug 00:43 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog. > > > In this state, the newly mounted volumes ("K-V-DAT-4" and > "K-V-DAT-1") are empty, i. e. they have NO data on it. Even the > previously existing bacula labels are gone. I suspect that this is because the OS has been so totally overloaded by data that it gets totally lost, which in turn confuses Bacula as it attempts to salvage what it can from the backup. > > > > I can now reproduce this failure mode; for example, I try to recycle > the last corrupted tape (which has no data on it anyway): I unmount > it, delete it from the catalog, label it again (using the same or > another volume name), mount it again. The hanging backup job will > attempt to write on it again, and corrupt it again: At this point, everything is more or less screwed up for that particular job. The file descriptor that the job holds for the tape is probably no longer valid, and as you see below, it continues to immediately get the "Device or resource busy" error. Thus is is better to cancel the job, and if that does not resolve the problem, to restart the Storage daemon, because things have gone terribly wrong. > > 18-Aug 01:14 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 18-Aug 01:14 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1. > 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. > 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace record at EOT failed. ERR=Input/output error > 18-Aug 01:14 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error on "DDS-3" (/dev/nst0). ERR=Input/output error. > > 18-Aug 01:14 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog. > 18-Aug 01:14 nyx-sd: Invalid slot=0 defined, cannot autoload Volume. > 18-Aug 01:14 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device "DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00 > 18-Aug 01:38 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 18-Aug 01:38 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1. > 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. > 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace record at EOT failed. ERR=Input/output error > 18-Aug 01:38 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error on "DDS-3" (/dev/nst0). ERR=Input/output error. > > 18-Aug 01:38 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog. > 18-Aug 01:38 nyx-sd: Invalid slot=0 defined, cannot autoload Volume. > 18-Aug 01:38 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device "DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00 > 18-Aug 01:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device "DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00 > 18-Aug 03:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device "DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00 > 18-Aug 07:43 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device "DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00 > 18-Aug 10:19 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 18-Aug 10:19 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1. > 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. > 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace record at EOT failed. ERR=Input/output error > 18-Aug 10:19 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error on "DDS-3" (/dev/nst0). ERR=Input/output error. > > 18-Aug 10:19 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog. > 18-Aug 10:19 nyx-sd: Invalid slot=0 defined, cannot autoload Volume. > 18-Aug 10:19 nyx-sd: Please mount Volume "K-T-DAT-2" on Storage Device "DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00 > 18-Aug 10:29 nyx-sd: 3301 Issuing autochanger "loaded drive 0" command. > 18-Aug 10:29 nyx-sd: 3302 Autochanger "loaded drive 0", result is Slot 1. > 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: block.c:538 Write error at 0:0 on device "DDS-3" (/dev/nst0). ERR=Device or resource busy. > 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: Backspace record at EOT failed. ERR=Input/output error > 18-Aug 10:29 nyx-sd: Castor-Home-2.2006-08-18_00.10.00 Error: label.c:441 Unable to write device "DDS-3" (/dev/nst0): ERR=dev.c:1430 ioctl MTBSR error on "DDS-3" (/dev/nst0). ERR=Input/output error. > > 18-Aug 10:29 nyx-sd: Marking Volume "K-V-DAT-1" in Error in Catalog. > 18-Aug 10:29 nyx-sd: Please mount Volume "K-V-DAT-1" on Storage Device "DDS-3" (/dev/nst0) for Job Castor-Home-2.2006-08-18_00.10.00 > > > To me it seems as if bacula somehwat gets confused when several jobs > are despooling data in parallel, and a medium gets full and is > changed. > > > All this is with Bacula 1.38.11 (28Jun06). > > > Any ideas? I think that it is the OS that is getting confused or overloaded. It returns an error status that should never happen, and from there on Bacula is unable to recover. > > > Best regards, > > Wolfgang Denk > > -- > Software Engineering: Embedded and Realtime Systems, Embedded Linux > Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd...@de... > When the ax entered the forest, the trees said, "The handle is one of > us!" -- Turkish proverb > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users > |