From: <bac...@li...> - 2004-09-20 17:25:02
|
====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000104 ====================================================================== Reported By: kardel Assigned To: ====================================================================== Project: bacula Bug ID: 104 Category: Storage Daemon Reproducibility: have not tried Severity: major Priority: normal Status: feedback ====================================================================== Date Submitted: 09-10-2004 23:17 PDT Last Modified: 09-20-2004 10:31 PDT ====================================================================== Summary: EOF handling stopped working? Description: tonights full backup failed to complete because it hits EOF on the tape. 1.35.2 was handNling the situation as expected. On 1.35.3 I see a "Network send error" and a backup job in error state after that. I will re-run the job to check whether it is reproducable and if it fails again i will downgrade to 1.35.2 a see what that does. ====================================================================== ---------------------------------------------------------------------- kardel - 09-11-2004 03:01 PDT ---------------------------------------------------------------------- Second attempt lead to a similar result. There are no core dumps. pip-dir: Start Backup JobId 95, Job=System.2004-09-11_08.14.08 pip-sd: Spooling data ... pip-sd: User specified spool size reached. pip-sd: Writting spooled data to Volume. Despooling 16,000,014,081 bytes ... pip-sd: System.2004-09-11_08.14.08 Fatal error: Cannot write block. Device at EOM. pip-sd: System.2004-09-11_08.14.08 Fatal error: Fatal device error: ERR=dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. pip-fd: System.2004-09-11_08.14.08 Fatal error: backup.c:470 Network send error. pip-dir: System.2004-09-11_08.14.08 Error: Bacula 1.35.3 (09Sep04): 11-Sep-2004 10:43 JobId: 95 Job: System.2004-09-11_08.14.08 Backup Level: Full Client: pip-fd FileSet: "all ffs" 2004-08-31 01:09:20 Pool: "Default" Start time: 11-Sep-2004 08:14 End time: 11-Sep-2004 10:43 FD Files Written: 589,625 SD Files Written: 589,625 FD Bytes Written: 15,872,576,269 SD Bytes Written: 15,961,142,780 Rate: 1766.0 KB/s Software Compression: 38.0 % Volume name(s): Volume Session Id: 3 Volume Session Time: 1094854872 Last Volume Bytes: 20,401,852,896 Non-fatal FD errors: 0 SD Errors: 0 FD termination status: Error SD termination status: Error Termination: *** Backup Error *** I now went back to 1.35.2 to see what happens now. Volume BACULA-0002 was marked Error as following error occured with 1.35.2 11-Sep-2004 11:51 pip-sd: System.2004-09-11_11.48.44 Error: Unable to position to end of data on device "/dev/nrst0". ERR=dev.c:477 ioctl MTEOM error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 11:51 pip-sd: Marking Volume "BACULA-0002" in Error in Catalog. I am no looking forward to see what happens at next EOM. ---------------------------------------------------------------------- kardel - 09-11-2004 06:24 PDT ---------------------------------------------------------------------- well, 1.35.2 won't work either... reverting back to 1.35.1 (stand by) 11-Sep-2004 13:18 pip-sd: User specified spool size reached. 11-Sep-2004 13:18 pip-sd: Writting spooled data to Volume. Despooling 16,000,013,563 bytes ... * * *pip-sd: spool.c:375 Bad return from despool in write_block. 11-Sep-2004 14:36 pip-sd: System.2004-09-11_11.48.44 Error: block.c:572 Write error at 5:9339 on device /dev/nrst0. ERR=Input/output error. 11-Sep-2004 14:36 pip-sd: System.2004-09-11_11.48.44 Fatal error: dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 14:36 pip-sd: System.2004-09-11_11.48.44 Fatal error: Fatal device error: ERR=dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 14:36 pip-fd: System.2004-09-11_11.48.44 Fatal error: backup.c:470 Network send error. 11-Sep-2004 14:36 pip-dir: System.2004-09-11_11.48.44 Error: Bacula 1.35.2 (30Aug04): 11-Sep-2004 14:36 JobId: 96 Job: System.2004-09-11_11.48.44 Backup Level: Full Client: pip-fd FileSet: "all ffs" 2004-08-31 01:09:20 Pool: "Default" Start time: 11-Sep-2004 11:48 End time: 11-Sep-2004 14:36 FD Files Written: 589,632 SD Files Written: 589,632 FD Bytes Written: 15,872,581,513 SD Bytes Written: 15,961,151,963 Rate: 1577.9 KB/s Software Compression: 38.0 % Volume name(s): BACULA-0000 Volume Session Id: 1 Volume Session Time: 1094896095 Last Volume Bytes: 4,060,904,223 Non-fatal FD errors: 0 SD Errors: 0 FD termination status: Error SD termination status: Error Termination: *** Backup Error *** ---------------------------------------------------------------------- kardel - 09-11-2004 10:01 PDT ---------------------------------------------------------------------- bacula 1.35.1 works as expected: 11-Sep-2004 18:22 pip-sd: System.2004-09-11_15.20.14 Error: block.c:566 Write error at 4:964 on device /dev/nrst0. ERR=Input/output error. 11-Sep-2004 18:22 pip-sd: System.2004-09-11_15.20.14 Error: dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 18:22 pip-sd: System.2004-09-11_15.20.14 Error: dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 18:22 pip-sd: System.2004-09-11_15.20.14 Error: Re-read of last block failed. Last block=62963 Current block=62963. 11-Sep-2004 18:22 pip-sd: End of medium on Volume "BACULA-0003" Bytes=4,061,932,394 Blocks=62,964 at 11-Sep-2004 18:22. 11-Sep-2004 18:23 pip-dir: Recycled volume "BACULA-0004" 11-Sep-2004 18:23 pip-sd: Please mount Volume "BACULA-0004" on Storage Device "SDT10000" for Job System.2004-09-11_15.20.14 So it seems EOM detection / EOF errorhandling changed between 1.35.1 to 1.35.2. ---------------------------------------------------------------------- kern - 09-18-2004 13:50 PDT ---------------------------------------------------------------------- I have changed the EOF handling between 1.35.1 and 1.35.3 to do exactly what 1.35.3 does -- fail the job if Bacula cannot write an EOF at the end of the tape. I believe that this is correct behavior and that your drive is either misconfigured (probably the same pthreads bug that existed in FreeBSD) or the driver is buggy. If Bacula cannot write an EOF at the end of the tape, it is *very* unlikely that you will be able to successfully read the tape to restore the files. I would like you to try a few things: 1. Run the Bacula btape "test" command. 2. Run the Bacula btape "fill" command (either one or two tapes). 3. Run the tapetest program (or perhaps it is testtape) It is described in the testing chapter and in the FreeBSD platform directory. 4. Try to recover the files that spanned two volumes. I will bet that several of the above do not work. Number 4 is, of course, the most important. ---------------------------------------------------------------------- kardel - 09-19-2004 09:53 PDT ---------------------------------------------------------------------- See bacula-log for the fill test log. Things I observed: 1) btape test run successful (so configuration/driver shouldn't be too bad) (I can provide a log wrt btape test also) 2) btape fill runs OK and acts sensible when hitting EOM in 1.35.1 - it also ignores ti MTWEOF error a EOM and keeps going on. The second tape is started correctly to the failed block that hit EOM on the first tape. 3) btape hits a SIGSEGV when getting ready to re-read the tapes. No traceback as btape was called via path and thus argv[0] contained just the releative path. Maybe the code for determining exepath should be checked as it didn't derive a correct path the the executables. My CWD was in the etc directory (/usr/pkg/etc/bacula) and btape was called and resided in /usr/pkg/sbin/btape which was reachable via PATH. It seems that lib/message.c is just concerned about the CWD. It should, if no absolute path is given scan the path for the executable - not just the CWD. 4) Thankfully there was the "unfill" command. This allowed for the test to finish. It show that files straddeling EOM and thus spanning multiple tapes can sucessfully be read again even when WEOF errors are ignored at EOM. I wouldn't even expect WEOF to succeed at EOM conditions. I would just expect a corrupt (runt) last block at worst. 5) The label operation for the second tape incurrs also an I/O error. This is strange. The tape itself is labelled correctly though as it works right after inserting the tape again after labelling. 6) I hope the logs and 1)-5) help in analysis. Regards, Frank ---------------------------------------------------------------------- kern - 09-20-2004 10:31 PDT ---------------------------------------------------------------------- Thanks for faithfully responding to each of the previous points. Here are my thoughts on each: 1. Btape test. OK, I expected it to work, but wanted to be 100% sure before continuing. 2. btape fill OK. Version 1.34.3 and above won't ignore the error on the WEOF (more on this later). 3. btape fill segfault. This is a constant problem I have because the silly thing takes so long to run. I'll look at it to see if I can fix the problem. Fortunately you discovered the unfill (written out of desperation). 4. unfill and log. Well, this is one of the more interesting things that I have seen in a long time. I did not expect the unfill to work. It did work because (I had forgotten) it specifically forward spaces to the last block and reads it. The unfill then stops reading. As a consequence, fill/unfill can perfectly handle running into the physical end of the tape (which should never happen). Unfortunately, as it is currently written, Bacula will not work because it simply reads until it gets an end of tape, which in your case is going to be an I/O error because it wrote off the end of the tape. I had never thought to make Bacula stop reading when it reaches what it thinks is the end of the tape rather than simply reading until the driver signals an end of tape. This is an interesting idea that I am going to look into. 5. The I/O error when labeling the second tape. This is very worrisome and is most likely because the previous error condition has not been properly cleared. Can you find out if there is some ioctl() call on your system that will clear all tape error conditions? On Linux it is called MTIOCERRSTAT and on FreeBSD it is called MTIOCERRSTAT It is most likely in /usr/include/sys/mtio.h 6. Yes, it all helped a lot. It is making me re-think how Bacula should handle I/O errors and when it should stop reading a tape. Bug History Date Modified Username Field Change ====================================================================== 09-10-04 23:17 kardel New Bug 09-11-04 03:01 kardel Bugnote Added: 0000226 09-11-04 06:24 kardel Bugnote Added: 0000227 09-11-04 10:01 kardel Bugnote Added: 0000228 09-15-04 03:48 Dan Langille Severity feature => major 09-18-04 13:50 kern Bugnote Added: 0000245 09-18-04 13:50 kern Status new => feedback 09-19-04 09:33 kardel File Added: bacula-log 09-19-04 09:53 kardel Bugnote Added: 0000248 09-20-04 10:31 kern Bugnote Added: 0000255 ====================================================================== |