From: <bac...@li...> - 2004-09-28 15:30:09
|
A BUGNOTE has been added to this bug. ====================================================================== http://bugs.bacula.org/bug_view_advanced_page.php?bug_id=0000104 ====================================================================== Reported By: kardel Assigned To: ====================================================================== Project: bacula Bug ID: 104 Category: Storage Daemon Reproducibility: have not tried Severity: major Priority: normal Status: feedback ====================================================================== Date Submitted: 09-10-2004 23:17 PDT Last Modified: 09-28-2004 08:33 PDT ====================================================================== Summary: EOF handling stopped working? Description: tonights full backup failed to complete because it hits EOF on the tape. 1.35.2 was handNling the situation as expected. On 1.35.3 I see a "Network send error" and a backup job in error state after that. I will re-run the job to check whether it is reproducable and if it fails again i will downgrade to 1.35.2 a see what that does. ====================================================================== ---------------------------------------------------------------------- kardel - 09-11-2004 03:01 PDT ---------------------------------------------------------------------- Second attempt lead to a similar result. There are no core dumps. pip-dir: Start Backup JobId 95, Job=System.2004-09-11_08.14.08 pip-sd: Spooling data ... pip-sd: User specified spool size reached. pip-sd: Writting spooled data to Volume. Despooling 16,000,014,081 bytes ... pip-sd: System.2004-09-11_08.14.08 Fatal error: Cannot write block. Device at EOM. pip-sd: System.2004-09-11_08.14.08 Fatal error: Fatal device error: ERR=dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. pip-fd: System.2004-09-11_08.14.08 Fatal error: backup.c:470 Network send error. pip-dir: System.2004-09-11_08.14.08 Error: Bacula 1.35.3 (09Sep04): 11-Sep-2004 10:43 JobId: 95 Job: System.2004-09-11_08.14.08 Backup Level: Full Client: pip-fd FileSet: "all ffs" 2004-08-31 01:09:20 Pool: "Default" Start time: 11-Sep-2004 08:14 End time: 11-Sep-2004 10:43 FD Files Written: 589,625 SD Files Written: 589,625 FD Bytes Written: 15,872,576,269 SD Bytes Written: 15,961,142,780 Rate: 1766.0 KB/s Software Compression: 38.0 % Volume name(s): Volume Session Id: 3 Volume Session Time: 1094854872 Last Volume Bytes: 20,401,852,896 Non-fatal FD errors: 0 SD Errors: 0 FD termination status: Error SD termination status: Error Termination: *** Backup Error *** I now went back to 1.35.2 to see what happens now. Volume BACULA-0002 was marked Error as following error occured with 1.35.2 11-Sep-2004 11:51 pip-sd: System.2004-09-11_11.48.44 Error: Unable to position to end of data on device "/dev/nrst0". ERR=dev.c:477 ioctl MTEOM error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 11:51 pip-sd: Marking Volume "BACULA-0002" in Error in Catalog. I am no looking forward to see what happens at next EOM. ---------------------------------------------------------------------- kardel - 09-11-2004 06:24 PDT ---------------------------------------------------------------------- well, 1.35.2 won't work either... reverting back to 1.35.1 (stand by) 11-Sep-2004 13:18 pip-sd: User specified spool size reached. 11-Sep-2004 13:18 pip-sd: Writting spooled data to Volume. Despooling 16,000,013,563 bytes ... * * *pip-sd: spool.c:375 Bad return from despool in write_block. 11-Sep-2004 14:36 pip-sd: System.2004-09-11_11.48.44 Error: block.c:572 Write error at 5:9339 on device /dev/nrst0. ERR=Input/output error. 11-Sep-2004 14:36 pip-sd: System.2004-09-11_11.48.44 Fatal error: dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 14:36 pip-sd: System.2004-09-11_11.48.44 Fatal error: Fatal device error: ERR=dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 14:36 pip-fd: System.2004-09-11_11.48.44 Fatal error: backup.c:470 Network send error. 11-Sep-2004 14:36 pip-dir: System.2004-09-11_11.48.44 Error: Bacula 1.35.2 (30Aug04): 11-Sep-2004 14:36 JobId: 96 Job: System.2004-09-11_11.48.44 Backup Level: Full Client: pip-fd FileSet: "all ffs" 2004-08-31 01:09:20 Pool: "Default" Start time: 11-Sep-2004 11:48 End time: 11-Sep-2004 14:36 FD Files Written: 589,632 SD Files Written: 589,632 FD Bytes Written: 15,872,581,513 SD Bytes Written: 15,961,151,963 Rate: 1577.9 KB/s Software Compression: 38.0 % Volume name(s): BACULA-0000 Volume Session Id: 1 Volume Session Time: 1094896095 Last Volume Bytes: 4,060,904,223 Non-fatal FD errors: 0 SD Errors: 0 FD termination status: Error SD termination status: Error Termination: *** Backup Error *** ---------------------------------------------------------------------- kardel - 09-11-2004 10:01 PDT ---------------------------------------------------------------------- bacula 1.35.1 works as expected: 11-Sep-2004 18:22 pip-sd: System.2004-09-11_15.20.14 Error: block.c:566 Write error at 4:964 on device /dev/nrst0. ERR=Input/output error. 11-Sep-2004 18:22 pip-sd: System.2004-09-11_15.20.14 Error: dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 18:22 pip-sd: System.2004-09-11_15.20.14 Error: dev.c:1152 ioctl MTWEOF error on /dev/nrst0. ERR=Input/output error. 11-Sep-2004 18:22 pip-sd: System.2004-09-11_15.20.14 Error: Re-read of last block failed. Last block=62963 Current block=62963. 11-Sep-2004 18:22 pip-sd: End of medium on Volume "BACULA-0003" Bytes=4,061,932,394 Blocks=62,964 at 11-Sep-2004 18:22. 11-Sep-2004 18:23 pip-dir: Recycled volume "BACULA-0004" 11-Sep-2004 18:23 pip-sd: Please mount Volume "BACULA-0004" on Storage Device "SDT10000" for Job System.2004-09-11_15.20.14 So it seems EOM detection / EOF errorhandling changed between 1.35.1 to 1.35.2. ---------------------------------------------------------------------- kern - 09-18-2004 13:50 PDT ---------------------------------------------------------------------- I have changed the EOF handling between 1.35.1 and 1.35.3 to do exactly what 1.35.3 does -- fail the job if Bacula cannot write an EOF at the end of the tape. I believe that this is correct behavior and that your drive is either misconfigured (probably the same pthreads bug that existed in FreeBSD) or the driver is buggy. If Bacula cannot write an EOF at the end of the tape, it is *very* unlikely that you will be able to successfully read the tape to restore the files. I would like you to try a few things: 1. Run the Bacula btape "test" command. 2. Run the Bacula btape "fill" command (either one or two tapes). 3. Run the tapetest program (or perhaps it is testtape) It is described in the testing chapter and in the FreeBSD platform directory. 4. Try to recover the files that spanned two volumes. I will bet that several of the above do not work. Number 4 is, of course, the most important. ---------------------------------------------------------------------- kardel - 09-19-2004 09:53 PDT ---------------------------------------------------------------------- See bacula-log for the fill test log. Things I observed: 1) btape test run successful (so configuration/driver shouldn't be too bad) (I can provide a log wrt btape test also) 2) btape fill runs OK and acts sensible when hitting EOM in 1.35.1 - it also ignores ti MTWEOF error a EOM and keeps going on. The second tape is started correctly to the failed block that hit EOM on the first tape. 3) btape hits a SIGSEGV when getting ready to re-read the tapes. No traceback as btape was called via path and thus argv[0] contained just the releative path. Maybe the code for determining exepath should be checked as it didn't derive a correct path the the executables. My CWD was in the etc directory (/usr/pkg/etc/bacula) and btape was called and resided in /usr/pkg/sbin/btape which was reachable via PATH. It seems that lib/message.c is just concerned about the CWD. It should, if no absolute path is given scan the path for the executable - not just the CWD. 4) Thankfully there was the "unfill" command. This allowed for the test to finish. It show that files straddeling EOM and thus spanning multiple tapes can sucessfully be read again even when WEOF errors are ignored at EOM. I wouldn't even expect WEOF to succeed at EOM conditions. I would just expect a corrupt (runt) last block at worst. 5) The label operation for the second tape incurrs also an I/O error. This is strange. The tape itself is labelled correctly though as it works right after inserting the tape again after labelling. 6) I hope the logs and 1)-5) help in analysis. Regards, Frank ---------------------------------------------------------------------- kern - 09-20-2004 10:31 PDT ---------------------------------------------------------------------- Thanks for faithfully responding to each of the previous points. Here are my thoughts on each: 1. Btape test. OK, I expected it to work, but wanted to be 100% sure before continuing. 2. btape fill OK. Version 1.34.3 and above won't ignore the error on the WEOF (more on this later). 3. btape fill segfault. This is a constant problem I have because the silly thing takes so long to run. I'll look at it to see if I can fix the problem. Fortunately you discovered the unfill (written out of desperation). 4. unfill and log. Well, this is one of the more interesting things that I have seen in a long time. I did not expect the unfill to work. It did work because (I had forgotten) it specifically forward spaces to the last block and reads it. The unfill then stops reading. As a consequence, fill/unfill can perfectly handle running into the physical end of the tape (which should never happen). Unfortunately, as it is currently written, Bacula will not work because it simply reads until it gets an end of tape, which in your case is going to be an I/O error because it wrote off the end of the tape. I had never thought to make Bacula stop reading when it reaches what it thinks is the end of the tape rather than simply reading until the driver signals an end of tape. This is an interesting idea that I am going to look into. 5. The I/O error when labeling the second tape. This is very worrisome and is most likely because the previous error condition has not been properly cleared. Can you find out if there is some ioctl() call on your system that will clear all tape error conditions? On Linux it is called MTIOCERRSTAT and on FreeBSD it is called MTIOCERRSTAT It is most likely in /usr/include/sys/mtio.h 6. Yes, it all helped a lot. It is making me re-think how Bacula should handle I/O errors and when it should stop reading a tape. ---------------------------------------------------------------------- kardel - 09-21-2004 02:31 PDT ---------------------------------------------------------------------- As for clearing tape errors the documentation says that doing a MTIOCGET will read and clear the errors. === MTIOCGET (struct mtget) Retrieve the status and parameters of the tape. Error status and residual is unlatched and cleared by the driver when it receives this ioctl. === I will try (probably coming weekend) whether the early EOM warning option helps the current 1.35.3 implementation. See this snipplet from the st (scsi tape driver): === EOM HANDLING Attempts to write past EOM and how EOM is reported are handled slightly differently based upon whether EARLY WARNING recognition is enabled in the driver. If EARLY WARNING recognitions is not enabled, then detection of EOM (as reported in SCSI Sense Data with an EOM indicator) causes the write oper- ation to be flagged with I/O error (EIO). This has the effect for the user application of not knowing actually how many bytes were read (since the return of the read(2) system call is set to -1). If EARLY WARNING recognition is enabled, then detection of EOM (as reported in SCSI Sense Data with an EOM indicator) has no immediate effect except that the driver notes that EOM has been detected. If the write completing didn't transfer all data that was requested, then the residual count (counting bytes not written) is returned to the user application. In any event, the next attempt to write (if that is the next action the user application takes) is immediately completed with no data transferred, and a residual returned to the user application indicating that no data was transferred. This is the traditional UNIX EOF indica- tion. The state that EOM had been seen is then cleared. In either mode of operation, the driver does not prohibit the user appli- cation from writing more data, if it chooses to do so. This will continue up until the physical end of media, which is usually signalled internally to the driver as a CHECK CONDITION with the Sense Key set to VOLUME OVER- FLOW. When this or any otherwise unhandled error occurs, an error return of EIO will be transmitted to the user application. This does indeed mean that if EARLY WARNING is enables and the device continues to set EOM indicators prior to hitting physical end of media, that an indeterminate number of 'short write returns' as described in the previous paragraph will occur. However, the expected user application behaviour (in common with other systems) is to close the tape and rewind and request another tape upon the receipt of the first EOM indicator, possibly after writing one trailer record. === Regards, Frank ---------------------------------------------------------------------- kern - 09-21-2004 03:05 PDT ---------------------------------------------------------------------- Thanks for the info on MTIOCGET. I have now added an ioctl(MTIOCGET) in the routine that is called to clear errors. This *should* prevent future error conditions such as what you saw after changing tapes. Concerning the EARLY WARNING feature. Yes, this is exactly what should be turned on for Bacula to work properly. From their description, I am quite optimistic that it will resolve the problem -- providing the zero write status is returned to Bacula. On FreeBSD when running with pthreads, the original pthreads implementation prevented this "error" condition from being returned. Since the NetBSD pthreads code is perhaps derived from the FreeBSD, there is a good chance that there will still be problems. The tapetest program is designed to detect this problem. Can you tell me exactly how EARLY WARNING is turned on? Also, why is it not the default? ---------------------------------------------------------------------- kardel - 09-22-2004 05:45 PDT ---------------------------------------------------------------------- As to NetBSD pthreads() implementation: It is completely different from FreeBSD's and based on scheduler activations. The user level library is also completeley new code. So it seems unlikely that similar bugs crop up - different ones - definiteley. From what i experienced with the pthread library it seems pretty solid now and pretty conservative (see other consistence check bugs from pthreads here). As for turning on MTEWARN: it is a MTIOCTOP called with MTEWARN as operation and count being used a C style boolean value. It is not default, as *grin* this driver does not do early warning for variable sized record hardware/configuration as default. Seems to be an older standard / backward bug/feature compatibility. The man page for the st driver will be in in the attached file section. Regards, Frank edited on: 09-22-04 05:43 edited on: 09-22-04 05:45 ---------------------------------------------------------------------- kern - 09-23-2004 06:25 PDT ---------------------------------------------------------------------- Concerning the pthreads implementation. OK. For the moment, I'll assume that it is not the source of any problems, especially since your driver was running in fixed block mode. Thanks for the man page. It is *very* helpful, and I can see that the original driver writer knew his stuff since it is one of the few good explanations of the difference and the consequences of variable vs fixed block size tape devices. Though I have tried to make a generic tape driver, I now realize that it is time to give the Bacula driver specific knowledge of system dependent ioctl()s so that it can set the modes it wants when possible. Your testing with the MTEWARN mode turned on will be very important and interesting. If it works, I'll implement it internally in Bacula as well as set the variable block size mode when appropriate. ---------------------------------------------------------------------- kern - 09-26-2004 01:34 PDT ---------------------------------------------------------------------- Would you pull the current code in the CVS 1.35.5 (25Sep04) and give it a test. It should automatically set EWARN and variable block size mode. ---------------------------------------------------------------------- kardel - 09-26-2004 10:01 PDT ---------------------------------------------------------------------- I did add MTEWARN myself and tested 1.35.3 with it. MTEWARN did the trick and backup went smoothly again. I'll add two patches to make bacula compile in NetBSD-current (probably earlier versions too). patch-aa modifies configure (not configure.in!) to cope with the way include files of postgresql are installed in pkgsrc. patch-ab take care of re-defined macros. I will try code next time permitting. edited on: 09-26-04 10:00 edited on: 09-26-04 10:01 ---------------------------------------------------------------------- kern - 09-27-2004 05:33 PDT ---------------------------------------------------------------------- Your patches patch-aa and patch-ab are not applied. Note, the real patch-aa must go in autoconf/aclocal.m4 -- no problem though. ---------------------------------------------------------------------- kardel - 09-28-2004 08:33 PDT ---------------------------------------------------------------------- no problem I assume you take care of re-definitions of macros in another way. Compile output without path-ab: ... ../bc_types.h:178:1: warning: "uint16_t" redefined In file included from ../bacula.h:50, from base64.c:30: /usr/include/stdint.h:61:1: warning: this is the location of the previous defini tion In file included from ../bacula.h:115, from base64.c:30: ../bc_types.h:179:1: warning: "uint32_t" redefined In file included from ../bacula.h:50, from base64.c:30: /usr/include/stdint.h:71:1: warning: this is the location of the previous defini tion In file included from ../bacula.h:115, from base64.c:30: ../bc_types.h:180:1: warning: "uint64_t" redefined In file included from ../bacula.h:50, from base64.c:30: /usr/include/stdint.h:81:1: warning: this is the location of the previous defini tio ... Bug History Date Modified Username Field Change ====================================================================== 09-10-04 23:17 kardel New Bug 09-11-04 03:01 kardel Bugnote Added: 0000226 09-11-04 06:24 kardel Bugnote Added: 0000227 09-11-04 10:01 kardel Bugnote Added: 0000228 09-15-04 03:48 Dan Langille Severity feature => major 09-18-04 13:50 kern Bugnote Added: 0000245 09-18-04 13:50 kern Status new => feedback 09-19-04 09:33 kardel File Added: bacula-log 09-19-04 09:53 kardel Bugnote Added: 0000248 09-20-04 10:31 kern Bugnote Added: 0000255 09-21-04 02:31 kardel Bugnote Added: 0000256 09-21-04 03:05 kern Bugnote Added: 0000257 09-22-04 05:37 kardel Bugnote Added: 0000263 09-22-04 05:38 kardel File Added: st.4 09-22-04 05:43 kardel Bugnote Edited: 0000263 09-22-04 05:45 kardel Bugnote Edited: 0000263 09-23-04 06:25 kern Bugnote Added: 0000269 09-26-04 01:34 kern Bugnote Added: 0000298 09-26-04 09:45 kardel Bugnote Added: 0000300 09-26-04 09:58 kardel File Added: patch-aa 09-26-04 09:59 kardel File Added: patch-ab 09-26-04 10:00 kardel Bugnote Edited: 0000300 09-26-04 10:01 kardel Bugnote Edited: 0000300 09-27-04 05:33 kern Bugnote Added: 0000308 09-28-04 08:33 kardel Bugnote Added: 0000314 ====================================================================== |