From: Brent B. <br...@ke...> - 2005-11-25 18:52:10
|
I'm trying to solve a problem that I've been having for a *long* time: When I do multivolume dumps, doing 'restore -C' on them almost always reveals corruption on files at and just after the point where the archive spans to the next tape. I've learned from the man page for the st driver that this is to be expected if write buffering is on (which it is by default): MT_ST_BUFFER_WRITES (Default: true) Buffer all write operations in fixed block mode. If this option is false and the drive uses a fixed block size, then all write operations must be for a multiple of the block size. This option must be set false to write reliable multi-volume archives. MT_ST_ASYNC_WRITES (Default: true) When this options is true write operations return immediately without waiting for the data to be transferred to the drive if the data fits into the driver's buffer. The write threshold determines how full the buffer must be before a new SCSI write command is issued. Any errors reported by the drive will be held until the next operation. This option must be set false to write reliable multi-volume archives. So...that would mean that buffered writes and async writes should be turned off if you expect to span tapes with dump (or tar, or anything else probably), right? So I do 'mt stoptions 4', and I still get: BOT ONLINE IM_REP_EN from 'mt status'. That's odd. The 'IM_REP_ON' at the end means it's still doing things asyncronously. Again, from the 'st' driver man page: GMT_IM_REP_EN(x): Immediate report mode. This bit is set if there are no guarantees that the data has been physically written to the tape when the write call returns. It is set zero only when the driver does not buffer data and the drive is set not to buffer data. So apparently it ignored my 'mt stoptions 4' call, because IM_REP_EN is still on, thus write buffers are still on, thus I'll probably get a corrupted backup again if I try to span multiple tapes. Now I try another mt command, 'mt drvbuffer 0' to just shut off all buffers completely. Now 'mt status' gives me: BOT ONLINE Good! No more IM_REP_EN now. I'm safe, right? We'll see. So I get ready to do a dump, and here's what that looks like: village:~# mt erase village:~# mt rewind village:~# mt datcompression 2 Compression on. Compression capable. Decompression capable. village:~# mt drvbuffer 0 village:~# mt stoptions 4 village:~# mt status drive type = Generic SCSI-2 tape drive status = 637534720 sense key error = 0 residue count = 0 file number = 0 block number = 0 Tape block size 512 bytes. Density code 0x26 (unknown). Soft error count since last status=0 General status bits on (41000000): BOT ONLINE So the tape table of contents is erased, it's at the beginning, hardware compression is on, all my nasty write-buffering is off, and 'mt status' is giving me only "BOT ONLINE" with no "IM_REP_EN" to prove it's okay. So I can dump? Here we go: village:~# dump -0au /srv DUMP: Date of this level 0 dump: Fri Nov 25 12:23:24 2005 DUMP: Dumping /dev/sdb6 (/srv) to /dev/nst0 DUMP: Label: /srv DUMP: Writing 10 Kilobyte records DUMP: mapping (Pass I) [regular files] DUMP: mapping (Pass II) [directories] DUMP: estimated 58195461 blocks. DUMP: Volume 1 started with block 1 at: Fri Nov 25 12:24:28 2005 DUMP: 0.00% done at 3 kB/s, finished in 4931:10 DUMP: 0.00% done at 3 kB/s, finished in 4947:55 DUMP: Interrupt received. DUMP: Do you want to abort dump?: ("yes" or "no") DUMP: Interrupt received. DUMP: Interrupt received. DUMP: Do you want to abort dump?: ("yes" or "no") DUMP: Do you want to abort dump?: ("yes" or "no") yes DUMP: The ENTIRE dump is aborted. village:~# And that's what it does... Everything looks fine through Phase II, then the tape drive starts, and it seems to be streaming just fine with no shoeshining of the tape at all...except...the hard drive is showing hardly any activity at all. What is it streaming? The tape motor is running, the tape drive activity light is on...but the hard drive is sending...nothing. Finally, as you can see above, dump ends up spending so much time waiting to start Phase III that it prints its percent done progress messages before it even gets that far. (And the progress messages say it's doing 3kB/s, and that it will finish in about five thousand hours.) All the while the tape drive motor is running, and it's happily streaming *something*, though with nothing coming from the hard drive, I can't imagine what. If I do 'mt tell' after aborting it, I get: village:~# mt tell At block 3980. So I'm now almost two megabytes into the tape media, I never even got to Phase III, thus it didn't even start archiving data yet. Very, very weird. Sorry about the long detailed message -- usually if you post a brief message, nobody can help you, so I thought I'd post the whole thing. What I'm hoping someone here can do is just tell me what *they* do for successful multivolume dumps, which can be verified by 'restore -C' as good. I've seen lots of posts in the dump mailing lists from people with this problem, and elsewhere, but no solutions, and it doesn't seem so much a dump problem as a general tape drive problem, since it's the SCSI tape driver that seems to create this scenario, and it would probably happen just the same with tar or anything that tries to span data across tape volumes on Linux. Can anyone help with this? If you're successfully writing (and verifying!) multivolume tape archives, how are you doing it? The problem only shows up on a 'restore -C' -- during the write, it looks okay -- so if you don't actually verify, you don't really know that you aren't having this problem, too. -- + Brent A. Busby, UNIX Systems Admin + "It's like being + + James Franck / Enrico Fermi Institute + blindsided by a + + The University of Chicago + flying dwarf..." + |