|
From: Brent B. <br...@ke...> - 2005-11-25 18:52:10
|
I'm trying to solve a problem that I've been having for a *long* time:
When I do multivolume dumps, doing 'restore -C' on them almost always
reveals corruption on files at and just after the point where the
archive spans to the next tape. I've learned from the man page for the
st driver that this is to be expected if write buffering is on (which it
is by default):
MT_ST_BUFFER_WRITES (Default: true)
Buffer all write operations in fixed block mode. If this option
is false and the drive uses a fixed block size, then all write
operations must be for a multiple of the block size. This
option must be set false to write reliable multi-volume
archives.
MT_ST_ASYNC_WRITES (Default: true)
When this options is true write operations return immediately
without waiting for the data to be transferred to the drive if
the data fits into the driver's buffer. The write threshold
determines how full the buffer must be before a new SCSI write
command is issued. Any errors reported by the drive will be
held until the next operation. This option must be set false to
write reliable multi-volume archives.
So...that would mean that buffered writes and async writes should be
turned off if you expect to span tapes with dump (or tar, or anything
else probably), right? So I do 'mt stoptions 4', and I still get:
BOT ONLINE IM_REP_EN
from 'mt status'. That's odd. The 'IM_REP_ON' at the end means it's
still doing things asyncronously. Again, from the 'st' driver man page:
GMT_IM_REP_EN(x): Immediate report mode. This bit is set if
there are no guarantees that the data has been physically
written to the tape when the write call returns. It is set zero
only when the driver does not buffer data and the drive is set
not to buffer data.
So apparently it ignored my 'mt stoptions 4' call, because IM_REP_EN is
still on, thus write buffers are still on, thus I'll probably get a
corrupted backup again if I try to span multiple tapes. Now I try
another mt command, 'mt drvbuffer 0' to just shut off all buffers
completely. Now 'mt status' gives me:
BOT ONLINE
Good! No more IM_REP_EN now. I'm safe, right? We'll see. So I get
ready to do a dump, and here's what that looks like:
village:~# mt erase
village:~# mt rewind
village:~# mt datcompression 2
Compression on.
Compression capable.
Decompression capable.
village:~# mt drvbuffer 0
village:~# mt stoptions 4
village:~# mt status
drive type = Generic SCSI-2 tape
drive status = 637534720
sense key error = 0
residue count = 0
file number = 0
block number = 0
Tape block size 512 bytes. Density code 0x26 (unknown).
Soft error count since last status=0
General status bits on (41000000):
BOT ONLINE
So the tape table of contents is erased, it's at the beginning, hardware
compression is on, all my nasty write-buffering is off, and 'mt status'
is giving me only "BOT ONLINE" with no "IM_REP_EN" to prove it's okay.
So I can dump? Here we go:
village:~# dump -0au /srv
DUMP: Date of this level 0 dump: Fri Nov 25 12:23:24 2005
DUMP: Dumping /dev/sdb6 (/srv) to /dev/nst0
DUMP: Label: /srv
DUMP: Writing 10 Kilobyte records
DUMP: mapping (Pass I) [regular files]
DUMP: mapping (Pass II) [directories]
DUMP: estimated 58195461 blocks.
DUMP: Volume 1 started with block 1 at: Fri Nov 25 12:24:28 2005
DUMP: 0.00% done at 3 kB/s, finished in 4931:10
DUMP: 0.00% done at 3 kB/s, finished in 4947:55
DUMP: Interrupt received.
DUMP: Do you want to abort dump?: ("yes" or "no") DUMP: Interrupt
received.
DUMP: Interrupt received.
DUMP: Do you want to abort dump?: ("yes" or "no") DUMP: Do you want
to abort dump?: ("yes" or "no") yes
DUMP: The ENTIRE dump is aborted.
village:~#
And that's what it does... Everything looks fine through Phase II, then
the tape drive starts, and it seems to be streaming just fine with no
shoeshining of the tape at all...except...the hard drive is showing
hardly any activity at all. What is it streaming? The tape motor is
running, the tape drive activity light is on...but the hard drive is
sending...nothing. Finally, as you can see above, dump ends up spending
so much time waiting to start Phase III that it prints its percent done
progress messages before it even gets that far. (And the progress
messages say it's doing 3kB/s, and that it will finish in about five
thousand hours.) All the while the tape drive motor is running, and
it's happily streaming *something*, though with nothing coming from the
hard drive, I can't imagine what. If I do 'mt tell' after aborting it,
I get:
village:~# mt tell
At block 3980.
So I'm now almost two megabytes into the tape media, I never even got to
Phase III, thus it didn't even start archiving data yet. Very, very
weird.
Sorry about the long detailed message -- usually if you post a brief
message, nobody can help you, so I thought I'd post the whole thing.
What I'm hoping someone here can do is just tell me what *they* do for
successful multivolume dumps, which can be verified by 'restore -C' as
good. I've seen lots of posts in the dump mailing lists from people
with this problem, and elsewhere, but no solutions, and it doesn't seem
so much a dump problem as a general tape drive problem, since it's the
SCSI tape driver that seems to create this scenario, and it would
probably happen just the same with tar or anything that tries to span
data across tape volumes on Linux.
Can anyone help with this? If you're successfully writing (and
verifying!) multivolume tape archives, how are you doing it? The
problem only shows up on a 'restore -C' -- during the write, it looks
okay -- so if you don't actually verify, you don't really know that you
aren't having this problem, too.
--
+ Brent A. Busby, UNIX Systems Admin + "It's like being +
+ James Franck / Enrico Fermi Institute + blindsided by a +
+ The University of Chicago + flying dwarf..." +
|