Re: [Dump-users] Live filesystem summary

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Quoting Christopher Blunck <ch...@wx...>:

> On Wed, Dec 08, 2004 at 02:10:57PM -0600, Eric Rostetter wrote:
> > > I have not fully tested the recovery of a system backed up in this
> > > way.  That is, as ever, left as an exercise for the reader.
> >
> > *grin*
> >
> > Please folks, remember that no matter how you do your backups, you
> > need to test them to make sure they work, and that you can restore files
> > and/or filesystems from them.  Backups are worthless if you can't restore
> > (recover) from them!
> 
> The problem I've run into in the past is that I test my backup/restore
> scripts
> in a "clean room" (aka static filesystem).

That's really all I mean.  You can't test all things.  But you can test if
your tape drive is physically bad, or your file selection criteria works,
if your documentation/instructions work, etc.

> When I tested them, I hadn't even
> considered that dump might have problems when the filesystem was being
> actively
> modified.

Well, I considered this, but I did the math, and figured it was worth the
risk in most cases.  In cases where it wasn't worth the risk, we did 
mirror splitting or snapshots.

> Even when I realized this was a potential data consistency
> problem,
> it was difficult to simulate in my "clean room".  Given that you're working

I can see that.  Which is why I didn't try.  Just did the math, and picked
the "best" solution (based on risk, data importance, etc).

> with some very low level io characteristics of the kernel, it becomes
> exceedingly difficult to emulate all environments under which your system
> will be loaded.  Given that challenge, it is equally difficult to thoroughly
> test your backup/recovery system.

True.  And people are lazy, so they won't *thoroughly* test it on a regular
basis anyway.

I do a simple test every six months.  I basically pick something at random
and make sure I can restore it.  Let's me know everything is working (or not).

Of course, there could be fringe cases (file corruption due to file being
written while being backed up) but this isn't the type of testing I was
talking about.

> I think the best you can shoot for is to evaluate the strategy and
> implementation as best as possible, and publicize it's tradeoffs such that
> everyone knows the consequences.  Right now we have 99% consistency in our
> backups, but "the potential exists" for corrupted files on our backup.  If
> you want to go to 100%, either we need to umount the filesystem or we need
> to look into other hardware solutions (doubly mirrored RAID sets, one of
> which
> can be taken offline and dumped).

Exactly.  And then revisit it from time to time.  We used to do the split
mirror (because we didn't have or couldn't afford snapshots).  Now we do
snapshots (can afford them now).  So, the point is, be sure to revisit
things over time, as loads change, technologies change, costs go down, etc).

Heck, I never, ever, thought I'd move from tapes to disks for backups.
Yet I just did about a month or two ago.  Tape drive broke, cost to replace it
was too high.  I could buy dozens of 250GB firewire/USB drives for the cost
of fixing/replacing the tape drive.  So we rotate over the disks now instead
of rotating over the tapes.  Same concept, but much faster access times.
Restoring a single file now take seconds rather than minutes or hours...

Only downside to the disk drives is higher power consumption...  They put
out about the same heat, take up about the same space, and I don't have 
to change tapes any more.

But, 6 months ago I was swearing I'd never do it, and telling people they
were crazy for doing it (which they were actually, as they were backing
up to a single drive; lose the drive and you lose all your backups!)

> My personal experience has been that most managers do not consider 99% to be
> sufficient coverage from the get-go.  However, when you explain to them that
> operational storage costs will triple in order to achieve that remaining 1%,
> and that the 200% increase in storage cost could be spent in more intelligent
> manners (hire another SA or developer to increase project efficiency), the
> 99% number doesn't seem so bad anymore.  :)

Like everything in life, no?

> Of course, you could also make arguments like "you have a hire probability
> of dying in your car on the way to work in the morning than we have of data
> loss in our backups" but that's comparing apples to oranges...
> 
> 
> -c

-- 
Eric Rostetter