Re: [smartmontools-support]What are "offline_uncorrectables"?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Mon, Nov 22, 2004 at 08:47:54AM -0800, Wright, Ryan P wrote:
> List,
> 
> I operate a large archive with close to 700 PATA disks. Most are 250GB
> Western Digital, some are 200GB (also WD). I just began using
> smartmontools to analyze the drives and have found the
> offline_uncorrectable errors to be a big problem.

Yes, they are a big problem.  The drive is telling you that it can't
read a sector that it was asked to.

> First, any drive with an offline_uncorrectable will not resync to the
> array (Linux, software RAID 5 on 3Ware 78xx series controllers). They
> seem to operate fine otherwise, but if another drive dies and the array
> begins resyncing, at some point during the resync the kernel will
> encounter errors on the drive with the offline_uncorrectable and will
> kick it out of the array (= double disk failure).

You need to replace drives that develop these errors if you care about
your data.

> Second, sometimes an offline_uncorrectable just fixes itself. It will be
> there one day and the next day the SMART data will show it with a value
> of 0.

It's possible that the data later became readable and the sector was
then relocated.  Did the Reallocated_Sector_Count go up also?

Or maybe there's a firmware bug erasing the SMART error logs, if the UNC
errors really did "magically" disappear.

> Can anyone help me to understand more about the nature of these errors,
> how they occur, what I can do about them, etc?

They occur when the drive's internal retry logic is not sufficient to
read the data stored in a particular sector.  When the drive gives up,
it logs a SMART error, and if the read was requested from the bus
interface, it throws an error that the kernel IDE driver will also log
in the kernel message buffer.

Reasons for a sector becoming unreadable could be temperature, magnetic
drift, magnetic or physical disturbance, defective media, etc.  The
point is that the drive couldn't read a sector.  The question you need
to ask yourself is, is this really an isolated event?  What is the
likelihood that other sectors on the drive, especially those in close
proximity to the UNC event that already occured, might become
unreadable?  Is the drive operating in a safe and stable environment
with respect to heat and its power source?

The bottom line is that when an UNC event happens on an allocated
sector, you have already lost data.  I would not trust such a drive to
be reliable in the future.  Either replace or RMA it if your application
is mission critical; if it is not, then you can try to get rid of the
error by following these steps, which will either stabilize that sector
or get it relocated:
http://smartmontools.sourceforge.net/BadBlockHowTo.txt

but remember that the drive has already lost data, and take appropriate
action based on that knowledge.

Scheduling periodic offline tests with email to the admin is an
extremely good idea if you have not already done so.  These tests will
not only inform you of status changes in the drives, but they can also
serve as canaries in the mine.  A sector that is read once a year by an
application might become unreadable with six months of non-use.  If you
have weekly tests scheduled, then the drive will notice that it has to
retry reads of that sector as it deteriorates - the sector will then be
relocated.

Something that I've thought would be useful for smartmontools to perform
is a surface refresh of the drive.  Read each sector, then assuming RLL
encoding, write data to the drive that would push the magnetic field to
opposite ends of its encoding range, then write the data back.  It's
possible that offline tests already do this, but I've never heard of
such a thing.  I also don't know of potential drawbacks to this method,
or whether or not there is a way to actually do that since the drive
firmware is always in the way.  Based on the method in the Bad Blocks
HOWTO, it seems that it would be possible.

Good luck,

-- 
Ryan Underwood, <ne...@ic...>

Re: [smartmontools-support]What are "offline_uncorrectables"?

Disk Inspection and Monitoring

Re: [smartmontools-support]What are "offline_uncorrectables"?