From: Ryan U. <nem...@ic...> - 2004-11-22 17:55:19
|
On Mon, Nov 22, 2004 at 08:47:54AM -0800, Wright, Ryan P wrote: > List, > > I operate a large archive with close to 700 PATA disks. Most are 250GB > Western Digital, some are 200GB (also WD). I just began using > smartmontools to analyze the drives and have found the > offline_uncorrectable errors to be a big problem. Yes, they are a big problem. The drive is telling you that it can't read a sector that it was asked to. > First, any drive with an offline_uncorrectable will not resync to the > array (Linux, software RAID 5 on 3Ware 78xx series controllers). They > seem to operate fine otherwise, but if another drive dies and the array > begins resyncing, at some point during the resync the kernel will > encounter errors on the drive with the offline_uncorrectable and will > kick it out of the array (= double disk failure). You need to replace drives that develop these errors if you care about your data. > Second, sometimes an offline_uncorrectable just fixes itself. It will be > there one day and the next day the SMART data will show it with a value > of 0. It's possible that the data later became readable and the sector was then relocated. Did the Reallocated_Sector_Count go up also? Or maybe there's a firmware bug erasing the SMART error logs, if the UNC errors really did "magically" disappear. > Can anyone help me to understand more about the nature of these errors, > how they occur, what I can do about them, etc? They occur when the drive's internal retry logic is not sufficient to read the data stored in a particular sector. When the drive gives up, it logs a SMART error, and if the read was requested from the bus interface, it throws an error that the kernel IDE driver will also log in the kernel message buffer. Reasons for a sector becoming unreadable could be temperature, magnetic drift, magnetic or physical disturbance, defective media, etc. The point is that the drive couldn't read a sector. The question you need to ask yourself is, is this really an isolated event? What is the likelihood that other sectors on the drive, especially those in close proximity to the UNC event that already occured, might become unreadable? Is the drive operating in a safe and stable environment with respect to heat and its power source? The bottom line is that when an UNC event happens on an allocated sector, you have already lost data. I would not trust such a drive to be reliable in the future. Either replace or RMA it if your application is mission critical; if it is not, then you can try to get rid of the error by following these steps, which will either stabilize that sector or get it relocated: http://smartmontools.sourceforge.net/BadBlockHowTo.txt but remember that the drive has already lost data, and take appropriate action based on that knowledge. Scheduling periodic offline tests with email to the admin is an extremely good idea if you have not already done so. These tests will not only inform you of status changes in the drives, but they can also serve as canaries in the mine. A sector that is read once a year by an application might become unreadable with six months of non-use. If you have weekly tests scheduled, then the drive will notice that it has to retry reads of that sector as it deteriorates - the sector will then be relocated. Something that I've thought would be useful for smartmontools to perform is a surface refresh of the drive. Read each sector, then assuming RLL encoding, write data to the drive that would push the magnetic field to opposite ends of its encoding range, then write the data back. It's possible that offline tests already do this, but I've never heard of such a thing. I also don't know of potential drawbacks to this method, or whether or not there is a way to actually do that since the drive firmware is always in the way. Based on the method in the Bad Blocks HOWTO, it seems that it would be possible. Good luck, -- Ryan Underwood, <ne...@ic...> |