Thread: RE: [smartmontools-support]What are "offline_uncorrectables"?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Ryan,

Thank you for your response.

>> You need to replace drives that develop these errors if you=20
>> care about your data.

This is what I've been doing. However, the rate of drive replacement
seems excessive to me. Out of ~700 drives, I've replaced just under 100
since last December, most of them over the last couple of months since I
discovered smartmontools and began frantically replacing drives with
offline_uncorrectables.=20

Most of the drives are ~2 years old and run 24x7, however they receive
little regular use. My archive is primarily write once, read rarely if
ever. It's long term mass storage. I maintain regular tape backups and
of course all data is RAID 5 (software, the hardware raid on the 78xx
controllers is terrible performance wise).

>> It's possible that the data later became readable and the=20
>> sector was then relocated.  Did the Reallocated_Sector_Count=20
>> go up also?

It did not. To give one example, on Saturday smartd sent me a message:

"The following warning/error was logged by the smartd daemon:
Device: /dev/twe1 [3ware_disk_04], 1 Offline uncorrectable sectors"

So I checked out the drive in question this morning, and here's what I
got:

=3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D
Device Model:     WDC WD2500JB-00EVA0
Serial Number:    WD-WMAEH1162555
Firmware Version: 15.05R15
User Capacity:    250,059,350,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Nov 22 10:15:31 2004 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always
-       0
  3 Spin_Up_Time            0x0007   133   122   021    Pre-fail  Always
-       3850
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always
-       16
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always
-       0
  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always
-       0
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always
-       9547
 10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always
-       0
 11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always
-       16
194 Temperature_Celsius     0x0022   122   253   000    Old_age   Always
-       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always
-       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always
-       2
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always
-       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always
-       0
200 Multi_Zone_Error_Rate   0x0009   200   155   051    Pre-fail
Offline      -       0

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%       805
137788807
# 2  Short offline       Completed without error       00%       781
-
# 3  Extended offline    Completed: read failure       10%       778
431282233
# 4  Short offline       Completed without error       00%       758
-
# 5  Short offline       Completed without error       00%       710
-
# 6  Short offline       Completed without error       00%       686
-
# 7  Short offline       Completed: read failure       90%       662
137788807
# 8  Short offline       Completed without error       00%       638
-
# 9  Short offline       Completed: read failure       90%       614
137788807
#10  Extended offline    Completed without error       00%       610
-
#11  Short offline       Completed without error       00%       590
-
#12  Short offline       Completed without error       00%       566
-
#13  Short offline       Completed without error       00%       542
-

So the offline_uncorrectable is gone. However, I can see the read
failures on the tests, so something's up with the drive. At this point I
have no idea what, nor what to do with it, as it appears to be running
fine otherwise. I also don't understand why tests show read failures,
but subsequent tests can complete without error. This is just one
example of many, at any given time I can have several drives in this
state. I pull them and replace with new, but I want to further
understand them as this seems like an unusually high failure rate.

Also, if I test the drive with Western Digital diagnostics, it will tell
me there's nothing wrong with it (or it will say "There was a problem,
but I fixed it" and subsequent tests come up clean - I never reuse these
drives as I don't trust WD diags).

Western Digital believes these problems will go away if I switch to
their "SB" model "Raid Edition" drives. As the BB & JB drives fail I've
been replacing them with SB models, so we'll see.

>> drive couldn't read a sector.  The question you need to ask=20
>> yourself is, is this really an isolated event?  What is the=20
>> likelihood that other sectors on the drive, especially those=20
>> in close proximity to the UNC event that already occured,=20
>> might become unreadable?  Is the drive operating in a safe=20
>> and stable environment with respect to heat and its power source?

All servers are on a raised floor data center; UPS stablized power @
208v; room temperature kept below 60 degrees, three redundant power
supplies per server. I have perf tiles in front of each rack and drive
temperatures have always looked OK - the example above is at 28C which
is pretty standard across the cluster.

>> Scheduling periodic offline tests with email to the admin is=20
>> an extremely good idea if you have not already done so. =20

I run the short tests nightly, and the extended tests weekly. Also, two
weeks ago I began taking a nightly snapshot of the SMART data for each
drive and storing it in a database so I can track changes. I'm currently
working on a reporting tool to make this data usable.

-Ryan

Thread: RE: [smartmontools-support]What are "offline_uncorrectables"?

Disk Inspection and Monitoring

smartmontools-support