From: Wright, R. P <Rya...@pn...> - 2004-11-22 18:34:47
|
Ryan, Thank you for your response. >> You need to replace drives that develop these errors if you=20 >> care about your data. This is what I've been doing. However, the rate of drive replacement seems excessive to me. Out of ~700 drives, I've replaced just under 100 since last December, most of them over the last couple of months since I discovered smartmontools and began frantically replacing drives with offline_uncorrectables.=20 Most of the drives are ~2 years old and run 24x7, however they receive little regular use. My archive is primarily write once, read rarely if ever. It's long term mass storage. I maintain regular tape backups and of course all data is RAID 5 (software, the hardware raid on the 78xx controllers is terrible performance wise). >> It's possible that the data later became readable and the=20 >> sector was then relocated. Did the Reallocated_Sector_Count=20 >> go up also? It did not. To give one example, on Saturday smartd sent me a message: "The following warning/error was logged by the smartd daemon: Device: /dev/twe1 [3ware_disk_04], 1 Offline uncorrectable sectors" So I checked out the drive in question this morning, and here's what I got: =3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D Device Model: WDC WD2500JB-00EVA0 Serial Number: WD-WMAEH1162555 Firmware Version: 15.05R15 User Capacity: 250,059,350,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Nov 22 10:15:31 2004 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled =3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 133 122 021 Pre-fail Always - 3850 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 16 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 9547 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 16 194 Temperature_Celsius 0x0022 122 253 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail Offline - 0 SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 805 137788807 # 2 Short offline Completed without error 00% 781 - # 3 Extended offline Completed: read failure 10% 778 431282233 # 4 Short offline Completed without error 00% 758 - # 5 Short offline Completed without error 00% 710 - # 6 Short offline Completed without error 00% 686 - # 7 Short offline Completed: read failure 90% 662 137788807 # 8 Short offline Completed without error 00% 638 - # 9 Short offline Completed: read failure 90% 614 137788807 #10 Extended offline Completed without error 00% 610 - #11 Short offline Completed without error 00% 590 - #12 Short offline Completed without error 00% 566 - #13 Short offline Completed without error 00% 542 - So the offline_uncorrectable is gone. However, I can see the read failures on the tests, so something's up with the drive. At this point I have no idea what, nor what to do with it, as it appears to be running fine otherwise. I also don't understand why tests show read failures, but subsequent tests can complete without error. This is just one example of many, at any given time I can have several drives in this state. I pull them and replace with new, but I want to further understand them as this seems like an unusually high failure rate. Also, if I test the drive with Western Digital diagnostics, it will tell me there's nothing wrong with it (or it will say "There was a problem, but I fixed it" and subsequent tests come up clean - I never reuse these drives as I don't trust WD diags). Western Digital believes these problems will go away if I switch to their "SB" model "Raid Edition" drives. As the BB & JB drives fail I've been replacing them with SB models, so we'll see. >> drive couldn't read a sector. The question you need to ask=20 >> yourself is, is this really an isolated event? What is the=20 >> likelihood that other sectors on the drive, especially those=20 >> in close proximity to the UNC event that already occured,=20 >> might become unreadable? Is the drive operating in a safe=20 >> and stable environment with respect to heat and its power source? All servers are on a raised floor data center; UPS stablized power @ 208v; room temperature kept below 60 degrees, three redundant power supplies per server. I have perf tiles in front of each rack and drive temperatures have always looked OK - the example above is at 28C which is pretty standard across the cluster. >> Scheduling periodic offline tests with email to the admin is=20 >> an extremely good idea if you have not already done so. =20 I run the short tests nightly, and the extended tests weekly. Also, two weeks ago I began taking a nightly snapshot of the SMART data for each drive and storing it in a database so I can track changes. I'm currently working on a reporting tool to make this data usable. -Ryan |
From: Wright, R. P <Rya...@pn...> - 2004-11-24 16:37:20
|
>> I have a system using software RAID1 and one of the disks=20 >> in teh array has=20 >> logged >> unreadable sectors during self tests. >>=20 >> First, are unreadable sectors the same as offline_uncorrectables? Not necessarily. From what I've gathered (thank you to all for the excellent conversation on this issue), the offline_uncorrectable is a read error that could not be (or is yet to be) corrected. Writing to the sector seems to force the drive to correct it. Use smartctl to grab a list of your drive attributes, if you have offline_uncorrectables they'll show up here: 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 The number at the end (this drive is 0) should be zero if everything is healthy. >> And two, since one of my disks has unreadable sectors,=20 >> should I be concerned=20 >> that the >> array will not sync at a future time if for some reason the=20 >> array is broken? Currently, the array is good according to=20 >> 'cat /proc/mdstat'... Yes, I would be concerned. It seems the Linux RAID implementation is a bit picky. If you lose a different drive and a resync starts, the drive with the unreadable sectors could be kicked out of the array by the kernel. If that happens, you've got a double-disk failure and your resync immediately stops. Now, it is possible to recover from this - I've done it many times. It involves carefully crafting a new /etc/raidtab, forcing a rewrite of the superblocks on the drives, and bringing the array back online in a degraded state. This usually allows you to copy 99%-100% of the data off the array to another safe place. Many drives that are in bad enough shape to be kicked out of the array during a resync will "hold up" to a copy (ie, the kernel will let them stay in the array). However, I would still consider replacing the disk with unreadable sectors now. -Ryan |
From: Bruce A. <ba...@gr...> - 2004-11-24 21:03:33
|
> >> First, are unreadable sectors the same as offline_uncorrectables? There are two ways of discovering unreadable sectors. (1) The operating system, on behalf of user or kernel code tries to read a sector of the disk, and the read fails. (2) The disk firmware, while executing a (so called offline) short or long self-test, find an unreadable sector. Method (1) leads to 'trouble'. User data can't be read, errors appear on the system logs, files are damaged, sysadmins get cell phone calls and gray hair. Method (2) is less stressful. It may be that the OS/file system has NO data stored at the unreadable sector. Or it may be part of a file that is never needed by users or the OS in your particular installation. Unreadable sectors found through method (1) increment the 'Current pending sector' count. Unreadable sectors found through method (2) increment the 'Offline uncorrectable sectors' count. Cheers, Bruce |
From: Pete <pe...@co...> - 2004-11-24 21:58:02
|
Ah! Thanks for the explanation Bruce and thanks to everyone else who responded to my question. /Peter ----- Original Message ----- From: "Bruce Allen" <ba...@gr...> To: "Wright, Ryan P" <Rya...@pn...> Cc: "pslists" <pls...@wa...>; "Ryan Underwood" <nem...@ic...>; "Smartmontools Mailing List" <sma...@li...> Sent: Wednesday, November 24, 2004 1:02 PM Subject: RE: [smartmontools-support]What are "offline_uncorrectables"? >> >> First, are unreadable sectors the same as offline_uncorrectables? > > There are two ways of discovering unreadable sectors. > > (1) The operating system, on behalf of user or kernel code tries to read > a sector of the disk, and the read fails. > > (2) The disk firmware, while executing a (so called offline) short or long > self-test, find an unreadable sector. > > Method (1) leads to 'trouble'. User data can't be read, errors appear on > the system logs, files are damaged, sysadmins get cell phone calls and > gray hair. > > Method (2) is less stressful. It may be that the OS/file system has NO > data stored at the unreadable sector. Or it may be part of a file that is > never needed by users or the OS in your particular installation. > > Unreadable sectors found through method (1) increment the 'Current pending > sector' count. > > Unreadable sectors found through method (2) increment the 'Offline > uncorrectable sectors' count. > > Cheers, > Bruce > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support |
From: Sebastian V. <seb...@he...> - 2004-11-23 11:28:26
|
On Monday 22 November 2004 20:34, Wright, Ryan P wrote: > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always > - 2 > So the offline_uncorrectable is gone. However, I can see the read > failures on the tests, so something's up with the drive. At this point I > have no idea what, nor what to do with it, as it appears to be running > fine otherwise. I also don't understand why tests show read failures, > but subsequent tests can complete without error. As can be seen above the drive firmware has put the failing sectors into pending mode, meaning it will reallocate them later. Normally reallocation should happen when someone tries to write to the sector. In fact this is what the later models of 3ware controllers are apparently doing to correct failed sectors. They use the redundant data to rewrite the sector. I have no idea if the 7800 models supported this functionality and your usage of software raid makes this functionality unavailable to you anyway. > Also, if I test the drive with Western Digital diagnostics, it will tell > me there's nothing wrong with it (or it will say "There was a problem, > but I fixed it" and subsequent tests come up clean - I never reuse these > drives as I don't trust WD diags). It probably forces a reallocation to take place. After that the drive will work normally, but it means you have lost the data on the sectors as can be expected. Now apart from the problem that the linux kernel kicks the drives with failing sectors out of the array, one failing sector doesn't quite mean the whole drive has failed. In fact this is one reason for using raid in the first place. There are different viewpoints on how many failed sectors are too many to trust the drive. Some say no failed sectors should be allowed. Others allow a few per year of lifetime. You will have to deside for yourself what is acceptable to you. I believe that the hdd manufacturers do try to test the discs that are marked for enterprise use more thoroughly. The manufacturers also think that a few failing sectors are acceptable for consumer grade disks. I have heard that some people are using badblocks as a stresstest for all new drives before they are put into the raid array. The idea is that if the drive develops bad sectors during the badblocks run, it's not worth putting in the array in the first place. Sebastian ps. Your archive is way beyond what I'm using smartmontools for. So take my comments as an outside observation. |
From: Bruce A. <ba...@gr...> - 2004-11-23 11:42:53
|
> As can be seen above the drive firmware has put the failing sectors into > pending mode, meaning it will reallocate them later. Normally reallocation > should happen when someone tries to write to the sector. > > In fact this is what the later models of 3ware controllers are apparently > doing to correct failed sectors. They use the redundant data to rewrite the > sector. I have no idea if the 7800 models supported this functionality and > your usage of software raid makes this functionality unavailable to you > anyway. Sebastian, good point! Is it really true that there is nothing in the software RAID subsystem that can be used to rewrite a drive with unreadable sectors? Without this, software RAID can't cope properly with the most common drive 'soft failure/data loss' mode. Cheers, Bruce |
From: Leon W. <le...@ma...> - 2004-11-23 14:00:12
|
Hello, On Tue, 2004-11-23 at 12:42, Bruce Allen wrote: > > As can be seen above the drive firmware has put the failing sectors into > > pending mode, meaning it will reallocate them later. Normally reallocation > > should happen when someone tries to write to the sector. > > > > In fact this is what the later models of 3ware controllers are apparently > > doing to correct failed sectors. They use the redundant data to rewrite the > > sector. I have no idea if the 7800 models supported this functionality and > > your usage of software raid makes this functionality unavailable to you > > anyway. > > Sebastian, good point! > > Is it really true that there is nothing in the software RAID subsystem > that can be used to rewrite a drive with unreadable sectors? Without > this, software RAID can't cope properly with the most common drive 'soft > failure/data loss' mode. > Interesting point. I know there are RAID optimization patches floating around that keep a "list/bitmap of unsynced blocks" for any a "not-yet-fully" redundant drive. In the current (Linux) SoftRAID code, a partition is always synced from its first block to its last (simple, but suboptimal). Using the patch, and a user tool to mark blocks unsynced, this could be a good combo. I will search my bookmarks/google for it. Leon. |
From: Bruce A. <ba...@gr...> - 2004-11-23 16:01:12
|
Leon, If you can figure this out, how about writing a short howto: HOWTO FIX BAD BLOCKS WITH SOFTWARE RAID which can either live in the mailing list archives or I can post it under the smartmontools web pages. Cheers, Bruce How about writing a On Tue, 23 Nov 2004, Leon Woestenberg wrote: > Hello, > > > On Tue, 2004-11-23 at 12:42, Bruce Allen wrote: > > > As can be seen above the drive firmware has put the failing sectors into > > > pending mode, meaning it will reallocate them later. Normally reallocation > > > should happen when someone tries to write to the sector. > > > > > > In fact this is what the later models of 3ware controllers are apparently > > > doing to correct failed sectors. They use the redundant data to rewrite the > > > sector. I have no idea if the 7800 models supported this functionality and > > > your usage of software raid makes this functionality unavailable to you > > > anyway. > > > > Sebastian, good point! > > > > Is it really true that there is nothing in the software RAID subsystem > > that can be used to rewrite a drive with unreadable sectors? Without > > this, software RAID can't cope properly with the most common drive 'soft > > failure/data loss' mode. > > > Interesting point. I know there are RAID optimization patches floating > around that keep a "list/bitmap of unsynced blocks" for any a > "not-yet-fully" redundant drive. > > In the current (Linux) SoftRAID code, a partition is always synced from > its first block to its last (simple, but suboptimal). > > Using the patch, and a user tool to mark blocks unsynced, this could be > a good combo. > > I will search my bookmarks/google for it. > > Leon. > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > |
From: Leon W. <le...@ma...> - 2004-11-24 00:36:39
|
Hello all, Bruce Allen wrote: >Leon, > >If you can figure this out, how about writing a short howto: > HOWTO FIX BAD BLOCKS WITH SOFTWARE RAID >which can either live in the mailing list archives or I can post it under >the smartmontools web pages. > > At least I was able to find the modified Software RAID for Linux, which does bitmap-based resyncs. The project is called "FastRAID". The README is here. It makes for an interesting read as it describes the bitmap that marks dirty (unsynced blocks) on the RAID array. http://cvs.sourceforge.net/viewcvs.py/fr5/fr5/README?rev=1.1.1.1&view=markup I have contacted Peter to ask him about a user-space way of dirtying the bitmap. Of course I could write a tool along with documentation once. Until the user space tool is there, there is no way of doing this I guess. Regards, Leon Woestenberg. |
From: Bruce A. <ba...@gr...> - 2004-11-24 05:18:41
|
> At least I was able to find the modified Software RAID for Linux, > which does bitmap-based resyncs. The project is called "FastRAID". > > The README is here. It makes for an interesting read as it describes > the bitmap that marks dirty (unsynced blocks) on the RAID array. > > http://cvs.sourceforge.net/viewcvs.py/fr5/fr5/README?rev=1.1.1.1&view=markup > > I have contacted Peter to ask him about a user-space way of dirtying the > bitmap. > > Of course I could write a tool along with documentation once. Until > the user space tool is there, there is no way of doing this I guess. It would be terrific to try and integrate this with smartmontools, so that we provide failing LBAs and then Peter or you provide a tool that forces rewrites of those bad blocks with the correct data from some redundant disks. Cheers, Bruce |
From: Leon W. <le...@ma...> - 2004-11-24 11:02:40
|
Hello all, regarding recovery of soft-failures on a disk in a RAID-1/5/6 array. A soft-failing bad-block (512 bytes) is a block of which the stored data cannot be read back correctly (this is probably detected through ECC error correction/detection coding). Writing to this block will (probably) have the drive reallocate the block. The idea is that even if one disk in the array shows soft-failures, we should be able to reconstruct the damaged data from the redundancy in the array. On Wed, 2004-11-24 at 06:18, Bruce Allen wrote: > > At least I was able to find the modified Software RAID for Linux, > > which does bitmap-based resyncs. The project is called "FastRAID". > > > > The README is here. It makes for an interesting read as it describes > > the bitmap that marks dirty (unsynced blocks) on the RAID array. > > > > http://cvs.sourceforge.net/viewcvs.py/fr5/fr5/README?rev=1.1.1.1&view=markup > > > > I have contacted Peter to ask him about a user-space way of dirtying the > > bitmap. > > > > Of course I could write a tool along with documentation once. Until > > the user space tool is there, there is no way of doing this I guess. > > It would be terrific to try and integrate this with smartmontools, so that > we provide failing LBAs and then Peter or you provide a tool that forces > rewrites of those bad blocks with the correct data from some redundant > disks. > Let's discuss the (imaginary) procedures: CASE I: running standard Linux SoftRAID We have a RAID array consisting of a number of drives. Somehow* we find a the LBA of a soft-failing bad-block (512 bytes). Of such block the stored data cannot be read back (errors are detected through error detection coding). Writing to this block will (probably) reallocate the block. Hot-fail then hot-remove the disk, then hot-add it. The disk will be fully resynced (and any bad-blocks will be written to). CASE II: running FastRAID5 (fr5) We have a FastRAID-5 array consisting of a number of drives. Somehow* we find a the LBA of a soft-failing bad-block (512 bytes). Of such block the stored data cannot be read back (errors are detected through error detection coding). Writing to this block will (probably) reallocate the block. We need to calculate the md disk block address from the LBA of a drive (how? - I think I need help from the people at the LinuxRAID ml). Hot-fail the disk, but DO NOT hot-remove it. Now read-write the md disk block address (this will write to the good drives and mark the blocks changed in the bitmap). Now hot-add the soft-failing disk; this will commence a "hot-repair". The blocks changed will be resynced to the soft-failing disk, the drive will write the bad block, thereby inducing a block reallocation. * smart(montools) long test, or badblocks non-modifying read on a md drive component. Let me know if I missed something? With kind regards, Leon Woestenberg. |
From: Bruno W. I. <br...@wo...> - 2004-11-23 16:47:45
|
On Tue, Nov 23, 2004 at 05:42:17 -0600, Bruce Allen <ba...@gr...> wrote: > > Is it really true that there is nothing in the software RAID subsystem > that can be used to rewrite a drive with unreadable sectors? Without > this, software RAID can't cope properly with the most common drive 'soft > failure/data loss' mode. I recently did this. You can book the system under knoppix (or some other live CD system) and copy good blocks on one drive to write over unreadable blocks on the drive having problems. This should get the unreadable blocks reallocated so that the bad druve becomes usable again. The badblocks' howto information about dd provides enough background on how to do this. It is probably a good idea to run the nondestructive badblocks write test while you are doing this to find other badblocks while you have the system down. Using long selftests is useful because they can find isolated bad sectors faster than running badblocks over the whole disk. The disadvantage is that the self test stops after finding one bad sector and this makes cleaning up a chunk of bad sectors slower. |
From: Bruce A. <ba...@gr...> - 2004-11-23 17:21:10
|
Bruno, Does this require 'identical' mirrored drives? Or will it work, eg, in RAID-5? Bruce On Tue, 23 Nov 2004, Bruno Wolff III wrote: > On Tue, Nov 23, 2004 at 05:42:17 -0600, > Bruce Allen <ba...@gr...> wrote: > > > > Is it really true that there is nothing in the software RAID subsystem > > that can be used to rewrite a drive with unreadable sectors? Without > > this, software RAID can't cope properly with the most common drive 'soft > > failure/data loss' mode. > > I recently did this. You can book the system under knoppix (or some other > live CD system) and copy good blocks on one drive to write over unreadable > blocks on the drive having problems. This should get the unreadable blocks > reallocated so that the bad druve becomes usable again. The badblocks' > howto information about dd provides enough background on how to do this. > It is probably a good idea to run the nondestructive badblocks write test > while you are doing this to find other badblocks while you have the system > down. Using long selftests is useful because they can find isolated bad > sectors faster than running badblocks over the whole disk. The disadvantage > is that the self test stops after finding one bad sector and this makes > cleaning up a chunk of bad sectors slower. > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > |
From: Bruno W. I. <br...@wo...> - 2004-11-23 21:04:58
|
On Tue, Nov 23, 2004 at 11:20:32 -0600, Bruce Allen <ba...@gr...> wrote: > Bruno, > > Does this require 'identical' mirrored drives? Or will it work, eg, in > RAID-5? If you knew enough you could probably fix raid 5, but the simple case only applies to raid 1. |
From: John G. <gn...@to...> - 2004-11-23 10:35:20
|
> First, any drive with an offline_uncorrectable will not resync to the > array (Linux, software RAID 5 on 3Ware 78xx series controllers). They > seem to operate fine otherwise, but if another drive dies and the array > begins resyncing, at some point during the resync the kernel will > encounter errors on the drive with the offline_uncorrectable and will > kick it out of the array (= double disk failure). This is amazingly foolish behavior on the part of the RAID implementation. (1) It should be possible to trivially recover from a bad sector ("offline uncorrectable") as soon as it happens. You already have the data replicated on your other RAID drives. Read those drives, compute the right data, and write it to the bad sector. The bad sector should get reallocated and be fine thereafter. Isn't this what RAID is supposed to be for? (2) When the RAID software gets a single bad sector during a RAID restore operation, after an entire drive had failed, it throws away a second entire drive, losing all the data? Why doesn't it recover everything except the one bad sector? (3) If the RAID software is trying to read a file and gets a bad sector, does it automatically reconstruct that sector from the other drives? Or does it hang with read errors? This is another common way for poorly designed RAID systems to mess up. Ideally it not only gives you the good data, quickly, but also eventually writes that data to the bad sector, causing the automatic recovery mentioned in (1) above. Here's my experience on individual drives (I've never run RAID, partly because of the kinds of stuff above that make it not so useful): Generally, if you get one or two bad sectors on a drive, it indicates minor defects on the platter, like a tiny scratch or a bit of oxide that fell off in use. The disk drive comes with lots of spare sectors to deal with this. It knows how to reallocate sectors. Disk drives used to have to be told when to do this, using system administration tools, but now they do it automatically. All the drive needs is a good copy of the data for that sector, which it can get one of two ways: as soon as you write good data to that sector; or if it is ever able, by retrying over and over, to read the sector. When either of these things happens, it is likely to move that data elsewhere, and thereafter stop using that bad physical spot on the disk. If a drive gets a short series of bad sectors at the same time, these often relate to some physical incident like the drive being dropped while running, or a particle of smoke getting thru the air filter, and damaging part of the platters. This can be recovered from, too, as long as the drive settles down to error-free operation after fixing the damage. If a drive produces bad sectors, and you fix 'em, and it makes more, and you fix 'em, and it makes more, etc, then you're looking at a drive that has a more serious problem. It's on its last legs, and you should immediately back it up. Good backup utilities will keep reading thru any errors they get, and give you a good copy of the 99.99% of your data that's undamaged. (If you have it RAIDed, then you already have a spinning backup online, but I suggest making an additional backup at the point you detect this kind of drive failure.) Then replace the drive. John Gilmore PS: An "offline uncorrectable" really means that the drive was testing itself ("offline") and encountered a sector that it could not read and could not error-correct. Online uncorrectables will produce a kernel error message (a read fails with a "media error" status code). Offline uncorrectables are only reported via S.M.A.R.T. since no software has explicitly tried to read that bad sector. Thus, perhaps you should make a script that turns an offline uncorrectable (reported in SMART logs) into an online uncorrectable (by trying to read that sector from a Unix process). Then the kernel and RAID software will see that the sector is damaged, and will do whatever it can do to recover (perhaps your RAID really does the right thing in case (1) above). |