From: Bruce A. <bru...@ae...> - 2009-03-03 20:38:12
|
Hi David, Copying the list in my reply... >> 'Offline uncorrectable' are those that appear during a self-test, not in >> response to READ instructions from the host. >> >> 'Pending' counts uncorrectable sectors discovered in response to READ >> instructions from the host. >> >> Your 'offline uncorrectable' counts are zero because you did not run >> self-tests in the past. > > Coming back to this for a minute, for a different seagate disk I did this: > > dd if=/dev/zero of=/dev/sda bs=4096 count=11000000 > > that ran until the 40 GB disk was overwritten then exited with an error. > (I didn't write it down, something about the output being full.) That's OK. Error was probably that you over-ran the end of the device. No problem in this case. > My > assumption was that this would flush all pending and offline > uncorrectable blocks. The disk had been sitting unused for 18 months, > so I figured a full write scan of the surface would be a good idea. Indeed that should reallocate any UNC sectors. > Then > > (should have done sync here, but didn't...) Probably not needed. This *might in principle* be needed to complete the last kB or so of writes to the disk. > smartctl -s on /dev/sda > smartctl -t long /dev/sda > > wait 30 minutes > > smartctl -a /dev/sda > > (edited output) > > Model Family: Seagate Barracuda ATA IV family > Device Model: ST340016A > Firmware Version: 3.19 > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 071 066 034 Pre-fail Always > - 151248114 > 3 Spin_Up_Time 0x0003 070 070 000 Pre-fail Always > - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always > - 72 > 5 Reallocated_Sector_Ct 0x0033 098 098 036 Pre-fail Always > - 115 > 7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always > - 374020168848 > 9 Power_On_Hours 0x0032 065 065 000 Old_age Always > - 30792 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always > - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always > - 72 > 194 Temperature_Celsius 0x0022 025 046 000 Old_age Always > - 25 > 195 Hardware_ECC_Recovered 0x001a 071 065 000 Old_age Always > - 151248114 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always > - 1 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 1 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always > - 0 > 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age > Offline - 0 > 202 TA_Increase_Count 0x0032 082 235 000 Old_age Always > - 18 > > # 1 Extended offline Completed without error 00% 30792 > - > > The long test didn't trigger any errors, so it should not have caused > the single Offline_Uncorrectable, Current_Pending_sector count to be 1. On this particular disk, it appears that the Offline_Uncorrectable, Current_Pending_sector counts are NOT reset, even when there are none. You could try a '-t offline' to see if *that* resets the counters. > Yet the dd didn't clear these problems. The system has 1GB of RAM, and > it's possible that the bad block was in that last 1GB, but the first > 39GB of the disk should have been overwritten (and probably the rest too > by the time smartctl -t long reached it.) > > I'll try this again on another disk with a sync, just to be sure, but at > first glance it looks like the full dd didn't clear some of the expected > fields. At the time the system was running on a PLD rescue > OS, booted over the network, and the intent was to wipe the disk. Depending upon the disk manufacturer, you might try their proprietary utility, for example for IBM/Hitachi 'Disk Fitness Test'. That might set the counters to zero. Cheers, Bruce |
From: David M. <ma...@ca...> - 2009-03-06 21:41:55
|
> > The long test didn't trigger any errors, so it should not have caused > > the single Offline_Uncorrectable, Current_Pending_sector count to be 1. > > On this particular disk, it appears that the Offline_Uncorrectable, > Current_Pending_sector counts are NOT reset, even when there are none. > > You could try a '-t offline' to see if *that* resets the counters. <SNIP> It did not. > > Depending upon the disk manufacturer, you might try their proprietary > utility, for example for IBM/Hitachi 'Disk Fitness Test'. That might > set the counters to zero. Running the seatools (DOS) long test didn't fix it either. It is currently running the seatools "Clear", maybe that will do it. I noticed that the "Clear" operation is a good deal slower than linux dd if=/dev/zero of=/dev/sda bs=4096 That dd ran at ~42MB/s over the whole disk and finished in 17 minutes. However bs=512 ran at only ~10MB/s until it was stopped - it would have taken around 68 minutes. Clear is even slower than dd with bs=512, since it has been running for 90 minutes and is only 38% completed. Hopefully this extra activity, whatever it is, will clear out these bits. Regards, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: David M. <ma...@ca...> - 2009-03-09 20:07:39
|
Apparently in many instances there is no way for mere mortals to clear the Offline_Uncorrectable and Current_Pending_sector counts on Seagate Barracuda ATA IV 40GB (ST340016A) drives. None of these procedures made the slightest difference: smartctl -t long /dev/sda #did not see any problems smartctl -t offline /dev/sda #did not see any problems dd if=/dev/zero of=/dev/sda bs=4096 dd if=/dev/zero of=/dev/sda bs=4096 oflags=direct seatools (DOS): erase disk long test #did not see any problems resize disk (made it a little smaller, then resized to full size) reload the DISK firmware, using a floppy found here: http://www.seagateunlock.com/ All of these were run on two different disks, neither one of which cleared a single count from either field. In one previous instance a disk of this type was able to clear these counts using seatool's "long test" - where it saw problems, asked if it could repair them, and after doing so these two counts went to zero. Sadly that seems not to be a reliable method. Regards, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: David M. <ma...@ca...> - 2009-03-30 17:24:25
|
Summary so far: some old Seagate ST340016A disks were found to have nonzero 'Offline uncorrectable' and 'Current_Pending_Sector' counts which could not be reset to zero by writing to every block on the disk. I contacted Seagate about this issue, and the best I could get out of them (on the second attempt) was: | I understand you are getting unclearable SMART 197 and 198 | fields on your drive. We do not recommend tampering with your | SMART values on the drive in any way. We do not have any utility | ourselves for clearing these fields. Seatools is the only valid | diagnostic that we use to test the drives for functionality. If | the drive passes both the short and long test of Seatools then | the drive itself is fine. If it fails the tests then the drive | should be replaced. I did not want to "tamper" with the drives, I just asked if there was a way to clear these fields. (Neither seatools nor dd will do so.) Anyway, no answer on why these particular drives ended up with these counts "stuck". The disks all pass both the long and short SMART tests. I suspect that this is related to the disks having been powered off for a very long time, well over a year. I think maybe that if the counts are set in these two fields, and the disks are left off for a very long time, somehow or other the firmware loses track of them. For instance, it may associate a time field with these blocks, and allow only so long (6 months?) before it swaps them out even if they are not overwritten, neglecting to clear the two fields when it does so. So when the disks were powered back up, this check may have been performed, resulting in the observed "stuck" values in those fields. Whatever this issue is, according to Seagate, it apparently does not indicate a failing disk. Regards, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: Bruce A. <bru...@ae...> - 2009-04-01 05:21:38
|
David, Thanks for the update. I suspect this is probably due to buggy SMART firmware on the disk. When writing and reviewing disk firmware, the disk vendors seem to be mostly concerned with performance (read/write speed) since this is what gets tested by reviewers and is used by customers in determining what to buy. The SMART part of the firmware is often an afterthought, and is not written or reviewed with the same attention to detail. Cheers, Bruce David Mathog wrote: > Summary so far: some old Seagate ST340016A disks were found to have > nonzero 'Offline uncorrectable' and 'Current_Pending_Sector' counts > which could not be reset to zero by writing to every block on the disk. > > I contacted Seagate about this issue, and the best I could get out of > them (on the second attempt) was: > > | I understand you are getting unclearable SMART 197 and 198 > | fields on your drive. We do not recommend tampering with your > | SMART values on the drive in any way. We do not have any utility > | ourselves for clearing these fields. Seatools is the only valid > | diagnostic that we use to test the drives for functionality. If > | the drive passes both the short and long test of Seatools then > | the drive itself is fine. If it fails the tests then the drive > | should be replaced. > > I did not want to "tamper" with the drives, I just asked if there was a > way to clear these fields. (Neither seatools nor dd will do so.) > > Anyway, no answer on why these particular drives ended up with these > counts "stuck". The disks all pass both the long and short SMART tests. > I suspect that this is related to the disks having been powered off for > a very long time, well over a year. I think maybe that if the counts > are set in these two fields, and the disks are left off for a very long > time, somehow or other the firmware loses track of them. For instance, > it may associate a time field with these blocks, and allow only so long > (6 months?) before it swaps them out even if they are not overwritten, > neglecting to clear the two fields when it does so. So when the disks > were powered back up, this check may have been performed, resulting in > the observed "stuck" values in those fields. Whatever this issue is, > according to Seagate, it apparently does not indicate a failing disk. > > Regards, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: Tim S. <ti...@bu...> - 2009-04-01 08:51:46
|
> David Mathog wrote: >> Anyway, no answer on why these particular drives ended up with these >> counts "stuck". Hmm. Just one thought - if you have the LBAs of the original errors, I wonder if it's worth trying to use hdparm's "--make-bad-sector" and then "--write-sector" commands? Bit of a long-shot but worth a try... Cheers, Tim. |
From: Christian F. <Chr...@t-...> - 2009-04-06 10:27:44
|
David Mathog wrote: > > Summary so far: some old Seagate ST340016A disks were found to have > nonzero 'Offline uncorrectable' and 'Current_Pending_Sector' counts > which could not be reset to zero by writing to every block on the > disk. > Just for Info: Current CVS version of smartd provides a workaround for this issue: If '-C 197+ -U 198+' is specified in smartd.conf, a warning is only issued if 'Current_Pending_Sector' or 'Offline uncorrectable' raw value increase. If the new persistence feature ('-s' option) is used, then this also works across boot cycles. I will also add '-v' options which will allow to enable this by the drive database. Cheers, Christian |