Synopsis:

I am at wits end and could use some pointers as to what I am doing wrong or if I just need to buy a new drive. I realize this may not the perfect forum for this question, and would be happy with just a pointer to the right place.

I have been getting SMART errors on a backup drive on my Fedora 12 file server. I have tried the instructions in "Bad block HOWTO for smartmontools" without avail. I have visited many websites and have not found anything more illuminating to my problem.

The drive is a backup drive of my data drive using rdiff-backup once a night. It is a Western Digital SATA 1T full size drive. It is my /dev/sdc drive and has only one partition /dev/sdc1

The file system used to be ext4, but since the instructions for fixing blocks only called out ext2/3 I formatted the drive to ext3 and used the following procedures to no avail.

Details:

I get the following in my email each day:

 --------------------- Smartd Begin ------------------------


 Currently unreadable (pending) sectors detected:
       /dev/sdc [SAT] - 48 Time(s)
       44 unreadable sectors detected

 Offline uncorrectable sectors detected:
       /dev/sdc [SAT] - 48 Time(s)
       30 offline uncorrectable sectors detected

 ---------------------- Smartd End -------------------------

This ends up in /var/log/messages each day:

Jul  9 19:53:58 tux smartd[1658]: Device: /dev/sdc [SAT], 44 Currently unreadable (pending) sectors
Jul  9 19:53:58 tux smartd[1658]: Device: /dev/sdc [SAT], 30 Offline uncorrectable sectors (changed -165)

The steps I took to try to fix these problems:


1) Get SMART info

[root@tux ~]# smartctl -d ata -a /dev/sdc
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EARS-00Y5B1
Serial Number:    WD-WMAV51375649
Firmware Version: 80.00A80
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jun 28 20:13:46 2011 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EARS-00Y5B1
Serial Number:    WD-WMAV51375649
Firmware Version: 80.00A80
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jun 28 20:14:25 2011 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                    was suspended by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:          (21300) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 245) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x3031)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   130   126   021    Pre-fail  Always       -       6475
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       619
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       9119
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       143
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       71
193 Load_Cycle_Count        0x0032   197   197   000    Old_age   Always       -       9117
194 Temperature_Celsius     0x0022   111   108   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   199   199   000    Old_age   Always       -       253
198 Offline_Uncorrectable   0x0030   199   199   000    Old_age   Offline      -       195
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   199   199   000    Old_age   Offline      -       291

SMART Error Log Version: 1
ATA Error Count: 805 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 805 occurred at disk power-on lifetime: 9119 hours (379 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 67 6f 24 e1  Error: UNC 8 sectors at LBA = 0x01246f67 = 19165031

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 67 6f 24 e1 08      00:39:36.070  READ DMA
  ec 00 00 00 00 00 a0 08      00:39:36.061  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:39:36.058  SET FEATURES [Set transfer mode]

Error 804 occurred at disk power-on lifetime: 9119 hours (379 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 67 6f 24 e1  Error: UNC 8 sectors at LBA = 0x01246f67 = 19165031

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 67 6f 24 e1 08      00:39:33.506  READ DMA
  b0 d5 01 09 4f c2 00 08      00:39:33.494  SMART READ LOG
  b0 d5 01 06 4f c2 00 08      00:39:33.490  SMART READ LOG
  b0 d5 01 01 4f c2 00 08      00:39:33.485  SMART READ LOG
  b0 d1 01 01 4f c2 00 08      00:39:33.477  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]

Error 803 occurred at disk power-on lifetime: 9119 hours (379 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 67 6f 24 e1  Error: UNC 8 sectors at LBA = 0x01246f67 = 19165031

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 67 6f 24 e1 08      00:39:30.754  READ DMA
  ec 00 00 00 00 00 a0 08      00:39:30.746  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:39:30.746  SET FEATURES [Set transfer mode]

Error 802 occurred at disk power-on lifetime: 9119 hours (379 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 67 6f 24 e1  Error: UNC 8 sectors at LBA = 0x01246f67 = 19165031

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 67 6f 24 e1 08      00:39:28.178  READ DMA
  ec 00 00 00 00 00 a0 08      00:39:28.169  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:39:28.166  SET FEATURES [Set transfer mode]

Error 801 occurred at disk power-on lifetime: 9119 hours (379 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 67 6f 24 e1  Error: UNC 8 sectors at LBA = 0x01246f67 = 19165031

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 67 6f 24 e1 08      00:39:25.615  READ DMA
  ec 00 00 00 00 00 a0 08      00:39:25.607  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      00:39:25.607  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      6491         2777760
# 2  Short offline       Completed: read failure       40%      6312         2773712

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@tux]# smartclt -l selftest /dev/sd
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      6491         2777760
# 2  Short offline       Completed: read failure       40%      6312         2773712

[root@tux]# smartctl -l selftest /dev/sdc
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      6491         2777760
# 2  Short offline       Completed: read failure       40%      6312         2773712




2) Get the bloack size
[root@tux]# dumpe2fs /dev/sdc | grep "Block size"
dumpe2fs 1.41.9 (22-Aug-2009)
Block size:               4096


3) LBA of bad chunk is 2773712

4) LBA of start of partition is (63)

[root@tux]# # LBA of start of dev/sdc is:
[root@tux]# fdisk -lu /dev/sdc

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk identifier: 0x0e30349b

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1              63  1953520064   976760001   83  Linux


5) Compute offset

(2773712-63)*512/4096 = 346706.125


6) Use DD to nuke single block at 3460706

[root@tux]# dd if=/dev/zero of=/dev/sdc bs=4096 count=1 seek=346706
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 5.0141e-05 s, 81.7 MB/s

7) Nuke the block at the other error location (347212)

[root@tux]# dd if=/dev/zero of=/dev/sdc bs=4096 count=1 seek=347212
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 4.8141e-05 s, 85.1 MB/s

8) At this point I rebooted the system and I still get the errors on boot up and once a day.



--
-Skye Sweeney