Re: [smartmontools-support]badblocks-howto

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Bruce and Kay,

> Hi Bruce,
> 
> I read your badblocks-howto at 
> http://smartmontools.sourceforge.net/BadBlockHowTo.txt and greatly 
> benefitted from it. One thing that's (maybe) missing is that often 
> the "smartctl -t long" scan finds a bad sector which is _not_ 
> assigned to any file. In that case it does not help to run debugfs,
>  or rather debugfs reports the fact that no file owns that sector. 
> Furthermore, it is somewhat laborious to come up with the correct 
> numbers for debugfs, and debugfs is slow ...
> 
> So what I suggest in the case of presence of 
> Current_Pending_Sector/Offline_Uncorrectable errors is to create a 
> huge file on that filesystem. dd if=/dev/zero of=/some/mount/point bs=4k
> creates the file. Leave it running until the partition/filesystem is 
> full. This will make the disk reallocate those sectors which do not 
> belong to a file. Check the "smartctl -a" output after that and make 
> sure that the sectors are reallocated. If any remain, use the 
> debugfs method. Of course the usual caveats apply - back it up first,
>  and so on.
> 
> just a suggestion for a future version!

I also read and went through the badblocks howto, but my situation was a
little different (which I'll explain now) so Kay's suggestion helped more.

I have two 120gb drives in a Software RAID 1 mirror, /dev/hda and /dev/hdb.

/dev/hdb started getting these issues:

# 8  Extended offline    Completed: read failure       90%     10326        
24551155
# 9  Extended offline    Completed: read failure       90%     10324        
24539182
#10  Extended offline    Completed: read failure       90%     10323        
24551155

As /dev/hdb is in a Software mirror, I really don't care about the data
contained on /dev/hdb as it's already mirrored on /dev/hda, so I failed and
removed all /dev/hdb parititions from the mirrorsets with:

mdadm /dev/md? -f /dev/hdb? -r /dev/hdb?

and set about to wipe the drive, first by removing all its partitions, then
creating one /dev/hdb1 parition encapsulating the entire drive and formatting it:

mke2fs -j -b 4096 -m 0 -c /dev/hdb1

the -c is the bad block check and no bad blocks where detected. I ran it a
second time "mke2fs -c /dev/hdb1" to make sure after that.

The I did the "dd /dev/null" method described by Kay (which incidently
requires a file to be specified on the command line and not a mountpoint) like:

mkdir /mnt/hd
mount /dev/hdb1 /mnt/hd
dd if=/dev/null of=/mnt/hd/test bs=4096

So the "test" file fills the drive.

Then I ran the smartctl check, and the failure still existed!

At this point I was going to turf the drive, but decided what the heck, I'll
low-level format it. So I grabbed my Microscope 2000 boot disk (commercial
product) and low level formatted the drive. I then ran a bad blocks check from
Microscope, butterfly read and write tests etc, all passed.

I then booted back into Linux, ran the "smartctl -t long /dev/hdb" and it
failed not at 90%, but at 60%. The badblocks were still there, but now not at
the same location.

So I then went into fdisk, which presented to me a different number of heads
and cylinders. Like /dev/hda, it was 255 heads and 14593 cylinders, with 63
sectors per track, after the low level format, it became 16 heads with 22k
(something) cylinders. I went into fdisk's extended options and changed the
heads and cylinders to what they were previously, and re-created the one
partition, /dev/hdb1, and re-did the dd with the /mnt/hd/test file again.

After that, did the smartctl long check and it PASSED!

Back into fdisk, created all partitions on the drive again to match /dev/hda,
boot flags, type ids, etc and then hot added them back into their respective
arrays:

mdadm /dev/md? -a /dev/hdb?

with a:

watch "cat /proc/mdstat"

in another putty session to see what was happening during the re-syncs.

All done, now multiple other smartctl checks pass every time:

# 1  Extended offline    Completed without error       00%     10358         -
# 2  Extended offline    Completed without error       00%     10350         -
# 3  Extended offline    Completed without error       00%     10346         -
# 4  Extended offline    Completed without error       00%     10344         -

I still see this in the smartctl -a list though:

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline    
 -       26

but the checks continue to pass and no problems are reported on the drive in
the system logs. So my guess is the drive isn't zeroing that value?

Anyway, my entire smartctl output is here for your reference:

# smartctl -a /dev/hdb
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD2500JB-00FUA0
Serial Number:    WD-WMAEP1857288
Firmware Version: 15.05R15
User Capacity:    250,059,350,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Aug  5 09:22:30 2005 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command
from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (7590) seconds.
Offline data collection
capabilities:                    (0x79) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  95) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   198   051    Pre-fail  Always     
 -       0
  3 Spin_Up_Time            0x0007   113   107   021    Pre-fail  Always     
 -       4858
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always     
 -       125
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always     
 -       2
  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always     
 -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always     
 -       6347
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always     
 -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always     
 -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always     
 -       117
194 Temperature_Celsius     0x0022   128   253   000    Old_age   Always     
 -       22
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always     
 -       2
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always     
 -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always     
 -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always     
 -       0
200 Multi_Zone_Error_Rate   0x0009   200   155   051    Pre-fail  Offline    
 -       0

SMART Error Log Version: 1
ATA Error Count: 36 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 36 occurred at disk power-on lifetime: 56 hours (2 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 5f a8 02 f0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  01 00 42 00 00 01 00 00      00:08:51.000  [RESERVED]
  00 00 00 00 00 00 00 00      00:08:51.000  NOP [Abort queued commands]
  01 00 25 00 00 01 00 00      00:08:51.000  [RESERVED]
  00 00 00 00 00 00 00 00      00:08:51.000  NOP [Abort queued commands]
  01 00 42 00 00 04 00 00      00:08:51.000  [RESERVED]

Error 35 occurred at disk power-on lifetime: 56 hours (2 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 02 5f a8 02 f0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  01 00 42 00 00 02 00 00      00:08:48.950  [RESERVED]
  00 00 02 00 00 63 a8 00      00:08:48.950  NOP [Abort queued commands]
  01 00 25 00 00 01 00 00      00:08:48.950  [RESERVED]
  00 00 00 00 00 00 00 00      00:08:48.950  NOP [Abort queued commands]
  01 00 42 00 00 08 00 00      00:08:48.950  [RESERVED]

Error 34 occurred at disk power-on lifetime: 56 hours (2 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 04 5f a8 02 f0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  01 00 42 00 00 04 00 00      00:08:46.950  [RESERVED]
  00 00 02 00 00 67 a8 00      00:08:46.950  NOP [Abort queued commands]
  01 00 42 00 00 08 00 00      00:08:46.950  [RESERVED]
  00 00 00 00 00 00 00 00      00:08:46.950  NOP [Abort queued commands]
  01 00 42 00 00 10 00 00      00:08:46.950  [RESERVED]

Error 33 occurred at disk power-on lifetime: 56 hours (2 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  01 51 08 5f a8 02 f0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  01 00 42 00 00 08 00 00      00:08:44.800  [RESERVED]
  00 00 00 00 00 00 00 00      00:08:44.800  NOP [Abort queued commands]
  01 00 42 00 00 10 00 00      00:08:44.800  [RESERVED]
  00 00 00 00 00 00 00 00      00:08:44.800  NOP [Abort queued commands]
  01 00 42 00 00 20 00 00      00:08:44.800  [RESERVED]

Error 32 occurred at disk power-on lifetime: 56 hours (2 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 5f a8 02 f0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  01 00 42 00 00 10 00 00      00:08:42.650  [RESERVED]
  00 00 00 00 00 00 00 00      00:08:42.650  NOP [Abort queued commands]
  01 00 42 00 00 20 00 00      00:08:42.650  [RESERVED]
  00 00 02 00 00 7f a8 00      00:08:42.650  NOP [Abort queued commands]
  01 00 42 00 00 40 00 00      00:08:42.650  [RESERVED]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours) 
LBA_of_first_error
# 1  Short offline       Completed without error       00%       878         -
# 2  Short offline       Completed without error       00%       854         -
# 3  Short offline       Completed without error       00%       830         -
# 4  Short offline       Completed without error       00%       806         -
# 5  Short offline       Completed without error       00%       782         -
# 6  Short offline       Completed without error       00%       759         -
# 7  Extended offline    Completed without error       00%       741         -
# 8  Short offline       Completed without error       00%       739         -
# 9  Short offline       Completed without error       00%       716         -
#10  Short offline       Completed without error       00%       693         -
#11  Short offline       Completed without error       00%       669         -
#12  Short offline       Completed without error       00%       645         -
#13  Short offline       Completed without error       00%       622         -
#14  Short offline       Completed without error       00%       599         -
#15  Extended offline    Completed without error       00%       580         -
#16  Short offline       Completed without error       00%       579         -
#17  Short offline       Completed without error       00%       556         -
#18  Short offline       Completed without error       00%       532         -
#19  Short offline       Completed without error       00%       509         -
#20  Short offline       Completed without error       00%       485         -
#21  Short offline       Completed without error       00%       463         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Michael.

Re: [smartmontools-support]badblocks-howto

Disk Inspection and Monitoring

Re: [smartmontools-support]badblocks-howto