Thread: [smartmontools-support]my-own-badblock-howto.txt [1.8]

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Author: Jean-Francois Patenaude
$RCSfile: my-own-badblock-howto.txt,v $
$Revision: 1.8 $
$Date: 2005/06/24 20:23:29 $

DISCLAIMER
==========
*** Do backups.   You shouldn't try this if you are not 100% comfortable with those tools.  Don't blame me if you break anything, you're on your own.   Don't run my commands directly (adapt them to your setup/problems)

MASTER BOOT RECORD BACKUP
=========================
*** While playing with the following tools, it sometimes happened that I lost my MBR.  I'm not sure why ... but back-it up first !

dd if=/dev/hda of=/mbr_hda bs=512 count=1

dd if=/dev/hda of=- bs=512 count=1 2> /dev/null | uuencode - mbr_hda | mail -s mbr_hda.uue you...@yo...d

FIND A BAD SECTOR
=================

With the syslogs
----------------
dmesg | grep UncorrectableError

#>> hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=17371583, sector=17371576

With a Smart extended test
--------------------------
smartctl -t long /dev/hda
#wait enough time for the results to appear .. may take an hour or even more
smartctl -l selftest /dev/hda

#>> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
#>> # 1  Extended offline    Completed: read failure       40%      6637         17371583

CONFIRM THAT THE HARD DRIVE HAS BAD SECTOR(S)
=============================================

smartctl --attributes /dev/hda | egrep "RAW_VALUE|Pending"

#>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
#>> 197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       2

*** This means that the hard drive would like to reallocate 2 bad sectors.  It hopes that someday it will be able to read this sector without any error.  If this happens, it would then copy this sector into a spare sector and mark it as permanently bad.    On the other hand, if you happen to write data on this sector, it will mark it as permanently bad and write the data directly in a spare sector.

FIND ALL SURROUNDING BAD SECTORS
================================

lba=17371583
let begin=$lba-50
let end=$lba+50
i=$begin
while [ $i -lt $end ]
do
 	# LBA is in 512 bytes blocks
 	dd if=/dev/hda of=/dev/null bs=512 skip=$i count=1 2> /dev/null
 	if [ $? -ne 0 ] ; then echo "$i: BAD" ; fi
 	let i+=1
done

*** You'll get a list of bad sectors ... in my particular case, the first/last ones were the following:
*** first: 17371567
*** last:  17371591

TRY TO WRITE ZERO ONTO THOSE BAD SECTORS
========================================

*** Note: this will destroy the content of any file using those sectors.  Use this only if you have backups of your files and absolutely want to get rid of your pending bad sectors.   In my particular case, those were UNUSED sectors ...    See http://smartmontools.sourceforge.net/BadBlockHowTo.txt for hints on how to find what files are affected.

dd if=/dev/zero of=/dev/hda bs=512 skip=17371567 count=25

DID THE HARD DRIVE REALLOCATE THE BAD SECTORS ?
===============================================

smartctl --attributes /dev/hda | egrep "RAW_VALUE|Pending"

#>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
#>> 197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       2

*** No it didn't reallocate them.
*** It happens that the drive won't care about the "writes" you just did and won't reallocate the bad sectors.   I don't understand why though ...   See the next step.

TRY WRITING PATTERNS ONTO THOSE BAD SECTORS
===========================================

*** Again, this will destroy the content of the sectors (and any associated file) you're working on.

badblocks -w -v -b 512 /dev/hda 17371591 17371567

DID THE HARD DRIVE REALLOCATE THE BAD SECTORS ?
===============================================

smartctl --attributes /dev/hda | egrep "RAW_VALUE|Pending"

#>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
#>> 197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       0

*** This time it worked.   Redo a SMART extended test to make sure everything is fine.

smartctl -t long /dev/hda
#wait enough time for the results to appear .. may take an hour or even more
smartctl -l selftest /dev/hda

#>> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
#>> # 1  Extended offline    Completed without error       00%      6656         -

**************
** ADDENDUM **
**************

FINDING THE AFFECTED PARTITION UNDER LINUX/LVM2
===============================================

If affected sectors are for 17371567 to 17371591

Get the offset of LVM partition
-------------------------------
fdisk -lu /dev/hda

#>>    Device Boot      Start         End      Blocks   Id  System
#>> /dev/hda1              63      401624      200781   83  Linux
#>> /dev/hda2          401625    12129074     5863725   83  Linux
#>> /dev/hda3        12129075   234372284   111121605   8e  Linux LVM

#>> OFFSET=12129075

*** Note that if the LBA sector had been lower than 12129075, it would have meant that the affected partition wasn't under LVM control.

Get the difference between your LBA bad sector and the LVM offset
-----------------------------------------------------------------

#>> first: 17371567 - 12129075 = 5242492
#>> last: 17371591 - 12129075 = 5242516

Get your PE size
----------------

pvdisplay | egrep 'PV Name|PE Size'
#>>  PV Name               /dev/hda3
#>>  PE Size (KByte)       524288

#>> Convert this number in "512 bytes" blocks : 524288 * 2 = 1048576

Get the affected PE
-------------------

#>> first:   5242492 / 1048576 = 4.99962
#>> last:    5242516 / 1048576 = 4.99965

#>> Your affected PE is/are: #4

Find the affected LV
--------------------

lvdisplay --maps | egrep 'LV Name|Physical extents'

#>>  LV Name                /dev/vg0/var
#>>    Physical extents    3 to 4

Confirm it's the affected LV
----------------------------
badblocks -v -b 4096 /dev/vg0/var

Verify if files are affected
----------------------------
find /var -mount -type f -exec md5sum {} \;

EOF

Thread: [smartmontools-support]my-own-badblock-howto.txt [1.8]

Disk Inspection and Monitoring

smartmontools-support