From: Jean-Francois P. <sma...@pa...> - 2005-06-29 12:28:30
|
Author: Jean-Francois Patenaude $RCSfile: my-own-badblock-howto.txt,v $ $Revision: 1.8 $ $Date: 2005/06/24 20:23:29 $ DISCLAIMER ========== *** Do backups. You shouldn't try this if you are not 100% comfortable with those tools. Don't blame me if you break anything, you're on your own. Don't run my commands directly (adapt them to your setup/problems) MASTER BOOT RECORD BACKUP ========================= *** While playing with the following tools, it sometimes happened that I lost my MBR. I'm not sure why ... but back-it up first ! dd if=/dev/hda of=/mbr_hda bs=512 count=1 dd if=/dev/hda of=- bs=512 count=1 2> /dev/null | uuencode - mbr_hda | mail -s mbr_hda.uue you...@yo...d FIND A BAD SECTOR ================= With the syslogs ---------------- dmesg | grep UncorrectableError #>> hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=17371583, sector=17371576 With a Smart extended test -------------------------- smartctl -t long /dev/hda #wait enough time for the results to appear .. may take an hour or even more smartctl -l selftest /dev/hda #>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error #>> # 1 Extended offline Completed: read failure 40% 6637 17371583 CONFIRM THAT THE HARD DRIVE HAS BAD SECTOR(S) ============================================= smartctl --attributes /dev/hda | egrep "RAW_VALUE|Pending" #>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE #>> 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 2 *** This means that the hard drive would like to reallocate 2 bad sectors. It hopes that someday it will be able to read this sector without any error. If this happens, it would then copy this sector into a spare sector and mark it as permanently bad. On the other hand, if you happen to write data on this sector, it will mark it as permanently bad and write the data directly in a spare sector. FIND ALL SURROUNDING BAD SECTORS ================================ lba=17371583 let begin=$lba-50 let end=$lba+50 i=$begin while [ $i -lt $end ] do # LBA is in 512 bytes blocks dd if=/dev/hda of=/dev/null bs=512 skip=$i count=1 2> /dev/null if [ $? -ne 0 ] ; then echo "$i: BAD" ; fi let i+=1 done *** You'll get a list of bad sectors ... in my particular case, the first/last ones were the following: *** first: 17371567 *** last: 17371591 TRY TO WRITE ZERO ONTO THOSE BAD SECTORS ======================================== *** Note: this will destroy the content of any file using those sectors. Use this only if you have backups of your files and absolutely want to get rid of your pending bad sectors. In my particular case, those were UNUSED sectors ... See http://smartmontools.sourceforge.net/BadBlockHowTo.txt for hints on how to find what files are affected. dd if=/dev/zero of=/dev/hda bs=512 skip=17371567 count=25 DID THE HARD DRIVE REALLOCATE THE BAD SECTORS ? =============================================== smartctl --attributes /dev/hda | egrep "RAW_VALUE|Pending" #>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE #>> 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 2 *** No it didn't reallocate them. *** It happens that the drive won't care about the "writes" you just did and won't reallocate the bad sectors. I don't understand why though ... See the next step. TRY WRITING PATTERNS ONTO THOSE BAD SECTORS =========================================== *** Again, this will destroy the content of the sectors (and any associated file) you're working on. badblocks -w -v -b 512 /dev/hda 17371591 17371567 DID THE HARD DRIVE REALLOCATE THE BAD SECTORS ? =============================================== smartctl --attributes /dev/hda | egrep "RAW_VALUE|Pending" #>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE #>> 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0 *** This time it worked. Redo a SMART extended test to make sure everything is fine. smartctl -t long /dev/hda #wait enough time for the results to appear .. may take an hour or even more smartctl -l selftest /dev/hda #>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error #>> # 1 Extended offline Completed without error 00% 6656 - ************** ** ADDENDUM ** ************** FINDING THE AFFECTED PARTITION UNDER LINUX/LVM2 =============================================== If affected sectors are for 17371567 to 17371591 Get the offset of LVM partition ------------------------------- fdisk -lu /dev/hda #>> Device Boot Start End Blocks Id System #>> /dev/hda1 63 401624 200781 83 Linux #>> /dev/hda2 401625 12129074 5863725 83 Linux #>> /dev/hda3 12129075 234372284 111121605 8e Linux LVM #>> OFFSET=12129075 *** Note that if the LBA sector had been lower than 12129075, it would have meant that the affected partition wasn't under LVM control. Get the difference between your LBA bad sector and the LVM offset ----------------------------------------------------------------- #>> first: 17371567 - 12129075 = 5242492 #>> last: 17371591 - 12129075 = 5242516 Get your PE size ---------------- pvdisplay | egrep 'PV Name|PE Size' #>> PV Name /dev/hda3 #>> PE Size (KByte) 524288 #>> Convert this number in "512 bytes" blocks : 524288 * 2 = 1048576 Get the affected PE ------------------- #>> first: 5242492 / 1048576 = 4.99962 #>> last: 5242516 / 1048576 = 4.99965 #>> Your affected PE is/are: #4 Find the affected LV -------------------- lvdisplay --maps | egrep 'LV Name|Physical extents' #>> LV Name /dev/vg0/var #>> Physical extents 3 to 4 Confirm it's the affected LV ---------------------------- badblocks -v -b 4096 /dev/vg0/var Verify if files are affected ---------------------------- find /var -mount -type f -exec md5sum {} \; EOF |
From: Gary F. <ga...@in...> - 2005-07-12 01:44:24
|
Reading over J. F. Patenaude's note, and the smartmon howto: http://smartmontools.sourceforge.net/BadBlockHowTo.txt I was wondering whether running e2fsck with the bad blocks flag is all that is needed? # umount /dev/hdb1 # e2fsck -c /dev/hdb1 If the bad block test encounters an error during the read step of the non-destructive read/write test, will it attempt to write back a block filled with zeros? Or will it just stop in its tracks and mark the block as bad? I could see where badblocks probably just returns back the block in error so that e2fsck can check to see if the bad block is inside of a file, and in that case, move it to lost+found, or whatever it does in that situation. Otherwise if badblocks wrote back a zero block, and this cleared the error, then e2fsck would be unaware of the fact that the error appeared inside of a file. Once e2fsck and badblocks have done their thing, and assuming that badblocks didn't write back a zero block, we might safely be able to zero out the block that the SMART diagnostics are complaining about, since presumably e2fsck has attempted to recover what it can of the file/inode/superblock, and any bad blocks have been marked. Once we zap the block, and hopefully it is relocated to an alternate sector, we can safely run 'e2fsck -c' again, and perhaps our bad block(s) can now be safely used again. BTW, I think the safest thing to do is probably to run e2fsck -c, then backup the entire file system, and rebuild the file system, first running the destructive read/write test (by specifying -c twice on the mke2fs command), and rebuild and reload the file system. |