|
From: Michael M. <mi...@np...> - 2004-12-21 01:34:24
|
Hi Bruno, I appreciate your help and advice so far. > > > > My guess is that the drive is having a problem with one sector. If I can > > mark > > > > that sector as bad, the problem will be resolved. If this is the case, how > > can > > > > I mark it as bad? > > > > > > One option is to shutdown and boot with a live CD (such as knoppix) and > > > copy the block from the good drive to the bad drive. > > > > Hmmm.. but how would I know exactly which block it was? > > It will probably show up on a self test. (If it doesn't show up on a > short selftest, try a long one.) That number will give you the block > number using 512 byte blocks. You will probably want to use 4096 > byte blocks and divide the number by 8, as under linux trying to > write a 512 byte block seems to result in a read of a 4096 byte > block that includes the bad sector which won't work very well. Be > sure to use the who disk device (e.g. /dev/hda), not a partition > device (e.g. /dev/hda1) unless you want to do some more math. What I've done after our discussions from yesterday was to use your commands below as: # mdadm /dev/md0 -f /dev/hda1 -r /dev/hda1 -a /dev/hda1 for each md device, which successfully went through and re-synced the mirrors. I then booted a rescue cd and: e2fsck -c /dev/hda1 to 8 (except extended and swap) No errors were found. Booting back into the box the normal unrecoverable errors were emailed to me again from smartd on /dev/hda, telling me to look again in /var/log/messages where I find: Dec 21 11:39:26 xxxxxxx smartd[2990]: Device: /dev/hda, 1 Currently unreadable (pending) sectors Dec 21 11:39:26 xxxxxxx smartd[2990]: Device: /dev/hda, 1 Offline uncorrectable sectors again. Running a long test only shows: === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error # 1 Extended offline Completed without error 00% 5778 - # 2 Extended offline Completed without error 00% 5762 - # 3 Short offline Completed without error 00% 5762 - so I don't know exactly what sector is the issue. > > When you mentioned that I thought maybe then I should just run an fsck on the > > drive and that would likely mark the block bad. > > The drive will reallocate the block once it can safely delete the > old one. It can do this if it either gets a good read of the block > or if the block gets overwritten. > > > > If you don't want to take the system down, but can afford a lot of > > > disk IO for a while, you can fail the bad drive and then add it back > > > to the mirror. This will rebuild the entire drive (or at least the > > > partitions you failed out of their mirror sets). > > > > Yeah, this is only a test cluster environment so it's no issue to bring the > > server down or hammer it with disk IO. > > > > I'm not too familiar with failing a drive and adding it back, do you happen > > have the steps involved? > > I believe you want to do something like the following: > mdadm /dev/md0 -f /dev/hda1 > mdadm /dev/md0 -r /dev/hda1 > mdadm /dev/md0 -a /dev/hda1 > > This assumes that the back block is on the hda1 partition and that that > partition is part of the md0 raid set. > > While the partition is offline, you might want to pound on it with > badblocks for a while to see if there are any other bad blocks on > the device (in that partition). I did this last night with "e2fsck -c" as discussed above, but it didn't find anything wrong with the drive. Should I be using the "badblocks" tool for this? > > I'm using RHEL 3.0.3. I've setup md0 to md5 as mirrored partitions on / dev/hda > > and /dev/hdb. Thanks. Michael. |