Re: [smartmontools-support]SMART Reported Disk Errors (Unrecoverable etc)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Bruno,

I appreciate your help and advice so far.

> > > > My guess is that the drive is having a problem with one sector. If I 
can 
> > mark 
> > > > that sector as bad, the problem will be resolved. If this is the case, 
how 
> > can 
> > > > I mark it as bad?
> > > 
> > > One option is to shutdown and boot with a live CD (such as knoppix) and
> > > copy the block from the good drive to the bad drive.
> > 
> > Hmmm.. but how would I know exactly which block it was?
> 
> It will probably show up on a self test. (If it doesn't show up on a 
> short selftest, try a long one.) That number will give you the block 
> number using 512 byte blocks. You will probably want to use 4096 
> byte blocks and divide the number by 8, as under linux trying to 
> write a 512 byte block seems to result in a read of a 4096 byte 
> block that includes the bad sector which won't work very well. Be 
> sure to use the who disk device (e.g. /dev/hda), not a partition 
> device (e.g. /dev/hda1) unless you want to do some more math.

What I've done after our discussions from yesterday was to use your commands 
below as:

# mdadm /dev/md0 -f /dev/hda1 -r /dev/hda1 -a /dev/hda1

for each md device, which successfully went through and re-synced the mirrors. 

I then booted a rescue cd and:

e2fsck -c /dev/hda1 to 8 (except extended and swap)

No errors were found. 

Booting back into the box the normal unrecoverable errors were emailed to me 
again from smartd on /dev/hda, telling me to look again in /var/log/messages 
where I find:

Dec 21 11:39:26 xxxxxxx smartd[2990]: Device: /dev/hda, 1 Currently unreadable 
(pending) sectors
Dec 21 11:39:26 xxxxxxx smartd[2990]: Device: /dev/hda, 1 Offline 
uncorrectable sectors

again.

Running a long test only shows:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA
_of_first_error
# 1  Extended offline    Completed without error       00%      5778         -
# 2  Extended offline    Completed without error       00%      5762         -
# 3  Short offline       Completed without error       00%      5762         -

so I don't know exactly what sector is the issue.

> > When you mentioned that I thought maybe then I should just run an fsck on 
the 
> > drive and that would likely mark the block bad.
> 
> The drive will reallocate the block once it can safely delete the 
> old one. It can do this if it either gets a good read of the block 
> or if the block gets overwritten.
> 
> > > If you don't want to take the system down, but can afford a lot of
> > > disk IO for a while, you can fail the bad drive and then add it back
> > > to the mirror. This will rebuild the entire drive (or at least the
> > > partitions you failed out of their mirror sets).
> > 
> > Yeah, this is only a test cluster environment so it's no issue to bring 
the 
> > server down or hammer it with disk IO.
> > 
> > I'm not too familiar with failing a drive and adding it back, do you 
happen 
> > have the steps involved?
> 
> I believe you want to do something like the following:
> mdadm /dev/md0 -f /dev/hda1
> mdadm /dev/md0 -r /dev/hda1
> mdadm /dev/md0 -a /dev/hda1
> 
> This assumes that the back block is on the hda1 partition and that that
> partition is part of the md0 raid set.
> 
> While the partition is offline, you might want to pound on it with
> badblocks for a while to see if there are any other bad blocks on
> the device (in that partition).

I did this last night with "e2fsck -c" as discussed above, but it didn't find 
anything wrong with the drive. Should I be using the "badblocks" tool for 
this?

> > I'm using RHEL 3.0.3. I've setup md0 to md5 as mirrored partitions on /
dev/hda 
> > and /dev/hdb.

Thanks.

Michael.

Re: [smartmontools-support]SMART Reported Disk Errors (Unrecoverable etc)

Disk Inspection and Monitoring

Re: [smartmontools-support]SMART Reported Disk Errors (Unrecoverable etc)