Thread: [smartmontools-support] How to fix bad blocks on software RAID?

smartmontools-support

[smartmontools-support] How to fix bad blocks on software RAID?

From: Kai S. <mai...@co...> - 2008-10-28 19:31:38

I have two Samsung HD502IJ disks in a software RAID 1 array on CentOS 5. 
There are two mirrored partitions (/dev/md0, /dev/md1) on each, a small 
one for the xen hypervisor system and a large one with the remainder of 
the disk that is managed by LVM to get a bunch of smaller partitions for 
xen guests.
Both disks show one Offline_Uncorrectable error at different offsets. 
Earlier this month I started to get errors from md in the logs about 
attempts for reallocation which seemed to fail for the first tries and 
then succeed (I can make the logs available if somebody thinks it would 
help). Since then it happened three or four more times. RAID has been fine 
all the time and shows as clean.
I checked with smartctl -a -d ata, output can be found here:
http://winware.org/smart/raid-error.sda.txt
http://winware.org/smart/raid-error.sdb.txt

The short offline tests then found the Offline_Uncorrectable errors and I 
also started getting two pending sectors in sda in the smartd messages. 
However, these disappeared after some time and only the two 
Offline_Uncorrectable errors remain.

It looks to me that I have a bunch of closely related bad blocks (after 
offset 208645048) on sda that keep upping my Raw_Read_Error_Rate each time 
a reading attempt is made on them. (which happens somewhat frequently 
because of backups each day). (Forget about the error on sdb for now.)

How can I trigger a reallocation so that these are not getting used 
anymore? 
If I understand correctly I cannot follow the bad block how-to exactly, or 
only partly?
I read this thread
http://sourceforge.net/mailarchive/message.php?
msg_id=Pine.LNX.4.64.0806270653350.5844%40gc.phys.uwm.edu
which suggests I could copy over good data from the other disk, but it's 
not clear to me at all how I find out where exactly the problem is and how 
I copy the correct data over.

For identifying I followed the badblock how-to section concerning LVM and 
I think I have identified the correct bad block number for that LVM 
partition. However, I can't prove that there is a problem with the two 
methods from the how-to.

1. Using dd to read from that block and around it is always fine. This 
might be due to the RAID? As I'm reading from the LVM device on the RAID 
partition I might always get readable data. Would I need to destroy the 
RAID before I can get any errors? But if I remove RAID I also break the 
LVM that sits on it. How would I then access a specific LVM partition on a 
specific disk?

2. And when I try debugfs on it I always get 

debugfs:  icheck 1000
Block   Inode number
1000    <block not found>

even on a low number like 1000.

Thanks for any hints.


Kai

Re: [smartmontools-support] How to fix bad blocks on software RAID?

From: Christian F. <Chr...@t-...> - 2008-10-30 12:09:14

Kai Schaetzl wrote:
> I have two Samsung HD502IJ disks in a software RAID 1 array on CentOS
> 5.
> There are two mirrored partitions (/dev/md0, /dev/md1) on each, a
> small one for the xen hypervisor system and a large one with the
> remainder of the disk that is managed by LVM to get a bunch of smaller
> partitions for xen guests.
> Both disks show one Offline_Uncorrectable error at different offsets.
> ...
> The short offline tests then found the Offline_Uncorrectable errors
> and I also started getting two pending sectors in sda in the smartd
> messages.
> However, these disappeared after some time and only the two
> Offline_Uncorrectable errors remain.
> ...
> I read this thread
> http://sourceforge.net/mailarchive/message.php?
> msg_id=Pine.LNX.4.64.0806270653350.5844%40gc.phys.uwm.edu
> which suggests I could copy over good data from the other disk, but
> it's not clear to me at all how I find out where exactly the problem
> is and how I copy the correct data over.
> 

AFIAK, the Linux software RAID does this for you if it encounters a bad
block on one of the disks:
http://lxr.linux.no/linux+v2.6.27/drivers/md/raid1.c#L1621

So a raw read through the RAID driver may force the reallocation - with
a probability of 50% :-)
(e.g. 'ddrescue -v /dev/md0 /dev/null read.log') 

Note: Some older Samsung disks (at least SP1614C from P80 series) do not
increment Reallocated_Sector_Ct and do not reset Offline_Uncorrectable
on bad sector reallocation. I don't know whether this is the case for T-
or F1-Series disks.

Cheers,
Christian

Re: [smartmontools-support] How to fix bad blocks on software RAID?

From: Bruce A. <ba...@gr...> - 2008-10-30 15:13:18

> AFIAK, the Linux software RAID does this for you if it encounters a bad 
> block on one of the disks: 
> http://lxr.linux.no/linux+v2.6.27/drivers/md/raid1.c#L1621

That's great!! When was this feature added to Linux software RAID?  Does 
it work for all redundant RAID levels like RAID-5 or RAID-6, or only for 
mirroring?

Cheers,
 	Bruce

Re: [smartmontools-support] How to fix bad blocks on software RAID?

From: David G. <da...@dg...> - 2008-10-30 16:00:01

Bruce Allen wrote:
>> AFIAK, the Linux software RAID does this for you if it encounters a bad 
>> block on one of the disks: 
>> http://lxr.linux.no/linux+v2.6.27/drivers/md/raid1.c#L1621
> 
> That's great!! When was this feature added to Linux software RAID?  Does 
> it work for all redundant RAID levels like RAID-5 or RAID-6, or only for 
> mirroring?

I think you guys want:
  http://linux-raid.osdl.org/index.php/RAID_Administration

Looking at:

echo check > /sys/block/mdX/md/sync_action
echo repair > /sys/block/mdX/md/sync_action


David
PS Bruce - did you get a recent email re Samsung?


-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."