On Fri, Apr 24, 2009 at 12:41 PM, Henry Nestler <henry.ne@arcor.de> wrote:
Hello David,


David Rorex wrote:
I'm having trouble writing a quick way to reproduce it, it seems the
same problem doesn't happen if I just use loop devices. That means the
minimal way to reproduce might be for you to have three separate actual
partitions to use for a raid-5 test :(

I'll see if I can write the minimal steps for reproducing, assuming you
create 3 small empty partitions on  your disk first.

Creating partitions on a disk is a problem for mostly testers.
An older Bug #1569947 has some instructions and testing steps. Perhaps you can use these also for your test here?

see:
"Data corruption on md/raid5 under 0.6.3 and also 0.7.1-hn14"
https://sourceforge.net/tracker/?func=detail&aid=1569947&group_id=98788&atid=622063


Hi Henry,

I tried the test in that page, and it does appear to reproduce the problem. My problem is different than his...in his case he has data corruption, seg fault, kernel panic. In my case, the dd commands write the files with no errors reported, but the md5sum hangs on trying to read the first file.

1. Create 3 files of 500MB filled with zeros in windows
2. Assign these files to colinux as block devices using cobd
3. Create the raid (x,y,z are the # of the cobd device set in the colinux config file)
modprobe raid5
mdadm --create /dev/md1 /dev/cobdX /dev/cobdY /dev/cobdZ
mkfs.ext3 /dev/md1
mkdir /raidtst
mount /dev/md1 /raidtst
4. Run test
while true; do
for i in 1 2 3; do dd_rescue -m 300M /dev/zero /raidtst/f.$i; done
for i in 1 2 3; do ls -la /raidtst/f.$i; md5sum /raidtst/f.$i; done
for i in 1 2 3; do rm -f /raidtst/f.$i; done
done
5. In my case, the last thing I see is the output of "ls -la /raidtst/f.1", the md5sum command never completes. Colinux is not crashed, if I run 'top', I see 75-99% cpu under 'wa' (aka IO Wait). I can still access /raidtst via ls, and copying a small 100 byte file over there works fine & I can read it back.

Maybe sometimes it works directly after writing the file and reading it back, due to the data being cached in RAM? It seems like with creating a lot of small test files, it works ok, but creating a very large file exposes the problem quicker. So if you can't reproduce, maybe try increasing the file size even more?

I will try running the debug daemon and see if I can get it to show anything that looks useful.

Thanks