When I started an e2fsck on my RAID-5 device (/dev/sdb1) important files in /lib (on /dev/sda3) were deleted! I ran 'e2fsck -f /dev/sdb1' to force a check on my unmounted ext3 filesystem. During the check, no error messages were prompted, but during the check the system became unusable. It has now happened to me 3 times and I struggling in the dark where to find help and a solution to this. Can some here help?
I'm running a Gentoo Linux amd64 system with an Areca ARC-1220 controller. In the system 3 devices are used:
1. /dev/sda: A RAID-1, 2x80GB on the Areca controller. 3 partitions; /boot, swap and /
2. /dev/sdb: A RAID-5, 4x500GB on the Areca controller. One partiation
3. /dev/sdc: One 200GB connected directly to the motherboard, backup disk. One partition
e2fsprogs version is 1.39.
I'm not keen on running e2fsck again, since I don't want to jeopardize the server and its data. But if someone has ideas how to debug this problem I might try it.
I ran CentOS 4 on an Areca 1220 using ext3 for about a year with no problems. But the only way I could see your situation happening is if the Areca driver had a bad bug in it. I doubt that e2fsck would do any errant I/O to a completely different device.
However, I am not an ext2/3 hacker so don't go by what I say.
Is there some reason why you would force an fsck on a journaled filesystem running on a RAID5 partition?
Since I was resizing the filesystem, I had to run e2fsck before running resize2fs.
Since I'm clueless, I filed a Linux kernel bug a time ago: http://bugzilla.kernel.org/show_bug.cgi?id=8209 (These guys are often clever, and know where to start looking for problem).
Unfortunately, Areca has never been able to reproduce the error. I have. Twice on my existing machine. But the interested thing is that I think I suffered from the exact problem more than a year ago on a completely different machine and setup (see kernel bug report).
Well, given that in your kernel.org bugzilla you reported that it happens on both reiserfsck and e2fsck, it seems very unlikely it's an e2fsprogs problem; it seems pretty clear you have some kind of device driver stability problem.
As it happens I have an Aerca controller (an ARC-1160, which is I believe the 16-port version of your card), and it's completely rock solid. Of course, I'm using it on a 32-bit system, and NOT a 64-bit system. An obvious thing to try is seeing if the problem goes away if you install 32-bit Linux system on your server --- does your server really need 64-bit support? If it's just a file server, you might be able to work just fine with the x86 distro bits installed instead of the x86_64 bits installed. In fact, because pointers are now half the size and integers are half the size, for some work loads 32-bit programs run measurably faster than their 64-bit counterparts. Which makes sense; if you don't need a Ford Expedition, a Toyota Scion will be much more fuel efficient.
In any case, as shack625 has said, this is very clearly an I/O device driver and/or firmware problem.
Thanks for your answer!
Interesting point! Areca only tried to reproduce it on a 32-bit system. I will contact them again and ask them to try it on a 64-bit system.
BTW, ARC-1160 is a PCI-X card, ARC-1220 is a PCIe card. Otherwise it's correct; mine has 8 (SATA2) ports. No, my server does not need 64-bit OS. Perhaps I should switch to 32-bit Gentoo because stability is more important to be than fraction of a percent performance improvements in some cases.
I suspect what you will find is that for some workloads, x86_64 will be faster (by a small amount) and for some workloads, it will be slower (by a small amount). The speed advantage comes because the x86_64 architectures has double the number of general purpose registers (and the x86 design is pathetically starved in terms of having way too few registers); the speed disadvatnage will come, as I have mentioned from integers and pointers being twice as wide. So if you are program is memory bound, doubling the size of pointers and integers could very will slow down the program on a 64-bit system. If on the other hand the program is doing something which is heavily register intensive, the extra registers could help. It really depends on the work load.
If you *do* find that going to a 32-bit kernel and userspace solves your problem, please note that in the kernel.org bugzilla, so the x86_64 people know that they should go bug-hunting in the architecture specific code....