From: René P. <ly...@lu...> - 2012-02-28 11:57:06
|
Hello! Here's a follow-up and a warning to everyone deploying servers. It's not meant to be yet another war story, it should illustrate what to look for when deploying MooseFS master servers. On Dec 11, 2011 at 1913 +0100, René Pfeiffer appeared and said: > … > We have the following scenario with a MooseFS deployment. > > - 3 servers > - server #3 : 1 master > - server #2 : chunk server > - server #1 : chunk server, one metalogger process running on this node > > Server #3 suffered from a hardware RAID failure including a trashed JFS > file system where the master logs were on. … We have spent a couple of days diagnosing the failover with the server vendor and the ISP that did the provisioning. Apparently the RAID meltdown was due to firmware bugs since the servers were deployed without any upgrades (classic case of communication failure). After recovery all firmwares on the servers were upgrades. After that the RAID failed again. This time the controller removed 2 out of 4 disks, because they weren't responding. Since th 2 removed disks were a complete RAID1 container the server went out of service again (this time without being master but only metalogger). Analysis yielded that storage server #3 is the only one with 2 TB disks which were _not_ approved by the hardware vendor (ISP used third-party disks to boost storage capacity). Storage servers #1 and #2 run 1 TB disks approved by the hardware vendor. Apparently the firmwares of the RAID controller do not like the 2 TB disks and their firmware leading to timeouts and communication errors on the data bus. Server types and disk models are available (please ask me off-list) if anyone is interested. We haven't figured out why the metalogger data was not useful after the first failure, but we suspect that due to the massive data corruption on storage server #3 the data sent to the metalogger was corrupt as well. I don't know if the master sends data from disk or from memory to the metalogger(s). If it reads data from disks and sends it, then our RAID controller might have eaten the data already. Best regards, René Pfeiffer. -- )\._.,--....,'``. fL Let GNU/Linux work for you while you take a nap. /, _.. \ _\ (`._ ,. R. Pfeiffer <lynx at luchs.at> + http://web.luchs.at/ `._.-(,_..'--(,_..'`-.;.' - System administration + Consulting + Teaching - Got mail delivery problems? http://web.luchs.at/information/blockedmail.php |