From: <wk...@bn...> - 2012-01-14 19:55:59
|
On 1/13/12 12:47 AM, ro...@mm... wrote: > Sorry for intervening and excuse a moosefs newbie question. > Why are you concerned so much about mfsmaster failing? How often does this > happen? > > I am considering moosefs for a small lan of 15 users, mainly for > aggregating unused storage space from various machines. Googling suggested > moosefs is rather robust, but this thread suggest otherwise. > Have I misunderstood something? The Master is a single point of failure. If it fails, your Data is not available until you bring it backup. The MooseFS software is very reliable, we run several clusters and have only seen failures due to human error or hardware (we started off testing with old, thrown off kit). The good news is that if you have MetaLoggers, recovering is very easy and very reliable. We have never seen data loss due with a recovery (except for data "on the fly") and we have seen some rather "inelegant" failures as we were playing around with the system. So use good kit (like a server class chassis, dual Power Supply, UPS and ECC memory) and you dramatically reduce the chance of the outage. Make sure one of the MetaLoggers is capable of being a Master, so you can promote it if needed. There are lots of reasons for an outage and MooseFS is pretty minor on the list. Because it is a rare issue (with good kit) AND our application can deal with some downtime, we elected to not have an automated failover and have a human identify what the real issue is and handle it. If the Master fails a staff member just promotes a MetaLogger to take over the role until we can fix the real master and switch back at a convenient time. Downtime for that is 5-15 minutes once you figure in, identifying the issue, recovering the metadata and moving over the IP, clearing arps and maybe restarting chunkservers, depending upon what happened. When you do recover, there will be garbage files (for the on the fly files) in the Control Panel that eventually get cleaned out automatically. As mentioned there ARE better procedures, and we would love to have a more automated (reliable) failover, but its not quite there yet. |