[MooseFS-Users] Single disk failure brings down an unrelated node

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I just had a weird MFS problem occur and was hoping that someone could
provide guidance. (Questions are at the bottom of this note.)  My cluster
is a simple one with 5 nodes and each node has one HDD.  My goal is set to
3 for the share that I am referring to in this post.

I just had a drive go offline.  Annoying but manageable; however, when it
went offline, it appears that it took another unrelated node offline with
it and to make matters worse, when I looked at the info tab in MFS, it said
that I was missing a number of chunks!  I have no idea why this would
happen.

Here is the syslog from the unrelated node:

Nov  4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+
Nov  4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks:
got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node
Nov  4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3 times:
[ replicator,read chunks: got status: IO error from (192.168.x.x:24CE)] <--
The IP of the failed node
Nov  4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+

After those messages, the node stopped responding and I could not ping it.
A reboot brought it back online.

Here are my questions:

   1. Why would a bad disk on one node bring down another so aggressively?
   Shouldn't they behave 100% independently of each other?
   2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the bad
   drive and the offline node) then shouldn't I still have access to all my
   data?  Why was MFS indicating missing chunks in this scenario?  Shouldn't I
   have 3 copies of my data and so be protected from a double disk failure?

Thank you,

JL

[MooseFS-Users] Single disk failure brings down an unrelated node

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

[MooseFS-Users] Single disk failure brings down an unrelated node