From: Jay L. <jl...@sl...> - 2019-11-04 04:40:53
|
Hi, I just had a weird MFS problem occur and was hoping that someone could provide guidance. (Questions are at the bottom of this note.) My cluster is a simple one with 5 nodes and each node has one HDD. My goal is set to 3 for the share that I am referring to in this post. I just had a drive go offline. Annoying but manageable; however, when it went offline, it appears that it took another unrelated node offline with it and to make matters worse, when I looked at the info tab in MFS, it said that I was missing a number of chunks! I have no idea why this would happen. Here is the syslog from the unrelated node: Nov 4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+ Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks: got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3 times: [ replicator,read chunks: got status: IO error from (192.168.x.x:24CE)] <-- The IP of the failed node Nov 4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+ After those messages, the node stopped responding and I could not ping it. A reboot brought it back online. Here are my questions: 1. Why would a bad disk on one node bring down another so aggressively? Shouldn't they behave 100% independently of each other? 2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the bad drive and the offline node) then shouldn't I still have access to all my data? Why was MFS indicating missing chunks in this scenario? Shouldn't I have 3 copies of my data and so be protected from a double disk failure? Thank you, JL |