|
From: Remolina, D. J <dij...@ae...> - 2019-11-04 23:34:39
|
If your drives are using ZFS, are you using the following option in each of the zfs pools?
zpool set failmode=continue {dataset}
If this is not the case, then it is possible one disk failure can halt the whole cluster. I was told about this a while back and I simulated a failure and in fact I could no longer write to my MooseFS file system when I forced failed a disk. Bringing the disk back online would allow the cluster to work again properly. I later change the setting and repeated the experiment and upon a single disk failure, the file system would continue to operate properly.
Diego
________________________________
From: Aleksander Wieliczko <ale...@mo...>
Sent: Monday, November 4, 2019 4:06 AM
To: Jay Livens <jl...@sl...>
Cc: moo...@li... <moo...@li...>
Subject: Re: [MooseFS-Users] Single disk failure brings down an unrelated node
Hi, Jay,
I believe that we are talking about MooseFS 3.0.105. Yes?
First of all I would like to ask about hard disks.
Do you use separate hard disk for OS and separate hard disk for chunks?
About question number 1:
These components are independent and they are not designed to bring each other down.
Is it possible that OS and chunks are stored on the same physical disk?
In such a scenario IO errors will influence the whole machine.
About second question:
That should work exactly like you described. It is extremely weird that you had some missing chunks.
Goal 3 means 3 copies, so lost of two components should not affect access to the data.
Is it possible to get some more logs from master server?
Best regards,
Aleksander Wieliczko
System Engineer
MooseFS Development & Support Team | moosefs.pro<http://moosefs.pro>
pon., 4 lis 2019 o 05:17 Jay Livens <jl...@sl...<mailto:jl...@sl...>> napisał(a):
Hi,
I just had a weird MFS problem occur and was hoping that someone could provide guidance. (Questions are at the bottom of this note.) My cluster is a simple one with 5 nodes and each node has one HDD. My goal is set to 3 for the share that I am referring to in this post.
I just had a drive go offline. Annoying but manageable; however, when it went offline, it appears that it took another unrelated node offline with it and to make matters worse, when I looked at the info tab in MFS, it said that I was missing a number of chunks! I have no idea why this would happen.
Here is the syslog from the unrelated node:
Nov 4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+
Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks: got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node
Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3 times: [ replicator,read chunks: got status: IO error from (192.168.x.x:24CE)] <-- The IP of the failed node
Nov 4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+
After those messages, the node stopped responding and I could not ping it. A reboot brought it back online.
Here are my questions:
1. Why would a bad disk on one node bring down another so aggressively? Shouldn't they behave 100% independently of each other?
2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the bad drive and the offline node) then shouldn't I still have access to all my data? Why was MFS indicating missing chunks in this scenario? Shouldn't I have 3 copies of my data and so be protected from a double disk failure?
Thank you,
JL
_________________________________________
moosefs-users mailing list
moo...@li...<mailto:moo...@li...>
https://lists.sourceforge.net/lists/listinfo/moosefs-users
|