From: Jay L. <jl...@sl...> - 2019-11-04 21:53:44
|
Diego, I will review that. Thank you! Aleksander, I responded to your earlier email directly. Let me know what I can provide. Thank you to both of you! Jay ---------------------------- Jay Livens jl...@sl... (617)875-1436 ---------------------------- On Mon, Nov 4, 2019 at 3:00 PM Remolina, Diego J < dij...@ae...> wrote: > If your drives are using ZFS, are you using the following option in each > of the zfs pools? > > zpool set failmode=continue {dataset} > > If this is not the case, then it is possible one disk failure can halt the > whole cluster. I was told about this a while back and I simulated a failure > and in fact I could no longer write to my MooseFS file system when I forced > failed a disk. Bringing the disk back online would allow the cluster to > work again properly. I later change the setting and repeated the experiment > and upon a single disk failure, the file system would continue to operate > properly. > > Diego > > ------------------------------ > *From:* Aleksander Wieliczko <ale...@mo...> > *Sent:* Monday, November 4, 2019 4:06 AM > *To:* Jay Livens <jl...@sl...> > *Cc:* moo...@li... < > moo...@li...> > *Subject:* Re: [MooseFS-Users] Single disk failure brings down an > unrelated node > > Hi, Jay, > > I believe that we are talking about MooseFS 3.0.105. Yes? > > First of all I would like to ask about hard disks. > Do you use separate hard disk for OS and separate hard disk for chunks? > > About question number 1: > These components are independent and they are not designed to bring each > other down. > Is it possible that OS and chunks are stored on the same physical disk? > In such a scenario IO errors will influence the whole machine. > > About second question: > That should work exactly like you described. It is extremely weird that > you had some missing chunks. > Goal 3 means 3 copies, so lost of two components should not affect access > to the data. > > Is it possible to get some more logs from master server? > > Best regards, > > Aleksander Wieliczko > System Engineer > MooseFS Development & Support Team | moosefs.pro > > > pon., 4 lis 2019 o 05:17 Jay Livens <jl...@sl...> napisał(a): > > Hi, > > I just had a weird MFS problem occur and was hoping that someone could > provide guidance. (Questions are at the bottom of this note.) My cluster > is a simple one with 5 nodes and each node has one HDD. My goal is set to > 3 for the share that I am referring to in this post. > > I just had a drive go offline. Annoying but manageable; however, when it > went offline, it appears that it took another unrelated node offline with > it and to make matters worse, when I looked at the info tab in MFS, it said > that I was missing a number of chunks! I have no idea why this would > happen. > > Here is the syslog from the unrelated node: > > Nov 4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+ > Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks: > got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node > Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3 > times: [ replicator,read chunks: got status: IO error from > (192.168.x.x:24CE)] <-- The IP of the failed node > Nov 4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+ > > After those messages, the node stopped responding and I could not ping > it. A reboot brought it back online. > > Here are my questions: > > 1. Why would a bad disk on one node bring down another so > aggressively? Shouldn't they behave 100% independently of each other? > 2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the > bad drive and the offline node) then shouldn't I still have access to all > my data? Why was MFS indicating missing chunks in this scenario? > Shouldn't I have 3 copies of my data and so be protected from a double disk > failure? > > Thank you, > > JL > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > |