Re: [MooseFS-Users] Single disk failure brings down an unrelated node

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Diego,

I will review that.  Thank you!  Aleksander, I responded to your earlier
email directly.  Let me know what I can provide.

Thank you to both of you!

Jay

----------------------------
Jay Livens

jl...@sl...
(617)875-1436
----------------------------

On Mon, Nov 4, 2019 at 3:00 PM Remolina, Diego J <
dij...@ae...> wrote:

> If your drives are using ZFS, are you using the following option in each
> of the zfs pools?
>
> zpool set failmode=continue {dataset}
>
> If this is not the case, then it is possible one disk failure can halt the
> whole cluster. I was told about this a while back and I simulated a failure
> and in fact I could no longer write to my MooseFS file system when I forced
> failed a disk. Bringing the disk back online would allow the cluster to
> work again properly. I later change the setting and repeated the experiment
> and upon a single disk failure, the file system would continue to operate
> properly.
>
> Diego
>
> ------------------------------
> *From:* Aleksander Wieliczko <ale...@mo...>
> *Sent:* Monday, November 4, 2019 4:06 AM
> *To:* Jay Livens <jl...@sl...>
> *Cc:* moo...@li... <
> moo...@li...>
> *Subject:* Re: [MooseFS-Users] Single disk failure brings down an
> unrelated node
>
> Hi, Jay,
>
> I believe that we are talking about MooseFS 3.0.105. Yes?
>
> First of all I would like to ask about hard disks.
> Do you use separate hard disk for OS and separate hard disk for chunks?
>
> About question number 1:
> These components are independent and they are not designed to bring each
> other down.
> Is it possible that OS and chunks are stored on the same physical disk?
> In such a scenario IO errors will influence the whole machine.
>
> About second question:
> That should work exactly like you described. It is extremely weird that
> you had some missing chunks.
> Goal 3 means 3 copies, so lost of two components should not affect access
> to the data.
>
> Is it possible to get some more logs from master server?
>
> Best regards,
>
> Aleksander Wieliczko
> System Engineer
> MooseFS Development & Support Team | moosefs.pro
>
>
> pon., 4 lis 2019 o 05:17 Jay Livens <jl...@sl...> napisał(a):
>
> Hi,
>
> I just had a weird MFS problem occur and was hoping that someone could
> provide guidance. (Questions are at the bottom of this note.)  My cluster
> is a simple one with 5 nodes and each node has one HDD.  My goal is set to
> 3 for the share that I am referring to in this post.
>
> I just had a drive go offline.  Annoying but manageable; however, when it
> went offline, it appears that it took another unrelated node offline with
> it and to make matters worse, when I looked at the info tab in MFS, it said
> that I was missing a number of chunks!  I have no idea why this would
> happen.
>
> Here is the syslog from the unrelated node:
>
> Nov  4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+
> Nov  4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks:
> got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node
> Nov  4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3
> times: [ replicator,read chunks: got status: IO error from
> (192.168.x.x:24CE)] <-- The IP of the failed node
> Nov  4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+
>
> After those messages, the node stopped responding and I could not ping
> it.  A reboot brought it back online.
>
> Here are my questions:
>
>    1. Why would a bad disk on one node bring down another so
>    aggressively?  Shouldn't they behave 100% independently of each other?
>    2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the
>    bad drive and the offline node) then shouldn't I still have access to all
>    my data?  Why was MFS indicating missing chunks in this scenario?
>    Shouldn't I have 3 copies of my data and so be protected from a double disk
>    failure?
>
> Thank you,
>
> JL
> _________________________________________
> moosefs-users mailing list
> moo...@li...
> https://lists.sourceforge.net/lists/listinfo/moosefs-users
>
>

Re: [MooseFS-Users] Single disk failure brings down an unrelated node

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

Re: [MooseFS-Users] Single disk failure brings down an unrelated node