From: Agata Kruszona-Z. <ch...@mo...> - 2020-09-02 09:46:22
|
W dniu 01.09.2020 o 23:19, WK pisze: > I just added two new chunkservers to an existing cluster. > > I am seeing lots of these > > Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: receive > timed out > > > and sometimes the master throws it out completely > > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. > Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of > mfsmaster. > Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop detected > (23.797745s) > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was > reset by Master > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing connection > with master > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master > Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > > All machines are running CentOS7 > > However there is a mix of MFS versions. > > The master is running 3.0.105 > > and the chunkservers are running various versions. > > > 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> > 1486939 9.8 TiB 11 TiB > 90.49 > - 0 0 B 0 B > - > 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> > 2313507 14 TiB 15 TiB > 90.49 > - 0 0 B 0 B > - > 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> > 2154089 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> > 2900774 16 TiB 18 TiB > 90.49 > - 0 0 B 0 B > - > 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> > 2319192 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> > 2131090 13 TiB 18 TiB > 70.56 > - 0 0 B 0 B > - > 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> > 7196 47 GiB 18 TiB > 0.25 > - 0 0 B 0 B > - > > Only the two newest units #6 and #7 are having problems and they are > running the latest MFS version. They were added due to the 90% disk > space issue, so there is a lot of rebalancing going on. > > I assumed the problem a mismatch between the 3.0.105 master and the new > version but #2 is also running 3.0.114 and is not having problem (though > it does have an older kernel) > > the networking appears fine (iperf runs at 1GB) no errors in dmesg etc. > > I will be scheduling some downtime to bring the master up to date > shortly but I'm interested if anybody else is having this problem Hi, It's definitely not a good idea to have an older master and newer chunk servers. I would suggest upgrade of everything to match the newest chunk servers. The messages you are getting are simply indicators of timed out connections. A connection between two MFS modules can time out due to network errors or due to one of the modules being "too busy" and not responding in time. "Too busy" might mean a number of things: slow I/O on local disks, CPU not keeping up (happens when you have other processes running on the same machines as MFS modules) and a number of other factors. You need to take a look at your system and try to find the bottleneck. For starters, you can try to lower the replication limits (fourth value in the CHUNKS_WRITE_REP_LIMIT and CHUNKS_READ_REP_LIMIT settings) and see if it helps get rid of the messages. -- Agata Kruszona-Zawadzka MooseFS Team |