From: WK <wk...@bn...> - 2020-09-01 21:35:43
|
I just added two new chunkservers to an existing cluster. I am seeing lots of these Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: receive timed out and sometimes the master throws it out completely Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of mfsmaster. Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop detected (23.797745s) Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was reset by Master Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing connection with master Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost All machines are running CentOS7 However there is a mix of MFS versions. The master is running 3.0.105 and the chunkservers are running various versions. 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> 1486939 9.8 TiB 11 TiB 90.49 - 0 0 B 0 B - 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> 2313507 14 TiB 15 TiB 90.49 - 0 0 B 0 B - 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> 2154089 13 TiB 14 TiB 90.49 - 0 0 B 0 B - 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> 2900774 16 TiB 18 TiB 90.49 - 0 0 B 0 B - 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> 2319192 13 TiB 14 TiB 90.49 - 0 0 B 0 B - 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> 2131090 13 TiB 18 TiB 70.56 - 0 0 B 0 B - 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> 7196 47 GiB 18 TiB 0.25 - 0 0 B 0 B - Only the two newest units #6 and #7 are having problems and they are running the latest MFS version. They were added due to the 90% disk space issue, so there is a lot of rebalancing going on. I assumed the problem a mismatch between the 3.0.105 master and the new version but #2 is also running 3.0.114 and is not having problem (though it does have an older kernel) the networking appears fine (iperf runs at 1GB) no errors in dmesg etc. I will be scheduling some downtime to bring the master up to date shortly but I'm interested if anybody else is having this problem -wk |