From: WK <wk...@bn...> - 2020-09-01 21:35:38
|
Forgot to show the master logs Getting tons of these. All error to the two new systems. Sep 1 14:09:52 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 00000000021490FB replication status: Disconnected Sep 1 14:10:12 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000016A270 replication status: Disconnected Sep 1 14:10:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000014B7ACC replication status: Disconnected Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> 10.166.0.26:9422) chunk: 0000000002161D0F replication status: Disconnected Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 00000000002A71B0 replication status: Disconnected Sep 1 14:11:49 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000001FDF21 replication status: Disconnected Sep 1 14:12:14 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 0000000001EB5CE2 replication status: Disconnected Sep 1 14:12:54 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000002132B93 replication status: Disconnected Sep 1 14:14:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> 10.166.0.26:9422) chunk: 0000000001CD13D5 replication status: Disconnected Sep 1 14:14:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000827F86 replication status: Disconnected Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000867B8E replication status: Disconnected Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000803F97 replication status: Disconnected Sep 1 14:15:15 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000020A5306 replication status: Disconnected Sep 1 14:15:16 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000133697 replication status: Disconnected Sep 1 14:15:21 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000007B7FCF replication status: Disconnected Sep 1 14:15:26 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000002966E3 replication status: Disconnected Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> 10.166.0.26:9422) chunk: 0000000001C96E65 replication status: Disconnected Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 000000000213C6EA replication status: Disconnected Sep 1 14:15:58 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> 10.166.0.26:9422) chunk: 00000000021651E6 replication status: Disconnected Sep 1 14:16:01 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000002150C8F replication status: Disconnected Sep 1 14:16:06 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> 10.166.0.26:9422) chunk: 00000000021070A8 replication status: Disconnected Sep 1 14:16:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> 10.166.0.26:9422) chunk: 0000000001CAB31F replication status: Disconnected Sep 1 14:16:18 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 000000000004BE69 replication status: IO error Sep 1 14:16:34 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000049C99 replication status: Disconnected Sep 1 14:16:35 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000001E3DA57 replication status: Disconnected Sep 1 14:16:39 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000014388A replication status: Disconnected Sep 1 14:16:40 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000160E1AA replication status: Disconnected Sep 1 14:16:43 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> 10.166.0.26:9422) chunk: 00000000021656B1 replication status: Disconnected Sep 1 14:17:59 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> 10.166.0.26:9422) chunk: 0000000001F20BAD replication status: Disconnected Sep 1 14:18:41 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000D03303 replication status: Disconnected Sep 1 14:19:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000215CF9C replication status: Disconnected Sep 1 14:19:18 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> 10.166.0.26:9422) chunk: 000000000213C207 replication status: Disconnected Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000019A2E00 replication status: Disconnected Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000001F75859 replication status: Disconnected Sep 1 14:20:22 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000534F96 replication status: Disconnected Sep 1 14:20:24 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000001C69388 replication status: Disconnected Sep 1 14:20:28 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000001F3B4C2 replication status: Disconnected Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> 10.166.0.26:9422) chunk: 0000000002164F77 replication status: Disconnected Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> 10.166.0.26:9422) chunk: 00000000020A8CF7 replication status: Disconnected Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000020F4BCD replication status: Disconnected Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> 10.166.0.26:9422) chunk: 00000000021509FF replication status: Disconnected Sep 1 14:21:02 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000025AB13 replication status: Disconnected Sep 1 14:21:06 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000213C226 replication status: Disconnected Sep 1 14:21:09 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000002151A03 replication status: Disconnected Sep 1 14:22:30 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 0000000002165995 replication status: Disconnected Sep 1 14:22:52 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> 10.166.0.26:9422) chunk: 0000000001C7CD90 replication status: Disconnected Sep 1 14:23:25 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000021615A3 replication status: Disconnected Sep 1 14:23:39 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> 10.166.0.26:9422) chunk: 0000000002161C17 replication status: Disconnected On 9/1/2020 2:19 PM, WK wrote: > > I just added two new chunkservers to an existing cluster. > > I am seeing lots of these > > Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: > receive timed out > > > and sometimes the master throws it out completely > > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. > Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of > mfsmaster. > Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop > detected (23.797745s) > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was > reset by Master > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing > connection with master > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master > Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > > All machines are running CentOS7 > > However there is a mix of MFS versions. > > The master is running 3.0.105 > > and the chunkservers are running various versions. > > > 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> > 1486939 9.8 TiB 11 TiB > 90.49 > - 0 0 B 0 B > - > 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> > 2313507 14 TiB 15 TiB > 90.49 > - 0 0 B 0 B > - > 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : switch > on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> > 2154089 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : switch > on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> > 2900774 16 TiB 18 TiB > 90.49 > - 0 0 B 0 B > - > 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : switch > on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> > 2319192 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : switch > on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> > 2131090 13 TiB 18 TiB > 70.56 > - 0 0 B 0 B > - > 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> > 7196 47 GiB 18 TiB > 0.25 > - 0 0 B 0 B > - > > Only the two newest units #6 and #7 are having problems and they are > running the latest MFS version. They were added due to the 90% disk > space issue, so there is a lot of rebalancing going on. > > I assumed the problem a mismatch between the 3.0.105 master and the > new version but #2 is also running 3.0.114 and is not having problem > (though it does have an older kernel) > > the networking appears fine (iperf runs at 1GB) no errors in dmesg etc. > > I will be scheduling some downtime to bring the master up to date > shortly but I'm interested if anybody else is having this problem > > -wk > > > |