From: Aleksander W. <ale...@mo...> - 2020-09-02 09:29:41
|
It looks like you chunk servers: 10.166.0.26 and 10.166.0.27 have some serious problems: network and disks problems "IO ERROR". Are you sure that they have enough resources - I mean they are not swapping or they don't have any network problems? Also at the moment, it looks like your cluster is running with not allowed configuration! MooseFS master should always be higher or equal in version to other MooseFS components(chunk servers, clients, meta loggers). Right now your master server is running in version 3.0105 - this can lead to weird cluster behavior. Please update all components to the same version first. Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro wt., 1 wrz 2020 o 23:36 WK <wk...@bn...> napisał(a): > Forgot to show the master logs > > > Getting tons of these. All error to the two new systems. > > Sep 1 14:09:52 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 00000000021490FB replication status: Disconnected > Sep 1 14:10:12 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000016A270 replication status: Disconnected > Sep 1 14:10:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000014B7ACC replication status: Disconnected > Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> > 10.166.0.26:9422) chunk: 0000000002161D0F replication status: Disconnected > Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 00000000002A71B0 replication status: Disconnected > Sep 1 14:11:49 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000001FDF21 replication status: Disconnected > Sep 1 14:12:14 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 0000000001EB5CE2 replication status: Disconnected > Sep 1 14:12:54 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000002132B93 replication status: Disconnected > Sep 1 14:14:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> > 10.166.0.26:9422) chunk: 0000000001CD13D5 replication status: Disconnected > Sep 1 14:14:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000827F86 replication status: Disconnected > Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000867B8E replication status: Disconnected > Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000803F97 replication status: Disconnected > Sep 1 14:15:15 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000020A5306 replication status: Disconnected > Sep 1 14:15:16 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000133697 replication status: Disconnected > Sep 1 14:15:21 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000007B7FCF replication status: Disconnected > Sep 1 14:15:26 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000002966E3 replication status: Disconnected > Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> > 10.166.0.26:9422) chunk: 0000000001C96E65 replication status: Disconnected > Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 000000000213C6EA replication status: Disconnected > Sep 1 14:15:58 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> > 10.166.0.26:9422) chunk: 00000000021651E6 replication status: Disconnected > Sep 1 14:16:01 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000002150C8F replication status: Disconnected > Sep 1 14:16:06 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> > 10.166.0.26:9422) chunk: 00000000021070A8 replication status: Disconnected > Sep 1 14:16:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> > 10.166.0.26:9422) chunk: 0000000001CAB31F replication status: Disconnected > Sep 1 14:16:18 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 000000000004BE69 replication status: IO error > Sep 1 14:16:34 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000049C99 replication status: Disconnected > Sep 1 14:16:35 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000001E3DA57 replication status: Disconnected > Sep 1 14:16:39 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000014388A replication status: Disconnected > Sep 1 14:16:40 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000160E1AA replication status: Disconnected > Sep 1 14:16:43 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> > 10.166.0.26:9422) chunk: 00000000021656B1 replication status: Disconnected > Sep 1 14:17:59 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> > 10.166.0.26:9422) chunk: 0000000001F20BAD replication status: Disconnected > Sep 1 14:18:41 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000D03303 replication status: Disconnected > Sep 1 14:19:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000215CF9C replication status: Disconnected > Sep 1 14:19:18 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> > 10.166.0.26:9422) chunk: 000000000213C207 replication status: Disconnected > Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000019A2E00 replication status: Disconnected > Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000001F75859 replication status: Disconnected > Sep 1 14:20:22 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000534F96 replication status: Disconnected > Sep 1 14:20:24 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000001C69388 replication status: Disconnected > Sep 1 14:20:28 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000001F3B4C2 replication status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> > 10.166.0.26:9422) chunk: 0000000002164F77 replication status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> > 10.166.0.26:9422) chunk: 00000000020A8CF7 replication status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000020F4BCD replication status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> > 10.166.0.26:9422) chunk: 00000000021509FF replication status: Disconnected > Sep 1 14:21:02 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000025AB13 replication status: Disconnected > Sep 1 14:21:06 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000213C226 replication status: Disconnected > Sep 1 14:21:09 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000002151A03 replication status: Disconnected > Sep 1 14:22:30 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 0000000002165995 replication status: Disconnected > Sep 1 14:22:52 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> > 10.166.0.26:9422) chunk: 0000000001C7CD90 replication status: Disconnected > Sep 1 14:23:25 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000021615A3 replication status: Disconnected > Sep 1 14:23:39 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> > 10.166.0.26:9422) chunk: 0000000002161C17 replication status: Disconnected > > > On 9/1/2020 2:19 PM, WK wrote: > > I just added two new chunkservers to an existing cluster. > > I am seeing lots of these > > Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: receive > timed out > > > and sometimes the master throws it out completely > > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. > Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of > mfsmaster. > Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop detected > (23.797745s) > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was reset > by Master > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing connection > with master > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master > Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > > All machines are running CentOS7 > > However there is a mix of MFS versions. > > The master is running 3.0.105 > > and the chunkservers are running various versions. > > > 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> > 1486939 9.8 TiB 11 TiB > 90.49 > - 0 0 B 0 B > - > 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> > 2313507 14 TiB 15 TiB > 90.49 > - 0 0 B 0 B > - > 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> > 2154089 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> > 2900774 16 TiB 18 TiB > 90.49 > - 0 0 B 0 B > - > 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> > 2319192 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> > 2131090 13 TiB 18 TiB > 70.56 > - 0 0 B 0 B > - > 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> > 7196 47 GiB 18 TiB > 0.25 > - 0 0 B 0 B > - > > Only the two newest units #6 and #7 are having problems and they are > running the latest MFS version. They were added due to the 90% disk space > issue, so there is a lot of rebalancing going on. > > I assumed the problem a mismatch between the 3.0.105 master and the new > version but #2 is also running 3.0.114 and is not having problem (though it > does have an older kernel) > > the networking appears fine (iperf runs at 1GB) no errors in dmesg etc. > > I will be scheduling some downtime to bring the master up to date shortly > but I'm interested if anybody else is having this problem > > -wk > > > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |