From: wkmail <wk...@bn...> - 2020-09-03 04:21:46
|
There *was* something going on with 10.166.0.26. there were three drives orginally there and we had added the 4th drive (along with a brand new 10.166.0.27) due to the impending space issue. A few hours after my initial note I noticed that the original three drives on 10.166.0.26 went from 90% full to 99% full during replication. I was worried that running out of space on those drives would cause an issue and put them into deprecated mode. (i.e. */mfsmountA, etc) The errors on both 10.166.0.26 and 10.166.0.27 immediately went away and the replication is proceeding as I would normally on the other nodes. this weekend is a holiday in the US so I will have the opportunity to shutdown and update the master to a proper version. MooseFS is so good that its easy to become lazy about such things. Thx -wk On 9/2/2020 2:21 AM, Aleksander Wieliczko wrote: > It looks like you chunk servers: 10.166.0.26 and 10.166.0.27 have some > serious problems: network and disks problems "IO ERROR". > Are you sure that they have enough resources - I mean they are not > swapping or they don't have any network problems? > > Also at the moment, it looks like your cluster is running with not > allowed configuration! > MooseFS master should always be higher or equal in version to other > MooseFS components(chunk servers, clients, meta loggers). > Right now your master server is running in version 3.0105 - this can > lead to weird cluster behavior. > > Please update all components to the same version first. > > Best regards, > > Aleksander Wieliczko > System Engineer > MooseFS Development & Support Team | moosefs.pro <http://moosefs.pro> > > > wt., 1 wrz 2020 o 23:36 WK <wk...@bn... > <mailto:wk...@bn...>> napisał(a): > > Forgot to show the master logs > > > Getting tons of these. All error to the two new systems. > > Sep 1 14:09:52 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021490FB replication > status: Disconnected > Sep 1 14:10:12 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000016A270 replication > status: Disconnected > Sep 1 14:10:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000014B7ACC replication > status: Disconnected > Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.21:9422 > <http://10.166.0.21:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000002161D0F replication > status: Disconnected > Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000002A71B0 replication > status: Disconnected > Sep 1 14:11:49 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000001FDF21 replication > status: Disconnected > Sep 1 14:12:14 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001EB5CE2 replication > status: Disconnected > Sep 1 14:12:54 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000002132B93 replication > status: Disconnected > Sep 1 14:14:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 > <http://10.166.0.23:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001CD13D5 replication > status: Disconnected > Sep 1 14:14:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000827F86 replication > status: Disconnected > Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000867B8E replication > status: Disconnected > Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000803F97 replication > status: Disconnected > Sep 1 14:15:15 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000020A5306 replication > status: Disconnected > Sep 1 14:15:16 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000133697 replication > status: Disconnected > Sep 1 14:15:21 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000007B7FCF replication > status: Disconnected > Sep 1 14:15:26 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000002966E3 replication > status: Disconnected > Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.22:9422 > <http://10.166.0.22:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001C96E65 replication > status: Disconnected > Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 000000000213C6EA replication > status: Disconnected > Sep 1 14:15:58 mfs66master mfsmaster[1297]: (10.166.0.22:9422 > <http://10.166.0.22:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021651E6 replication > status: Disconnected > Sep 1 14:16:01 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000002150C8F replication > status: Disconnected > Sep 1 14:16:06 mfs66master mfsmaster[1297]: (10.166.0.25:9422 > <http://10.166.0.25:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021070A8 replication > status: Disconnected > Sep 1 14:16:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 > <http://10.166.0.23:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001CAB31F replication > status: Disconnected > Sep 1 14:16:18 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 000000000004BE69 replication > status: IO error > Sep 1 14:16:34 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000049C99 replication > status: Disconnected > Sep 1 14:16:35 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000001E3DA57 replication > status: Disconnected > Sep 1 14:16:39 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000014388A replication > status: Disconnected > Sep 1 14:16:40 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000160E1AA replication > status: Disconnected > Sep 1 14:16:43 mfs66master mfsmaster[1297]: (10.166.0.23:9422 > <http://10.166.0.23:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021656B1 replication > status: Disconnected > Sep 1 14:17:59 mfs66master mfsmaster[1297]: (10.166.0.22:9422 > <http://10.166.0.22:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001F20BAD replication > status: Disconnected > Sep 1 14:18:41 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000D03303 replication > status: Disconnected > Sep 1 14:19:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000215CF9C replication > status: Disconnected > Sep 1 14:19:18 mfs66master mfsmaster[1297]: (10.166.0.22:9422 > <http://10.166.0.22:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 000000000213C207 replication > status: Disconnected > Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000019A2E00 replication > status: Disconnected > Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000001F75859 replication > status: Disconnected > Sep 1 14:20:22 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000534F96 replication > status: Disconnected > Sep 1 14:20:24 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000001C69388 replication > status: Disconnected > Sep 1 14:20:28 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000001F3B4C2 replication > status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 > <http://10.166.0.25:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000002164F77 replication > status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.21:9422 > <http://10.166.0.21:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000020A8CF7 replication > status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000020F4BCD replication > status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 > <http://10.166.0.25:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021509FF replication > status: Disconnected > Sep 1 14:21:02 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000025AB13 replication > status: Disconnected > Sep 1 14:21:06 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000213C226 replication > status: Disconnected > Sep 1 14:21:09 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000002151A03 replication > status: Disconnected > Sep 1 14:22:30 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000002165995 replication > status: Disconnected > Sep 1 14:22:52 mfs66master mfsmaster[1297]: (10.166.0.21:9422 > <http://10.166.0.21:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001C7CD90 replication > status: Disconnected > Sep 1 14:23:25 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000021615A3 replication > status: Disconnected > Sep 1 14:23:39 mfs66master mfsmaster[1297]: (10.166.0.21:9422 > <http://10.166.0.21:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000002161C17 replication > status: Disconnected > > > On 9/1/2020 2:19 PM, WK wrote: >> >> I just added two new chunkservers to an existing cluster. >> >> I am seeing lots of these >> >> Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: >> receive timed out >> >> >> and sometimes the master throws it out completely >> >> Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. >> Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice >> of mfsmaster. >> Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop >> detected (23.797745s) >> Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection >> was reset by Master >> Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing >> connection with master >> Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... >> Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to >> Master >> Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> >> All machines are running CentOS7 >> >> However there is a mix of MFS versions. >> >> The master is running 3.0.105 >> >> and the chunkservers are running various versions. >> >> >> 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> >> 1486939 9.8 TiB 11 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> >> 2313507 14 TiB 15 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> >> 2154089 13 TiB 14 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> >> 2900774 16 TiB 18 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> >> 2319192 13 TiB 14 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> >> 2131090 13 TiB 18 TiB >> 70.56 >> - 0 0 B 0 B >> - >> 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> >> 7196 47 GiB 18 TiB >> 0.25 >> - 0 0 B 0 B >> - >> >> Only the two newest units #6 and #7 are having problems and they >> are running the latest MFS version. They were added due to the >> 90% disk space issue, so there is a lot of rebalancing going on. >> >> I assumed the problem a mismatch between the 3.0.105 master and >> the new version but #2 is also running 3.0.114 and is not having >> problem (though it does have an older kernel) >> >> the networking appears fine (iperf runs at 1GB) no errors in >> dmesg etc. >> >> I will be scheduling some downtime to bring the master up to date >> shortly but I'm interested if anybody else is having this problem >> >> -wk >> >> >> > _________________________________________ > moosefs-users mailing list > moo...@li... > <mailto:moo...@li...> > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |