Re: [MooseFS-Users] replicator: receive timed out

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

W dniu 01.09.2020 o 23:19, WK pisze:
> I just added two new chunkservers to an existing cluster.
> 
> I am seeing lots of these
> 
> Sep  1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: receive 
> timed out
> 
> 
> and sometimes the master throws it out completely
> 
> Sep  1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39.
> Sep  1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of 
> mfsmaster.
> Sep  1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop detected 
> (23.797745s)
> Sep  1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was 
> reset by Master
> Sep  1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing connection 
> with master
> Sep  1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ...
> Sep  1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master
> Sep  1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> Sep  1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: 
> connection lost
> 
> All machines are running CentOS7
> 
> However there is a mix of MFS versions.
> 
> The master is running 3.0.105
> 
> and the chunkservers are running various versions.
> 
> 
> 1 	mfs66chunker1 	10.166.0.21 	9422 	3 	- 	3.0.103 	4 	OFF : switch on 
> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> 
> 	1486939 	9.8 TiB 	11 TiB 	
> 90.49
> 	- 	0 	0 B 	0 B 	
> -
> 2 	mfs66chunker2 	10.166.0.22 	9422 	2 	- 	3.0.114 	6 	OFF : switch on 
> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> 
> 	2313507 	14 TiB 	15 TiB 	
> 90.49
> 	- 	0 	0 B 	0 B 	
> -
> 3 	mfs66chunker3 	10.166.0.23 	9422 	1 	- 	3.0.111 	13 	OFF : switch on 
> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> 
> 	2154089 	13 TiB 	14 TiB 	
> 90.49
> 	- 	0 	0 B 	0 B 	
> -
> 4 	mfs66chunker4 	10.166.0.24 	9422 	4 	- 	3.0.103 	12 	OFF : switch on 
> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> 
> 	2900774 	16 TiB 	18 TiB 	
> 90.49
> 	- 	0 	0 B 	0 B 	
> -
> 5 	mfs66chunker5 	10.166.0.25 	9422 	5 	- 	3.0.103 	12 	OFF : switch on 
> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> 
> 	2319192 	13 TiB 	14 TiB 	
> 90.49
> 	- 	0 	0 B 	0 B 	
> -
> 6 	mfs66chunker6 	10.166.0.26 	9422 	6 	- 	3.0.114 	(6) 	OFF : switch on 
> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> 
> 	2131090 	13 TiB 	18 TiB 	
> 70.56
> 	- 	0 	0 B 	0 B 	
> -
> 7 	mfs66chunker7 	10.166.0.27 	9422 	7 	- 	3.0.114 	4 	OFF : switch on 
> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> 
> 	7196 	47 GiB 	18 TiB 	
> 0.25
> 	- 	0 	0 B 	0 B 	
> -
> 
> Only the two newest units #6 and #7 are having problems and they are 
> running the latest MFS version. They were added due to the 90% disk 
> space issue, so there is a lot of rebalancing going on.
> 
> I assumed the problem a mismatch between the 3.0.105 master and the new 
> version but #2 is also running 3.0.114 and is not having problem (though 
> it does have an older kernel)
> 
> the networking appears fine (iperf runs at 1GB) no errors in dmesg etc.
> 
> I will be scheduling some downtime to bring the master up to date 
> shortly but I'm interested if anybody else is having this problem

Hi,

It's definitely not a good idea to have an older master and newer chunk 
servers. I would suggest upgrade of everything to match the newest chunk 
servers.

The messages you are getting are simply indicators of timed out 
connections. A connection between two MFS modules can time out due to 
network errors or due to one of the modules being "too busy" and not 
responding in time. "Too busy" might mean a number of things: slow I/O 
on local disks, CPU not keeping up (happens when you have other 
processes running on the same machines as MFS modules) and a number of 
other factors. You need to take a look at your system and try to find 
the bottleneck.

For starters, you can try to lower the replication limits (fourth value 
in the CHUNKS_WRITE_REP_LIMIT and CHUNKS_READ_REP_LIMIT settings) and 
see if it helps get rid of the messages.

--
Agata Kruszona-Zawadzka
MooseFS Team

Re: [MooseFS-Users] replicator: receive timed out

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

Re: [MooseFS-Users] replicator: receive timed out