From: Gandalf C. <gan...@gm...> - 2018-05-30 20:39:30
|
Il giorno mer 30 mag 2018 alle ore 22:11 R.C. <mil...@gm...> ha scritto: > > Are you able to always reproduce it? Yes, always: - kill -9 the leader - one follower goes in "ELECT" - start back the killed leader (mfsmaster -a) "old leader" is set as "FOLLOWER DESYNC", the elect leader is still in ELECT, the only one follwer left is still in FOLLOWER So, it seems that leader election doesn't work properly. Currently, all chunkservers are UP This is the syslog for the master stuck in ELECT phase: May 30 22:20:39 cs02 mfsmaster[3061]: connection was reset by Master May 30 22:20:40 cs02 mfsmaster[3061]: connecting ... May 30 22:20:40 cs02 mfsmaster[3061]: connection failed, error: ECONNREFUSED (Connection refused) May 30 22:20:41 cs02 mfsmaster[3061]: state: ELECT ; changed 0 seconds ago ; leaderip: 0.0.0.0 May 30 22:20:45 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:20:47 cs02 mfsmaster[3061]: csdb: found cs using ip:port and csid (10.200.1.13:9422,2) May 30 22:20:47 cs02 mfsmaster[3061]: chunkserver register begin (packet version: 6) - ip: 10.200.1.13 / port: 9422, usedspace: 15160705024 (14.12 GiB), totalspace: 3861174812672 (3596.00 GiB) May 30 22:20:50 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:20:51 cs02 mfsmaster[3061]: csdb: found cs using ip:port and csid (10.200.1.11:9422,1) May 30 22:20:51 cs02 mfsmaster[3061]: chunkserver register begin (packet version: 6) - ip: 10.200.1.11 / port: 9422, usedspace: 15158607872 (14.12 GiB), totalspace: 3861174812672 (3596.00 GiB) May 30 22:20:55 cs02 mfsmaster[3061]: csdb: found cs using ip:port and csid (10.200.1.12:9422,3) May 30 22:20:55 cs02 mfsmaster[3061]: chunkserver register begin (packet version: 6) - ip: 10.200.1.12 / port: 9422, usedspace: 15166865408 (14.13 GiB), totalspace: 3861174812672 (3596.00 GiB) May 30 22:20:55 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:21:00 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:21:00 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.13) has been closed by peer May 30 22:21:05 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.13) has been closed by peer May 30 22:21:05 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:21:09 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.13) has been closed by peer May 30 22:21:10 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:21:14 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.13) has been closed by peer and so on..... This is the log for the killed leader: May 30 22:20:54 cs03 mfsmaster[6525]: open files limit has been set to: 16384 May 30 22:20:54 cs03 mfsmaster[6525]: set gid to 111 May 30 22:20:54 cs03 mfsmaster[6525]: set uid to 107 May 30 22:20:54 cs03 mfsmaster[6525]: out of memory killer disabled May 30 22:20:54 cs03 mfsmaster[6525]: monotonic clock function: clock_gettime May 30 22:20:54 cs03 mfsmaster[6525]: monotonic clock speed: 189371 ops / 10 mili seconds May 30 22:20:54 cs03 mfsmaster[6525]: exports file has been loaded May 30 22:20:54 cs03 mfsmaster[6525]: topology file has been loaded May 30 22:20:55 cs03 mfsmaster[6525]: stats file has been loaded May 30 22:20:55 cs03 mfsmaster[6525]: master <-> metaloggers module: listen on *:9419 May 30 22:20:55 cs03 mfsmaster[6525]: master <-> chunkservers module: listen on *:9420 May 30 22:20:55 cs03 mfsmaster[6525]: main master server module: listen on *:9421 May 30 22:21:00 cs03 mfsmaster[6525]: connecting ... May 30 22:21:00 cs03 mfsmaster[6525]: connected to Master May 30 22:21:00 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:05 cs03 mfsmaster[6525]: connecting ... May 30 22:21:05 cs03 mfsmaster[6525]: connected to Master May 30 22:21:05 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:10 cs03 mfsmaster[6525]: connecting ... May 30 22:21:10 cs03 mfsmaster[6525]: connected to Master May 30 22:21:10 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:15 cs03 mfsmaster[6525]: connecting ... May 30 22:21:15 cs03 mfsmaster[6525]: connected to Master May 30 22:21:15 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:20 cs03 mfsmaster[6525]: connecting ... May 30 22:21:20 cs03 mfsmaster[6525]: connected to Master May 30 22:21:20 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:25 cs03 mfsmaster[6525]: connecting ... May 30 22:21:25 cs03 mfsmaster[6525]: connected to Master May 30 22:21:25 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:30 cs03 mfsmaster[6525]: connecting ... and so on.......... Log for the follower: May 30 22:20:39 cs01 mfsmaster[5207]: connection was reset by Master May 30 22:20:40 cs01 mfsmaster[5207]: connecting ... May 30 22:20:40 cs01 mfsmaster[5207]: connection failed, error: ECONNREFUSED (Connection refused) May 30 22:20:45 cs01 mfsmaster[5207]: connecting ... May 30 22:20:45 cs01 mfsmaster[5207]: connected to Master May 30 22:20:45 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:20:50 cs01 mfsmaster[5207]: connecting ... May 30 22:20:50 cs01 mfsmaster[5207]: chunkserver disconnected - ip: 10.200.1.12 / port: 9422, usedspace: 15166865408 (14.13 GiB), totalspace: 3861174812672 (3596.00 GiB) May 30 22:20:50 cs01 mfsmaster[5207]: connected to Master May 30 22:20:50 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:20:55 cs01 mfsmaster[5207]: connecting ... May 30 22:20:55 cs01 mfsmaster[5207]: connected to Master May 30 22:20:55 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:00 cs01 mfsmaster[5207]: connecting ... May 30 22:21:00 cs01 mfsmaster[5207]: connected to Master May 30 22:21:00 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:05 cs01 mfsmaster[5207]: connecting ... May 30 22:21:05 cs01 mfsmaster[5207]: connected to Master May 30 22:21:05 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:10 cs01 mfsmaster[5207]: connecting ... May 30 22:21:10 cs01 mfsmaster[5207]: connected to Master May 30 22:21:10 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:15 cs01 mfsmaster[5207]: connecting ... May 30 22:21:15 cs01 mfsmaster[5207]: connected to Master May 30 22:21:15 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:20 cs01 mfsmaster[5207]: connecting ... May 30 22:21:20 cs01 mfsmaster[5207]: connected to Master May 30 22:21:20 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 It seems that connection to a master is succesful, but which master ? I don't have any master right now, only an ELECT Restarting the ELECT node doesn't help Restarting the DESYNC node doesn't help Restarting the FOLLWER node will put the node in DESYNC And I'm still without any master. Now, restarting ELECT node where the other 2 are in DESYNC, demote the ELECT to DESYNC. Now I'm still without any master, all nodes are in DESYNC After a while with all nodes in DESYNC, a leader was elected. Now there is one undergoal chunk. > Could you please describe and report here the complete configuration of > your sandbox system? The only changes are: MASTER_HOST=my.master.domain # this points to all 3 servers ATIME_MODE=2 In chunkservers i've set speed limit to 30MB/s Everything else is unchanged. 3 servers, on each server I have both chunkserver and master processes. Each server has 2 HDD configured (ZFS) There is also another server (in production) used as client where I'm mounting the MooseFS storage (on this server I can't do any kind of test except changing moosefs config. no reboot or similiar, it's a production server) Just one gigabit switch as test. no bonding or anything else. |