From: Jakub Kruszona-Z. <jak...@ge...> - 2018-06-04 05:31:12
|
> On 30 May, 2018, at 22:39, Gandalf Corvotempesta <gan...@gm...> wrote: > > Il giorno mer 30 mag 2018 alle ore 22:11 R.C. <mil...@gm...> ha scritto: >> >> Are you able to always reproduce it? > > Yes, always: > > - kill -9 the leader > - one follower goes in "ELECT" > - start back the killed leader (mfsmaster -a) > > "old leader" is set as "FOLLOWER DESYNC", the elect leader is still in > ELECT, the only one follwer left is still in FOLLOWER It is ok. > > So, it seems that leader election doesn't work properly. It works. You have ELECT. > Currently, all chunkservers are UP ELECT becomes LEADER when more than half of registered chunkservers are connected. I see from log that at least three (IP: .11, .12 and .13) connected to ELECT. How many chunkservers do you have in "Servers" tab in CGI? Do you have "disconnected" chunkservers? > > This is the syslog for the master stuck in ELECT phase: > > May 30 22:20:39 cs02 mfsmaster[3061]: connection was reset by Master > May 30 22:20:40 cs02 mfsmaster[3061]: connecting ... > May 30 22:20:40 cs02 mfsmaster[3061]: connection failed, error: > ECONNREFUSED (Connection refused) > May 30 22:20:41 cs02 mfsmaster[3061]: state: ELECT ; changed 0 seconds > ago ; leaderip: 0.0.0.0 > May 30 22:20:45 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:20:47 cs02 mfsmaster[3061]: csdb: found cs using ip:port and > csid (10.200.1.13:9422,2) > May 30 22:20:47 cs02 mfsmaster[3061]: chunkserver register begin > (packet version: 6) - ip: 10.200.1.13 / port: 9422, usedspace: > 15160705024 (14.12 GiB), totalspace: 3861174812672 (3596.00 GiB) > May 30 22:20:50 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:20:51 cs02 mfsmaster[3061]: csdb: found cs using ip:port and > csid (10.200.1.11:9422,1) > May 30 22:20:51 cs02 mfsmaster[3061]: chunkserver register begin > (packet version: 6) - ip: 10.200.1.11 / port: 9422, usedspace: > 15158607872 (14.12 GiB), totalspace: 3861174812672 (3596.00 GiB) > May 30 22:20:55 cs02 mfsmaster[3061]: csdb: found cs using ip:port and > csid (10.200.1.12:9422,3) > May 30 22:20:55 cs02 mfsmaster[3061]: chunkserver register begin > (packet version: 6) - ip: 10.200.1.12 / port: 9422, usedspace: > 15166865408 (14.13 GiB), totalspace: 3861174812672 (3596.00 GiB) > May 30 22:20:55 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:21:00 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:21:00 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.13) has been closed by peer > May 30 22:21:05 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.13) has been closed by peer > May 30 22:21:05 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:21:09 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.13) has been closed by peer > May 30 22:21:10 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:21:14 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.13) has been closed by peer > and so on..... > > > > It seems that connection to a master is succesful, but which master ? > I don't have any master right now, only an ELECT > It is ok. You should have one ELECT after LEADER death. ELECT should become LEADER quickly after that. If it doesn't work then it looks like that you have some configuration problems (number of working chunkservers vs number of known chunkservers). -- Regards, Jakub Kruszona-Zawadzki - - - - - - - - - - - - - - - - Segmentation fault (core dumped) Phone: +48 602 212 039 |