From: Diego R. <dij...@ae...> - 2018-06-04 15:17:13
|
See below... >> >> In these 3 servers i'm running master and chunkservers, thus, i have >> 3 chunkservers > > ok. > >> >> I've not checked for any disconnected chunkserver but it should block >> after 2 disconnected chunkservers, right? > > yes (if the total number of known chunkserver is 3 or 4). > >> In a 3 nodes cluster, quorum is met at 2, so it should survive at 1 >> chunkserver failure and i'm pretty sure that i don't have 2 >> chunkservers down during the master switch > > Yes. I've seen in your log three chunkservers connected to ELECT - > this is really strange. Could you please send us some screenshots from > your CGI? > > As I understand you have three masters and three chunkservers on the > same machines and ELECT is not becoming LEADER for a long time > (minutes) after killing LEADER, but when you stop everything and start > again then you have LEADER? > Here are my observations, I have 3 servers running 4.6.0, two (ae-fs02 and ae-fs03) of them have both roles, master and chunkserver, the other (ae-fs01), has meta logger and chunkserver. When ae-fs03 is leader, I stop the chunkserver process on ae-fs01, then stop master on ae-fs03, then ae-fs02 which was follower, becomes elect and within 2-3 seconds, it becomes leader while ae-fs03 shows up as dead. At this point, ae-fs02 is the leader and ae-fs03 becomes follower when I start the service moosefs-master. If I reboot ae-fs02, then ae-fs03 continues to show up as FOLLOWER, but it takes a long time, over a minute, for it to become elect and then leader. The GUI shows: Until it finally changes to: After ae-fs02 has rebooted, and the moosefs services started, the cluster is back to normal. HTH, Diego |