You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(20) |
Feb
(11) |
Mar
(11) |
Apr
(9) |
May
(22) |
Jun
(85) |
Jul
(94) |
Aug
(80) |
Sep
(72) |
Oct
(64) |
Nov
(69) |
Dec
(89) |
2011 |
Jan
(72) |
Feb
(109) |
Mar
(116) |
Apr
(117) |
May
(117) |
Jun
(102) |
Jul
(91) |
Aug
(72) |
Sep
(51) |
Oct
(41) |
Nov
(55) |
Dec
(74) |
2012 |
Jan
(45) |
Feb
(77) |
Mar
(99) |
Apr
(113) |
May
(132) |
Jun
(75) |
Jul
(70) |
Aug
(58) |
Sep
(58) |
Oct
(37) |
Nov
(51) |
Dec
(15) |
2013 |
Jan
(28) |
Feb
(16) |
Mar
(25) |
Apr
(38) |
May
(23) |
Jun
(39) |
Jul
(42) |
Aug
(19) |
Sep
(41) |
Oct
(31) |
Nov
(18) |
Dec
(18) |
2014 |
Jan
(17) |
Feb
(19) |
Mar
(39) |
Apr
(16) |
May
(10) |
Jun
(13) |
Jul
(17) |
Aug
(13) |
Sep
(8) |
Oct
(53) |
Nov
(23) |
Dec
(7) |
2015 |
Jan
(35) |
Feb
(13) |
Mar
(14) |
Apr
(56) |
May
(8) |
Jun
(18) |
Jul
(26) |
Aug
(33) |
Sep
(40) |
Oct
(37) |
Nov
(24) |
Dec
(20) |
2016 |
Jan
(38) |
Feb
(20) |
Mar
(25) |
Apr
(14) |
May
(6) |
Jun
(36) |
Jul
(27) |
Aug
(19) |
Sep
(36) |
Oct
(24) |
Nov
(15) |
Dec
(16) |
2017 |
Jan
(8) |
Feb
(13) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(10) |
Jul
(20) |
Aug
(3) |
Sep
(18) |
Oct
(8) |
Nov
|
Dec
(5) |
2018 |
Jan
(15) |
Feb
(9) |
Mar
(12) |
Apr
(7) |
May
(123) |
Jun
(41) |
Jul
|
Aug
(14) |
Sep
|
Oct
(15) |
Nov
|
Dec
(7) |
2019 |
Jan
(2) |
Feb
(9) |
Mar
(2) |
Apr
(9) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(6) |
Oct
(1) |
Nov
(12) |
Dec
(2) |
2020 |
Jan
(2) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
(4) |
Jul
(4) |
Aug
(1) |
Sep
(18) |
Oct
(2) |
Nov
|
Dec
|
2021 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
(5) |
Oct
(5) |
Nov
(3) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Gandalf C. <gan...@gm...> - 2018-06-04 12:19:14
|
Il giorno lun 4 giu 2018 alle ore 08:55 Jakub Kruszona-Zawadzki <jak...@ge...> ha scritto: > Yes. I've seen in your log three chunkservers connected to ELECT - this is really strange. Could you please send us some screenshots from your CGI? > > As I understand you have three masters and three chunkservers on the same machines and ELECT is not becoming LEADER for a long time (minutes) after killing LEADER, but when you stop everything and start again then you have LEADER? I've tried right now and transation is happening properly. Nothing changed from last time i've tested and today. Absolutely nothing. |
From: Gandalf C. <gan...@gm...> - 2018-06-04 07:18:31
|
Il lun 4 giu 2018, 08:55 Jakub Kruszona-Zawadzki <jak...@ge...> ha scritto: > Yes. I've seen in your log three chunkservers connected to ELECT - this is > really strange. Could you please send us some screenshots from your CGI? > I'll try today if i have spare time. As I understand you have three masters and three chunkservers on the same > machines and ELECT is not becoming LEADER for a long time (minutes) after > killing LEADER, but when you stop everything and start again then you have > LEADER? > Exactly. I've waited for hours, not minutes. After stopping all master processes (i've never touched chunkservers), a new leader is elected (but different from the hanged ELECT, for what it warth) |
From: Jakub Kruszona-Z. <jak...@ge...> - 2018-06-04 06:55:49
|
> On 4 Jun, 2018, at 8:30, Gandalf Corvotempesta <gan...@gm...> wrote: > > Il lun 4 giu 2018, 07:31 Jakub Kruszona-Zawadzki <jak...@ge... <mailto:jak...@ge...>> ha scritto: > It is ok. You should have one ELECT after LEADER death. ELECT should become LEADER quickly after that. > If it doesn't work then it looks like that you have some configuration problems (number of working chunkservers vs number of known chunkservers). > > I have a 3 servers cluster (it's a test but the hardware is what i'll put in production) It's ok. > > In these 3 servers i'm running master and chunkservers, thus, i have 3 chunkservers ok. > > I've not checked for any disconnected chunkserver but it should block after 2 disconnected chunkservers, right? yes (if the total number of known chunkserver is 3 or 4). > In a 3 nodes cluster, quorum is met at 2, so it should survive at 1 chunkserver failure and i'm pretty sure that i don't have 2 chunkservers down during the master switch Yes. I've seen in your log three chunkservers connected to ELECT - this is really strange. Could you please send us some screenshots from your CGI? As I understand you have three masters and three chunkservers on the same machines and ELECT is not becoming LEADER for a long time (minutes) after killing LEADER, but when you stop everything and start again then you have LEADER? > > Anyway, is an odd number of metadata servers/chunkservers suggested? No. Maybe for small number of chunkservers it is better to have odd number, but only because it is more efficient (same safety level in terms of the number of chunkservers that may die is in case of 2N and 2N-1 servers). In your case (3 servers) you may of course add one more, but still only one can die without stopping the cluster. > Because on an even number, splitbrain could arise and quorum can't always be met. (4 nodes, 2 down. Splitbrain and no quorum) No splitbrain, because MORE than half is needed for quorum, so in case of 4 we need 3 for quorum. -- Regards, Jakub Kruszona-Zawadzki - - - - - - - - - - - - - - - - Segmentation fault (core dumped) Phone: +48 602 212 039 |
From: Gandalf C. <gan...@gm...> - 2018-06-04 06:31:07
|
Il lun 4 giu 2018, 07:31 Jakub Kruszona-Zawadzki <jak...@ge...> ha scritto: > It is ok. You should have one ELECT after LEADER death. ELECT should > become LEADER quickly after that. > If it doesn't work then it looks like that you have some configuration > problems (number of working chunkservers vs number of known chunkservers). > I have a 3 servers cluster (it's a test but the hardware is what i'll put in production) In these 3 servers i'm running master and chunkservers, thus, i have 3 chunkservers I've not checked for any disconnected chunkserver but it should block after 2 disconnected chunkservers, right? In a 3 nodes cluster, quorum is met at 2, so it should survive at 1 chunkserver failure and i'm pretty sure that i don't have 2 chunkservers down during the master switch Anyway, is an odd number of metadata servers/chunkservers suggested? Because on an even number, splitbrain could arise and quorum can't always be met. (4 nodes, 2 down. Splitbrain and no quorum) > |
From: Jakub Kruszona-Z. <jak...@ge...> - 2018-06-04 05:31:12
|
> On 30 May, 2018, at 22:39, Gandalf Corvotempesta <gan...@gm...> wrote: > > Il giorno mer 30 mag 2018 alle ore 22:11 R.C. <mil...@gm...> ha scritto: >> >> Are you able to always reproduce it? > > Yes, always: > > - kill -9 the leader > - one follower goes in "ELECT" > - start back the killed leader (mfsmaster -a) > > "old leader" is set as "FOLLOWER DESYNC", the elect leader is still in > ELECT, the only one follwer left is still in FOLLOWER It is ok. > > So, it seems that leader election doesn't work properly. It works. You have ELECT. > Currently, all chunkservers are UP ELECT becomes LEADER when more than half of registered chunkservers are connected. I see from log that at least three (IP: .11, .12 and .13) connected to ELECT. How many chunkservers do you have in "Servers" tab in CGI? Do you have "disconnected" chunkservers? > > This is the syslog for the master stuck in ELECT phase: > > May 30 22:20:39 cs02 mfsmaster[3061]: connection was reset by Master > May 30 22:20:40 cs02 mfsmaster[3061]: connecting ... > May 30 22:20:40 cs02 mfsmaster[3061]: connection failed, error: > ECONNREFUSED (Connection refused) > May 30 22:20:41 cs02 mfsmaster[3061]: state: ELECT ; changed 0 seconds > ago ; leaderip: 0.0.0.0 > May 30 22:20:45 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:20:47 cs02 mfsmaster[3061]: csdb: found cs using ip:port and > csid (10.200.1.13:9422,2) > May 30 22:20:47 cs02 mfsmaster[3061]: chunkserver register begin > (packet version: 6) - ip: 10.200.1.13 / port: 9422, usedspace: > 15160705024 (14.12 GiB), totalspace: 3861174812672 (3596.00 GiB) > May 30 22:20:50 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:20:51 cs02 mfsmaster[3061]: csdb: found cs using ip:port and > csid (10.200.1.11:9422,1) > May 30 22:20:51 cs02 mfsmaster[3061]: chunkserver register begin > (packet version: 6) - ip: 10.200.1.11 / port: 9422, usedspace: > 15158607872 (14.12 GiB), totalspace: 3861174812672 (3596.00 GiB) > May 30 22:20:55 cs02 mfsmaster[3061]: csdb: found cs using ip:port and > csid (10.200.1.12:9422,3) > May 30 22:20:55 cs02 mfsmaster[3061]: chunkserver register begin > (packet version: 6) - ip: 10.200.1.12 / port: 9422, usedspace: > 15166865408 (14.13 GiB), totalspace: 3861174812672 (3596.00 GiB) > May 30 22:20:55 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:21:00 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:21:00 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.13) has been closed by peer > May 30 22:21:05 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.13) has been closed by peer > May 30 22:21:05 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:21:09 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.13) has been closed by peer > May 30 22:21:10 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.11) has been closed by peer > May 30 22:21:14 cs02 mfsmaster[3061]: connection with > MASTER(10.200.1.13) has been closed by peer > and so on..... > > > > It seems that connection to a master is succesful, but which master ? > I don't have any master right now, only an ELECT > It is ok. You should have one ELECT after LEADER death. ELECT should become LEADER quickly after that. If it doesn't work then it looks like that you have some configuration problems (number of working chunkservers vs number of known chunkservers). -- Regards, Jakub Kruszona-Zawadzki - - - - - - - - - - - - - - - - Segmentation fault (core dumped) Phone: +48 602 212 039 |
From: Gandalf C. <gan...@gm...> - 2018-05-31 15:06:42
|
Il giorno gio 31 mag 2018 alle ore 16:56 R.C. <mil...@gm...> ha scritto: > You expect that a production master is going to have less available space on disk than RAM? This is not an aswer for at least 5 reasons: 1) it's a test cluster 2) our metadata server is using 2.1GB 3) I have 4GB free on disk. 4) free disks is the same in each other server, only one is crashed 5) i'm expecting that, on a multimaster cluster, a bad metadata dump doesn't kill the instances: for 2 reasons: a) it's multimaster, metadata dump could be on another server c) a server failure (for whatever reason) could move the current leader on a different node, even without free space (metadata are kept in ram) So, why, with tons of free RAM available and a multi-master cluster, out-of-space during dump will kill the master process on a follower? I don't see any advantage in doing this. |
From: R.C. <mil...@gm...> - 2018-05-31 14:55:42
|
You expect that a production master is going to have less available space on disk than RAM? Messaggio originale Da: Gandalf Corvotempesta Inviato: giovedì 31 maggio 2018 13:09 A: moo...@li... Oggetto: [MooseFS-Users] [v4] master process crashed on metadata dump May 31 13:00:07 cs01 mfsmaster[21698]: write error May 31 13:00:07 cs01 mfsmaster[21698]: can't write metadata May 31 13:00:15 cs01 mfsmaster[21698]: write error May 31 13:00:15 cs01 mfsmaster[21698]: write error May 31 13:00:15 cs01 mfsmaster[21698]: write error May 31 13:00:15 cs01 mfsmaster[21698]: write error May 31 13:00:15 cs01 mfsmaster[20898]: child finished May 31 13:00:15 cs01 mfsmaster[20898]: store process has finished - store time: 15.774 May 31 13:00:15 cs01 mfsmaster[20898]: metadata not stored !!! (child exited) - exiting May 31 13:00:15 cs01 mfsmaster[20898]: internal terminate request May 31 13:00:15 cs01 mfsmaster[20898]: state: transition FOLLOWER -> DUMMY ; changed 52033 seconds ago ; leaderip: 10.200.1.13 May 31 13:00:15 cs01 mfsmaster[20898]: state: DUMMY ; changed 0 seconds ago ; leaderip: 10.200.1.13 May 31 13:00:16 cs01 mfsmaster[20898]: exited from main loop May 31 13:00:16 cs01 mfsmaster[20898]: exititng ... May 31 13:00:16 cs01 mfsmaster[20898]: main master server module: closing *:9421 May 31 13:00:16 cs01 mfsmaster[20898]: master <-> chunkservers module: closing *:9420 May 31 13:00:16 cs01 mfsmaster[20898]: master control module: closing *:9419 May 31 13:00:23 cs01 mfsmaster[20898]: cleaning metadata ... May 31 13:00:24 cs01 mfsmaster[20898]: metadata have been cleaned May 31 13:00:24 cs01 mfsmaster[20898]: process exited successfully (status:0) May 31 13:00:25 cs01 mfsmaster[24058]: set gid to 111 May 31 13:00:25 cs01 mfsmaster[24058]: set uid to 107 May 31 13:00:25 cs01 mfsmaster[24058]: can't find process to terminate # df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 9.1M 1.6G 1% /run /dev/sdd1 12G 8.2G 3.9G 69% / tmpfs 7.9G 0 7.9G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup chunk0 1.8T 239G 1.6T 14% /mnt/chunks/chunk0 chunk1 1.8T 243G 1.6T 14% /mnt/chunks/chunk1 tmpfs 1.6G 0 1.6G 0% /run/user/1000 maybe i'm running out of space, but I think that master process shound't crash # ls -la /var/lib/mfs/ total 7352160 drwxr-xr-x 2 mfs mfs 311 May 31 13:00 . drwxr-xr-x 25 root root 4096 May 31 00:13 .. -rw-r----- 1 mfs mfs 5 May 30 21:39 .mfschunkserver.lock -rw-r----- 1 mfs mfs 4870620149 May 31 13:00 changelog.1.mfs -rw-r----- 1 mfs mfs 90 May 30 18:30 changelog.12.mfs -rw-r----- 1 mfs mfs 16509535 May 30 22:20 changelog.2.mfs -rw-r----- 1 mfs mfs 1164 May 30 21:32 changelog.4.mfs -rw-r----- 1 mfs mfs 10 May 30 21:17 chunkserverid.mfs -rw-r----- 1 mfs mfs 4066348 May 31 13:00 csstats.mfs -rw-r----- 1 mfs mfs 144 May 31 13:00 metadata.crc -rw-r----- 1 mfs mfs 910498282 May 31 13:00 metadata.mfs -rw-r----- 1 mfs mfs 852231792 May 31 12:00 metadata.mfs.back.1 -rw-r----- 1 mfs mfs 870973440 May 31 13:00 metadata.mfs.emergency -rw-r--r-- 1 root root 8 May 17 09:28 metadata.mfs.empty -rw-r----- 1 mfs mfs 3672832 May 31 13:00 stats.mfs ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Gandalf C. <gan...@gm...> - 2018-05-31 11:09:36
|
May 31 13:00:07 cs01 mfsmaster[21698]: write error May 31 13:00:07 cs01 mfsmaster[21698]: can't write metadata May 31 13:00:15 cs01 mfsmaster[21698]: write error May 31 13:00:15 cs01 mfsmaster[21698]: write error May 31 13:00:15 cs01 mfsmaster[21698]: write error May 31 13:00:15 cs01 mfsmaster[21698]: write error May 31 13:00:15 cs01 mfsmaster[20898]: child finished May 31 13:00:15 cs01 mfsmaster[20898]: store process has finished - store time: 15.774 May 31 13:00:15 cs01 mfsmaster[20898]: metadata not stored !!! (child exited) - exiting May 31 13:00:15 cs01 mfsmaster[20898]: internal terminate request May 31 13:00:15 cs01 mfsmaster[20898]: state: transition FOLLOWER -> DUMMY ; changed 52033 seconds ago ; leaderip: 10.200.1.13 May 31 13:00:15 cs01 mfsmaster[20898]: state: DUMMY ; changed 0 seconds ago ; leaderip: 10.200.1.13 May 31 13:00:16 cs01 mfsmaster[20898]: exited from main loop May 31 13:00:16 cs01 mfsmaster[20898]: exititng ... May 31 13:00:16 cs01 mfsmaster[20898]: main master server module: closing *:9421 May 31 13:00:16 cs01 mfsmaster[20898]: master <-> chunkservers module: closing *:9420 May 31 13:00:16 cs01 mfsmaster[20898]: master control module: closing *:9419 May 31 13:00:23 cs01 mfsmaster[20898]: cleaning metadata ... May 31 13:00:24 cs01 mfsmaster[20898]: metadata have been cleaned May 31 13:00:24 cs01 mfsmaster[20898]: process exited successfully (status:0) May 31 13:00:25 cs01 mfsmaster[24058]: set gid to 111 May 31 13:00:25 cs01 mfsmaster[24058]: set uid to 107 May 31 13:00:25 cs01 mfsmaster[24058]: can't find process to terminate # df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 9.1M 1.6G 1% /run /dev/sdd1 12G 8.2G 3.9G 69% / tmpfs 7.9G 0 7.9G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup chunk0 1.8T 239G 1.6T 14% /mnt/chunks/chunk0 chunk1 1.8T 243G 1.6T 14% /mnt/chunks/chunk1 tmpfs 1.6G 0 1.6G 0% /run/user/1000 maybe i'm running out of space, but I think that master process shound't crash # ls -la /var/lib/mfs/ total 7352160 drwxr-xr-x 2 mfs mfs 311 May 31 13:00 . drwxr-xr-x 25 root root 4096 May 31 00:13 .. -rw-r----- 1 mfs mfs 5 May 30 21:39 .mfschunkserver.lock -rw-r----- 1 mfs mfs 4870620149 May 31 13:00 changelog.1.mfs -rw-r----- 1 mfs mfs 90 May 30 18:30 changelog.12.mfs -rw-r----- 1 mfs mfs 16509535 May 30 22:20 changelog.2.mfs -rw-r----- 1 mfs mfs 1164 May 30 21:32 changelog.4.mfs -rw-r----- 1 mfs mfs 10 May 30 21:17 chunkserverid.mfs -rw-r----- 1 mfs mfs 4066348 May 31 13:00 csstats.mfs -rw-r----- 1 mfs mfs 144 May 31 13:00 metadata.crc -rw-r----- 1 mfs mfs 910498282 May 31 13:00 metadata.mfs -rw-r----- 1 mfs mfs 852231792 May 31 12:00 metadata.mfs.back.1 -rw-r----- 1 mfs mfs 870973440 May 31 13:00 metadata.mfs.emergency -rw-r--r-- 1 root root 8 May 17 09:28 metadata.mfs.empty -rw-r----- 1 mfs mfs 3672832 May 31 13:00 stats.mfs |
From: R.C. <mil...@gm...> - 2018-05-31 10:32:51
|
It is obviously an indicator of hashes memory allocation proportions. Messaggio originale Da: Gandalf Corvotempesta Inviato: giovedì 31 maggio 2018 11:53 A: moo...@li... Oggetto: [MooseFS-Users] v4: meaning of "distribution bar" What's the meaning of distribution bar in CGI panel ? Any explaination for each colored bar ? ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Gandalf C. <gan...@gm...> - 2018-05-31 09:52:37
|
What's the meaning of distribution bar in CGI panel ? Any explaination for each colored bar ? |
From: Piotr R. K. <pio...@mo...> - 2018-05-30 22:42:26
|
Dear MooseFS Users, We are happy to announce that we are coming to Hannover, Germany as we are exhibiting at CeBIT from 11th June to 15th June 2018 (Opening Hours: 10am - 6pm)! Come join us and attend many conferences and amazing festivals all along the week. So, book your tickets for CeBIT, Hannover, Germany and meet us at Hall 012, Stand Number B123 near d!talk. We have limited entry passes to this exciting event, please contact us soon to reserve your ticket! We hope to meet you there! :) Best regards, Peter / MooseFS Team -- Piotr Robert Konopelko | mobile: +48 601 476 440 MooseFS Client Support Team | moosefs.com |
From: Gandalf C. <gan...@gm...> - 2018-05-30 20:39:30
|
Il giorno mer 30 mag 2018 alle ore 22:11 R.C. <mil...@gm...> ha scritto: > > Are you able to always reproduce it? Yes, always: - kill -9 the leader - one follower goes in "ELECT" - start back the killed leader (mfsmaster -a) "old leader" is set as "FOLLOWER DESYNC", the elect leader is still in ELECT, the only one follwer left is still in FOLLOWER So, it seems that leader election doesn't work properly. Currently, all chunkservers are UP This is the syslog for the master stuck in ELECT phase: May 30 22:20:39 cs02 mfsmaster[3061]: connection was reset by Master May 30 22:20:40 cs02 mfsmaster[3061]: connecting ... May 30 22:20:40 cs02 mfsmaster[3061]: connection failed, error: ECONNREFUSED (Connection refused) May 30 22:20:41 cs02 mfsmaster[3061]: state: ELECT ; changed 0 seconds ago ; leaderip: 0.0.0.0 May 30 22:20:45 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:20:47 cs02 mfsmaster[3061]: csdb: found cs using ip:port and csid (10.200.1.13:9422,2) May 30 22:20:47 cs02 mfsmaster[3061]: chunkserver register begin (packet version: 6) - ip: 10.200.1.13 / port: 9422, usedspace: 15160705024 (14.12 GiB), totalspace: 3861174812672 (3596.00 GiB) May 30 22:20:50 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:20:51 cs02 mfsmaster[3061]: csdb: found cs using ip:port and csid (10.200.1.11:9422,1) May 30 22:20:51 cs02 mfsmaster[3061]: chunkserver register begin (packet version: 6) - ip: 10.200.1.11 / port: 9422, usedspace: 15158607872 (14.12 GiB), totalspace: 3861174812672 (3596.00 GiB) May 30 22:20:55 cs02 mfsmaster[3061]: csdb: found cs using ip:port and csid (10.200.1.12:9422,3) May 30 22:20:55 cs02 mfsmaster[3061]: chunkserver register begin (packet version: 6) - ip: 10.200.1.12 / port: 9422, usedspace: 15166865408 (14.13 GiB), totalspace: 3861174812672 (3596.00 GiB) May 30 22:20:55 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:21:00 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:21:00 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.13) has been closed by peer May 30 22:21:05 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.13) has been closed by peer May 30 22:21:05 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:21:09 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.13) has been closed by peer May 30 22:21:10 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.11) has been closed by peer May 30 22:21:14 cs02 mfsmaster[3061]: connection with MASTER(10.200.1.13) has been closed by peer and so on..... This is the log for the killed leader: May 30 22:20:54 cs03 mfsmaster[6525]: open files limit has been set to: 16384 May 30 22:20:54 cs03 mfsmaster[6525]: set gid to 111 May 30 22:20:54 cs03 mfsmaster[6525]: set uid to 107 May 30 22:20:54 cs03 mfsmaster[6525]: out of memory killer disabled May 30 22:20:54 cs03 mfsmaster[6525]: monotonic clock function: clock_gettime May 30 22:20:54 cs03 mfsmaster[6525]: monotonic clock speed: 189371 ops / 10 mili seconds May 30 22:20:54 cs03 mfsmaster[6525]: exports file has been loaded May 30 22:20:54 cs03 mfsmaster[6525]: topology file has been loaded May 30 22:20:55 cs03 mfsmaster[6525]: stats file has been loaded May 30 22:20:55 cs03 mfsmaster[6525]: master <-> metaloggers module: listen on *:9419 May 30 22:20:55 cs03 mfsmaster[6525]: master <-> chunkservers module: listen on *:9420 May 30 22:20:55 cs03 mfsmaster[6525]: main master server module: listen on *:9421 May 30 22:21:00 cs03 mfsmaster[6525]: connecting ... May 30 22:21:00 cs03 mfsmaster[6525]: connected to Master May 30 22:21:00 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:05 cs03 mfsmaster[6525]: connecting ... May 30 22:21:05 cs03 mfsmaster[6525]: connected to Master May 30 22:21:05 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:10 cs03 mfsmaster[6525]: connecting ... May 30 22:21:10 cs03 mfsmaster[6525]: connected to Master May 30 22:21:10 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:15 cs03 mfsmaster[6525]: connecting ... May 30 22:21:15 cs03 mfsmaster[6525]: connected to Master May 30 22:21:15 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:20 cs03 mfsmaster[6525]: connecting ... May 30 22:21:20 cs03 mfsmaster[6525]: connected to Master May 30 22:21:20 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:25 cs03 mfsmaster[6525]: connecting ... May 30 22:21:25 cs03 mfsmaster[6525]: connected to Master May 30 22:21:25 cs03 mfsmaster[6525]: sending to LEADER meta_version: 242578 May 30 22:21:30 cs03 mfsmaster[6525]: connecting ... and so on.......... Log for the follower: May 30 22:20:39 cs01 mfsmaster[5207]: connection was reset by Master May 30 22:20:40 cs01 mfsmaster[5207]: connecting ... May 30 22:20:40 cs01 mfsmaster[5207]: connection failed, error: ECONNREFUSED (Connection refused) May 30 22:20:45 cs01 mfsmaster[5207]: connecting ... May 30 22:20:45 cs01 mfsmaster[5207]: connected to Master May 30 22:20:45 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:20:50 cs01 mfsmaster[5207]: connecting ... May 30 22:20:50 cs01 mfsmaster[5207]: chunkserver disconnected - ip: 10.200.1.12 / port: 9422, usedspace: 15166865408 (14.13 GiB), totalspace: 3861174812672 (3596.00 GiB) May 30 22:20:50 cs01 mfsmaster[5207]: connected to Master May 30 22:20:50 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:20:55 cs01 mfsmaster[5207]: connecting ... May 30 22:20:55 cs01 mfsmaster[5207]: connected to Master May 30 22:20:55 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:00 cs01 mfsmaster[5207]: connecting ... May 30 22:21:00 cs01 mfsmaster[5207]: connected to Master May 30 22:21:00 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:05 cs01 mfsmaster[5207]: connecting ... May 30 22:21:05 cs01 mfsmaster[5207]: connected to Master May 30 22:21:05 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:10 cs01 mfsmaster[5207]: connecting ... May 30 22:21:10 cs01 mfsmaster[5207]: connected to Master May 30 22:21:10 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:15 cs01 mfsmaster[5207]: connecting ... May 30 22:21:15 cs01 mfsmaster[5207]: connected to Master May 30 22:21:15 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 May 30 22:21:20 cs01 mfsmaster[5207]: connecting ... May 30 22:21:20 cs01 mfsmaster[5207]: connected to Master May 30 22:21:20 cs01 mfsmaster[5207]: sending to LEADER meta_version: 242578 It seems that connection to a master is succesful, but which master ? I don't have any master right now, only an ELECT Restarting the ELECT node doesn't help Restarting the DESYNC node doesn't help Restarting the FOLLWER node will put the node in DESYNC And I'm still without any master. Now, restarting ELECT node where the other 2 are in DESYNC, demote the ELECT to DESYNC. Now I'm still without any master, all nodes are in DESYNC After a while with all nodes in DESYNC, a leader was elected. Now there is one undergoal chunk. > Could you please describe and report here the complete configuration of > your sandbox system? The only changes are: MASTER_HOST=my.master.domain # this points to all 3 servers ATIME_MODE=2 In chunkservers i've set speed limit to 30MB/s Everything else is unchanged. 3 servers, on each server I have both chunkserver and master processes. Each server has 2 HDD configured (ZFS) There is also another server (in production) used as client where I'm mounting the MooseFS storage (on this server I can't do any kind of test except changing moosefs config. no reboot or similiar, it's a production server) Just one gigabit switch as test. no bonding or anything else. |
From: R.C. <mil...@gm...> - 2018-05-30 20:11:15
|
Are you able to always reproduce it? Could you please describe and report here the complete configuration of your sandbox system? Tnx Il 30/05/2018 22:05, Gandalf Corvotempesta ha scritto: > Il giorno mer 30 mag 2018 alle ore 21:58 Piotr Robert Konopelko > <pio...@mo...> ha scritto: >> So you got a new leader automatically after killing previous one, but then something went wrong? > Honestly, I don't understood exactly what happens, I only know that > leader wasn't elected properly and the only > way to get a new leader was to restart all master instances. > > What I've seen for sure is the leader lost: a working leader was > demoted for unknown reason. (no, the demoted > leader wasn't the one that i've killed) > >> Are you sure all Chunkservers were up and running at that time? > Yes. > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Piotr R. K. <pio...@mo...> - 2018-05-30 20:09:53
|
> On 30 May 2018, at 10:05 PM, Gandalf Corvotempesta <gan...@gm...> wrote: > > Il giorno mer 30 mag 2018 alle ore 21:58 Piotr Robert Konopelko > <pio...@mo...> ha scritto: >> So you got a new leader automatically after killing previous one, but then something went wrong? > > Honestly, I don't understood exactly what happens, I only know that > leader wasn't elected properly and the only > way to get a new leader was to restart all master instances. > > What I've seen for sure is the leader lost: a working leader was > demoted for unknown reason. (no, the demoted > leader wasn't the one that i've killed) > >> Are you sure all Chunkservers were up and running at that time? > > Yes. Logs from all the servers would be much appreciated, without them I can't tell a lot. BTW - we are having national holidays in Poland and my further replies may be delayed. The best would be to catch up on Monday and continue the topic. Thank you, Best regards, Peter -- Piotr Robert Konopelko | mobile: +48 601 476 440 MooseFS Client Support Team | moosefs.com <http://moosefs.com/> GitHub <https://github.com/moosefs/moosefs> | Twitter <https://twitter.com/moosefs> | Facebook <https://www.facebook.com/moosefs> | LinkedIn <https://www.linkedin.com/company/moosefs> |
From: Gandalf C. <gan...@gm...> - 2018-05-30 20:05:53
|
Il giorno mer 30 mag 2018 alle ore 21:58 Piotr Robert Konopelko <pio...@mo...> ha scritto: > So you got a new leader automatically after killing previous one, but then something went wrong? Honestly, I don't understood exactly what happens, I only know that leader wasn't elected properly and the only way to get a new leader was to restart all master instances. What I've seen for sure is the leader lost: a working leader was demoted for unknown reason. (no, the demoted leader wasn't the one that i've killed) > Are you sure all Chunkservers were up and running at that time? Yes. |
From: Piotr R. K. <pio...@mo...> - 2018-05-30 19:58:17
|
> On 30 May 2018, at 9:48 PM, Gandalf Corvotempesta <gan...@gm...> wrote: > > That's fine. But I was able to break everything. > I've killed one master, waited for a new leader, than started it back with > "mfsmaster -e". So you got a new leader automatically after killing previous one, but then something went wrong? > After this, i've lost the leader, one of the node was promoted to "ELECT", > the restarted master as DSYNC and the remaining one as FOLLOWER. No leader. > The only way to get a leader was to restart all masters. Are you sure all Chunkservers were up and running at that time? Best regards, Peter -- Piotr Robert Konopelko | mobile: +48 601 476 440 MooseFS Client Support Team | moosefs.com |
From: Gandalf C. <gan...@gm...> - 2018-05-30 19:48:46
|
Il giorno mer 30 mag 2018 alle ore 21:40 Piotr Robert Konopelko < pio...@mo...> ha scritto: > It is an expected behavior. If you have any of Master Servers failure, you may want to analyse the situation, do a metadata backup etc. before spawning it again. > In prod environments Master Servers work without restart for months (usually until they are updated to a newer version), and real failure situations are very rare. That's fine. But I was able to break everything. I've killed one master, waited for a new leader, than started it back with "mfsmaster -e". After this, i've lost the leader, one of the node was promoted to "ELECT", the restarted master as DSYNC and the remaining one as FOLLOWER. No leader. The only way to get a leader was to restart all masters. |
From: Piotr R. K. <pio...@mo...> - 2018-05-30 19:40:44
|
> On 30 May 2018, at 9:28 PM, Gandalf Corvotempesta <gan...@gm...> wrote: > > Il giorno mer 30 mag 2018 alle ore 21:22 Gandalf Corvotempesta < > gan...@gm...> ha scritto: >> Let's see what happens by killing one master.... > > It sync, but only manually. > If I kill mfsmaster, it doesn't start again, is asking for "mfsmaster -a" > > Even after a simple "systemctl restart moosefs-master" it doesn't start > again without the "-a" It is an expected behavior. If you have any of Master Servers failure, you may want to analyse the situation, do a metadata backup etc. before spawning it again. In prod environments Master Servers work without restart for months (usually until they are updated to a newer version), and real failure situations are very rare. Best regards, Peter -- Piotr Robert Konopelko | mobile: +48 601 476 440 MooseFS Client Support Team | moosefs.com |
From: Gandalf C. <gan...@gm...> - 2018-05-30 19:28:41
|
Il giorno mer 30 mag 2018 alle ore 21:22 Gandalf Corvotempesta < gan...@gm...> ha scritto: > Let's see what happens by killing one master.... It sync, but only manually. If I kill mfsmaster, it doesn't start again, is asking for "mfsmaster -a" Even after a simple "systemctl restart moosefs-master" it doesn't start again without the "-a" |
From: Gandalf C. <gan...@gm...> - 2018-05-30 19:22:58
|
Il giorno mer 30 mag 2018 alle ore 21:15 Piotr Robert Konopelko < pio...@mo...> ha scritto: > This is the case. > Please setup and run Chunkservers, as the minimum required number of CS to work with Masters HA is 3. MooseFS HA is full of protections, e.g. split-brain prevention. It bases on a voting mechanism, so Chunkservers vote on Leader election. > In order to have Leader in cluster, you have to have set up and connected at least an "integer half" + 1 of Chunkservers (so if you have 3, it must be 2, if you have 4 or 5, it must be 3, if you have 6 or 7 it must be 4 and so on). Ok, now is clear. I've thought that leader election was made by masters, not by chunkservers. I've configured 3 chunkservers (using the same nodes as masters) and now it is synced. Let's see what happens by killing one master.... |
From: Piotr R. K. <pio...@mo...> - 2018-05-30 19:15:57
|
> On 30 May 2018, at 9:04 PM, Gandalf Corvotempesta <gan...@gm...> wrote: > > May 30 21:00:00 cs03 mfsmaster[10665]: supervising schedule hasn't been > sent due to lack of valid chunkservers > > Keep in mind that i do not have any chunkservers, yet. This is the case. Please setup and run Chunkservers, as the minimum required number of CS to work with Masters HA is 3. MooseFS HA is full of protections, e.g. split-brain prevention. It bases on a voting mechanism, so Chunkservers vote on Leader election. In order to have Leader in cluster, you have to have set up and connected at least an "integer half" + 1 of Chunkservers (so if you have 3, it must be 2, if you have 4 or 5, it must be 3, if you have 6 or 7 it must be 4 and so on). Best regards, Peter -- Piotr Robert Konopelko | mobile: +48 601 476 440 MooseFS Client Support Team | moosefs.com |
From: Gandalf C. <gan...@gm...> - 2018-05-30 19:05:05
|
Il giorno mer 30 mag 2018 alle ore 20:54 Piotr Robert Konopelko < pio...@mo...> ha scritto: > What Leader and Follower say in /var/log/syslog or /var/log/messages? follower full of these: May 30 21:01:12 cs01 mfsmaster[1809]: manager task spawn May 30 21:01:12 cs01 mfsmaster[1809]: child finished May 30 21:01:22 cs01 mfsmaster[1809]: manager task spawn May 30 21:01:22 cs01 mfsmaster[1809]: child finished May 30 21:01:32 cs01 mfsmaster[1809]: manager task spawn May 30 21:01:32 cs01 mfsmaster[1809]: child finished May 30 21:01:42 cs01 mfsmaster[1809]: manager task spawn May 30 21:01:42 cs01 mfsmaster[1809]: child finished May 30 21:01:52 cs01 mfsmaster[1809]: manager task spawn May 30 21:01:52 cs01 mfsmaster[1809]: child finished May 30 21:02:02 cs01 mfsmaster[1809]: manager task spawn May 30 21:02:02 cs01 mfsmaster[1809]: child finished leader only these: May 30 20:56:00 cs03 mfsmaster[10665]: supervising schedule hasn't been sent due to lack of valid chunkservers May 30 20:57:00 cs03 mfsmaster[10665]: supervising schedule hasn't been sent due to lack of valid chunkservers May 30 20:58:00 cs03 mfsmaster[10665]: supervising schedule hasn't been sent due to lack of valid chunkservers May 30 20:59:00 cs03 mfsmaster[10665]: supervising schedule hasn't been sent due to lack of valid chunkservers May 30 21:00:00 cs03 mfsmaster[10665]: supervising schedule hasn't been sent due to lack of valid chunkservers > Is /etc/mfs/mfsmaster.cfg consistent on all the servers? Yes > Are all of them able to resolve MASTER_HOST properly to 3 addresses? Yes > Are ports 9419-9422 open between them? Yes, no firewall set. > You can also try to stop Followers and start them with "-e" parameter: > mfsmaster -e > in order to start "empty" Masters and let them download metadata from Leader. Already done, it will start with an empty "metadata id" but it doesnt sync. Keep in mind that i do not have any chunkservers, yet. |
From: Piotr R. K. <pio...@mo...> - 2018-05-30 18:54:47
|
> On 30 May 2018, at 7:54 PM, Gandalf Corvotempesta <gan...@gm...> wrote: > > I've attached an image. > Pretty strange, metadata version mismatch, but "metadata id" are identical > and follower is in "DESYNC" state. > Il giorno mer 30 mag 2018 alle ore 18:48 Gandalf Corvotempesta < > gan...@gm...> ha scritto: > >> Il giorno mer 30 mag 2018 alle ore 18:45 Gandalf Corvotempesta < >> gan...@gm...> ha scritto: >>> So, the only configuration is to create a DNS record pointing to all >> master >>> IPs, then set "MASTER_HOST" on all services pointing to this DNS record. > >> Another issue: after rebooting one of the follower server, the master >> process is now respawning and stuck at "FOLLOWER (DESYNC)" >> Should I manually run something after the reboot ? Is possible to automate >> the resync ? > <screen.jpg> What Leader and Follower say in /var/log/syslog or /var/log/messages? Is /etc/mfs/mfsmaster.cfg consistent on all the servers? Are all of them able to resolve MASTER_HOST properly to 3 addresses? Are ports 9419-9422 open between them? Please let me know. You can also try to stop Followers and start them with "-e" parameter: mfsmaster -e in order to start "empty" Masters and let them download metadata from Leader. Best regards, Peter -- Piotr Robert Konopelko | mobile: +48 601 476 440 MooseFS Client Support Team | moosefs.com |
From: Ricardo J. B. <ric...@do...> - 2018-05-30 18:00:02
|
El Miércoles 30/05/2018 a las 13:48, Gandalf Corvotempesta escribió: > Il giorno mer 30 mag 2018 alle ore 18:48 Piotr Robert Konopelko > <pio...@mo...> ha scritto: > > Use "vim /usr/share/mfscgi/index.html" on every host where MFS CGI is > > installed :) > > > > Then clear browser's cache > > I've tought there was a configuration file somewhere :) BTW, I just realized you're talking about MFS v4, I assume it'll be similar to v3 and older which is what we're using. Cheers, -- Ricardo J. Barberis Senior SysAdmin / IT Architect DonWeb La Actitud Es Todo www.DonWeb.com |
From: Ricardo J. B. <ric...@do...> - 2018-05-30 17:59:52
|
El Miércoles 30/05/2018 a las 13:48, Gandalf Corvotempesta escribió: > Il giorno mer 30 mag 2018 alle ore 18:48 Piotr Robert Konopelko > <pio...@mo...> ha scritto: > > Use "vim /usr/share/mfscgi/index.html" on every host where MFS CGI is > > installed :) > > > > Then clear browser's cache > > I've tought there was a configuration file somewhere :) We have a separate webserver with the CGI (slightly tweaked) to check all of our clusters in one place, and we pass mfsmater's name in the GET query to the CGI, e.g.: http://mfscgi.lan/cgi-bin/mfs3/mfs.cgi?masterhost=10.200.1.11&mastername=mfsmaster01 IIRC, masterhost is where to connect to and mastername is what shows up in the web browser's title. Of course, on this webserver we have an index.html with links to those personalized URLs, we don't type them by hand every time :) Cheers, -- Ricardo J. Barberis Senior SysAdmin / IT Architect DonWeb La Actitud Es Todo www.DonWeb.com |