You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(20) |
Feb
(11) |
Mar
(11) |
Apr
(9) |
May
(22) |
Jun
(85) |
Jul
(94) |
Aug
(80) |
Sep
(72) |
Oct
(64) |
Nov
(69) |
Dec
(89) |
2011 |
Jan
(72) |
Feb
(109) |
Mar
(116) |
Apr
(117) |
May
(117) |
Jun
(102) |
Jul
(91) |
Aug
(72) |
Sep
(51) |
Oct
(41) |
Nov
(55) |
Dec
(74) |
2012 |
Jan
(45) |
Feb
(77) |
Mar
(99) |
Apr
(113) |
May
(132) |
Jun
(75) |
Jul
(70) |
Aug
(58) |
Sep
(58) |
Oct
(37) |
Nov
(51) |
Dec
(15) |
2013 |
Jan
(28) |
Feb
(16) |
Mar
(25) |
Apr
(38) |
May
(23) |
Jun
(39) |
Jul
(42) |
Aug
(19) |
Sep
(41) |
Oct
(31) |
Nov
(18) |
Dec
(18) |
2014 |
Jan
(17) |
Feb
(19) |
Mar
(39) |
Apr
(16) |
May
(10) |
Jun
(13) |
Jul
(17) |
Aug
(13) |
Sep
(8) |
Oct
(53) |
Nov
(23) |
Dec
(7) |
2015 |
Jan
(35) |
Feb
(13) |
Mar
(14) |
Apr
(56) |
May
(8) |
Jun
(18) |
Jul
(26) |
Aug
(33) |
Sep
(40) |
Oct
(37) |
Nov
(24) |
Dec
(20) |
2016 |
Jan
(38) |
Feb
(20) |
Mar
(25) |
Apr
(14) |
May
(6) |
Jun
(36) |
Jul
(27) |
Aug
(19) |
Sep
(36) |
Oct
(24) |
Nov
(15) |
Dec
(16) |
2017 |
Jan
(8) |
Feb
(13) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(10) |
Jul
(20) |
Aug
(3) |
Sep
(18) |
Oct
(8) |
Nov
|
Dec
(5) |
2018 |
Jan
(15) |
Feb
(9) |
Mar
(12) |
Apr
(7) |
May
(123) |
Jun
(41) |
Jul
|
Aug
(14) |
Sep
|
Oct
(15) |
Nov
|
Dec
(7) |
2019 |
Jan
(2) |
Feb
(9) |
Mar
(2) |
Apr
(9) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(6) |
Oct
(1) |
Nov
(12) |
Dec
(2) |
2020 |
Jan
(2) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
(4) |
Jul
(4) |
Aug
(1) |
Sep
(18) |
Oct
(2) |
Nov
|
Dec
|
2021 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
(5) |
Oct
(5) |
Nov
(3) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Markus K. <mar...@tu...> - 2020-10-14 10:54:25
|
On Thursday, 10 September 2020 07:51:01 CEST Markus Köberl wrote: > On Tuesday, 8 September 2020 09:04:50 CEST Agata Kruszona-Zawadzka wrote: > > W dniu 07.09.2020 o 15:17, Markus Köberl pisze: > > > On Monday, 7 September 2020 12:19:42 CEST Agata Kruszona-Zawadzka wrote: > > >> > > >> W dniu 04.09.2020 o 15:22, Markus Köberl pisze: > > >>> Since some time (last few versions of MooseFS) on a few chunkservers the used space grows above the default ACCEPTABLE_PERCENTAGE_DIFFERENCE = 1.0 till I restart the affected chunkserver. > > >>> On the webinterface i see huge numbers for overgoal (even 4 extra copies). After the restart of the chunkserver the overgoal goes down but starts growing again after some time. > > >> > > >> We have an issue in MooseFS currently, where on disks with I/O errors in > > >> certain circustances some chunks get locked and cannot be deleted until > > >> the whole chunk server process is restarted. We introduced a fix for > > >> that, it's gonna be available in version 3.0.115. The issue does not > > >> affect disks without I/O errors. > > > > > > Thanks good to hear that a fix might be on the way. > > > > > > Could it be that instead of "some chunks get locked and cannot be deleted" that there a no deletes at all on this chunk server, or might that be a different problem? > > > > Yes, that's exactly it. By "some chunks" I meant that not every chunk is > > able to trigger the problem, but once it happens so that enough chunks > > do, then, due to operation limits (specifically deletions limit in this > > instance), the system won't attempt to delete any more chunks. > > Thank you for confirming that it is the same problem and the good explanation. I can confirm that all our problems are resolved with version 3.0.115. Thank you for all the good work! regards Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: mar...@tu... |
From: Ricardo J. B. <ric...@do...> - 2020-09-30 16:53:25
|
OK, mfsmaster started, mentioning: loading edge: 2502428->5211944 error: empty name loading edge: 2502428->5211944 empty filename replaced by %28empty 5211944%29 ok (4.3644) Then I started one mfschunkserver (previously moving .chunkdb and .metaid out of the way) and it scanned fine. Finally, I was able to mount it locally! \o/ Now to continue restoring files to the new cluster :) Thanks again! PS: in order to compile moosefs on a CentOS 7.8 minimal installation I had to install these additional packages: autoconf automake libtool PPS: full output of mfsmaster: # mfsmaster -a -i -c /etc/mfs/mfsmaster.cfg open files limit has been set to: 16384 working directory: /mnt/mailmfs/mfs lockfile created and locked initializing mfsmaster modules ... exports file has been loaded topology file has been loaded write replication limit in old format - change limits to new format read replication limit in old format - change limits to new format loading metadata ... loading sessions data ... ok (0.0000) loading storage classes data ... ok (0.0000) loading objects (files,directories,etc.) ... ok (0.6527) loading names ... loading edge: 2502428->5211944 error: empty name loading edge: 2502428->5211944 empty filename replaced by %28empty 5211944%29 ok (4.3644) loading deletion timestamps ... ok (0.0046) loading quota definitions ... ok (0.0004) loading xattr data ... ok (0.0000) loading posix_acl data ... ok (0.0000) loading open files data ... ok (0.0014) loading flock_locks data ... ok (0.0000) loading posix_locks data ... ok (0.0000) loading chunkservers data ... ok (0.0000) loading chunks data ... ok (0.2275) checking filesystem consistency ... ok connecting files and chunks ... ok all inodes: 12285650 directory inodes: 511090 file inodes: 11774560 chunks: 11668123 metadata file has been loaded stats file has been loaded master <-> metaloggers module: listen on *:9419 master <-> chunkservers module: listen on *:9420 main master server module: listen on *:9421 mfsmaster daemon initialized properly El Miércoles 30/09/2020 a las 12:12, Piotr Robert Konopelko escribió: > That's great, thank you for the reply. Please definitely try this patch and > let us know. > Looking forward to hearing from you. > > Best regards, > Piotr > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > pio...@mo... > *Business & Technical Support Manager* > MooseFS Client Support Team > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | > Twitter <https://twitter.com/moosefs> | Facebook > <https://www.facebook.com/moosefs> | LinkedIn > <https://www.linkedin.com/company/moosefs> > > > On Wed, Sep 30, 2020 at 4:51 PM Ricardo J. Barberis < > > ric...@do...> wrote: > > Cool, we're in the process of restoring from backups to a new mfs cluster > > but > > we'll reconfigure the old cluster to put it online and try this patch. > > > > It'll hopefully allow us to recover whatever we don't have in the backups > > (which are fairly recent but since these are mailboxes we're talking > > about everything helps). > > > > I'll let you know how the process goes. > > > > Thank you! > > > > El Miércoles 30/09/2020 a las 10:16, Piotr Robert Konopelko escribió: > > > Sorry, I forgot to add – in order to build MooseFS from sources, you > > > > need a > > > > > few dependencies. > > > Please install them as described here: > > > https://github.com/moosefs/moosefs#source-code. > > > > > > Piotr > > > > > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > > > pio...@mo... > > > *Business & Technical Support Manager* > > > MooseFS Client Support Team > > > > > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> > > > > > > Twitter <https://twitter.com/moosefs> | Facebook > > > <https://www.facebook.com/moosefs> | LinkedIn > > > <https://www.linkedin.com/company/moosefs> > > > > > > > > > On Wed, Sep 30, 2020 at 3:00 PM Piotr Robert Konopelko < > > > > > > pio...@mo...> wrote: > > > > Hello Ricardo, > > > > > > > > After your report, we have updated the Master Server code today to > > > > let the "*-i*" parameter ignore empty file paths' (edges') names and > > > > substitute them with "*(empty <inode_number>)*" string, without > > > > quotes and lower / greater than marks of course. > > > > > > > > Please have a look at the following commit: > > > > https://github.com/moosefs/moosefs/commit/886ea4a703afce1b40e4853ce02101a > > > > > >8c43829f3 > > > > > > > > It will be included in the nearest MooseFS 3.0.115 release. > > > > > > > > In order to get this feature before release, please clone the MooseFS > > > > Git > > > > > > repository from GitHub: https://github.com/moosefs/moosefs, build > > > > binaries by running *./linux_build.sh* script inside MooseFS Git > > > > repository directory (this script doesn't run "*make install*", so > > > > you can just copy (replace) "*mfsmaster*" executable binary file to " > > > > */usr/sbin/mfsmaster*") and use this newly built "*mfsmaster*" binary > > > > (don't forget about passing "*-a*" and "*-i*" parameters to it). It > > > > should be able to load your metadata and substitute empty names with > > > > the > > > > > > above mentioned string. > > > > > > > > Please let me know if it worked for you and do not hesitate to > > > > contact > > > > us > > > > > > if you have any questions. > > > > > > > > Best regards, > > > > Piotr > > > > > > > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > > > > pio...@mo... > > > > *Business & Technical Support Manager* > > > > MooseFS Client Support Team > > > > > > > > WWW <https://moosefs.com> | GitHub > > > > <https://github.com/moosefs/moosefs> > > > > > > > > Twitter <https://twitter.com/moosefs> | Facebook > > > > <https://www.facebook.com/moosefs> | LinkedIn > > > > <https://www.linkedin.com/company/moosefs> > > > > > > > > > > > > On Wed, Sep 30, 2020 at 2:34 AM Ricardo J. Barberis < > > > > > > > > ric...@do...> wrote: > > > >> El Martes 29/09/2020 a las 20:55, Piotr Robert Konopelko escribió: > > > >> > Hello Ricardo, > > > >> > > > > >> > 1) Have you done a backup of /var/lib/mfs after the failure? If > > > >> > not, > > > >> > > > >> please > > > >> > > > >> > do so, just in case. > > > >> > > > >> Yes, I have a backup post-crash, before atempting to recover > > > >> > > > >> > 2) Have you tried the following sequence: > > > >> > > > > >> > - Stopping the Metalogger, > > > >> > - Copying all the metadata_ml* files and changelog_ml* files > > > >> > from Metalogger's /var/lib/mfs directory to Master Server's > > > > /var/lib/mfs > > > > > >> > directory, > > > >> > - Starting the Master server with "-a" parameter that makes it > > > >> > > > >> trying to > > > >> > > > >> > recover the metadata: > > > >> > > > > >> > *mfsmaster -a* > > > >> > > > > >> > ? > > > >> > > > >> Yes, I tried this sequence in the master server and also in the > > > >> metalogger, > > > >> copying the files to a new directory (so I also got a backup of the > > > >> files from the metalogger). > > > >> > > > >> > The above sequence is usually the best way to recover the > > > >> > metadata. > > > >> > > > >> I also tried 'mfsmaster -a -i' to no avail. > > > >> > > > >> > > > >> BTW, all of this is in my original email, sorry if I wasn't very > > > > clear. > > > > > >> > Best regards, > > > >> > Piotr > > > >> > > > > >> > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > > > >> > pio...@mo... > > > >> > *Business & Technical Support Manager* > > > >> > MooseFS Client Support Team > > > >> > > > > >> > WWW <https://moosefs.com> | GitHub > > > >> > <https://github.com/moosefs/moosefs> > > > >> > > > > >> > Twitter <https://twitter.com/moosefs> | Facebook > > > >> > <https://www.facebook.com/moosefs> | LinkedIn > > > >> > <https://www.linkedin.com/company/moosefs> > > > >> > > > > >> > > > > >> > On Wed, Sep 30, 2020 at 1:46 AM Ricardo J. Barberis < > > > >> > > > > >> > ric...@do...> wrote: > > > >> > > Hi all, > > > >> > > > > > >> > > My mfsmaster crashed crashed today and when trying to start it I > > > > get > > > > > >> this > > > >> > > > >> > > error: > > > >> > > > > > >> > > loading edge: 2502428->5211944 error: empty name > > > >> > > > > > >> > > > > > >> > > I tried with metadata.mfs.back, metadata.mfs.back.1 from master, > > > > and > > > > > >> also > > > >> > > > >> > > metadata_ml.mfs.back and metadata_ml.mfs.back.1 from metalogger, > > > > all > > > > > >> of > > > >> > > > >> > > them > > > >> > > fail with the same error. > > > >> > > > > > >> > > > > > >> > > My mfsmaster was 3.0.100 when it crashed, I upgraded to 3.0.114 > > > > but > > > > > >> it's > > > >> > > > >> > > the same: > > > >> > > > > > >> > > # mfsmaster -v > > > >> > > version: 3.0.114-1 ; build: 1257 > > > >> > > > > > >> > > > > > >> > > Any hints to solve this much appreciated, as this is a > > > >> > > production cluster. > > > >> > > > > > >> > > > > > >> > > I have dumped all the metadata files and all of them have this: > > > >> > > > > > >> > > # mfsmetadump metadata.mfs.back > metadata.mfs.dump > > > >> > > # grep 2502428 metadata.mfs.dump | grep -1 5211944 > > > >> > > EDGE|p: 2502428|c: 5154814|i:0x7FFFFFFFF92D4A30|n: > > > >> > > 1519914956.H419257P2952.c106-dr.dattaweb.com:2,S > > > >> > > EDGE|p: 2502428|c: 5211944|i:0x000000095F864814|n: > > > >> > > EDGE|p: 2502428|c: 2920292|i:0x7FFFFFFFF947F4F2|n: > > > >> > > 1519809696.H296554P8952.c105-dr.dattaweb.com:2,S > > > >> > > > > > >> > > > > > >> > > Full output of 'mfsmaster -xx -a -i -c /etc/mfs/mfsmaster.cfg': > > > >> > > > > > >> > > open files limit has been set to: 16384 > > > >> > > working directory: /mnt/mailmfs/mfs > > > >> > > lockfile created and locked > > > >> > > initializing mfsmaster modules ... > > > >> > > exports file has been loaded > > > >> > > topology file has been loaded > > > >> > > write replication limit in old format - change limits to new > > > > format > > > > > >> > > read replication limit in old format - change limits to new > > > >> > > format loading metadata ... > > > >> > > found valid metadata file: metadata.mfs.back.1 (version: > > > > 16411110888 > > > > > >> > > ; id: 59AEE396E5130F20) > > > >> > > found invalid metadata file (wrong header): metadata.crc > > > >> > > found valid metadata file: metadata.mfs.back (version: > > > > 16412023641 ; > > > > > >> id: > > > >> > > 59AEE396E5130F20) > > > >> > > chosen most recent metadata file: metadata.mfs.back (version: > > > >> > > > >> 16412023641 > > > >> > > > >> > > ; id: 59AEE396E5130F20) > > > >> > > loading sessions data ... ok (0.0000) > > > >> > > loading storage classes data ... ok (0.0000) > > > >> > > loading objects (files,directories,etc.) ... ok (1.1999) > > > >> > > loading names ... > > > >> > > loading edge: 2502428->5211944 error: empty name > > > >> > > cleaning metadata ... > > > >> > > cleaning objects ... done > > > >> > > cleaning names ... done > > > >> > > cleaning deletion timestamps ... done > > > >> > > cleaning quota definitions ... done > > > >> > > cleaning chunks data ...done > > > >> > > cleaning xattr data ...done > > > >> > > cleaning posix_acl data ...done > > > >> > > cleaning flock locks data ...done > > > >> > > cleaning posix locks data ...done > > > >> > > cleaning chunkservers data ...done > > > >> > > cleaning open files data ...done > > > >> > > cleaning sessions data ...done > > > >> > > cleaning storage classes data ...done > > > >> > > cleaning dictionary data ...done > > > >> > > metadata have been cleaned > > > >> > > error loading metadata file (metadata.mfs.back): ENOENT (No such > > > >> > > file > > > >> > > > >> or > > > >> > > > >> > > directory) > > > >> > > init: metadata manager failed !!! > > > >> > > error occurred during initialization - exiting -- Ricardo J. Barberis Senior SysAdmin / IT Architect DonWeb La Actitud Es Todo www.DonWeb.com |
From: Piotr R. K. <pio...@mo...> - 2020-09-30 15:12:38
|
That's great, thank you for the reply. Please definitely try this patch and let us know. Looking forward to hearing from you. Best regards, Piotr *Piotr Robert Konopelko* | m: +48 601 476 440 | e: pio...@mo... *Business & Technical Support Manager* MooseFS Client Support Team WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | Twitter <https://twitter.com/moosefs> | Facebook <https://www.facebook.com/moosefs> | LinkedIn <https://www.linkedin.com/company/moosefs> On Wed, Sep 30, 2020 at 4:51 PM Ricardo J. Barberis < ric...@do...> wrote: > Cool, we're in the process of restoring from backups to a new mfs cluster > but > we'll reconfigure the old cluster to put it online and try this patch. > > It'll hopefully allow us to recover whatever we don't have in the backups > (which are fairly recent but since these are mailboxes we're talking about > everything helps). > > I'll let you know how the process goes. > > Thank you! > > El Miércoles 30/09/2020 a las 10:16, Piotr Robert Konopelko escribió: > > Sorry, I forgot to add – in order to build MooseFS from sources, you > need a > > few dependencies. > > Please install them as described here: > > https://github.com/moosefs/moosefs#source-code. > > > > Piotr > > > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > > pio...@mo... > > *Business & Technical Support Manager* > > MooseFS Client Support Team > > > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> > | > > Twitter <https://twitter.com/moosefs> | Facebook > > <https://www.facebook.com/moosefs> | LinkedIn > > <https://www.linkedin.com/company/moosefs> > > > > > > On Wed, Sep 30, 2020 at 3:00 PM Piotr Robert Konopelko < > > > > pio...@mo...> wrote: > > > Hello Ricardo, > > > > > > After your report, we have updated the Master Server code today to let > > > the "*-i*" parameter ignore empty file paths' (edges') names and > > > substitute them with "*(empty <inode_number>)*" string, without quotes > > > and lower / greater than marks of course. > > > > > > Please have a look at the following commit: > > > > > > > https://github.com/moosefs/moosefs/commit/886ea4a703afce1b40e4853ce02101a > > >8c43829f3 > > > > > > It will be included in the nearest MooseFS 3.0.115 release. > > > > > > In order to get this feature before release, please clone the MooseFS > Git > > > repository from GitHub: https://github.com/moosefs/moosefs, build > > > binaries by running *./linux_build.sh* script inside MooseFS Git > > > repository directory (this script doesn't run "*make install*", so you > > > can just copy (replace) "*mfsmaster*" executable binary file to " > > > */usr/sbin/mfsmaster*") and use this newly built "*mfsmaster*" binary > > > (don't forget about passing "*-a*" and "*-i*" parameters to it). It > > > should be able to load your metadata and substitute empty names with > the > > > above mentioned string. > > > > > > Please let me know if it worked for you and do not hesitate to contact > us > > > if you have any questions. > > > > > > Best regards, > > > Piotr > > > > > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > > > pio...@mo... > > > *Business & Technical Support Manager* > > > MooseFS Client Support Team > > > > > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> > | > > > Twitter <https://twitter.com/moosefs> | Facebook > > > <https://www.facebook.com/moosefs> | LinkedIn > > > <https://www.linkedin.com/company/moosefs> > > > > > > > > > On Wed, Sep 30, 2020 at 2:34 AM Ricardo J. Barberis < > > > > > > ric...@do...> wrote: > > >> El Martes 29/09/2020 a las 20:55, Piotr Robert Konopelko escribió: > > >> > Hello Ricardo, > > >> > > > >> > 1) Have you done a backup of /var/lib/mfs after the failure? If not, > > >> > > >> please > > >> > > >> > do so, just in case. > > >> > > >> Yes, I have a backup post-crash, before atempting to recover > > >> > > >> > 2) Have you tried the following sequence: > > >> > > > >> > - Stopping the Metalogger, > > >> > - Copying all the metadata_ml* files and changelog_ml* files from > > >> > Metalogger's /var/lib/mfs directory to Master Server's > /var/lib/mfs > > >> > directory, > > >> > - Starting the Master server with "-a" parameter that makes it > > >> > > >> trying to > > >> > > >> > recover the metadata: > > >> > > > >> > *mfsmaster -a* > > >> > > > >> > ? > > >> > > >> Yes, I tried this sequence in the master server and also in the > > >> metalogger, > > >> copying the files to a new directory (so I also got a backup of the > > >> files from the metalogger). > > >> > > >> > The above sequence is usually the best way to recover the metadata. > > >> > > >> I also tried 'mfsmaster -a -i' to no avail. > > >> > > >> > > >> BTW, all of this is in my original email, sorry if I wasn't very > clear. > > >> > > >> > Best regards, > > >> > Piotr > > >> > > > >> > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > > >> > pio...@mo... > > >> > *Business & Technical Support Manager* > > >> > MooseFS Client Support Team > > >> > > > >> > WWW <https://moosefs.com> | GitHub > > >> > <https://github.com/moosefs/moosefs> > > >> > > > >> > Twitter <https://twitter.com/moosefs> | Facebook > > >> > <https://www.facebook.com/moosefs> | LinkedIn > > >> > <https://www.linkedin.com/company/moosefs> > > >> > > > >> > > > >> > On Wed, Sep 30, 2020 at 1:46 AM Ricardo J. Barberis < > > >> > > > >> > ric...@do...> wrote: > > >> > > Hi all, > > >> > > > > >> > > My mfsmaster crashed crashed today and when trying to start it I > get > > >> > > >> this > > >> > > >> > > error: > > >> > > > > >> > > loading edge: 2502428->5211944 error: empty name > > >> > > > > >> > > > > >> > > I tried with metadata.mfs.back, metadata.mfs.back.1 from master, > and > > >> > > >> also > > >> > > >> > > metadata_ml.mfs.back and metadata_ml.mfs.back.1 from metalogger, > all > > >> > > >> of > > >> > > >> > > them > > >> > > fail with the same error. > > >> > > > > >> > > > > >> > > My mfsmaster was 3.0.100 when it crashed, I upgraded to 3.0.114 > but > > >> > > >> it's > > >> > > >> > > the same: > > >> > > > > >> > > # mfsmaster -v > > >> > > version: 3.0.114-1 ; build: 1257 > > >> > > > > >> > > > > >> > > Any hints to solve this much appreciated, as this is a production > > >> > > cluster. > > >> > > > > >> > > > > >> > > I have dumped all the metadata files and all of them have this: > > >> > > > > >> > > # mfsmetadump metadata.mfs.back > metadata.mfs.dump > > >> > > # grep 2502428 metadata.mfs.dump | grep -1 5211944 > > >> > > EDGE|p: 2502428|c: 5154814|i:0x7FFFFFFFF92D4A30|n: > > >> > > 1519914956.H419257P2952.c106-dr.dattaweb.com:2,S > > >> > > EDGE|p: 2502428|c: 5211944|i:0x000000095F864814|n: > > >> > > EDGE|p: 2502428|c: 2920292|i:0x7FFFFFFFF947F4F2|n: > > >> > > 1519809696.H296554P8952.c105-dr.dattaweb.com:2,S > > >> > > > > >> > > > > >> > > Full output of 'mfsmaster -xx -a -i -c /etc/mfs/mfsmaster.cfg': > > >> > > > > >> > > open files limit has been set to: 16384 > > >> > > working directory: /mnt/mailmfs/mfs > > >> > > lockfile created and locked > > >> > > initializing mfsmaster modules ... > > >> > > exports file has been loaded > > >> > > topology file has been loaded > > >> > > write replication limit in old format - change limits to new > format > > >> > > read replication limit in old format - change limits to new format > > >> > > loading metadata ... > > >> > > found valid metadata file: metadata.mfs.back.1 (version: > 16411110888 > > >> > > ; id: 59AEE396E5130F20) > > >> > > found invalid metadata file (wrong header): metadata.crc > > >> > > found valid metadata file: metadata.mfs.back (version: > 16412023641 ; > > >> > > >> id: > > >> > > 59AEE396E5130F20) > > >> > > chosen most recent metadata file: metadata.mfs.back (version: > > >> > > >> 16412023641 > > >> > > >> > > ; id: 59AEE396E5130F20) > > >> > > loading sessions data ... ok (0.0000) > > >> > > loading storage classes data ... ok (0.0000) > > >> > > loading objects (files,directories,etc.) ... ok (1.1999) > > >> > > loading names ... > > >> > > loading edge: 2502428->5211944 error: empty name > > >> > > cleaning metadata ... > > >> > > cleaning objects ... done > > >> > > cleaning names ... done > > >> > > cleaning deletion timestamps ... done > > >> > > cleaning quota definitions ... done > > >> > > cleaning chunks data ...done > > >> > > cleaning xattr data ...done > > >> > > cleaning posix_acl data ...done > > >> > > cleaning flock locks data ...done > > >> > > cleaning posix locks data ...done > > >> > > cleaning chunkservers data ...done > > >> > > cleaning open files data ...done > > >> > > cleaning sessions data ...done > > >> > > cleaning storage classes data ...done > > >> > > cleaning dictionary data ...done > > >> > > metadata have been cleaned > > >> > > error loading metadata file (metadata.mfs.back): ENOENT (No such > > >> > > file > > >> > > >> or > > >> > > >> > > directory) > > >> > > init: metadata manager failed !!! > > >> > > error occurred during initialization - exiting > -- > Ricardo J. Barberis > Senior SysAdmin / IT Architect > DonWeb > La Actitud Es Todo > www.DonWeb.com > _____ > |
From: Ricardo J. B. <ric...@do...> - 2020-09-30 14:51:54
|
Cool, we're in the process of restoring from backups to a new mfs cluster but we'll reconfigure the old cluster to put it online and try this patch. It'll hopefully allow us to recover whatever we don't have in the backups (which are fairly recent but since these are mailboxes we're talking about everything helps). I'll let you know how the process goes. Thank you! El Miércoles 30/09/2020 a las 10:16, Piotr Robert Konopelko escribió: > Sorry, I forgot to add – in order to build MooseFS from sources, you need a > few dependencies. > Please install them as described here: > https://github.com/moosefs/moosefs#source-code. > > Piotr > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > pio...@mo... > *Business & Technical Support Manager* > MooseFS Client Support Team > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | > Twitter <https://twitter.com/moosefs> | Facebook > <https://www.facebook.com/moosefs> | LinkedIn > <https://www.linkedin.com/company/moosefs> > > > On Wed, Sep 30, 2020 at 3:00 PM Piotr Robert Konopelko < > > pio...@mo...> wrote: > > Hello Ricardo, > > > > After your report, we have updated the Master Server code today to let > > the "*-i*" parameter ignore empty file paths' (edges') names and > > substitute them with "*(empty <inode_number>)*" string, without quotes > > and lower / greater than marks of course. > > > > Please have a look at the following commit: > > > > https://github.com/moosefs/moosefs/commit/886ea4a703afce1b40e4853ce02101a > >8c43829f3 > > > > It will be included in the nearest MooseFS 3.0.115 release. > > > > In order to get this feature before release, please clone the MooseFS Git > > repository from GitHub: https://github.com/moosefs/moosefs, build > > binaries by running *./linux_build.sh* script inside MooseFS Git > > repository directory (this script doesn't run "*make install*", so you > > can just copy (replace) "*mfsmaster*" executable binary file to " > > */usr/sbin/mfsmaster*") and use this newly built "*mfsmaster*" binary > > (don't forget about passing "*-a*" and "*-i*" parameters to it). It > > should be able to load your metadata and substitute empty names with the > > above mentioned string. > > > > Please let me know if it worked for you and do not hesitate to contact us > > if you have any questions. > > > > Best regards, > > Piotr > > > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > > pio...@mo... > > *Business & Technical Support Manager* > > MooseFS Client Support Team > > > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | > > Twitter <https://twitter.com/moosefs> | Facebook > > <https://www.facebook.com/moosefs> | LinkedIn > > <https://www.linkedin.com/company/moosefs> > > > > > > On Wed, Sep 30, 2020 at 2:34 AM Ricardo J. Barberis < > > > > ric...@do...> wrote: > >> El Martes 29/09/2020 a las 20:55, Piotr Robert Konopelko escribió: > >> > Hello Ricardo, > >> > > >> > 1) Have you done a backup of /var/lib/mfs after the failure? If not, > >> > >> please > >> > >> > do so, just in case. > >> > >> Yes, I have a backup post-crash, before atempting to recover > >> > >> > 2) Have you tried the following sequence: > >> > > >> > - Stopping the Metalogger, > >> > - Copying all the metadata_ml* files and changelog_ml* files from > >> > Metalogger's /var/lib/mfs directory to Master Server's /var/lib/mfs > >> > directory, > >> > - Starting the Master server with "-a" parameter that makes it > >> > >> trying to > >> > >> > recover the metadata: > >> > > >> > *mfsmaster -a* > >> > > >> > ? > >> > >> Yes, I tried this sequence in the master server and also in the > >> metalogger, > >> copying the files to a new directory (so I also got a backup of the > >> files from the metalogger). > >> > >> > The above sequence is usually the best way to recover the metadata. > >> > >> I also tried 'mfsmaster -a -i' to no avail. > >> > >> > >> BTW, all of this is in my original email, sorry if I wasn't very clear. > >> > >> > Best regards, > >> > Piotr > >> > > >> > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > >> > pio...@mo... > >> > *Business & Technical Support Manager* > >> > MooseFS Client Support Team > >> > > >> > WWW <https://moosefs.com> | GitHub > >> > <https://github.com/moosefs/moosefs> > >> > > >> > Twitter <https://twitter.com/moosefs> | Facebook > >> > <https://www.facebook.com/moosefs> | LinkedIn > >> > <https://www.linkedin.com/company/moosefs> > >> > > >> > > >> > On Wed, Sep 30, 2020 at 1:46 AM Ricardo J. Barberis < > >> > > >> > ric...@do...> wrote: > >> > > Hi all, > >> > > > >> > > My mfsmaster crashed crashed today and when trying to start it I get > >> > >> this > >> > >> > > error: > >> > > > >> > > loading edge: 2502428->5211944 error: empty name > >> > > > >> > > > >> > > I tried with metadata.mfs.back, metadata.mfs.back.1 from master, and > >> > >> also > >> > >> > > metadata_ml.mfs.back and metadata_ml.mfs.back.1 from metalogger, all > >> > >> of > >> > >> > > them > >> > > fail with the same error. > >> > > > >> > > > >> > > My mfsmaster was 3.0.100 when it crashed, I upgraded to 3.0.114 but > >> > >> it's > >> > >> > > the same: > >> > > > >> > > # mfsmaster -v > >> > > version: 3.0.114-1 ; build: 1257 > >> > > > >> > > > >> > > Any hints to solve this much appreciated, as this is a production > >> > > cluster. > >> > > > >> > > > >> > > I have dumped all the metadata files and all of them have this: > >> > > > >> > > # mfsmetadump metadata.mfs.back > metadata.mfs.dump > >> > > # grep 2502428 metadata.mfs.dump | grep -1 5211944 > >> > > EDGE|p: 2502428|c: 5154814|i:0x7FFFFFFFF92D4A30|n: > >> > > 1519914956.H419257P2952.c106-dr.dattaweb.com:2,S > >> > > EDGE|p: 2502428|c: 5211944|i:0x000000095F864814|n: > >> > > EDGE|p: 2502428|c: 2920292|i:0x7FFFFFFFF947F4F2|n: > >> > > 1519809696.H296554P8952.c105-dr.dattaweb.com:2,S > >> > > > >> > > > >> > > Full output of 'mfsmaster -xx -a -i -c /etc/mfs/mfsmaster.cfg': > >> > > > >> > > open files limit has been set to: 16384 > >> > > working directory: /mnt/mailmfs/mfs > >> > > lockfile created and locked > >> > > initializing mfsmaster modules ... > >> > > exports file has been loaded > >> > > topology file has been loaded > >> > > write replication limit in old format - change limits to new format > >> > > read replication limit in old format - change limits to new format > >> > > loading metadata ... > >> > > found valid metadata file: metadata.mfs.back.1 (version: 16411110888 > >> > > ; id: 59AEE396E5130F20) > >> > > found invalid metadata file (wrong header): metadata.crc > >> > > found valid metadata file: metadata.mfs.back (version: 16412023641 ; > >> > >> id: > >> > > 59AEE396E5130F20) > >> > > chosen most recent metadata file: metadata.mfs.back (version: > >> > >> 16412023641 > >> > >> > > ; id: 59AEE396E5130F20) > >> > > loading sessions data ... ok (0.0000) > >> > > loading storage classes data ... ok (0.0000) > >> > > loading objects (files,directories,etc.) ... ok (1.1999) > >> > > loading names ... > >> > > loading edge: 2502428->5211944 error: empty name > >> > > cleaning metadata ... > >> > > cleaning objects ... done > >> > > cleaning names ... done > >> > > cleaning deletion timestamps ... done > >> > > cleaning quota definitions ... done > >> > > cleaning chunks data ...done > >> > > cleaning xattr data ...done > >> > > cleaning posix_acl data ...done > >> > > cleaning flock locks data ...done > >> > > cleaning posix locks data ...done > >> > > cleaning chunkservers data ...done > >> > > cleaning open files data ...done > >> > > cleaning sessions data ...done > >> > > cleaning storage classes data ...done > >> > > cleaning dictionary data ...done > >> > > metadata have been cleaned > >> > > error loading metadata file (metadata.mfs.back): ENOENT (No such > >> > > file > >> > >> or > >> > >> > > directory) > >> > > init: metadata manager failed !!! > >> > > error occurred during initialization - exiting -- Ricardo J. Barberis Senior SysAdmin / IT Architect DonWeb La Actitud Es Todo www.DonWeb.com _____ |
From: Piotr R. K. <pio...@mo...> - 2020-09-30 13:52:16
|
Hello Ricardo, After your report, we have updated the Master Server code today to let the " *-i*" parameter ignore empty file paths' (edges') names and substitute them with "*(empty <inode_number>)*" string, without quotes and lower / greater than marks of course. Please have a look at the following commit: https://github.com/moosefs/moosefs/commit/886ea4a703afce1b40e4853ce02101a8c43829f3 It will be included in the nearest MooseFS 3.0.115 release. In order to get this feature before release, please clone the MooseFS Git repository from GitHub: https://github.com/moosefs/moosefs, build binaries by running *./linux_build.sh* script inside MooseFS Git repository directory (this script doesn't run "*make install*", so you can just copy (replace) "*mfsmaster*" executable binary file to "*/usr/sbin/mfsmaster*") and use this newly built "*mfsmaster*" binary (don't forget about passing "*-a*" and "*-i*" parameters to it). It should be able to load your metadata and substitute empty names with the above mentioned string. Please let me know if it worked for you and do not hesitate to contact us if you have any questions. Best regards, Piotr *Piotr Robert Konopelko* | m: +48 601 476 440 | e: pio...@mo... *Business & Technical Support Manager* MooseFS Client Support Team WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | Twitter <https://twitter.com/moosefs> | Facebook <https://www.facebook.com/moosefs> | LinkedIn <https://www.linkedin.com/company/moosefs> On Wed, Sep 30, 2020 at 2:34 AM Ricardo J. Barberis < ric...@do...> wrote: > El Martes 29/09/2020 a las 20:55, Piotr Robert Konopelko escribió: > > Hello Ricardo, > > > > 1) Have you done a backup of /var/lib/mfs after the failure? If not, > please > > do so, just in case. > > Yes, I have a backup post-crash, before atempting to recover > > > 2) Have you tried the following sequence: > > > > - Stopping the Metalogger, > > - Copying all the metadata_ml* files and changelog_ml* files from > > Metalogger's /var/lib/mfs directory to Master Server's /var/lib/mfs > > directory, > > - Starting the Master server with "-a" parameter that makes it trying > to > > recover the metadata: > > > > *mfsmaster -a* > > > > ? > > Yes, I tried this sequence in the master server and also in the > metalogger, > copying the files to a new directory (so I also got a backup of the files > from the metalogger). > > > The above sequence is usually the best way to recover the metadata. > > I also tried 'mfsmaster -a -i' to no avail. > > > BTW, all of this is in my original email, sorry if I wasn't very clear. > > > Best regards, > > Piotr > > > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > > pio...@mo... > > *Business & Technical Support Manager* > > MooseFS Client Support Team > > > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> > | > > Twitter <https://twitter.com/moosefs> | Facebook > > <https://www.facebook.com/moosefs> | LinkedIn > > <https://www.linkedin.com/company/moosefs> > > > > > > On Wed, Sep 30, 2020 at 1:46 AM Ricardo J. Barberis < > > > > ric...@do...> wrote: > > > Hi all, > > > > > > My mfsmaster crashed crashed today and when trying to start it I get > this > > > error: > > > > > > loading edge: 2502428->5211944 error: empty name > > > > > > > > > I tried with metadata.mfs.back, metadata.mfs.back.1 from master, and > also > > > metadata_ml.mfs.back and metadata_ml.mfs.back.1 from metalogger, all of > > > them > > > fail with the same error. > > > > > > > > > My mfsmaster was 3.0.100 when it crashed, I upgraded to 3.0.114 but > it's > > > the same: > > > > > > # mfsmaster -v > > > version: 3.0.114-1 ; build: 1257 > > > > > > > > > Any hints to solve this much appreciated, as this is a production > > > cluster. > > > > > > > > > I have dumped all the metadata files and all of them have this: > > > > > > # mfsmetadump metadata.mfs.back > metadata.mfs.dump > > > # grep 2502428 metadata.mfs.dump | grep -1 5211944 > > > EDGE|p: 2502428|c: 5154814|i:0x7FFFFFFFF92D4A30|n: > > > 1519914956.H419257P2952.c106-dr.dattaweb.com:2,S > > > EDGE|p: 2502428|c: 5211944|i:0x000000095F864814|n: > > > EDGE|p: 2502428|c: 2920292|i:0x7FFFFFFFF947F4F2|n: > > > 1519809696.H296554P8952.c105-dr.dattaweb.com:2,S > > > > > > > > > Full output of 'mfsmaster -xx -a -i -c /etc/mfs/mfsmaster.cfg': > > > > > > open files limit has been set to: 16384 > > > working directory: /mnt/mailmfs/mfs > > > lockfile created and locked > > > initializing mfsmaster modules ... > > > exports file has been loaded > > > topology file has been loaded > > > write replication limit in old format - change limits to new format > > > read replication limit in old format - change limits to new format > > > loading metadata ... > > > found valid metadata file: metadata.mfs.back.1 (version: 16411110888 ; > > > id: 59AEE396E5130F20) > > > found invalid metadata file (wrong header): metadata.crc > > > found valid metadata file: metadata.mfs.back (version: 16412023641 ; > id: > > > 59AEE396E5130F20) > > > chosen most recent metadata file: metadata.mfs.back (version: > 16412023641 > > > ; id: 59AEE396E5130F20) > > > loading sessions data ... ok (0.0000) > > > loading storage classes data ... ok (0.0000) > > > loading objects (files,directories,etc.) ... ok (1.1999) > > > loading names ... > > > loading edge: 2502428->5211944 error: empty name > > > cleaning metadata ... > > > cleaning objects ... done > > > cleaning names ... done > > > cleaning deletion timestamps ... done > > > cleaning quota definitions ... done > > > cleaning chunks data ...done > > > cleaning xattr data ...done > > > cleaning posix_acl data ...done > > > cleaning flock locks data ...done > > > cleaning posix locks data ...done > > > cleaning chunkservers data ...done > > > cleaning open files data ...done > > > cleaning sessions data ...done > > > cleaning storage classes data ...done > > > cleaning dictionary data ...done > > > metadata have been cleaned > > > error loading metadata file (metadata.mfs.back): ENOENT (No such file > or > > > directory) > > > init: metadata manager failed !!! > > > error occurred during initialization - exiting > -- > Ricardo J. Barberis > Senior SysAdmin / IT Architect > DonWeb > La Actitud Es Todo > www.DonWeb.com > _____ > |
From: Piotr R. K. <pio...@mo...> - 2020-09-30 13:17:10
|
Sorry, I forgot to add – in order to build MooseFS from sources, you need a few dependencies. Please install them as described here: https://github.com/moosefs/moosefs#source-code. Piotr *Piotr Robert Konopelko* | m: +48 601 476 440 | e: pio...@mo... *Business & Technical Support Manager* MooseFS Client Support Team WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | Twitter <https://twitter.com/moosefs> | Facebook <https://www.facebook.com/moosefs> | LinkedIn <https://www.linkedin.com/company/moosefs> On Wed, Sep 30, 2020 at 3:00 PM Piotr Robert Konopelko < pio...@mo...> wrote: > Hello Ricardo, > > After your report, we have updated the Master Server code today to let the > "*-i*" parameter ignore empty file paths' (edges') names and substitute > them with "*(empty <inode_number>)*" string, without quotes and lower / > greater than marks of course. > > Please have a look at the following commit: > > https://github.com/moosefs/moosefs/commit/886ea4a703afce1b40e4853ce02101a8c43829f3 > > It will be included in the nearest MooseFS 3.0.115 release. > > In order to get this feature before release, please clone the MooseFS Git > repository from GitHub: https://github.com/moosefs/moosefs, build > binaries by running *./linux_build.sh* script inside MooseFS Git > repository directory (this script doesn't run "*make install*", so you > can just copy (replace) "*mfsmaster*" executable binary file to " > */usr/sbin/mfsmaster*") and use this newly built "*mfsmaster*" binary > (don't forget about passing "*-a*" and "*-i*" parameters to it). It > should be able to load your metadata and substitute empty names with the > above mentioned string. > > Please let me know if it worked for you and do not hesitate to contact us > if you have any questions. > > Best regards, > Piotr > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > pio...@mo... > *Business & Technical Support Manager* > MooseFS Client Support Team > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | > Twitter <https://twitter.com/moosefs> | Facebook > <https://www.facebook.com/moosefs> | LinkedIn > <https://www.linkedin.com/company/moosefs> > > > On Wed, Sep 30, 2020 at 2:34 AM Ricardo J. Barberis < > ric...@do...> wrote: > >> El Martes 29/09/2020 a las 20:55, Piotr Robert Konopelko escribió: >> > Hello Ricardo, >> > >> > 1) Have you done a backup of /var/lib/mfs after the failure? If not, >> please >> > do so, just in case. >> >> Yes, I have a backup post-crash, before atempting to recover >> >> > 2) Have you tried the following sequence: >> > >> > - Stopping the Metalogger, >> > - Copying all the metadata_ml* files and changelog_ml* files from >> > Metalogger's /var/lib/mfs directory to Master Server's /var/lib/mfs >> > directory, >> > - Starting the Master server with "-a" parameter that makes it >> trying to >> > recover the metadata: >> > >> > *mfsmaster -a* >> > >> > ? >> >> Yes, I tried this sequence in the master server and also in the >> metalogger, >> copying the files to a new directory (so I also got a backup of the files >> from the metalogger). >> >> > The above sequence is usually the best way to recover the metadata. >> >> I also tried 'mfsmaster -a -i' to no avail. >> >> >> BTW, all of this is in my original email, sorry if I wasn't very clear. >> >> > Best regards, >> > Piotr >> > >> > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: >> > pio...@mo... >> > *Business & Technical Support Manager* >> > MooseFS Client Support Team >> > >> > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> >> | >> > Twitter <https://twitter.com/moosefs> | Facebook >> > <https://www.facebook.com/moosefs> | LinkedIn >> > <https://www.linkedin.com/company/moosefs> >> > >> > >> > On Wed, Sep 30, 2020 at 1:46 AM Ricardo J. Barberis < >> > >> > ric...@do...> wrote: >> > > Hi all, >> > > >> > > My mfsmaster crashed crashed today and when trying to start it I get >> this >> > > error: >> > > >> > > loading edge: 2502428->5211944 error: empty name >> > > >> > > >> > > I tried with metadata.mfs.back, metadata.mfs.back.1 from master, and >> also >> > > metadata_ml.mfs.back and metadata_ml.mfs.back.1 from metalogger, all >> of >> > > them >> > > fail with the same error. >> > > >> > > >> > > My mfsmaster was 3.0.100 when it crashed, I upgraded to 3.0.114 but >> it's >> > > the same: >> > > >> > > # mfsmaster -v >> > > version: 3.0.114-1 ; build: 1257 >> > > >> > > >> > > Any hints to solve this much appreciated, as this is a production >> > > cluster. >> > > >> > > >> > > I have dumped all the metadata files and all of them have this: >> > > >> > > # mfsmetadump metadata.mfs.back > metadata.mfs.dump >> > > # grep 2502428 metadata.mfs.dump | grep -1 5211944 >> > > EDGE|p: 2502428|c: 5154814|i:0x7FFFFFFFF92D4A30|n: >> > > 1519914956.H419257P2952.c106-dr.dattaweb.com:2,S >> > > EDGE|p: 2502428|c: 5211944|i:0x000000095F864814|n: >> > > EDGE|p: 2502428|c: 2920292|i:0x7FFFFFFFF947F4F2|n: >> > > 1519809696.H296554P8952.c105-dr.dattaweb.com:2,S >> > > >> > > >> > > Full output of 'mfsmaster -xx -a -i -c /etc/mfs/mfsmaster.cfg': >> > > >> > > open files limit has been set to: 16384 >> > > working directory: /mnt/mailmfs/mfs >> > > lockfile created and locked >> > > initializing mfsmaster modules ... >> > > exports file has been loaded >> > > topology file has been loaded >> > > write replication limit in old format - change limits to new format >> > > read replication limit in old format - change limits to new format >> > > loading metadata ... >> > > found valid metadata file: metadata.mfs.back.1 (version: 16411110888 ; >> > > id: 59AEE396E5130F20) >> > > found invalid metadata file (wrong header): metadata.crc >> > > found valid metadata file: metadata.mfs.back (version: 16412023641 ; >> id: >> > > 59AEE396E5130F20) >> > > chosen most recent metadata file: metadata.mfs.back (version: >> 16412023641 >> > > ; id: 59AEE396E5130F20) >> > > loading sessions data ... ok (0.0000) >> > > loading storage classes data ... ok (0.0000) >> > > loading objects (files,directories,etc.) ... ok (1.1999) >> > > loading names ... >> > > loading edge: 2502428->5211944 error: empty name >> > > cleaning metadata ... >> > > cleaning objects ... done >> > > cleaning names ... done >> > > cleaning deletion timestamps ... done >> > > cleaning quota definitions ... done >> > > cleaning chunks data ...done >> > > cleaning xattr data ...done >> > > cleaning posix_acl data ...done >> > > cleaning flock locks data ...done >> > > cleaning posix locks data ...done >> > > cleaning chunkservers data ...done >> > > cleaning open files data ...done >> > > cleaning sessions data ...done >> > > cleaning storage classes data ...done >> > > cleaning dictionary data ...done >> > > metadata have been cleaned >> > > error loading metadata file (metadata.mfs.back): ENOENT (No such file >> or >> > > directory) >> > > init: metadata manager failed !!! >> > > error occurred during initialization - exiting >> -- >> Ricardo J. Barberis >> Senior SysAdmin / IT Architect >> DonWeb >> La Actitud Es Todo >> www.DonWeb.com >> _____ >> > |
From: Ricardo J. B. <ric...@do...> - 2020-09-30 00:34:31
|
El Martes 29/09/2020 a las 20:55, Piotr Robert Konopelko escribió: > Hello Ricardo, > > 1) Have you done a backup of /var/lib/mfs after the failure? If not, please > do so, just in case. Yes, I have a backup post-crash, before atempting to recover > 2) Have you tried the following sequence: > > - Stopping the Metalogger, > - Copying all the metadata_ml* files and changelog_ml* files from > Metalogger's /var/lib/mfs directory to Master Server's /var/lib/mfs > directory, > - Starting the Master server with "-a" parameter that makes it trying to > recover the metadata: > > *mfsmaster -a* > > ? Yes, I tried this sequence in the master server and also in the metalogger, copying the files to a new directory (so I also got a backup of the files from the metalogger). > The above sequence is usually the best way to recover the metadata. I also tried 'mfsmaster -a -i' to no avail. BTW, all of this is in my original email, sorry if I wasn't very clear. > Best regards, > Piotr > > *Piotr Robert Konopelko* | m: +48 601 476 440 | e: > pio...@mo... > *Business & Technical Support Manager* > MooseFS Client Support Team > > WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | > Twitter <https://twitter.com/moosefs> | Facebook > <https://www.facebook.com/moosefs> | LinkedIn > <https://www.linkedin.com/company/moosefs> > > > On Wed, Sep 30, 2020 at 1:46 AM Ricardo J. Barberis < > > ric...@do...> wrote: > > Hi all, > > > > My mfsmaster crashed crashed today and when trying to start it I get this > > error: > > > > loading edge: 2502428->5211944 error: empty name > > > > > > I tried with metadata.mfs.back, metadata.mfs.back.1 from master, and also > > metadata_ml.mfs.back and metadata_ml.mfs.back.1 from metalogger, all of > > them > > fail with the same error. > > > > > > My mfsmaster was 3.0.100 when it crashed, I upgraded to 3.0.114 but it's > > the same: > > > > # mfsmaster -v > > version: 3.0.114-1 ; build: 1257 > > > > > > Any hints to solve this much appreciated, as this is a production > > cluster. > > > > > > I have dumped all the metadata files and all of them have this: > > > > # mfsmetadump metadata.mfs.back > metadata.mfs.dump > > # grep 2502428 metadata.mfs.dump | grep -1 5211944 > > EDGE|p: 2502428|c: 5154814|i:0x7FFFFFFFF92D4A30|n: > > 1519914956.H419257P2952.c106-dr.dattaweb.com:2,S > > EDGE|p: 2502428|c: 5211944|i:0x000000095F864814|n: > > EDGE|p: 2502428|c: 2920292|i:0x7FFFFFFFF947F4F2|n: > > 1519809696.H296554P8952.c105-dr.dattaweb.com:2,S > > > > > > Full output of 'mfsmaster -xx -a -i -c /etc/mfs/mfsmaster.cfg': > > > > open files limit has been set to: 16384 > > working directory: /mnt/mailmfs/mfs > > lockfile created and locked > > initializing mfsmaster modules ... > > exports file has been loaded > > topology file has been loaded > > write replication limit in old format - change limits to new format > > read replication limit in old format - change limits to new format > > loading metadata ... > > found valid metadata file: metadata.mfs.back.1 (version: 16411110888 ; > > id: 59AEE396E5130F20) > > found invalid metadata file (wrong header): metadata.crc > > found valid metadata file: metadata.mfs.back (version: 16412023641 ; id: > > 59AEE396E5130F20) > > chosen most recent metadata file: metadata.mfs.back (version: 16412023641 > > ; id: 59AEE396E5130F20) > > loading sessions data ... ok (0.0000) > > loading storage classes data ... ok (0.0000) > > loading objects (files,directories,etc.) ... ok (1.1999) > > loading names ... > > loading edge: 2502428->5211944 error: empty name > > cleaning metadata ... > > cleaning objects ... done > > cleaning names ... done > > cleaning deletion timestamps ... done > > cleaning quota definitions ... done > > cleaning chunks data ...done > > cleaning xattr data ...done > > cleaning posix_acl data ...done > > cleaning flock locks data ...done > > cleaning posix locks data ...done > > cleaning chunkservers data ...done > > cleaning open files data ...done > > cleaning sessions data ...done > > cleaning storage classes data ...done > > cleaning dictionary data ...done > > metadata have been cleaned > > error loading metadata file (metadata.mfs.back): ENOENT (No such file or > > directory) > > init: metadata manager failed !!! > > error occurred during initialization - exiting -- Ricardo J. Barberis Senior SysAdmin / IT Architect DonWeb La Actitud Es Todo www.DonWeb.com _____ |
From: Piotr R. K. <pio...@mo...> - 2020-09-30 00:19:47
|
Hello Ricardo, 1) Have you done a backup of /var/lib/mfs after the failure? If not, please do so, just in case. 2) Have you tried the following sequence: - Stopping the Metalogger, - Copying all the metadata_ml* files and changelog_ml* files from Metalogger's /var/lib/mfs directory to Master Server's /var/lib/mfs directory, - Starting the Master server with "-a" parameter that makes it trying to recover the metadata: *mfsmaster -a* ? The above sequence is usually the best way to recover the metadata. Best regards, Piotr *Piotr Robert Konopelko* | m: +48 601 476 440 | e: pio...@mo... *Business & Technical Support Manager* MooseFS Client Support Team WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | Twitter <https://twitter.com/moosefs> | Facebook <https://www.facebook.com/moosefs> | LinkedIn <https://www.linkedin.com/company/moosefs> On Wed, Sep 30, 2020 at 1:46 AM Ricardo J. Barberis < ric...@do...> wrote: > Hi all, > > My mfsmaster crashed crashed today and when trying to start it I get this > error: > > loading edge: 2502428->5211944 error: empty name > > > I tried with metadata.mfs.back, metadata.mfs.back.1 from master, and also > metadata_ml.mfs.back and metadata_ml.mfs.back.1 from metalogger, all of > them > fail with the same error. > > > My mfsmaster was 3.0.100 when it crashed, I upgraded to 3.0.114 but it's > the same: > > # mfsmaster -v > version: 3.0.114-1 ; build: 1257 > > > Any hints to solve this much appreciated, as this is a production cluster. > > > I have dumped all the metadata files and all of them have this: > > # mfsmetadump metadata.mfs.back > metadata.mfs.dump > # grep 2502428 metadata.mfs.dump | grep -1 5211944 > EDGE|p: 2502428|c: 5154814|i:0x7FFFFFFFF92D4A30|n: > 1519914956.H419257P2952.c106-dr.dattaweb.com:2,S > EDGE|p: 2502428|c: 5211944|i:0x000000095F864814|n: > EDGE|p: 2502428|c: 2920292|i:0x7FFFFFFFF947F4F2|n: > 1519809696.H296554P8952.c105-dr.dattaweb.com:2,S > > > Full output of 'mfsmaster -xx -a -i -c /etc/mfs/mfsmaster.cfg': > > open files limit has been set to: 16384 > working directory: /mnt/mailmfs/mfs > lockfile created and locked > initializing mfsmaster modules ... > exports file has been loaded > topology file has been loaded > write replication limit in old format - change limits to new format > read replication limit in old format - change limits to new format > loading metadata ... > found valid metadata file: metadata.mfs.back.1 (version: 16411110888 ; id: > 59AEE396E5130F20) > found invalid metadata file (wrong header): metadata.crc > found valid metadata file: metadata.mfs.back (version: 16412023641 ; id: > 59AEE396E5130F20) > chosen most recent metadata file: metadata.mfs.back (version: 16412023641 > ; id: 59AEE396E5130F20) > loading sessions data ... ok (0.0000) > loading storage classes data ... ok (0.0000) > loading objects (files,directories,etc.) ... ok (1.1999) > loading names ... > loading edge: 2502428->5211944 error: empty name > cleaning metadata ... > cleaning objects ... done > cleaning names ... done > cleaning deletion timestamps ... done > cleaning quota definitions ... done > cleaning chunks data ...done > cleaning xattr data ...done > cleaning posix_acl data ...done > cleaning flock locks data ...done > cleaning posix locks data ...done > cleaning chunkservers data ...done > cleaning open files data ...done > cleaning sessions data ...done > cleaning storage classes data ...done > cleaning dictionary data ...done > metadata have been cleaned > error loading metadata file (metadata.mfs.back): ENOENT (No such file or > directory) > init: metadata manager failed !!! > error occurred during initialization - exiting > > > -- > Ricardo J. Barberis > Senior SysAdmin / IT Architect > DonWeb > La Actitud Es Todo > www.DonWeb.com > _____ > > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Ricardo J. B. <ric...@do...> - 2020-09-29 23:45:24
|
Hi all, My mfsmaster crashed crashed today and when trying to start it I get this error: loading edge: 2502428->5211944 error: empty name I tried with metadata.mfs.back, metadata.mfs.back.1 from master, and also metadata_ml.mfs.back and metadata_ml.mfs.back.1 from metalogger, all of them fail with the same error. My mfsmaster was 3.0.100 when it crashed, I upgraded to 3.0.114 but it's the same: # mfsmaster -v version: 3.0.114-1 ; build: 1257 Any hints to solve this much appreciated, as this is a production cluster. I have dumped all the metadata files and all of them have this: # mfsmetadump metadata.mfs.back > metadata.mfs.dump # grep 2502428 metadata.mfs.dump | grep -1 5211944 EDGE|p: 2502428|c: 5154814|i:0x7FFFFFFFF92D4A30|n:1519914956.H419257P2952.c106-dr.dattaweb.com:2,S EDGE|p: 2502428|c: 5211944|i:0x000000095F864814|n: EDGE|p: 2502428|c: 2920292|i:0x7FFFFFFFF947F4F2|n:1519809696.H296554P8952.c105-dr.dattaweb.com:2,S Full output of 'mfsmaster -xx -a -i -c /etc/mfs/mfsmaster.cfg': open files limit has been set to: 16384 working directory: /mnt/mailmfs/mfs lockfile created and locked initializing mfsmaster modules ... exports file has been loaded topology file has been loaded write replication limit in old format - change limits to new format read replication limit in old format - change limits to new format loading metadata ... found valid metadata file: metadata.mfs.back.1 (version: 16411110888 ; id: 59AEE396E5130F20) found invalid metadata file (wrong header): metadata.crc found valid metadata file: metadata.mfs.back (version: 16412023641 ; id: 59AEE396E5130F20) chosen most recent metadata file: metadata.mfs.back (version: 16412023641 ; id: 59AEE396E5130F20) loading sessions data ... ok (0.0000) loading storage classes data ... ok (0.0000) loading objects (files,directories,etc.) ... ok (1.1999) loading names ... loading edge: 2502428->5211944 error: empty name cleaning metadata ... cleaning objects ... done cleaning names ... done cleaning deletion timestamps ... done cleaning quota definitions ... done cleaning chunks data ...done cleaning xattr data ...done cleaning posix_acl data ...done cleaning flock locks data ...done cleaning posix locks data ...done cleaning chunkservers data ...done cleaning open files data ...done cleaning sessions data ...done cleaning storage classes data ...done cleaning dictionary data ...done metadata have been cleaned error loading metadata file (metadata.mfs.back): ENOENT (No such file or directory) init: metadata manager failed !!! error occurred during initialization - exiting -- Ricardo J. Barberis Senior SysAdmin / IT Architect DonWeb La Actitud Es Todo www.DonWeb.com _____ |
From: Markus K. <mar...@tu...> - 2020-09-10 05:51:14
|
On Tuesday, 8 September 2020 09:04:50 CEST Agata Kruszona-Zawadzka wrote: > W dniu 07.09.2020 o 15:17, Markus Köberl pisze: > > On Monday, 7 September 2020 12:19:42 CEST Agata Kruszona-Zawadzka wrote: > >> > >> W dniu 04.09.2020 o 15:22, Markus Köberl pisze: > >>> Since some time (last few versions of MooseFS) on a few chunkservers the used space grows above the default ACCEPTABLE_PERCENTAGE_DIFFERENCE = 1.0 till I restart the affected chunkserver. > >>> On the webinterface i see huge numbers for overgoal (even 4 extra copies). After the restart of the chunkserver the overgoal goes down but starts growing again after some time. > >> > >> We have an issue in MooseFS currently, where on disks with I/O errors in > >> certain circustances some chunks get locked and cannot be deleted until > >> the whole chunk server process is restarted. We introduced a fix for > >> that, it's gonna be available in version 3.0.115. The issue does not > >> affect disks without I/O errors. > > > > Thanks good to hear that a fix might be on the way. > > > > Could it be that instead of "some chunks get locked and cannot be deleted" that there a no deletes at all on this chunk server, or might that be a different problem? > > Yes, that's exactly it. By "some chunks" I meant that not every chunk is > able to trigger the problem, but once it happens so that enough chunks > do, then, due to operation limits (specifically deletions limit in this > instance), the system won't attempt to delete any more chunks. Thank you for confirming that it is the same problem and the good explanation. regards Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: mar...@tu... |
From: Agata Kruszona-Z. <ch...@mo...> - 2020-09-08 07:05:13
|
W dniu 07.09.2020 o 15:17, Markus Köberl pisze: > On Monday, 7 September 2020 12:19:42 CEST Agata Kruszona-Zawadzka wrote: >> >> W dniu 04.09.2020 o 15:22, Markus Köberl pisze: >>> Since some time (last few versions of MooseFS) on a few chunkservers the used space grows above the default ACCEPTABLE_PERCENTAGE_DIFFERENCE = 1.0 till I restart the affected chunkserver. >>> On the webinterface i see huge numbers for overgoal (even 4 extra copies). After the restart of the chunkserver the overgoal goes down but starts growing again after some time. >> >> We have an issue in MooseFS currently, where on disks with I/O errors in >> certain circustances some chunks get locked and cannot be deleted until >> the whole chunk server process is restarted. We introduced a fix for >> that, it's gonna be available in version 3.0.115. The issue does not >> affect disks without I/O errors. > > Thanks good to hear that a fix might be on the way. > > Could it be that instead of "some chunks get locked and cannot be deleted" that there a no deletes at all on this chunk server, or might that be a different problem? Yes, that's exactly it. By "some chunks" I meant that not every chunk is able to trigger the problem, but once it happens so that enough chunks do, then, due to operation limits (specifically deletions limit in this instance), the system won't attempt to delete any more chunks. -- Regards, Agata Kruszona-Zawadzka MooseFS Team |
From: Markus K. <mar...@tu...> - 2020-09-07 13:17:36
|
On Monday, 7 September 2020 12:19:42 CEST Agata Kruszona-Zawadzka wrote: > > W dniu 04.09.2020 o 15:22, Markus Köberl pisze: > > Since some time (last few versions of MooseFS) on a few chunkservers the used space grows above the default ACCEPTABLE_PERCENTAGE_DIFFERENCE = 1.0 till I restart the affected chunkserver. > > On the webinterface i see huge numbers for overgoal (even 4 extra copies). After the restart of the chunkserver the overgoal goes down but starts growing again after some time. > > We have an issue in MooseFS currently, where on disks with I/O errors in > certain circustances some chunks get locked and cannot be deleted until > the whole chunk server process is restarted. We introduced a fix for > that, it's gonna be available in version 3.0.115. The issue does not > affect disks without I/O errors. Thanks good to hear that a fix might be on the way. Could it be that instead of "some chunks get locked and cannot be deleted" that there a no deletes at all on this chunk server, or might that be a different problem? The reason is that on Friday I did restart some of our chunk servers so that everything got in balance again. Over the Weekend one chunk server crashed. Because the combined disk space in my chunk servers varies between 2TiB and 25TiB, during rebalancing the used disk space varies for some time. But usually always got close to zero over some Time. I rebooted the crashed chunk server Today. But on a few chunk servers (not related with the number and size of the installed disks) there are no deletes happening at all. On the "Server Charts" tab the diagrams for "number of chunk deletions per minute" are completely empty for 7 of my 35 chunks servers. While all the others had heavy activity for about 2 hours. Now the used disk space varies between avg+1.22 up to avg+12.6% on those 7 hosts while all the others are in balance. There are no deletions in any diagram for the past 45 minutes. regards Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: mar...@tu... |
From: Agata Kruszona-Z. <ch...@mo...> - 2020-09-07 10:47:14
|
W dniu 04.09.2020 o 15:22, Markus Köberl pisze: > Since some time (last few versions of MooseFS) on a few chunkservers the used space grows above the default ACCEPTABLE_PERCENTAGE_DIFFERENCE = 1.0 till I restart the affected chunkserver. > On the webinterface i see huge numbers for overgoal (even 4 extra copies). After the restart of the chunkserver the overgoal goes down but starts growing again after some time. We have an issue in MooseFS currently, where on disks with I/O errors in certain circustances some chunks get locked and cannot be deleted until the whole chunk server process is restarted. We introduced a fix for that, it's gonna be available in version 3.0.115. The issue does not affect disks without I/O errors. -- Regards, Agata Kruszona-Zawadzka MooseFS Team |
From: Markus K. <mar...@tu...> - 2020-09-04 13:40:58
|
Since some time (last few versions of MooseFS) on a few chunkservers the used space grows above the default ACCEPTABLE_PERCENTAGE_DIFFERENCE = 1.0 till I restart the affected chunkserver. On the webinterface i see huge numbers for overgoal (even 4 extra copies). After the restart of the chunkserver the overgoal goes down but starts growing again after some time. Right at the moment I have only one chunkserver with average+1.2 but the number for overgoal is 20% of stable. I am running MooseFS since many years and never encountered this kind op problem. Master and chunkserver are all up to date with the newest mfs version. In the log file of a chunkserver I see: Sep 3 01:15:41 ravenriad mfschunkserver[11969]: replicator,read chunks: got status: IO error from (172.16.140.244:24CE) And on master at the same time (cfsh11 = 172.16.140.244): Sep 3 01:15:41 cfsh11 mfschunkserver[713]: chunk_readcrc: file:/srv/MooseFS6//C0/chunk_0000000009E27A36_00000001.mfs - wrong id/version in header (0000000009E27A36_00000000) Sep 3 01:15:41 cfsh11 mfschunkserver[713]: hdd_io_begin: file:/srv/MooseFS6//C0/chunk_0000000009E27A36_00000001.mfs - read error: Success (errno=0) Sep 3 01:16:24 cfsh11 mfschunkserver[713]: chunk_readcrc: file:/srv/MooseFS3//39/chunk_0000000014D7E46B_00000001.mfs - wrong id/version in header (0000000014D7E46B_00000000) Sep 3 01:16:24 cfsh11 mfschunkserver[713]: hdd_io_begin: file:/srv/MooseFS3//39 an on chunkserver cfsh11: Sep 3 01:15:41 cfsh11 mfschunkserver[713]: chunk_readcrc: file:/srv/MooseFS6//C0/chunk_0000000009E27A36_00000001.mfs - wrong id/version in header (0000000009E27A36_00000000) Sep 3 01:15:41 cfsh11 mfschunkserver[713]: hdd_io_begin: file:/srv/MooseFS6//C0/chunk_0000000009E27A36_00000001.mfs - read error: Success (errno=0) Sep 3 01:16:24 cfsh11 mfschunkserver[713]: chunk_readcrc: file:/srv/MooseFS3//39/chunk_0000000014D7E46B_00000001.mfs - wrong id/version in header (0000000014D7E46B_00000000) Sep 3 01:16:24 cfsh11 mfschunkserver[713]: hdd_io_begin: file:/srv/MooseFS3//39/chunk_0000000014D7E46B_00000001.mfs - read error: Success (errno=0) On master I also see log entries like: Sep 4 10:54:21 cfshm1 mfsmaster[7386]: (172.16.140.140:9422 -> 172.16.140.106:9422) chunk: 0000000009524179 replication status: IO error Sep 4 10:54:42 cfshm1 mfsmaster[7386]: (172.16.140.244:9422 -> 172.16.140.52:9422) chunk: 0000000004B6062C replication status: IO error Sep 4 10:54:42 cfshm1 mfsmaster[7386]: got replication status from one server, but another is set as busy !!! Sep 4 10:54:43 cfshm1 mfsmaster[7386]: chunk 0000000004B6062C_00000001: unexpected BUSY copies - fixing Sep 4 10:54:43 cfshm1 mfsmaster[7386]: got replication status from one server, but another is set as busy !!! Sep 4 10:54:43 cfshm1 mfsmaster[7386]: got replication status from server not set as busy !!! Sep 4 10:54:43 cfshm1 mfsmaster[7386]: got replication status from server which had had that chunk before (chunk:0000000004B6062C_00000001) Sep 4 10:54:43 cfshm1 mfsmaster[7386]: chunk 0000000004B6062C_00000001: unexpected BUSY copies - fixing Sep 4 10:55:56 cfshm1 mfsmaster[7386]: (172.16.140.244:9422 -> 172.16.140.89:9422) chunk: 00000000113406F1 replication status: No such chunk Sep 4 10:59:16 cfshm1 mfsmaster[7386]: (172.16.140.244:9422 -> 172.16.140.186:9422) chunk: 0000000012DAB6CA replication status: IO error I am not aware of any changes I might have maid. From netdata I get warnings regarding inbound packages dropped, fifo errors and tcp handshake resets sent or received. This warnings I also got before the problems with the overgoal begun. In my configuration a have net.core.rmem_max=26214400 and net.ipv4.tcp_challenge_ack_limit=999999999, otherwise it should be the defaults of debian buster. The setup is for a lab at our university with 35 chunkserver distributed over several buildings on the campus network. About one third are old workstations with only chunkserver with 6-8 HDs. The others are cluster nodes (6 HDs) or workstations (1-4 HDs) with users also working on graphical interface. Therefore not all switches involved are data-center switches and the network cables might be from 1998 where some of the buildings where build. Also it is often the case that the hosts are under heavy load or high memory pressure. I am very happy for ideas what might cause that new behavior (within the last halve year) or how to eliminate it again. regards Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: mar...@tu... |
From: wkmail <wk...@bn...> - 2020-09-03 04:21:46
|
There *was* something going on with 10.166.0.26. there were three drives orginally there and we had added the 4th drive (along with a brand new 10.166.0.27) due to the impending space issue. A few hours after my initial note I noticed that the original three drives on 10.166.0.26 went from 90% full to 99% full during replication. I was worried that running out of space on those drives would cause an issue and put them into deprecated mode. (i.e. */mfsmountA, etc) The errors on both 10.166.0.26 and 10.166.0.27 immediately went away and the replication is proceeding as I would normally on the other nodes. this weekend is a holiday in the US so I will have the opportunity to shutdown and update the master to a proper version. MooseFS is so good that its easy to become lazy about such things. Thx -wk On 9/2/2020 2:21 AM, Aleksander Wieliczko wrote: > It looks like you chunk servers: 10.166.0.26 and 10.166.0.27 have some > serious problems: network and disks problems "IO ERROR". > Are you sure that they have enough resources - I mean they are not > swapping or they don't have any network problems? > > Also at the moment, it looks like your cluster is running with not > allowed configuration! > MooseFS master should always be higher or equal in version to other > MooseFS components(chunk servers, clients, meta loggers). > Right now your master server is running in version 3.0105 - this can > lead to weird cluster behavior. > > Please update all components to the same version first. > > Best regards, > > Aleksander Wieliczko > System Engineer > MooseFS Development & Support Team | moosefs.pro <http://moosefs.pro> > > > wt., 1 wrz 2020 o 23:36 WK <wk...@bn... > <mailto:wk...@bn...>> napisał(a): > > Forgot to show the master logs > > > Getting tons of these. All error to the two new systems. > > Sep 1 14:09:52 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021490FB replication > status: Disconnected > Sep 1 14:10:12 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000016A270 replication > status: Disconnected > Sep 1 14:10:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000014B7ACC replication > status: Disconnected > Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.21:9422 > <http://10.166.0.21:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000002161D0F replication > status: Disconnected > Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000002A71B0 replication > status: Disconnected > Sep 1 14:11:49 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000001FDF21 replication > status: Disconnected > Sep 1 14:12:14 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001EB5CE2 replication > status: Disconnected > Sep 1 14:12:54 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000002132B93 replication > status: Disconnected > Sep 1 14:14:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 > <http://10.166.0.23:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001CD13D5 replication > status: Disconnected > Sep 1 14:14:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000827F86 replication > status: Disconnected > Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000867B8E replication > status: Disconnected > Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000803F97 replication > status: Disconnected > Sep 1 14:15:15 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000020A5306 replication > status: Disconnected > Sep 1 14:15:16 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000133697 replication > status: Disconnected > Sep 1 14:15:21 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000007B7FCF replication > status: Disconnected > Sep 1 14:15:26 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000002966E3 replication > status: Disconnected > Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.22:9422 > <http://10.166.0.22:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001C96E65 replication > status: Disconnected > Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 000000000213C6EA replication > status: Disconnected > Sep 1 14:15:58 mfs66master mfsmaster[1297]: (10.166.0.22:9422 > <http://10.166.0.22:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021651E6 replication > status: Disconnected > Sep 1 14:16:01 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000002150C8F replication > status: Disconnected > Sep 1 14:16:06 mfs66master mfsmaster[1297]: (10.166.0.25:9422 > <http://10.166.0.25:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021070A8 replication > status: Disconnected > Sep 1 14:16:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 > <http://10.166.0.23:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001CAB31F replication > status: Disconnected > Sep 1 14:16:18 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 000000000004BE69 replication > status: IO error > Sep 1 14:16:34 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000049C99 replication > status: Disconnected > Sep 1 14:16:35 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000001E3DA57 replication > status: Disconnected > Sep 1 14:16:39 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000014388A replication > status: Disconnected > Sep 1 14:16:40 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000160E1AA replication > status: Disconnected > Sep 1 14:16:43 mfs66master mfsmaster[1297]: (10.166.0.23:9422 > <http://10.166.0.23:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021656B1 replication > status: Disconnected > Sep 1 14:17:59 mfs66master mfsmaster[1297]: (10.166.0.22:9422 > <http://10.166.0.22:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001F20BAD replication > status: Disconnected > Sep 1 14:18:41 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000D03303 replication > status: Disconnected > Sep 1 14:19:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000215CF9C replication > status: Disconnected > Sep 1 14:19:18 mfs66master mfsmaster[1297]: (10.166.0.22:9422 > <http://10.166.0.22:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 000000000213C207 replication > status: Disconnected > Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000019A2E00 replication > status: Disconnected > Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000001F75859 replication > status: Disconnected > Sep 1 14:20:22 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000000534F96 replication > status: Disconnected > Sep 1 14:20:24 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000001C69388 replication > status: Disconnected > Sep 1 14:20:28 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000001F3B4C2 replication > status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 > <http://10.166.0.25:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000002164F77 replication > status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.21:9422 > <http://10.166.0.21:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000020A8CF7 replication > status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000020F4BCD replication > status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 > <http://10.166.0.25:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 00000000021509FF replication > status: Disconnected > Sep 1 14:21:02 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000025AB13 replication > status: Disconnected > Sep 1 14:21:06 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 000000000213C226 replication > status: Disconnected > Sep 1 14:21:09 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 0000000002151A03 replication > status: Disconnected > Sep 1 14:22:30 mfs66master mfsmaster[1297]: (10.166.0.24:9422 > <http://10.166.0.24:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000002165995 replication > status: Disconnected > Sep 1 14:22:52 mfs66master mfsmaster[1297]: (10.166.0.21:9422 > <http://10.166.0.21:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000001C7CD90 replication > status: Disconnected > Sep 1 14:23:25 mfs66master mfsmaster[1297]: (10.166.0.26:9422 > <http://10.166.0.26:9422> -> 10.166.0.27:9422 > <http://10.166.0.27:9422>) chunk: 00000000021615A3 replication > status: Disconnected > Sep 1 14:23:39 mfs66master mfsmaster[1297]: (10.166.0.21:9422 > <http://10.166.0.21:9422> -> 10.166.0.26:9422 > <http://10.166.0.26:9422>) chunk: 0000000002161C17 replication > status: Disconnected > > > On 9/1/2020 2:19 PM, WK wrote: >> >> I just added two new chunkservers to an existing cluster. >> >> I am seeing lots of these >> >> Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: >> receive timed out >> >> >> and sometimes the master throws it out completely >> >> Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. >> Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice >> of mfsmaster. >> Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop >> detected (23.797745s) >> Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection >> was reset by Master >> Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing >> connection with master >> Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... >> Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to >> Master >> Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: >> connection lost >> >> All machines are running CentOS7 >> >> However there is a mix of MFS versions. >> >> The master is running 3.0.105 >> >> and the chunkservers are running various versions. >> >> >> 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> >> 1486939 9.8 TiB 11 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> >> 2313507 14 TiB 15 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> >> 2154089 13 TiB 14 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> >> 2900774 16 TiB 18 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> >> 2319192 13 TiB 14 TiB >> 90.49 >> - 0 0 B 0 B >> - >> 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> >> 2131090 13 TiB 18 TiB >> 70.56 >> - 0 0 B 0 B >> - >> 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : >> switch on >> <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> >> 7196 47 GiB 18 TiB >> 0.25 >> - 0 0 B 0 B >> - >> >> Only the two newest units #6 and #7 are having problems and they >> are running the latest MFS version. They were added due to the >> 90% disk space issue, so there is a lot of rebalancing going on. >> >> I assumed the problem a mismatch between the 3.0.105 master and >> the new version but #2 is also running 3.0.114 and is not having >> problem (though it does have an older kernel) >> >> the networking appears fine (iperf runs at 1GB) no errors in >> dmesg etc. >> >> I will be scheduling some downtime to bring the master up to date >> shortly but I'm interested if anybody else is having this problem >> >> -wk >> >> >> > _________________________________________ > moosefs-users mailing list > moo...@li... > <mailto:moo...@li...> > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Agata Kruszona-Z. <ch...@mo...> - 2020-09-02 09:46:22
|
W dniu 01.09.2020 o 23:19, WK pisze: > I just added two new chunkservers to an existing cluster. > > I am seeing lots of these > > Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: receive > timed out > > > and sometimes the master throws it out completely > > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. > Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of > mfsmaster. > Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop detected > (23.797745s) > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was > reset by Master > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing connection > with master > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master > Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > > All machines are running CentOS7 > > However there is a mix of MFS versions. > > The master is running 3.0.105 > > and the chunkservers are running various versions. > > > 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> > 1486939 9.8 TiB 11 TiB > 90.49 > - 0 0 B 0 B > - > 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> > 2313507 14 TiB 15 TiB > 90.49 > - 0 0 B 0 B > - > 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> > 2154089 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> > 2900774 16 TiB 18 TiB > 90.49 > - 0 0 B 0 B > - > 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> > 2319192 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> > 2131090 13 TiB 18 TiB > 70.56 > - 0 0 B 0 B > - > 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> > 7196 47 GiB 18 TiB > 0.25 > - 0 0 B 0 B > - > > Only the two newest units #6 and #7 are having problems and they are > running the latest MFS version. They were added due to the 90% disk > space issue, so there is a lot of rebalancing going on. > > I assumed the problem a mismatch between the 3.0.105 master and the new > version but #2 is also running 3.0.114 and is not having problem (though > it does have an older kernel) > > the networking appears fine (iperf runs at 1GB) no errors in dmesg etc. > > I will be scheduling some downtime to bring the master up to date > shortly but I'm interested if anybody else is having this problem Hi, It's definitely not a good idea to have an older master and newer chunk servers. I would suggest upgrade of everything to match the newest chunk servers. The messages you are getting are simply indicators of timed out connections. A connection between two MFS modules can time out due to network errors or due to one of the modules being "too busy" and not responding in time. "Too busy" might mean a number of things: slow I/O on local disks, CPU not keeping up (happens when you have other processes running on the same machines as MFS modules) and a number of other factors. You need to take a look at your system and try to find the bottleneck. For starters, you can try to lower the replication limits (fourth value in the CHUNKS_WRITE_REP_LIMIT and CHUNKS_READ_REP_LIMIT settings) and see if it helps get rid of the messages. -- Agata Kruszona-Zawadzka MooseFS Team |
From: Aleksander W. <ale...@mo...> - 2020-09-02 09:29:41
|
It looks like you chunk servers: 10.166.0.26 and 10.166.0.27 have some serious problems: network and disks problems "IO ERROR". Are you sure that they have enough resources - I mean they are not swapping or they don't have any network problems? Also at the moment, it looks like your cluster is running with not allowed configuration! MooseFS master should always be higher or equal in version to other MooseFS components(chunk servers, clients, meta loggers). Right now your master server is running in version 3.0105 - this can lead to weird cluster behavior. Please update all components to the same version first. Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro wt., 1 wrz 2020 o 23:36 WK <wk...@bn...> napisał(a): > Forgot to show the master logs > > > Getting tons of these. All error to the two new systems. > > Sep 1 14:09:52 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 00000000021490FB replication status: Disconnected > Sep 1 14:10:12 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000016A270 replication status: Disconnected > Sep 1 14:10:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000014B7ACC replication status: Disconnected > Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> > 10.166.0.26:9422) chunk: 0000000002161D0F replication status: Disconnected > Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 00000000002A71B0 replication status: Disconnected > Sep 1 14:11:49 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000001FDF21 replication status: Disconnected > Sep 1 14:12:14 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 0000000001EB5CE2 replication status: Disconnected > Sep 1 14:12:54 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000002132B93 replication status: Disconnected > Sep 1 14:14:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> > 10.166.0.26:9422) chunk: 0000000001CD13D5 replication status: Disconnected > Sep 1 14:14:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000827F86 replication status: Disconnected > Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000867B8E replication status: Disconnected > Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000803F97 replication status: Disconnected > Sep 1 14:15:15 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000020A5306 replication status: Disconnected > Sep 1 14:15:16 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000133697 replication status: Disconnected > Sep 1 14:15:21 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000007B7FCF replication status: Disconnected > Sep 1 14:15:26 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000002966E3 replication status: Disconnected > Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> > 10.166.0.26:9422) chunk: 0000000001C96E65 replication status: Disconnected > Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 000000000213C6EA replication status: Disconnected > Sep 1 14:15:58 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> > 10.166.0.26:9422) chunk: 00000000021651E6 replication status: Disconnected > Sep 1 14:16:01 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000002150C8F replication status: Disconnected > Sep 1 14:16:06 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> > 10.166.0.26:9422) chunk: 00000000021070A8 replication status: Disconnected > Sep 1 14:16:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> > 10.166.0.26:9422) chunk: 0000000001CAB31F replication status: Disconnected > Sep 1 14:16:18 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 000000000004BE69 replication status: IO error > Sep 1 14:16:34 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000049C99 replication status: Disconnected > Sep 1 14:16:35 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000001E3DA57 replication status: Disconnected > Sep 1 14:16:39 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000014388A replication status: Disconnected > Sep 1 14:16:40 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000160E1AA replication status: Disconnected > Sep 1 14:16:43 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> > 10.166.0.26:9422) chunk: 00000000021656B1 replication status: Disconnected > Sep 1 14:17:59 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> > 10.166.0.26:9422) chunk: 0000000001F20BAD replication status: Disconnected > Sep 1 14:18:41 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000D03303 replication status: Disconnected > Sep 1 14:19:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000215CF9C replication status: Disconnected > Sep 1 14:19:18 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> > 10.166.0.26:9422) chunk: 000000000213C207 replication status: Disconnected > Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000019A2E00 replication status: Disconnected > Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000001F75859 replication status: Disconnected > Sep 1 14:20:22 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000000534F96 replication status: Disconnected > Sep 1 14:20:24 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000001C69388 replication status: Disconnected > Sep 1 14:20:28 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000001F3B4C2 replication status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> > 10.166.0.26:9422) chunk: 0000000002164F77 replication status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> > 10.166.0.26:9422) chunk: 00000000020A8CF7 replication status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000020F4BCD replication status: Disconnected > Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> > 10.166.0.26:9422) chunk: 00000000021509FF replication status: Disconnected > Sep 1 14:21:02 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000025AB13 replication status: Disconnected > Sep 1 14:21:06 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 000000000213C226 replication status: Disconnected > Sep 1 14:21:09 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 0000000002151A03 replication status: Disconnected > Sep 1 14:22:30 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> > 10.166.0.26:9422) chunk: 0000000002165995 replication status: Disconnected > Sep 1 14:22:52 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> > 10.166.0.26:9422) chunk: 0000000001C7CD90 replication status: Disconnected > Sep 1 14:23:25 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> > 10.166.0.27:9422) chunk: 00000000021615A3 replication status: Disconnected > Sep 1 14:23:39 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> > 10.166.0.26:9422) chunk: 0000000002161C17 replication status: Disconnected > > > On 9/1/2020 2:19 PM, WK wrote: > > I just added two new chunkservers to an existing cluster. > > I am seeing lots of these > > Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: receive > timed out > > > and sometimes the master throws it out completely > > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. > Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of > mfsmaster. > Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop detected > (23.797745s) > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was reset > by Master > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing connection > with master > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master > Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > > All machines are running CentOS7 > > However there is a mix of MFS versions. > > The master is running 3.0.105 > > and the chunkservers are running various versions. > > > 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> > 1486939 9.8 TiB 11 TiB > 90.49 > - 0 0 B 0 B > - > 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> > 2313507 14 TiB 15 TiB > 90.49 > - 0 0 B 0 B > - > 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> > 2154089 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> > 2900774 16 TiB 18 TiB > 90.49 > - 0 0 B 0 B > - > 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> > 2319192 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> > 2131090 13 TiB 18 TiB > 70.56 > - 0 0 B 0 B > - > 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> > 7196 47 GiB 18 TiB > 0.25 > - 0 0 B 0 B > - > > Only the two newest units #6 and #7 are having problems and they are > running the latest MFS version. They were added due to the 90% disk space > issue, so there is a lot of rebalancing going on. > > I assumed the problem a mismatch between the 3.0.105 master and the new > version but #2 is also running 3.0.114 and is not having problem (though it > does have an older kernel) > > the networking appears fine (iperf runs at 1GB) no errors in dmesg etc. > > I will be scheduling some downtime to bring the master up to date shortly > but I'm interested if anybody else is having this problem > > -wk > > > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: WK <wk...@bn...> - 2020-09-01 21:35:43
|
I just added two new chunkservers to an existing cluster. I am seeing lots of these Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: receive timed out and sometimes the master throws it out completely Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of mfsmaster. Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop detected (23.797745s) Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was reset by Master Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing connection with master Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: connection lost All machines are running CentOS7 However there is a mix of MFS versions. The master is running 3.0.105 and the chunkservers are running various versions. 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> 1486939 9.8 TiB 11 TiB 90.49 - 0 0 B 0 B - 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> 2313507 14 TiB 15 TiB 90.49 - 0 0 B 0 B - 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> 2154089 13 TiB 14 TiB 90.49 - 0 0 B 0 B - 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> 2900774 16 TiB 18 TiB 90.49 - 0 0 B 0 B - 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> 2319192 13 TiB 14 TiB 90.49 - 0 0 B 0 B - 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> 2131090 13 TiB 18 TiB 70.56 - 0 0 B 0 B - 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : switch on <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> 7196 47 GiB 18 TiB 0.25 - 0 0 B 0 B - Only the two newest units #6 and #7 are having problems and they are running the latest MFS version. They were added due to the 90% disk space issue, so there is a lot of rebalancing going on. I assumed the problem a mismatch between the 3.0.105 master and the new version but #2 is also running 3.0.114 and is not having problem (though it does have an older kernel) the networking appears fine (iperf runs at 1GB) no errors in dmesg etc. I will be scheduling some downtime to bring the master up to date shortly but I'm interested if anybody else is having this problem -wk |
From: WK <wk...@bn...> - 2020-09-01 21:35:38
|
Forgot to show the master logs Getting tons of these. All error to the two new systems. Sep 1 14:09:52 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 00000000021490FB replication status: Disconnected Sep 1 14:10:12 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000016A270 replication status: Disconnected Sep 1 14:10:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000014B7ACC replication status: Disconnected Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> 10.166.0.26:9422) chunk: 0000000002161D0F replication status: Disconnected Sep 1 14:11:09 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 00000000002A71B0 replication status: Disconnected Sep 1 14:11:49 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000001FDF21 replication status: Disconnected Sep 1 14:12:14 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 0000000001EB5CE2 replication status: Disconnected Sep 1 14:12:54 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000002132B93 replication status: Disconnected Sep 1 14:14:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> 10.166.0.26:9422) chunk: 0000000001CD13D5 replication status: Disconnected Sep 1 14:14:14 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000827F86 replication status: Disconnected Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000867B8E replication status: Disconnected Sep 1 14:15:10 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000803F97 replication status: Disconnected Sep 1 14:15:15 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000020A5306 replication status: Disconnected Sep 1 14:15:16 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000133697 replication status: Disconnected Sep 1 14:15:21 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000007B7FCF replication status: Disconnected Sep 1 14:15:26 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000002966E3 replication status: Disconnected Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> 10.166.0.26:9422) chunk: 0000000001C96E65 replication status: Disconnected Sep 1 14:15:31 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 000000000213C6EA replication status: Disconnected Sep 1 14:15:58 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> 10.166.0.26:9422) chunk: 00000000021651E6 replication status: Disconnected Sep 1 14:16:01 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000002150C8F replication status: Disconnected Sep 1 14:16:06 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> 10.166.0.26:9422) chunk: 00000000021070A8 replication status: Disconnected Sep 1 14:16:07 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> 10.166.0.26:9422) chunk: 0000000001CAB31F replication status: Disconnected Sep 1 14:16:18 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 000000000004BE69 replication status: IO error Sep 1 14:16:34 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000049C99 replication status: Disconnected Sep 1 14:16:35 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000001E3DA57 replication status: Disconnected Sep 1 14:16:39 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000014388A replication status: Disconnected Sep 1 14:16:40 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000160E1AA replication status: Disconnected Sep 1 14:16:43 mfs66master mfsmaster[1297]: (10.166.0.23:9422 -> 10.166.0.26:9422) chunk: 00000000021656B1 replication status: Disconnected Sep 1 14:17:59 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> 10.166.0.26:9422) chunk: 0000000001F20BAD replication status: Disconnected Sep 1 14:18:41 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000D03303 replication status: Disconnected Sep 1 14:19:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000215CF9C replication status: Disconnected Sep 1 14:19:18 mfs66master mfsmaster[1297]: (10.166.0.22:9422 -> 10.166.0.26:9422) chunk: 000000000213C207 replication status: Disconnected Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000019A2E00 replication status: Disconnected Sep 1 14:20:17 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000001F75859 replication status: Disconnected Sep 1 14:20:22 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000000534F96 replication status: Disconnected Sep 1 14:20:24 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000001C69388 replication status: Disconnected Sep 1 14:20:28 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000001F3B4C2 replication status: Disconnected Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> 10.166.0.26:9422) chunk: 0000000002164F77 replication status: Disconnected Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> 10.166.0.26:9422) chunk: 00000000020A8CF7 replication status: Disconnected Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000020F4BCD replication status: Disconnected Sep 1 14:20:29 mfs66master mfsmaster[1297]: (10.166.0.25:9422 -> 10.166.0.26:9422) chunk: 00000000021509FF replication status: Disconnected Sep 1 14:21:02 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000025AB13 replication status: Disconnected Sep 1 14:21:06 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 000000000213C226 replication status: Disconnected Sep 1 14:21:09 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 0000000002151A03 replication status: Disconnected Sep 1 14:22:30 mfs66master mfsmaster[1297]: (10.166.0.24:9422 -> 10.166.0.26:9422) chunk: 0000000002165995 replication status: Disconnected Sep 1 14:22:52 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> 10.166.0.26:9422) chunk: 0000000001C7CD90 replication status: Disconnected Sep 1 14:23:25 mfs66master mfsmaster[1297]: (10.166.0.26:9422 -> 10.166.0.27:9422) chunk: 00000000021615A3 replication status: Disconnected Sep 1 14:23:39 mfs66master mfsmaster[1297]: (10.166.0.21:9422 -> 10.166.0.26:9422) chunk: 0000000002161C17 replication status: Disconnected On 9/1/2020 2:19 PM, WK wrote: > > I just added two new chunkservers to an existing cluster. > > I am seeing lots of these > > Sep 1 14:09:31 mfs66chunker7 mfschunkserver[1364]: replicator: > receive timed out > > > and sometimes the master throws it out completely > > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:38:46 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:39 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:39:40 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:16 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:40:44 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:19 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:41:49 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:43:09 mfs66chunker6 systemd-logind: Removed session 39. > Sep 1 13:43:09 mfs66chunker6 systemd: Removed slice User Slice of > mfsmaster. > Sep 1 13:43:34 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:44:08 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:45:53 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:03 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:47:51 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:31 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:49:41 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: long loop > detected (23.797745s) > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: connection was > reset by Master > Sep 1 13:50:05 mfs66chunker6 mfschunkserver[27693]: closing > connection with master > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connecting ... > Sep 1 13:50:06 mfs66chunker6 mfschunkserver[27693]: connected to Master > Sep 1 13:50:36 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > Sep 1 13:51:17 mfs66chunker6 mfschunkserver[27693]: replicator: > connection lost > > All machines are running CentOS7 > > However there is a mix of MFS versions. > > The master is running 3.0.105 > > and the chunkservers are running various versions. > > > 1 mfs66chunker1 10.166.0.21 9422 3 - 3.0.103 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.21%3A9422> > 1486939 9.8 TiB 11 TiB > 90.49 > - 0 0 B 0 B > - > 2 mfs66chunker2 10.166.0.22 9422 2 - 3.0.114 6 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.22%3A9422> > 2313507 14 TiB 15 TiB > 90.49 > - 0 0 B 0 B > - > 3 mfs66chunker3 10.166.0.23 9422 1 - 3.0.111 13 OFF : switch > on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.23%3A9422> > 2154089 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 4 mfs66chunker4 10.166.0.24 9422 4 - 3.0.103 12 OFF : switch > on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.24%3A9422> > 2900774 16 TiB 18 TiB > 90.49 > - 0 0 B 0 B > - > 5 mfs66chunker5 10.166.0.25 9422 5 - 3.0.103 12 OFF : switch > on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.25%3A9422> > 2319192 13 TiB 14 TiB > 90.49 > - 0 0 B 0 B > - > 6 mfs66chunker6 10.166.0.26 9422 6 - 3.0.114 (6) OFF : switch > on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.26%3A9422> > 2131090 13 TiB 18 TiB > 70.56 > - 0 0 B 0 B > - > 7 mfs66chunker7 10.166.0.27 9422 7 - 3.0.114 4 OFF : switch on > <http://mfs66master.pixelgate.net:9425/mfs.cgi?sections=CS&CSmaintenanceon=10.166.0.27%3A9422> > 7196 47 GiB 18 TiB > 0.25 > - 0 0 B 0 B > - > > Only the two newest units #6 and #7 are having problems and they are > running the latest MFS version. They were added due to the 90% disk > space issue, so there is a lot of rebalancing going on. > > I assumed the problem a mismatch between the 3.0.105 master and the > new version but #2 is also running 3.0.114 and is not having problem > (though it does have an older kernel) > > the networking appears fine (iperf runs at 1GB) no errors in dmesg etc. > > I will be scheduling some downtime to bring the master up to date > shortly but I'm interested if anybody else is having this problem > > -wk > > > |
From: web u. <web...@gm...> - 2020-08-15 17:17:17
|
Is there a free version of a windows client available? If so where can I get it? What about a comercial product? Can I just buy the moosefs windows client or do I have to upgrade my entire moosefs to a paid version? I'm mostly a linux setup but have one windows machine that I need to support and hence the need for a moosefs windows client. |
From: Jay L. <jl...@sl...> - 2020-07-27 17:41:40
|
Hi all, I am happy to report that this problem resolved itself. All data was there, and it just took another automatic MFS filesystem check loop to identify that the blocks were not actually missing. (e.g. I did nothing and MFS just fixed itself.) Awesome! Jay On Mon, Jul 27, 2020 at 12:37 PM Jay Livens <jl...@sl...> wrote: > *Edit: I just realized that I had previously sent this with an email > address that was not subscribed to this list. Sorry if this gets duped.* > > I had a catastrophic hardware outage due to a switch upgrade this AM, and > MFS was not happy about it. Everything was working great prior to the > problem. Once MooseFS came back online, I was hit with the following: Missing > files (gathered by previous file-loop) for 3,470 files, and the > error message in each row in the missing file table says "NO COPY." > > A quick check of some of the files listed under missing files indicates > that they still exist and are still accessible, and so I am trying to > understand what this error means and what I should do about it. I do have a > backup of my files if I need to go that route. (The admin page does not > show any missing chunks.) > > Thank you in advance! > |
From: Piotr R. K. <pio...@mo...> - 2020-07-27 17:07:39
|
Hello Jay, Do you have any missing chunks (in red) in the "All chunks state matrix" (MFS CGI, "Info" tab) at the moment? Best regards, Piotr *Piotr Robert Konopelko* | m: +48 601 476 440 | e: pio...@mo... *Business & Technical Support Manager* MooseFS Client Support Team WWW <https://moosefs.com> | GitHub <https://github.com/moosefs/moosefs> | Twitter <https://twitter.com/moosefs> | Facebook <https://www.facebook.com/moosefs> | LinkedIn <https://www.linkedin.com/company/moosefs> On Mon, Jul 27, 2020 at 6:38 PM Jay Livens <jl...@sl...> wrote: > *Edit: I just realized that I had previously sent this with an email > address that was not subscribed to this list. Sorry if this gets duped.* > > I had a catastrophic hardware outage due to a switch upgrade this AM, and > MFS was not happy about it. Everything was working great prior to the > problem. Once MooseFS came back online, I was hit with the following: Missing > files (gathered by previous file-loop) for 3,470 files, and the > error message in each row in the missing file table says "NO COPY." > > A quick check of some of the files listed under missing files indicates > that they still exist and are still accessible, and so I am trying to > understand what this error means and what I should do about it. I do have a > backup of my files if I need to go that route. (The admin page does not > show any missing chunks.) > > Thank you in advance! > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Jay L <jl...@li...> - 2020-07-27 16:46:40
|
Hi, I had a catastrophic hardware outage due to a switch upgrade this AM, and MFS was not happy about it. Everything was working great prior to the problem. Once MooseFS came back online, I was hit with the following: Missing files (gathered by previous file-loop) for 3,470 files, and the error message in each row in the missing file table says "NO COPY." A quick check of some of the files listed under missing files indicated that they still existed and were still accessible, and so I am trying to understand what this error means and what I should do about it. I do have a backup if I need to go that route. Thank you in advance! |
From: Jay L. <jl...@sl...> - 2020-07-27 16:37:48
|
*Edit: I just realized that I had previously sent this with an email address that was not subscribed to this list. Sorry if this gets duped.* I had a catastrophic hardware outage due to a switch upgrade this AM, and MFS was not happy about it. Everything was working great prior to the problem. Once MooseFS came back online, I was hit with the following: Missing files (gathered by previous file-loop) for 3,470 files, and the error message in each row in the missing file table says "NO COPY." A quick check of some of the files listed under missing files indicates that they still exist and are still accessible, and so I am trying to understand what this error means and what I should do about it. I do have a backup of my files if I need to go that route. (The admin page does not show any missing chunks.) Thank you in advance! |
From: Tianon G. <ti...@in...> - 2020-06-10 18:22:29
|
FWIW, https://github.com/moosefs/moosefs/issues/290 is an (existing) related discussion. Tianon Gravi SVP of Operations InfoSiftr, LLC ti...@in... Office: (702) 724-2670 ex 6180; Las Vegas, NV (Main office phone -- does not reach my desk) 4096R / B42F 6819 007F 00F8 8E36 4FD4 036A 9C25 BF35 7DD4 On Wed, 10 Jun 2020 at 06:28, Agata Kruszona-Zawadzka <ch...@mo...> wrote: > Hi, > > If I may suggest - ask this on github. The community there is much more > active than here, someone might have a solution that they can share with > you. > > Regards, > Agata > > W dniu 05.06.2020 o 13:42, Markus Köberl pisze: > > Are there nagios/icinga scripts available for monitoring. This should be > quite easy using mfscli. > > I guess already somebody did the work and is willing to share the > scripts. > > > > > > Thank you, > > Markus Köberl > > > > > -- > -- > Agata Kruszona-Zawadzka > MooseFS Team > > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |