From: Markus K. <mar...@tu...> - 2020-09-04 13:40:58
|
Since some time (last few versions of MooseFS) on a few chunkservers the used space grows above the default ACCEPTABLE_PERCENTAGE_DIFFERENCE = 1.0 till I restart the affected chunkserver. On the webinterface i see huge numbers for overgoal (even 4 extra copies). After the restart of the chunkserver the overgoal goes down but starts growing again after some time. Right at the moment I have only one chunkserver with average+1.2 but the number for overgoal is 20% of stable. I am running MooseFS since many years and never encountered this kind op problem. Master and chunkserver are all up to date with the newest mfs version. In the log file of a chunkserver I see: Sep 3 01:15:41 ravenriad mfschunkserver[11969]: replicator,read chunks: got status: IO error from (172.16.140.244:24CE) And on master at the same time (cfsh11 = 172.16.140.244): Sep 3 01:15:41 cfsh11 mfschunkserver[713]: chunk_readcrc: file:/srv/MooseFS6//C0/chunk_0000000009E27A36_00000001.mfs - wrong id/version in header (0000000009E27A36_00000000) Sep 3 01:15:41 cfsh11 mfschunkserver[713]: hdd_io_begin: file:/srv/MooseFS6//C0/chunk_0000000009E27A36_00000001.mfs - read error: Success (errno=0) Sep 3 01:16:24 cfsh11 mfschunkserver[713]: chunk_readcrc: file:/srv/MooseFS3//39/chunk_0000000014D7E46B_00000001.mfs - wrong id/version in header (0000000014D7E46B_00000000) Sep 3 01:16:24 cfsh11 mfschunkserver[713]: hdd_io_begin: file:/srv/MooseFS3//39 an on chunkserver cfsh11: Sep 3 01:15:41 cfsh11 mfschunkserver[713]: chunk_readcrc: file:/srv/MooseFS6//C0/chunk_0000000009E27A36_00000001.mfs - wrong id/version in header (0000000009E27A36_00000000) Sep 3 01:15:41 cfsh11 mfschunkserver[713]: hdd_io_begin: file:/srv/MooseFS6//C0/chunk_0000000009E27A36_00000001.mfs - read error: Success (errno=0) Sep 3 01:16:24 cfsh11 mfschunkserver[713]: chunk_readcrc: file:/srv/MooseFS3//39/chunk_0000000014D7E46B_00000001.mfs - wrong id/version in header (0000000014D7E46B_00000000) Sep 3 01:16:24 cfsh11 mfschunkserver[713]: hdd_io_begin: file:/srv/MooseFS3//39/chunk_0000000014D7E46B_00000001.mfs - read error: Success (errno=0) On master I also see log entries like: Sep 4 10:54:21 cfshm1 mfsmaster[7386]: (172.16.140.140:9422 -> 172.16.140.106:9422) chunk: 0000000009524179 replication status: IO error Sep 4 10:54:42 cfshm1 mfsmaster[7386]: (172.16.140.244:9422 -> 172.16.140.52:9422) chunk: 0000000004B6062C replication status: IO error Sep 4 10:54:42 cfshm1 mfsmaster[7386]: got replication status from one server, but another is set as busy !!! Sep 4 10:54:43 cfshm1 mfsmaster[7386]: chunk 0000000004B6062C_00000001: unexpected BUSY copies - fixing Sep 4 10:54:43 cfshm1 mfsmaster[7386]: got replication status from one server, but another is set as busy !!! Sep 4 10:54:43 cfshm1 mfsmaster[7386]: got replication status from server not set as busy !!! Sep 4 10:54:43 cfshm1 mfsmaster[7386]: got replication status from server which had had that chunk before (chunk:0000000004B6062C_00000001) Sep 4 10:54:43 cfshm1 mfsmaster[7386]: chunk 0000000004B6062C_00000001: unexpected BUSY copies - fixing Sep 4 10:55:56 cfshm1 mfsmaster[7386]: (172.16.140.244:9422 -> 172.16.140.89:9422) chunk: 00000000113406F1 replication status: No such chunk Sep 4 10:59:16 cfshm1 mfsmaster[7386]: (172.16.140.244:9422 -> 172.16.140.186:9422) chunk: 0000000012DAB6CA replication status: IO error I am not aware of any changes I might have maid. From netdata I get warnings regarding inbound packages dropped, fifo errors and tcp handshake resets sent or received. This warnings I also got before the problems with the overgoal begun. In my configuration a have net.core.rmem_max=26214400 and net.ipv4.tcp_challenge_ack_limit=999999999, otherwise it should be the defaults of debian buster. The setup is for a lab at our university with 35 chunkserver distributed over several buildings on the campus network. About one third are old workstations with only chunkserver with 6-8 HDs. The others are cluster nodes (6 HDs) or workstations (1-4 HDs) with users also working on graphical interface. Therefore not all switches involved are data-center switches and the network cables might be from 1998 where some of the buildings where build. Also it is often the case that the hosts are under heavy load or high memory pressure. I am very happy for ideas what might cause that new behavior (within the last halve year) or how to eliminate it again. regards Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: mar...@tu... |