From: wkmail <wk...@bn...> - 2012-05-17 01:00:23
|
A previously stable cluster went crazy and we had to bring everything down and then backup to recover. Does this indicate a NETWORK issue of some sort or is something else going on. The server is recovering from a failed disk from a few days ago and we had also designed another drive for removal, so it is busy fixing two seperate issues. On the chunkserver logs we see this. May 16 15:16:50 mfs1chunker5 mfschunkserver[1399]: replicator: receive timed out May 16 15:57:37 mfs1chunker5 mfschunkserver[1399]: replicator: got status: 19 from (C0A80017:24CE) May 16 16:17:34 mfs1chunker5 mfschunkserver[1399]: replicator: got status: 19 from (C0A80017:24CE) May 16 16:38:40 mfs1chunker5 mfschunkserver[1399]: replicator: receive timed out May 16 16:39:15 mfs1chunker5 mfschunkserver[1399]: replicator: receive timed out May 16 16:45:35 mfs1chunker5 mfschunkserver[1399]: replicator: connect timed out May 16 16:46:05 mfs1chunker5 mfschunkserver[1399]: replicator: receive timed out May 16 16:46:24 mfs1chunker5 mfschunkserver[1399]: replicator: receive timed out May 16 16:49:28 mfs1chunker5 mfschunkserver[1399]: (write) write error: ECONNRESET (Connection reset by peer) May 16 17:24:51 mfs1chunker5 mfschunkserver[1399]: replicator: receive timed out May 16 17:24:57 mfs1chunker5 mfschunkserver[1399]: replicator: receive timed out May 16 17:25:02 mfs1chunker5 mfschunkserver[1399]: replicator: connect timed out May 16 17:39:55 mfs1chunker5 mfschunkserver[1399]: replicator: receive timed out on the mfsmaster logs we see entries like this May 16 17:14:43 mfs1master mfsmaster[32522]: (192.168.0.24:9422) chunk: 0000000000A59D5B replication status: 28 May 16 17:14:45 mfs1master mfsmaster[32522]: (192.168.0.27:9422) chunk: 0000000000C8D906 replication status: 28 May 16 17:14:45 mfs1master mfsmaster[32522]: (192.168.0.22:9422) chunk: 0000000000CD6FCD replication status: 28 May 16 17:14:47 mfs1master mfsmaster[32522]: (192.168.0.25:9422) chunk: 000000000024AB78 replication status: 28 May 16 17:14:49 mfs1master mfsmaster[32522]: (192.168.0.21:9422) chunk: 0000000000CC7DEA replication status: 28 May 16 17:14:49 mfs1master mfsmaster[32522]: (192.168.0.24:9422) chunk: 00000000001A7DEA replication status: 28 May 16 17:14:52 mfs1master mfsmaster[32522]: (192.168.0.27:9422) chunk: 000000000027B995 replication status: 28 May 16 17:14:52 mfs1master mfsmaster[32522]: (192.168.0.22:9422) chunk: 00000000002BB995 replication status: 28 May 16 17:14:54 mfs1master mfsmaster[32522]: (192.168.0.25:9422) chunk: 0000000000258C07 replication status: 28 May 16 17:14:54 mfs1master mfsmaster[32522]: (192.168.0.21:9422) chunk: 0000000000238C07 replication status: 28 May 16 17:14:56 mfs1master mfsmaster[32522]: (192.168.0.24:9422) chunk: 0000000000205E79 replication status: 26 May 16 17:14:58 mfs1master mfsmaster[32522]: (192.168.0.22:9422) chunk: 00000000005330EB replication status: 26 May 16 17:14:59 mfs1master mfsmaster[32522]: (192.168.0.27:9422) chunk: 0000000000579A24 replication status: 26 May 16 17:14:59 mfs1master mfsmaster[32522]: (192.168.0.25:9422) chunk: 0000000000C89A24 replication status: 26 |