From: W K. <wk...@bn...> - 2011-05-23 22:32:55
|
Last Night, we had one of our 4 chunkservers 'lock up' in some mysterious way. The master started giving off these messages May 22 22:37:50 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 000000000023DB7F replication status: 22 May 22 22:37:51 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 0000000000149C9D replication status: 22 May 22 22:37:51 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 00000000002EB8F6 deletion status: 22 May 22 22:38:26 mfs1master mfsmaster[2522]: connection with ML(192.168.0.24) has been closed by peer May 23 11:12:11 mfs1master mfsmaster[2522]: chunkserver disconnected - ip: 192.168.0.24, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB) MooseFS did the right thing, and kicked the chunkserver out. There was no interruption of service and we didn't even notice the problem until someone looked at the CGI this morning and saw that we had a large number of undergoal (goal=2) files which moose was fixing (and had been fixing all night) the undergoal condition at a rate of about 2-4 chunks a second. So we replaced the failed chunkserver and continued on, quite content with how resiliant MooseFS was under a failure. We then thought about it and decided that we had gone a long time with only 1 copy of a large number of chunks and that perhaps a goal of 3 would have been safer. (i.e. if 1 of the 4 chunkservers dies, we still have 2 copies and could still lose a second chunkserver without harm). So we reset the Goal from 2 to 3. We did this while were still in an undergoal position at goal=2 for about 10,000 chunks that hadn't yet been healed. So now the CGI is showing 10,000+ chunks with a single copy (red), 2 million+ chunks are now orange (2 copies) and the system is happily increasing the 'green' 3 valid copy column. The problem is that it seems to be concentrating on the orange (2 copy) files and ignoring the 10,000+ red ones that are most at risk. In the last hour we've seen a few 'red' chunks disappear but the vast majority of activity is occuring in the orange (2 copy) column. Shouldn't the replication worry about the single copy files first? I also realize we could simply set the goal back to 2 let it finish that up and THEN switch it to 3 but I'm curious as to what the community says. -WK |