[Moosefs-users] Replication Priority?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Last Night, we had one of our 4 chunkservers 'lock up' in some 
mysterious way.  The master started giving off these messages

May 22 22:37:50 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 000000000023DB7F replication status: 22
May 22 22:37:51 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 0000000000149C9D replication status: 22
May 22 22:37:51 mfs1master mfsmaster[2522]: (192.168.0.24:9422) chunk: 00000000002EB8F6 deletion status: 22
May 22 22:38:26 mfs1master mfsmaster[2522]: connection with ML(192.168.0.24) has been closed by peer
May 23 11:12:11 mfs1master mfsmaster[2522]: chunkserver disconnected - ip: 192.168.0.24, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)

MooseFS did the right thing, and kicked the chunkserver out. There was 
no interruption of service and we didn't even notice the problem until 
someone looked at
the CGI this morning and saw that we had a large number of undergoal 
(goal=2) files which moose was fixing (and had been fixing all night) 
the undergoal condition at a rate of about 2-4 chunks a second.

So we replaced the failed chunkserver  and continued on, quite content 
with how resiliant MooseFS was under a failure.

We then thought about it and decided that we had gone a long time with 
only 1 copy of a large number of chunks and that perhaps a goal of 3 
would have been safer. (i.e. if 1 of the 4 chunkservers dies, we still 
have 2 copies and could still lose a second chunkserver without harm).

So we reset the Goal from 2 to 3. We did this while were still in an 
undergoal position at goal=2 for about 10,000 chunks that hadn't yet 
been healed.

So now the CGI is showing 10,000+ chunks with a single copy (red), 2 
million+ chunks are now orange (2 copies) and the system is happily 
increasing the 'green' 3 valid copy column.

The problem is that it seems to be concentrating on the orange (2 copy) 
files and ignoring the 10,000+ red ones that are most at risk. In the 
last hour we've seen a few 'red' chunks
disappear but the vast majority of activity is occuring in the orange (2 
copy) column.

Shouldn't the replication worry about the single copy files first?

I also realize we could simply set the goal back to 2 let it finish that 
up and THEN switch it to 3 but I'm curious as to what the community says.

-WK

[Moosefs-users] Replication Priority?

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

[Moosefs-users] Replication Priority?