[Moosefs-users] Replication of undergoal chunks can take a long time

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi All

I have noticed that if I restart a chunkserver, when it rejoins, the cgi
shows that some of the chunks are undergoal (about 100 or so, depending
on how long it was offline for)

I assume this is because chunks are changing while the chunkserver is
offline, and it has outdated copies.

Most of the undergoal chunks are re-replicated fairly quickly (a minute
or two), but I often see a few chunks that take a longer time to get
replicated (up to an hour or more)

I can see that often this happens to the same chunks (same ID).  In my
case, this chunk came up undergoal a lot while I was restarting my
chunkservers:
ndb-test1-02.os.img
chunk 224: 0000000000001068_00000036 / (id:4200 ver:54)
copy 1: 10.168.8.54:9422

 I also had been seeing the following in my logs:
 replicator: got status: 19 from (XXXXX)

19 is wrong chunk version.

I am assuming that the replicator is trying to replicate that chunk, but
as it is changing so often, by the time the replicator has copied the
data, the copy is invalid, so is not used. 

Can someone confirm my thoughts above?

Would it be useful to have a patch force replication of a block after X
number of failed attempts (by locking the source chunk for a short
while, to ensure that replication happens)?

Regards

Chris

[Moosefs-users] Replication of undergoal chunks can take a long time

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

[Moosefs-users] Replication of undergoal chunks can take a long time