From: Chris P. <ch...@ec...> - 2012-02-28 08:09:54
|
Hi All I have noticed that if I restart a chunkserver, when it rejoins, the cgi shows that some of the chunks are undergoal (about 100 or so, depending on how long it was offline for) I assume this is because chunks are changing while the chunkserver is offline, and it has outdated copies. Most of the undergoal chunks are re-replicated fairly quickly (a minute or two), but I often see a few chunks that take a longer time to get replicated (up to an hour or more) I can see that often this happens to the same chunks (same ID). In my case, this chunk came up undergoal a lot while I was restarting my chunkservers: ndb-test1-02.os.img chunk 224: 0000000000001068_00000036 / (id:4200 ver:54) copy 1: 10.168.8.54:9422 I also had been seeing the following in my logs: replicator: got status: 19 from (XXXXX) 19 is wrong chunk version. I am assuming that the replicator is trying to replicate that chunk, but as it is changing so often, by the time the replicator has copied the data, the copy is invalid, so is not used. Can someone confirm my thoughts above? Would it be useful to have a patch force replication of a block after X number of failed attempts (by locking the source chunk for a short while, to ensure that replication happens)? Regards Chris |