#135 ShortIdStock::stockboy: Assertion `leaseValid' failed

open
nobody
Repository (41)
5
2007-03-05
2007-03-05
No

It looks like I may have exposed some lurking problems with repos/85. A long-running program (the weeder, which had been running for about 2.5 hours when this occurred) running against the new repository code with the lower shortid block lease time (2 hours rather than 24 hours) died with this assertion failure:

SourceOrDerived.C:207: static void* ShortIdStock::stockboy(void*): Assertion `leaseValid' failed.

I looked over the client-side implementation which acquires and renews shortid block leases (which is where this assertion occurred) before making the change in lease duration and it seemed to me like it should have functioned correctly with this change. However, with the old duration it's unlikely that many client programs ran long enough to have their shortid block leases expire, so it's fair to say that the expiration case hasn't had much testing.

There are a few ways that the client-side renewal code could get into trouble. First, the lease expiration time is given to the client as a an absolute time on the server. The client waits until its system clock is close to that time before renewing the lease. (It calls the server to renew if it's within 120 seconds, but sleeps until it's within 60 seconds.) If the client's system clock is significantly behind the server, the lease could expire before the client makes the renewal call. Secondly, while 60 seconds should be enough margin for lease renewal there could be a combination of delays on the client, on the server, and in the network between them. While it seems unlikely that those delays would add up to 60 seconds, in cases of high load, high I/O latency, or when the server has difficulty acquiring the locks needed for lease renewal it's not completely inconceivable.

There are a couple steps we could take to improve the situation. First, the client code could wait until the midpoint of the lease before renewing rather than waiting until it has almost expired. This would give much more margin to absorb any potential delays. Secondly, we could change the network protocol to have the server send the lease expiration back to the client as both an absolute time and a relative time. That would make it possible to work around clock skew by having the client compute a second absolute time based on its system clock and the relative time and assume that the lease will expire at the earlier of the two absolute times.

Discussion