#128 Replication fails with "connection reset by peer"

open
nobody
Repository (41)
4
2006-08-12
2006-08-12
No

The repository client interface (src/VDirSurrogate.C in
/vesta/vestasys.org/vesta/repos) uses the MultiSRPC
class to re-use connections (src/MultiSRPC.C in
/vesta/vestasys.org/vesta/srpc). This reduces the cost
of an individual RPC in applications that make more
than one.

In long-running processes, open connections can lie
dormant for a long period in a MultiSRPC instance.
They could even continue to exist after the peer has
rebooted. If this happens, an RPC can get a
"connection reset by peer" error which simply indicates
that the other end has rebooted. A new connection
might work in that case. It would be ideal if the RPC
could be automatically retried in such a case.

The repository server also acts as a client to other
repositories for replication and mastership transfer
operations. If repository A reboots and open
connections to it are cached in repository B's
VDirSurrogate MultiSRPC instance, replication and
mastership transfer operations with repository B as the
destination and repository A as the source will fail
with errors like:

07/25/2006 11:15:34 Replicate:
"example.com/foo/bar/27" initial RPC to
vesta.peer.example.com:21776 failed: connection reset
by peer

It would be preferable to handle this transparently.
While we could handle it specifically for the initial
RPCs in the replication and mastership transfer
portions of the repository server, I think it would be
better to handle it in the VDirSuorrgate class.
Unfortunately, I don't think we can handle it down in
the MultiSRPC class as that would require at a minimum
an additional round-trip before each RPC, which seems
unacceptable.

In the event of a "connection reset by peer" error,
VDirSuorrgate should start by calling MultiSRPC::Purge
for the host/port in question. After that the RPC
could be retried a single time with a different
connection. If that fails, then it should be
reasonable to return the error to the client.

Discussion