[Chromium-dev] Mothership connection brokering bug

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I believe I've found and fixed a long hidden bug in the mothership
code related to connection brokering.

I'd been testing a simple sort-last configuration with Infiniband.
The app/readback nodes were being automatically started.  Sometimes
everything would start up fine, but other times Chromium would appear
to hang during start-up.  It definitely felt like a timing problem.

The short story is the ib_accept_wait and ib_connect_wait fields on
socket wrappers need to be cleared after they're used/consumed.  This
also needs to be done for SDP and TCPIP.  It looks like GM, Teac and
TSComm are OK.

Here's the long story:

The SocketWrapper class has a number of <protocol>_accept_wait and
<protocol>_connect_wait fields that are used for connection brokering
between Chromium nodes.

When the mothership gets a 'connectionrequest' message, we loop over
all socket wrappers looking for a <protocol>_accept_wait field that's
not 'None' and matches the incoming request's hostname and port
number.  When we find one, that means there was an earlier
"acceptrequest" message that we can now satisfy.  So, we send out two
messages, one to the connect-node and one to the accept-node, to tell
the two endpoints about each other.  Now, we should never re-use the
<protocol>_accept_wait field.  But I found that we were never
resetting the <protocol>_accept_wait field to 'None' so we _were_
reusing the info!

Depending on the protocol, this was harmless or deadly.  With TCP/IP
the messages sent to the connect-node and accept-node were just a
connection ID and endian flag.  It turns out the connection ID isn't
especially significant to the two nodes.  So, returning the same data
to several connections wasn't a big deal.

But with Infiniband there's a whole bunch of other info sent to the
two parties: node_id, server_lid1, server_qp_ous, server_qp, etc.
That lead to trouble.

In my case, several "connectionrequest" messages were all getting
satisfied by just one or two "acceptrequest" messages; the one-to-one
correspondence wasn't enforced.

Conversely, the <protocol>_connect_wait field must also be cleared
when we get a process a "acceptrequest" message for an earlier
"connectionrequest" message.

Interesting, the GM protocol does this already.

Finally, the timing of events was significant.  Depending on the order
in which "connectrequest" and "acceptrequest" messages came in, we
could sometimes get lucky and manage to establish all the connections.
But it was just dumb luck when we did.

If there aren't any concerns, I'll check in my fixes tomorow morning.
I believe the TCP/IP, IB and SDP protocols are the only ones that need
to be fixed.

For the TSComm/Teac protocols, the <protocol>_connect/accept_wait
fields are actually lists.  It appears that entries from the list are
correctly popped/removed after they're used.

-Brian