From: Brian P. <bri...@tu...> - 2004-10-27 22:59:07
|
I believe I've found and fixed a long hidden bug in the mothership code related to connection brokering. I'd been testing a simple sort-last configuration with Infiniband. The app/readback nodes were being automatically started. Sometimes everything would start up fine, but other times Chromium would appear to hang during start-up. It definitely felt like a timing problem. The short story is the ib_accept_wait and ib_connect_wait fields on socket wrappers need to be cleared after they're used/consumed. This also needs to be done for SDP and TCPIP. It looks like GM, Teac and TSComm are OK. Here's the long story: The SocketWrapper class has a number of <protocol>_accept_wait and <protocol>_connect_wait fields that are used for connection brokering between Chromium nodes. When the mothership gets a 'connectionrequest' message, we loop over all socket wrappers looking for a <protocol>_accept_wait field that's not 'None' and matches the incoming request's hostname and port number. When we find one, that means there was an earlier "acceptrequest" message that we can now satisfy. So, we send out two messages, one to the connect-node and one to the accept-node, to tell the two endpoints about each other. Now, we should never re-use the <protocol>_accept_wait field. But I found that we were never resetting the <protocol>_accept_wait field to 'None' so we _were_ reusing the info! Depending on the protocol, this was harmless or deadly. With TCP/IP the messages sent to the connect-node and accept-node were just a connection ID and endian flag. It turns out the connection ID isn't especially significant to the two nodes. So, returning the same data to several connections wasn't a big deal. But with Infiniband there's a whole bunch of other info sent to the two parties: node_id, server_lid1, server_qp_ous, server_qp, etc. That lead to trouble. In my case, several "connectionrequest" messages were all getting satisfied by just one or two "acceptrequest" messages; the one-to-one correspondence wasn't enforced. Conversely, the <protocol>_connect_wait field must also be cleared when we get a process a "acceptrequest" message for an earlier "connectionrequest" message. Interesting, the GM protocol does this already. Finally, the timing of events was significant. Depending on the order in which "connectrequest" and "acceptrequest" messages came in, we could sometimes get lucky and manage to establish all the connections. But it was just dumb luck when we did. If there aren't any concerns, I'll check in my fixes tomorow morning. I believe the TCP/IP, IB and SDP protocols are the only ones that need to be fixed. For the TSComm/Teac protocols, the <protocol>_connect/accept_wait fields are actually lists. It appears that entries from the list are correctly popped/removed after they're used. -Brian |