#83 SocketInitiator: Timeout connecting kills other connections


QuickFIX 1.13.2 and onwards have a fairly serious bug: While the engine is trying to connect to a host, it's not servicing other connections. If the host is on a network that has suddenly gone dark (which happened to us yesterday), WinSock can take about 25 seconds to time out. If enough failing sessions go to the same network, the resulting delays on all other lines are enough to cause the counterparties to hang up.

This happens with the SocketInitiator. I tracked the behavior to commit #2159, which also deals with timeouts. I found a pretty effective way to test this behavior is to set up several sessions to an invalid host through a null route (http://en.wikipedia.org/wiki/Null_route) and then see if other valid sessions connect properly.

I don't know if switching to ThreadedSocketInitiator would fix the problem for us. I'll be sure to try that if I can't figure out how to fix the bug itself. I'm sure I can produce sample C# code if needed.


  • The original bug report doesn't exactly describe the problem it's trying to fix, but as I understand it, the SocketInitiator needs an error to happen to call connect() on the base Initiator class, and connect() needs to be called for an error to happen, and thanks to the m_reconnectInterval logic in SocketInitiator::onTimeout(), sometimes that process gets interrupted. The solution seems to be to make connects synchronous-- which, as I noted above, is disastrous for all other lines serviced by the same SocketInitiator.

    So it needs to be put back to asynchronous, which means rolling back commit #2159, and we need to find a better way to ensure connect() is called again in a reasonable amount of time.

    One simple fix would be if SocketInitiator::onStart() called block() with a nonzero timeout, say one second. When select() times out in SocketMonitor::block(), it causes a call to onTimeout(), which is what we want in the first place. Of course, we also want that if select() never times out, so really, why not call onTimeout() for every block()?

  • Patch to revert the original fix for 2895449

  • Added two patches. First one reverts #2159 and makes connects asynchronous again. The second adds a call to onTimeout() within SocketInitiator's block() loop, and a one-second timeout to the block() call to make sure that part of the loop is called at least that often. That should make sure that disconnected sessions are reattempted *sometime*.

  • Oren Miller
    Oren Miller

    Patched have been applied and checked into repository.

  • Oren Miller
    Oren Miller

    • status: open --> pending