Re: [Quickfix-developers] Intermittent disconnects on Solaris

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Tue, 22 Mar 2005 12:05:43 -0000, kri...@rb...
<kri...@rb...> wrote:
> Oren, Caleb -
> 
> When debugging this problem, I was quite surprised by QuickFIX's socket handling. Despite the use of select(), non-blocking I/O is not actually used which seems to be why platform-specific ioctl's are used (to figure out how much you can read with out blocking).
> 
> As for using recv()/read() == 0 to determine EOF, that fact that the actual socket reads and writes are scattered among various functions and methods of unrelated classes makes this quite harder to do. It also complicates error reporting (i.e. errno on failed reads/writes) and disconnection reasons that I've seen people rquesting on the list.
> 
> There is a SocketConnection class which 'owns' a connected socket and there are functions in Utility.cpp for doing socket I/O, yet grep'ing for recv() calls returns two calls to the syscall recv(), and neither one is in either SocketConnection or in Utility.cpp. FIX::Parser's readFromStream()'s method seems an especially dubious place to put recv(); the fact that FIX::Parser has a member varible for holding a file descriptor as well a pointer to an std::istream also strikes me as wrong.

I agree that the design of the connections, strategies, etc. is
confusing and hard to follow.  I also agree that the Parser should not
have any knowledge of a file descriptor or iostream.  It should just
be handed char*s or std::strings.

> lsdev2:~/work/quickfix/src/C++ $ grep -n -1 recv *.cpp *.h
> Parser.cpp-144-
> Parser.cpp:145:    size = recv( m_socket, m_readBuffer, m_bufferSize, 0 );
> Parser.cpp-146-    if ( size <= 0 ) throw RecvFailed();
> --
> ThreadedSocketConnection.cpp-96-    buffer = new char[ bytes + 1 ];
> ThreadedSocketConnection.cpp:97:    int result = recv( m_socket, buffer, bytes, 0 );
> ThreadedSocketConnection.cpp-98-    if ( result <= 0 ) { throw std::exception(); }
> 
> During my attempts to recreate the Solaris disconnect problem, I wrote sample QuickFIX apps that would send torrents of messages. It was quite easy to lock up two QuickFIX apps talking to each other during a reconnect, both sides would requests resends from each other and then the SocketInitiator and SocketAcceptor threads in the apps would block on send() when resending the messages to each other, as send() would not return until the other side called recv(), which it could not do until the other side called recv(), etc. The apps would then sleep forever.

I have previously submitted a patch (attached here) that enables
non-blocking I/O by using the MSG_DONTWAIT send/recv flag.  It is
lightly tested, and was made originally against 1.9.0, but I was able
to clear up a lock-up type scenario as you describe with it.

I'm not totally happy with the end-result: it won't work on systems
that don't support MSG_DONTWAIT, and wedges yet another callback into
the Socket Strategy interface (onWriteable).  A better approach to the
first problem would be to use the appropriate ioctl to set the sockets
into non-blocking mode for all operations.  As far as the Strategy
interface goes, I think it is over-designed (look at all the empty
callbacks!) and should probably be scrapped in favor of something
simpler.

> This problem can only be avoided by using non-block I/O or having one thread for reading and another for writing, neither of which SocketInitiator or SocketAcceptor do (I haven't looked at ThreadedSocketInitiator).

I think the overall approach of using select and then calling
send/recv as appropriate is the right one.  I don't think the idea of
another thread per-connection or possibly two is tenable.  There is
already one that is handing off strings to the Parser for processing
(when using ThreadedSocketConnection).  Adding more would be
gratuitous.

The down-side of using select is that you essentially lock yourself
into a certain amount of latency: the timeval you pass to the select
call.  If the SocketMonitor is sitting in select and you have a new fd
to add to the list (e.g. there is suddenly some data to send), you
either have to wait for the select timeout to expire, or you add a
special "signalling" fd to your fd_set that you write to when you want
to force select to break out of its wait.

Anyway, thanks for the analysis and insight Kristofer.  I hope we can
improve QuickFIX as a result of the points raised in this thread.

-- 
Caleb Epstein
caleb dot epstein at gmail dot com