Re: [Quickfix-developers] Intermittent disconnects on Solaris
Brought to you by:
orenmnero
From: Caleb E. <cal...@gm...> - 2005-03-22 13:35:43
|
On Tue, 22 Mar 2005 12:05:43 -0000, kri...@rb... <kri...@rb...> wrote: > Oren, Caleb - > > When debugging this problem, I was quite surprised by QuickFIX's socket handling. Despite the use of select(), non-blocking I/O is not actually used which seems to be why platform-specific ioctl's are used (to figure out how much you can read with out blocking). > > As for using recv()/read() == 0 to determine EOF, that fact that the actual socket reads and writes are scattered among various functions and methods of unrelated classes makes this quite harder to do. It also complicates error reporting (i.e. errno on failed reads/writes) and disconnection reasons that I've seen people rquesting on the list. > > There is a SocketConnection class which 'owns' a connected socket and there are functions in Utility.cpp for doing socket I/O, yet grep'ing for recv() calls returns two calls to the syscall recv(), and neither one is in either SocketConnection or in Utility.cpp. FIX::Parser's readFromStream()'s method seems an especially dubious place to put recv(); the fact that FIX::Parser has a member varible for holding a file descriptor as well a pointer to an std::istream also strikes me as wrong. I agree that the design of the connections, strategies, etc. is confusing and hard to follow. I also agree that the Parser should not have any knowledge of a file descriptor or iostream. It should just be handed char*s or std::strings. > lsdev2:~/work/quickfix/src/C++ $ grep -n -1 recv *.cpp *.h > Parser.cpp-144- > Parser.cpp:145: size = recv( m_socket, m_readBuffer, m_bufferSize, 0 ); > Parser.cpp-146- if ( size <= 0 ) throw RecvFailed(); > -- > ThreadedSocketConnection.cpp-96- buffer = new char[ bytes + 1 ]; > ThreadedSocketConnection.cpp:97: int result = recv( m_socket, buffer, bytes, 0 ); > ThreadedSocketConnection.cpp-98- if ( result <= 0 ) { throw std::exception(); } > > During my attempts to recreate the Solaris disconnect problem, I wrote sample QuickFIX apps that would send torrents of messages. It was quite easy to lock up two QuickFIX apps talking to each other during a reconnect, both sides would requests resends from each other and then the SocketInitiator and SocketAcceptor threads in the apps would block on send() when resending the messages to each other, as send() would not return until the other side called recv(), which it could not do until the other side called recv(), etc. The apps would then sleep forever. I have previously submitted a patch (attached here) that enables non-blocking I/O by using the MSG_DONTWAIT send/recv flag. It is lightly tested, and was made originally against 1.9.0, but I was able to clear up a lock-up type scenario as you describe with it. I'm not totally happy with the end-result: it won't work on systems that don't support MSG_DONTWAIT, and wedges yet another callback into the Socket Strategy interface (onWriteable). A better approach to the first problem would be to use the appropriate ioctl to set the sockets into non-blocking mode for all operations. As far as the Strategy interface goes, I think it is over-designed (look at all the empty callbacks!) and should probably be scrapped in favor of something simpler. > This problem can only be avoided by using non-block I/O or having one thread for reading and another for writing, neither of which SocketInitiator or SocketAcceptor do (I haven't looked at ThreadedSocketInitiator). I think the overall approach of using select and then calling send/recv as appropriate is the right one. I don't think the idea of another thread per-connection or possibly two is tenable. There is already one that is handing off strings to the Parser for processing (when using ThreadedSocketConnection). Adding more would be gratuitous. The down-side of using select is that you essentially lock yourself into a certain amount of latency: the timeval you pass to the select call. If the SocketMonitor is sitting in select and you have a new fd to add to the list (e.g. there is suddenly some data to send), you either have to wait for the select timeout to expire, or you add a special "signalling" fd to your fd_set that you write to when you want to force select to break out of its wait. Anyway, thanks for the analysis and insight Kristofer. I hope we can improve QuickFIX as a result of the points raised in this thread. -- Caleb Epstein caleb dot epstein at gmail dot com |