RE: [Quickfix-developers] Intermittent disconnects on Solaris
Brought to you by:
orenmnero
From: Bishop, B. <Bar...@gs...> - 2005-03-22 12:35:49
|
Hi Kristofer, Firstly, I'd like to say thanks very much for your hard work in finding this 'bug'. The later versions of quickfix are much more graceful when reconnecting, but it still remains an annoying problem. Secondly, I would totally agree to all of your comments. I would further add, that it has always struck me as madness that the Parser class contains a socket handle. It also contains an istream pointer, so I conclude that every time a new strategy is devised for collecting bytes to parse a new member will be added to Parser. Personally, I would have designed Parser to have a method for processing bytes and separate classes for collecting bundles of bytes to give to the Parser, whether from a socket, a stream, a file or whatever. I would also re-enforce your point about threading and sockets. For clearness of design and ease of partitioning work, I would absolutely always design components so that every socket has its own dedicated reading thread and very often a dedicated writing thread also. These would of course, always using blocking I/O. At the moment, I think these improvements would involve significant upheaval in the design of quickfix. Oren, is this kind of thing at all likely? Later in the year I might actually have time to work on some of these ideas, but in the mean time, thanks again. Barry Bishop -----Original Message----- From: kri...@rb... [mailto:kri...@rb...] Sent: Tuesday, March 22, 2005 12:06 PM To: or...@qu...; cal...@gm... Cc: qui...@li...; Bishop, Barry Subject: RE: [Quickfix-developers] Intermittent disconnects on Solaris Oren, Caleb - When debugging this problem, I was quite surprised by QuickFIX's socket handling. Despite the use of select(), non-blocking I/O is not actually used which seems to be why platform-specific ioctl's are used (to figure out how much you can read with out blocking). As for using recv()/read() == 0 to determine EOF, that fact that the actual socket reads and writes are scattered among various functions and methods of unrelated classes makes this quite harder to do. It also complicates error reporting (i.e. errno on failed reads/writes) and disconnection reasons that I've seen people rquesting on the list. There is a SocketConnection class which 'owns' a connected socket and there are functions in Utility.cpp for doing socket I/O, yet grep'ing for recv() calls returns two calls to the syscall recv(), and neither one is in either SocketConnection or in Utility.cpp. FIX::Parser's readFromStream()'s method seems an especially dubious place to put recv(); the fact that FIX::Parser has a member varible for holding a file descriptor as well a pointer to an std::istream also strikes me as wrong. lsdev2:~/work/quickfix/src/C++ $ grep -n -1 recv *.cpp *.h Parser.cpp-144- Parser.cpp:145: size = recv( m_socket, m_readBuffer, m_bufferSize, 0 ); Parser.cpp-146- if ( size <= 0 ) throw RecvFailed(); -- ThreadedSocketConnection.cpp-96- buffer = new char[ bytes + 1 ]; ThreadedSocketConnection.cpp:97: int result = recv( m_socket, buffer, bytes, 0 ); ThreadedSocketConnection.cpp-98- if ( result <= 0 ) { throw std::exception(); } During my attempts to recreate the Solaris disconnect problem, I wrote sample QuickFIX apps that would send torrents of messages. It was quite easy to lock up two QuickFIX apps talking to each other during a reconnect, both sides would requests resends from each other and then the SocketInitiator and SocketAcceptor threads in the apps would block on send() when resending the messages to each other, as send() would not return until the other side called recv(), which it could not do until the other side called recv(), etc. The apps would then sleep forever. This problem can only be avoided by using non-block I/O or having one thread for reading and another for writing, neither of which SocketInitiator or SocketAcceptor do (I haven't looked at ThreadedSocketInitiator). Addressing this would be a fair amount of work, however the patch above has eliminated the disconnects which were causing us major headaches as they tended to happen at around market close. Regards, - Kris -----Original Message----- From: Oren Miller [mailto:or...@qu...] Sent: 21 March 2005 19:23 To: Caleb Epstein; Peterson, Kristofer Cc: qui...@li...; Bar...@gs... Subject: Re: [Quickfix-developers] Intermittent disconnects on Solaris Yeah, I agree. --oren ----- Original Message ----- From: "Caleb Epstein" <cal...@gm...> To: <kri...@rb...> Cc: <qui...@li...>; <Bar...@gs...> Sent: Monday, March 21, 2005 1:20 PM Subject: Re: [Quickfix-developers] Intermittent disconnects on Solaris > QuickFIX Documentation: > http://www.quickfixengine.org/quickfix/doc/html/index.html > QuickFIX FAQ: http://www.quickfixengine.org/wikifix/index.php?QuickFixFAQ > QuickFIX Support: http://www.quickfixengine.org/services.html > > On Mon, 21 Mar 2005 19:02:19 -0000, kri...@rb... > <kri...@rb...> wrote: > >> I have isolated the cause of the intermittent disconnects caused by >> QuickFIX on Solaris. The problem was due to the use of the I_NREAD ioctl >> to determine whether a readable socket was EOF or not. >> >> In certain circumstances which seem to involve high network traffic >> and >> low machine load, I_NREAD will return zero for a readable >> socket that >> actually has data. In such cases, QuickFIX would >> erroneously close the >> socket. >> >> I replaced the I_NREAD code in socket_disconnected() one byte recv() >> with >> the MSG_PEEK flag. This appears to have resolved this rather troublesome >> issue for us in production. >> > > Shouldn't QuickFIX just rely on recv returning 0 to detect a socket > disconnect, instead of relying on this ioctl? I've never seen a > socket-based application using this technique to detect disconnects > before. Clearly its not 100% reliable. > > -- > Caleb Epstein > caleb dot epstein at gmail dot com > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. Discover which products truly live up to the hype. Start > reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Quickfix-developers mailing list > Qui...@li... > https://lists.sourceforge.net/lists/listinfo/quickfix-developers > <font face="Times New Roman" size="3"> <p>------------------------------------------------------------------------- -----</p> <p> This email is intended only for the use of the individual(s) to whom it is addressed and may be privileged and confidential. Unauthorised use or disclosure is prohibited. If you receive this e-mail in error, please advise immediately and delete the original message. This message may have been altered without your or our knowledge and the sender does not accept any liability for any errors or omissions in the message.</p> <p>====================================================</p> </font> |