Thread: RE: [Quickfix-developers] Intermittent disconnects on Solaris
Brought to you by:
orenmnero
From: <kri...@rb...> - 2005-03-22 12:06:20
|
Oren, Caleb - When debugging this problem, I was quite surprised by QuickFIX's socket han= dling. Despite the use of select(), non-blocking I/O is not actually used w= hich seems to be why platform-specific ioctl's are used (to figure out how = much you can read with out blocking). As for using recv()/read() =3D=3D 0 to determine EOF, that fact that the ac= tual socket reads and writes are scattered among various functions and meth= ods of unrelated classes makes this quite harder to do. It also complicates= error reporting (i.e. errno on failed reads/writes) and disconnection reas= ons that I've seen people rquesting on the list. There is a SocketConnection class which 'owns' a connected socket and there= are functions in Utility.cpp for doing socket I/O, yet grep'ing for recv()= calls returns two calls to the syscall recv(), and neither one is in eithe= r SocketConnection or in Utility.cpp. FIX::Parser's readFromStream()'s meth= od seems an especially dubious place to put recv(); the fact that FIX::Pars= er has a member varible for holding a file descriptor as well a pointer to = an std::istream also strikes me as wrong. lsdev2:~/work/quickfix/src/C++ $ grep -n -1 recv *.cpp *.h Parser.cpp-144- Parser.cpp:145: size =3D recv( m_socket, m_readBuffer, m_bufferSize, 0 ); Parser.cpp-146- if ( size <=3D 0 ) throw RecvFailed(); -- ThreadedSocketConnection.cpp-96- buffer =3D new char[ bytes + 1 ]; ThreadedSocketConnection.cpp:97: int result =3D recv( m_socket, buffer, = bytes, 0 ); ThreadedSocketConnection.cpp-98- if ( result <=3D 0 ) { throw std::excep= tion(); } During my attempts to recreate the Solaris disconnect problem, I wrote samp= le QuickFIX apps that would send torrents of messages. It was quite easy to= lock up two QuickFIX apps talking to each other during a reconnect, both s= ides would requests resends from each other and then the SocketInitiator an= d SocketAcceptor threads in the apps would block on send() when resending t= he messages to each other, as send() would not return until the other side = called recv(), which it could not do until the other side called recv(), et= c. The apps would then sleep forever. This problem can only be avoided by using non-block I/O or having one threa= d for reading and another for writing, neither of which SocketInitiator or = SocketAcceptor do (I haven't looked at ThreadedSocketInitiator). Addressing this would be a fair amount of work, however the patch above has= eliminated the disconnects which were causing us major headaches as they t= ended to happen at around market close. Regards, - Kris -----Original Message----- From: Oren Miller [mailto:or...@qu...] Sent: 21 March 2005 19:23 To: Caleb Epstein; Peterson, Kristofer Cc: qui...@li...; Bar...@gs... Subject: Re: [Quickfix-developers] Intermittent disconnects on Solaris Yeah, I agree. --oren ----- Original Message -----=20 From: "Caleb Epstein" <cal...@gm...> To: <kri...@rb...> Cc: <qui...@li...>; <Bar...@gs...> Sent: Monday, March 21, 2005 1:20 PM Subject: Re: [Quickfix-developers] Intermittent disconnects on Solaris > QuickFIX Documentation:=20 > http://www.quickfixengine.org/quickfix/doc/html/index.html > QuickFIX FAQ: http://www.quickfixengine.org/wikifix/index.php?QuickFixFAQ > QuickFIX Support: http://www.quickfixengine.org/services.html > > On Mon, 21 Mar 2005 19:02:19 -0000, kri...@rb... > <kri...@rb...> wrote: > >> I have isolated the cause of the intermittent disconnects caused by=20 >> QuickFIX on Solaris. The problem was due to the use of the I_NREAD ioctl= >> to determine whether a readable socket was EOF or not. >> >> In certain circumstances which seem to involve high network traffic and= >> low machine load, I_NREAD will return zero for a readable socket that= >> actually has data. In such cases, QuickFIX would erroneously close the= >> socket. >> >> I replaced the I_NREAD code in socket_disconnected() one byte recv() wit= h=20 >> the MSG_PEEK flag. This appears to have resolved this rather troublesome= >> issue for us in production. >> > > Shouldn't QuickFIX just rely on recv returning 0 to detect a socket > disconnect, instead of relying on this ioctl? I've never seen a > socket-based application using this technique to detect disconnects > before. Clearly its not 100% reliable. > > --=20 > Caleb Epstein > caleb dot epstein at gmail dot com > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=3D6595&alloc_id=3D14396&op=3Dclick > _______________________________________________ > Quickfix-developers mailing list > Qui...@li... > https://lists.sourceforge.net/lists/listinfo/quickfix-developers >=20 <font face=3D"Times New Roman" size=3D"3"> <p>------------------------------------------------------------------------= ------</p> <p> This email is intended only for the use of the individual(s) to whom it= is addressed and may be privileged and confidential. Unauthorised use or d= isclosure is prohibited. If you receive this e-mail in error, please advise= immediately and delete the original message. This message may have been al= tered without your or our knowledge and the sender does not accept any liab= ility for any errors or omissions in the message.</p> <p>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D</p> </font> |
From: Bishop, B. <Bar...@gs...> - 2005-03-22 12:35:49
|
Hi Kristofer, Firstly, I'd like to say thanks very much for your hard work in finding this 'bug'. The later versions of quickfix are much more graceful when reconnecting, but it still remains an annoying problem. Secondly, I would totally agree to all of your comments. I would further add, that it has always struck me as madness that the Parser class contains a socket handle. It also contains an istream pointer, so I conclude that every time a new strategy is devised for collecting bytes to parse a new member will be added to Parser. Personally, I would have designed Parser to have a method for processing bytes and separate classes for collecting bundles of bytes to give to the Parser, whether from a socket, a stream, a file or whatever. I would also re-enforce your point about threading and sockets. For clearness of design and ease of partitioning work, I would absolutely always design components so that every socket has its own dedicated reading thread and very often a dedicated writing thread also. These would of course, always using blocking I/O. At the moment, I think these improvements would involve significant upheaval in the design of quickfix. Oren, is this kind of thing at all likely? Later in the year I might actually have time to work on some of these ideas, but in the mean time, thanks again. Barry Bishop -----Original Message----- From: kri...@rb... [mailto:kri...@rb...] Sent: Tuesday, March 22, 2005 12:06 PM To: or...@qu...; cal...@gm... Cc: qui...@li...; Bishop, Barry Subject: RE: [Quickfix-developers] Intermittent disconnects on Solaris Oren, Caleb - When debugging this problem, I was quite surprised by QuickFIX's socket handling. Despite the use of select(), non-blocking I/O is not actually used which seems to be why platform-specific ioctl's are used (to figure out how much you can read with out blocking). As for using recv()/read() == 0 to determine EOF, that fact that the actual socket reads and writes are scattered among various functions and methods of unrelated classes makes this quite harder to do. It also complicates error reporting (i.e. errno on failed reads/writes) and disconnection reasons that I've seen people rquesting on the list. There is a SocketConnection class which 'owns' a connected socket and there are functions in Utility.cpp for doing socket I/O, yet grep'ing for recv() calls returns two calls to the syscall recv(), and neither one is in either SocketConnection or in Utility.cpp. FIX::Parser's readFromStream()'s method seems an especially dubious place to put recv(); the fact that FIX::Parser has a member varible for holding a file descriptor as well a pointer to an std::istream also strikes me as wrong. lsdev2:~/work/quickfix/src/C++ $ grep -n -1 recv *.cpp *.h Parser.cpp-144- Parser.cpp:145: size = recv( m_socket, m_readBuffer, m_bufferSize, 0 ); Parser.cpp-146- if ( size <= 0 ) throw RecvFailed(); -- ThreadedSocketConnection.cpp-96- buffer = new char[ bytes + 1 ]; ThreadedSocketConnection.cpp:97: int result = recv( m_socket, buffer, bytes, 0 ); ThreadedSocketConnection.cpp-98- if ( result <= 0 ) { throw std::exception(); } During my attempts to recreate the Solaris disconnect problem, I wrote sample QuickFIX apps that would send torrents of messages. It was quite easy to lock up two QuickFIX apps talking to each other during a reconnect, both sides would requests resends from each other and then the SocketInitiator and SocketAcceptor threads in the apps would block on send() when resending the messages to each other, as send() would not return until the other side called recv(), which it could not do until the other side called recv(), etc. The apps would then sleep forever. This problem can only be avoided by using non-block I/O or having one thread for reading and another for writing, neither of which SocketInitiator or SocketAcceptor do (I haven't looked at ThreadedSocketInitiator). Addressing this would be a fair amount of work, however the patch above has eliminated the disconnects which were causing us major headaches as they tended to happen at around market close. Regards, - Kris -----Original Message----- From: Oren Miller [mailto:or...@qu...] Sent: 21 March 2005 19:23 To: Caleb Epstein; Peterson, Kristofer Cc: qui...@li...; Bar...@gs... Subject: Re: [Quickfix-developers] Intermittent disconnects on Solaris Yeah, I agree. --oren ----- Original Message ----- From: "Caleb Epstein" <cal...@gm...> To: <kri...@rb...> Cc: <qui...@li...>; <Bar...@gs...> Sent: Monday, March 21, 2005 1:20 PM Subject: Re: [Quickfix-developers] Intermittent disconnects on Solaris > QuickFIX Documentation: > http://www.quickfixengine.org/quickfix/doc/html/index.html > QuickFIX FAQ: http://www.quickfixengine.org/wikifix/index.php?QuickFixFAQ > QuickFIX Support: http://www.quickfixengine.org/services.html > > On Mon, 21 Mar 2005 19:02:19 -0000, kri...@rb... > <kri...@rb...> wrote: > >> I have isolated the cause of the intermittent disconnects caused by >> QuickFIX on Solaris. The problem was due to the use of the I_NREAD ioctl >> to determine whether a readable socket was EOF or not. >> >> In certain circumstances which seem to involve high network traffic >> and >> low machine load, I_NREAD will return zero for a readable >> socket that >> actually has data. In such cases, QuickFIX would >> erroneously close the >> socket. >> >> I replaced the I_NREAD code in socket_disconnected() one byte recv() >> with >> the MSG_PEEK flag. This appears to have resolved this rather troublesome >> issue for us in production. >> > > Shouldn't QuickFIX just rely on recv returning 0 to detect a socket > disconnect, instead of relying on this ioctl? I've never seen a > socket-based application using this technique to detect disconnects > before. Clearly its not 100% reliable. > > -- > Caleb Epstein > caleb dot epstein at gmail dot com > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. Discover which products truly live up to the hype. Start > reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Quickfix-developers mailing list > Qui...@li... > https://lists.sourceforge.net/lists/listinfo/quickfix-developers > <font face="Times New Roman" size="3"> <p>------------------------------------------------------------------------- -----</p> <p> This email is intended only for the use of the individual(s) to whom it is addressed and may be privileged and confidential. Unauthorised use or disclosure is prohibited. If you receive this e-mail in error, please advise immediately and delete the original message. This message may have been altered without your or our knowledge and the sender does not accept any liability for any errors or omissions in the message.</p> <p>====================================================</p> </font> |
From: Caleb E. <cal...@gm...> - 2005-03-22 13:35:43
Attachments:
quickfix-1.9.0-nonblocking-io.patch
|
On Tue, 22 Mar 2005 12:05:43 -0000, kri...@rb... <kri...@rb...> wrote: > Oren, Caleb - > > When debugging this problem, I was quite surprised by QuickFIX's socket handling. Despite the use of select(), non-blocking I/O is not actually used which seems to be why platform-specific ioctl's are used (to figure out how much you can read with out blocking). > > As for using recv()/read() == 0 to determine EOF, that fact that the actual socket reads and writes are scattered among various functions and methods of unrelated classes makes this quite harder to do. It also complicates error reporting (i.e. errno on failed reads/writes) and disconnection reasons that I've seen people rquesting on the list. > > There is a SocketConnection class which 'owns' a connected socket and there are functions in Utility.cpp for doing socket I/O, yet grep'ing for recv() calls returns two calls to the syscall recv(), and neither one is in either SocketConnection or in Utility.cpp. FIX::Parser's readFromStream()'s method seems an especially dubious place to put recv(); the fact that FIX::Parser has a member varible for holding a file descriptor as well a pointer to an std::istream also strikes me as wrong. I agree that the design of the connections, strategies, etc. is confusing and hard to follow. I also agree that the Parser should not have any knowledge of a file descriptor or iostream. It should just be handed char*s or std::strings. > lsdev2:~/work/quickfix/src/C++ $ grep -n -1 recv *.cpp *.h > Parser.cpp-144- > Parser.cpp:145: size = recv( m_socket, m_readBuffer, m_bufferSize, 0 ); > Parser.cpp-146- if ( size <= 0 ) throw RecvFailed(); > -- > ThreadedSocketConnection.cpp-96- buffer = new char[ bytes + 1 ]; > ThreadedSocketConnection.cpp:97: int result = recv( m_socket, buffer, bytes, 0 ); > ThreadedSocketConnection.cpp-98- if ( result <= 0 ) { throw std::exception(); } > > During my attempts to recreate the Solaris disconnect problem, I wrote sample QuickFIX apps that would send torrents of messages. It was quite easy to lock up two QuickFIX apps talking to each other during a reconnect, both sides would requests resends from each other and then the SocketInitiator and SocketAcceptor threads in the apps would block on send() when resending the messages to each other, as send() would not return until the other side called recv(), which it could not do until the other side called recv(), etc. The apps would then sleep forever. I have previously submitted a patch (attached here) that enables non-blocking I/O by using the MSG_DONTWAIT send/recv flag. It is lightly tested, and was made originally against 1.9.0, but I was able to clear up a lock-up type scenario as you describe with it. I'm not totally happy with the end-result: it won't work on systems that don't support MSG_DONTWAIT, and wedges yet another callback into the Socket Strategy interface (onWriteable). A better approach to the first problem would be to use the appropriate ioctl to set the sockets into non-blocking mode for all operations. As far as the Strategy interface goes, I think it is over-designed (look at all the empty callbacks!) and should probably be scrapped in favor of something simpler. > This problem can only be avoided by using non-block I/O or having one thread for reading and another for writing, neither of which SocketInitiator or SocketAcceptor do (I haven't looked at ThreadedSocketInitiator). I think the overall approach of using select and then calling send/recv as appropriate is the right one. I don't think the idea of another thread per-connection or possibly two is tenable. There is already one that is handing off strings to the Parser for processing (when using ThreadedSocketConnection). Adding more would be gratuitous. The down-side of using select is that you essentially lock yourself into a certain amount of latency: the timeval you pass to the select call. If the SocketMonitor is sitting in select and you have a new fd to add to the list (e.g. there is suddenly some data to send), you either have to wait for the select timeout to expire, or you add a special "signalling" fd to your fd_set that you write to when you want to force select to break out of its wait. Anyway, thanks for the analysis and insight Kristofer. I hope we can improve QuickFIX as a result of the points raised in this thread. -- Caleb Epstein caleb dot epstein at gmail dot com |
From: Oren M. <or...@qu...> - 2005-03-22 15:35:58
|
I know all of this is due for an overhaul. I would have actually prefered to use one of the libraries dedicated to socket handling, but a couple reasons prevented me from doing so: First licensing issues limited our choices (can't use GPL and distribute under our license), and second I felt it important that QF have as few third party dependencies as possible to make the build process simple. It was far from assured anyone would bother to download and build quickfix, much less have to build an equally complicated third party socket library. In order to encourage adoption, the basics had to be encapsulated into the main library. Well ok, that part of the plan worked well enough, and now we are paying the price for that decision. Fortunately unlike before we have the benefit of experienced socket developers to advise on how to correct the inexperienced design decisions. First thing to know is the Parser's use of an std::stream is an experiment gone bad. It's not actually used and should have been removed a long time ago. (At one point, the sockets were wrapped in a custom std::stream. So at the time that was intended to be the abstraction. In practice there were some problems with this and it was abandoned.) The only reason it still exists is so there is a way for the unit tests to pass data to the Parser. Obviously this reduces the value of the tests somewhat and brings glaring attention to the design flaw. Why was the recv call placed there? Probably a poor attempt to share code between the Socket and ThreadedSocket implementations. If we had ever gotten around to implementing any non-socket based transports, this would certainly have been addressed sooner. But this will be easy to fix. We basically need to pull the functionality out of the readFromStream call, and instead pass in a parameter to readFixMessage which will contain just the bytes to be processed. The other problems will be harder to address, although I believe some of you would likely benefit from the ThreadedSocket classes. These are what is currently recommended for high frequency lines, particularly if you are running several of them. I personally don't want to add threads to the standard Socket implementation as that pretty much defeats the existence of its purpose. It is intended for applications which do not want or need to deal with threads. When the SocketInitiator/Acceptor Combined with the block() or poll() call, QuickFIX can be run entirely single threaded. I think there is value in this. I also think there is plenty of precedence for this if done correctly. I think the naming of the Socket and ThreadedSocket implementations has caused confusion. Perhaps it would be better to just have classes named SocketInitiator/Acceptor and pass in the threading model through the constructor or configuration settings. What I think I would like to have happen is if we can come up with a plan to address the current shortcomings within the existing framework, do a release, and then focus on a more comprehensive overhaul. Comments? --oren |