Re: [Quickfix-developers] MsgSeqNum too high problem
Brought to you by:
orenmnero
|
From: Brian E. <azz...@ya...> - 2005-06-28 20:27:17
|
Alexey - I'm not a QuickFIX developer, but I've rooted around the code a little bit and about the only explanation I can think of that might explain your problem is if the sequence number being used by the acceptor is corrupt or the message offset table is broken. Looking at nextResendRequest(...) in Session.cpp, it looks like it may be possible to fall into a trap when an attempt to retrieve the "missing" messages fails. This may be because the message store is bad in some non-fatal way (read fails but doesn't throw an exception - looks like FileStore may do this if message number 'n' simply doesn't exist in the offsets table) or because the "beginning" sequence number is greater than the "end" sequence number that actually got stored (corrupt sequence number). This could happen if, in your scenario, message number "5" for some reason did not exist in the message store - the FileStore get() will stop on the first non-existent message number. Even if 6-8 exist (in the form of heartbeats), the get will return an empty set. In this case, because the main for loop is skipped (no messages to iterate thru), no messages are sent and, as a side-effect of this, no sequence number reset gets sent either (the code that calls a generateSequenceReset() call is within an if clause that won't be true - the "begin" variable never gets set [fails check #1] and endSeqNo and msgSeqNum are both 0 [fails check #2]). The acceptor is happy because it increments the initiator's "expected" sequence number regardless of whether any messages are sent in response to the resend request, so it will continue to send heartbeats and will not error out on additional sequence resets. This is all EXTREMELY seat-of-the-pants analysis - there may well be checks gonig on under the covers that I'm not aware of. It's an odd sort of scenario. - Brian Erst Thynk Software, Inc. --- Alexey Zubko <ale...@in...> wrote: > Hello Brian, > > Your fix prevents the disconnecting, but I still have a sequencing > problem. > > Actually I thing there are two problems - > 1. Acceptor sends nothing back. > 2. Initiator ignores incoming application messages. > > I've got the problem during debugging - terminating initiator's > process, > but it's easy to reproduce: > If to increase NextSenderMsgSeqNum number in acceptor's file, after > restart the initiator sends resend requests, the acceptor sends > nothing > back and ignores all incoming application messages (there is no > fromApp() call). > Both acceptor and initiator are QF (CVS version + your patch). > > > Initiator: > > 20050628-19:38:55 : Created session > 20050628-19:38:55 : Connecting to eng-server on port 57258 > 20050628-19:38:55 : Connection succeeded > 20050628-19:38:56 : Initiated logon request > 20050628-19:38:59 : Received logon response > 20050628-19:40:50 : Created session > 20050628-19:40:50 : Connecting to eng-server on port 57258 > 20050628-19:40:50 : Connection succeeded > 20050628-19:40:51 : Initiated logon request > 20050628-19:40:51 : Received logon response > 20050628-19:40:51 : MsgSeqNum too high, expecting 5 but received 7 > 20050628-19:40:51 : Sent ResendRequest FROM: 5 TO: 6 > 20050628-19:40:51 : MsgSeqNum too high, expecting 5 but received 8 > 20050628-19:40:51 : Sent ResendRequest FROM: 5 TO: 7 > 20050628-19:41:06 : MsgSeqNum too high, expecting 5 but received 9 > 20050628-19:41:06 : Sent ResendRequest FROM: 5 TO: 8 > ......... > > Acceptor: > > 20050628-19:38:48 : Created session > 20050628-19:38:56 : Received logon request > 20050628-19:38:56 : Responding to logon request > 20050628-19:39:41 : Socket Error > 20050628-19:39:41 : Disconnecting > 20050628-19:40:42 : Created session > 20050628-19:40:51 : Received logon request > 20050628-19:40:51 : Responding to logon request > 20050628-19:40:51 : Received ResendRequest FROM: 5 TO: 0 > 20050628-19:40:51 : Received ResendRequest FROM: 5 TO: 0 > 20050628-19:41:06 : Received ResendRequest FROM: 5 TO: 0 > 20050628-19:41:21 : Received ResendRequest FROM: 5 TO: 0 > 20050628-19:41:36 : Received ResendRequest FROM: 5 TO: 0 > 20050628-19:41:51 : Socket Error > 20050628-19:41:51 : Disconnecting > > > Regards, > Alexey Zubko > > Infinium Capital Corporation > (416) 360-7000 ext. 305 > > > > Brian Erst wrote: > > >Oren - > > > >I believe the following change to Session.cpp will fix the timeout > >problem when receiving out-of-sequence messages while awaiting a > >sequence reset/gap-fill: > > > >Move the following lines in Session::verify(...) > > > > UtcTimeStamp now; > > m_state.lastReceivedTime( now ); > > > >from their current position up to a spot just before the preceding > >if/else clause: > > > >-->>Place the code here<<-- > >if ( checkTooHigh && isTargetTooHigh( msgSeqNum ) ) > > > >This will increment the "lastReceivedTime" in the SessionState > object > >even when the sequence number is wrong. This appears to solve a > whole > >bunch of interrelated timeouts that could occur in this scenario > (test > >request, heartbeat, logon response, etc.) in one quick hit. Sequence > >too low is largely unaffected by this change, as it will cause a > >disconnect when hit, so that part of the code shouldn't need to be > >rearranged. > > > >It was somewhat unclear to me whether the following line: > > > > m_state.testRequest( 0 ); > > > >Also needed to be moved, but it seemed like it did not. > > > >- Brian Erst > >Thynk Software, Inc. > >--- Oren Miller <or...@qu...> wrote: > > > > > > > >>QuickFIX Documentation: > >>http://www.quickfixengine.org/quickfix/doc/html/index.html > >>QuickFIX FAQ: > >>http://www.quickfixengine.org/wikifix/index.php?QuickFixFAQ > >>QuickFIX Support: http://www.quickfixengine.org/services.html > >> > >>Seems to me this shouldn't be happening. I'm guessing that the > >>engine isn't > >>processing the message because the sequence number is too high, > hence > >>no > >>heartbeat is processed. I believe that out of sequence messages > >>should > >>probably count as a keep-alive even if their contents arn't > >>processed. > >> > >>--oren > >> > >>----- Original Message ----- > >>From: "Alexey Zubko" <ale...@in...> > >>To: <qui...@li...> > >>Sent: Monday, June 27, 2005 1:44 PM > >>Subject: [Quickfix-developers] MsgSeqNum too high problem > >> > >> > >> > >> > >>>QuickFIX Documentation: > >>>http://www.quickfixengine.org/quickfix/doc/html/index.html > >>>QuickFIX FAQ: > >>> > >>> > >>http://www.quickfixengine.org/wikifix/index.php?QuickFixFAQ > >> > >> > >>>QuickFIX Support: http://www.quickfixengine.org/services.html > >>> > >>>Hi, > >>> > >>>I'd like to ask if QF initiator acts correctly in the following > >>> > >>> > >>scenario. > >> > >> > >>>The initiator detects that seqnum of a server is too high and > sends > >>> > >>> > >>a > >> > >> > >>>resend message. > >>>The server doesn't resend requested messages, but there are > >>> > >>> > >>heartbeat > >> > >> > >>>messages. > >>>The initiator disconnects because of timeout on heartbeat. :-) > >>>Below is the initiator's log. > >>> > >>>20050627-17:52:50 : Created session > >>>20050627-17:52:51 : Connecting to eng-server on port 5725 > >>>20050627-17:52:51 : Connection succeeded > >>>20050627-17:52:52 : Initiated logon request > >>>20050627-17:53:15 : Received logon response > >>>20050627-17:53:15 : MsgSeqNum too high, expecting 13 but received > >>> > >>> > >>15 > >> > >> > >>>20050627-17:53:15 : Sent ResendRequest FROM: 13 TO: 14 > >>>20050627-17:53:15 : MsgSeqNum too high, expecting 13 but received > >>> > >>> > >>16 > >> > >> > >>>20050627-17:53:15 : Sent ResendRequest FROM: 13 TO: 15 > >>>20050627-17:53:15 : MsgSeqNum too high, expecting 13 but received > >>> > >>> > >>17 > >> > >> > >>>20050627-17:53:15 : Sent ResendRequest FROM: 13 TO: 16 > >>>20050627-17:53:22 : Sent test request TEST > >>>20050627-17:53:22 : MsgSeqNum too high, expecting 13 but received > >>> > === message truncated === |