Enquire Link behavior

-]-[
2005-01-07
2013-05-21
  • -]-[
    -]-[
    2005-01-07

    Hi, happy new year!

    As far as I can tell the documentation surrounding enquire_link does not state anything about an effect it would have on the connection status.  If there were no response to the query, nothing happens.  It is just a confidance check.

    The inactivity timer of oserl does not get reset on enquire_link messages.  This would mean that if there were no actual message traffic the link would be dropped after the inactivity_time.  The default value of inifinity means that the link is never dropped.

    There is a missing case here.  If the enquire_link messages stopped arriving or no response came to it and the inactivity_time was set to infinity the dead link would be held open indefinately.  Setting the inactivity time to say 5 minutes would mean that a live bind that stil had enquire_link messages on it but no message traffic would get unbound.

    It would seem that what is needed is a modified behavior for the enquire_link_timer.  It should stop the session if no response is received to an enquire_link message in a reasonable time or if a set number of enquire_link messages are not replied to.  Prehaps just check if there are enquire_link messages in the requests queue when the enquire_link_timer expires and react accordingly.

    At the moment the only way to prevent a dead link to be held open is by setting the inactivity_timer to a sufficiently long period so that there will be at least one data message.  For low volume links that need to be open continuously this might be hard to predict.

     
    • Happy new year,

      That's right, the inactivity timer does not get reset on enquire_link messages, otherwise the inactivity timer would never expire (...if greater than the enquire link timer... as usually is).  Thus, if the inactivity timer is set to 5 minutes, and no other messages but enquire_links are issued, the session must be dropped.

      From the SMPP specification, I do also guess the enquire_link operation shouldn't have any effect on the connection status.

      Well, now I've been thinking on the missing case you mention.  I'm not sure if it'll be appropriated to drop the session if the other peer doesn't respond to a number (what number?) of enquire_link messages.  Since this behaviour is not specified on the SMPP specification, it could be annoying for people not expecting it. 

      I don't think we should drop the session if the other peer doesn't respond to our enquire_link messages, but still holding the underlying tcp connection up.  If the connection gets closed, the session will be dropped by oserl, at least this is what is intended to (and what I believe it happens, doesn't it?).  But as long as the inactivity timer is running and the tcp connection is open, the session is held.  Why do you think we should change that?  Could I know more details on your particular problem?

      Maybe, instead of dropping the session, we could add the callback "handle_enquire_link_failure" to notify upper layers that the other peer did not respond to our enquire_link message.  Would that solve the situation? This new callback should be fairly easy to implement.  I'll add it to the TODO list for the next release if you find it useful.

      Best,

      Quique

       
    • -]-[
      -]-[
      2005-01-07

      The spec is indeed a bit unclear about how to handle the case where enquire_link does not create a response.

      What I have found on our live server is that the situation does occur where the peer stops responding to any messages, but the socket connection is stil in tact.  With the inactivity_time set to infinity it correctly keeps the session and socket connection open.

      To fix this, I changed the inactivity_time to some large number (10 minutes at present).  The problem is that during times of low activity, it might happen that there is no actual traffic for this amount of time.  This will cause the session to be terminated when that is not what I actually want.

      What I want is the following: inactivity_time should be infinity as before.  No disconnects should happen due to low volume traffic.  If the enquire_link queries indicate that the peer is no longer responsive, the session should be terminated.

      Your suggested handle_enquire_link_failure sounds like the thing that will solve the problem elegantly.  It will allow the esme (or smsc) implementer to implement her own counters or reconnection logic.  This seems to be what the spec intends.

      The question now is how to decide when an enquire_link actually failed.  Perhaps it is reasonable to assume that it has failed if the enquire_link_timer has expired and there is an enquire_link stil pending a response.  Otherwise an enquire_link_request_time needs to be specified and another timer added (enquire_link_request_timer?)

      The code that manges the connections in my system will then handle the enquire_link_failures and can take appropriate action as it sees fit.

      Sounds like a very good idea to me :)

      -]-[einrich

       
      • Instead of introducing a new timer, I was thinking about using the response_timer. 

        Once the enquire_link is issued, if no enquire_link_resp, nor other valid operation PDU arrives before the response_timer elapses, then the handle_enquire_link_failure callback is triggered. Wouldn't that do the job?

        Thanks for your feedback Heinrich.

        Best regards,

        Quique

         
    • I have looked into hooking the enquire_link_faliure into the request_broker where the response timer does its timeout bit.  The code is as follows

      request_broker(undefined, CmdId, Time) ->
          receive
              {FsmRef, {response, RespId, Pdu}} ->
                  case operation:get_param(command_status, Pdu) of
                      ?ESME_ROK when RespId == ?COMMAND_ID_BIND_RECEIVER_RESP;
                                     RespId == ?COMMAND_ID_BIND_TRANSMITTER_RESP;
                                     RespId == ?COMMAND_ID_BIND_TRANSCEIVER_RESP;
                                     RespId == ?COMMAND_ID_UNBIND_RESP ->
                          gen_fsm:send_event(FsmRef, RespId);
                      ?ESME_ROK ->
                          ok;
                      Error ->
                          {error, Error}
                  end
          after Time ->
                  Error = operation:request_failure_code(CmdId),
                  {error, Error}
          end;

      The snag here is that in the 'after Time ->' clause there is no reference to the running FSM, so there is no way to launch an event at it.  The reference comes from the handle_input_correct_pdu function that sends a message to the borker thread

      Broker ! {self(), {response, CmdId, Pdu}}

      If I understand the code correctly, the reason there are two request_broker functions, is because the one that has Caller=undefined, does not launch events back at the FSM.  Problem is that we want to change the code so that now it does send an event back.

      Can you think of any other way than sending an event to the FSM to fire off a callback function?

       
      • Hello,

        The assumptions you made about the code are correct.  Having the Caller=undefined means 'nobody' is waiting for the operation response, thus there's no need for a gen_fsm:reply(Caller, Reply) to be issued.  Internal enquire_link requests, sent after every enquire_link timeout, have always the Caller set to undefined.  The main reason for ignoring enquire_link responses, is explained at the end of this message.

        First of all, the easiest way to ensure you have the FSM reference also in the 'after Time ->' clause, is by simply sending this reference to the request_broker when you spawn it.  For doing so you need to do a few small changes in gen_esme_session.erl and/or gen_smsc_session.erl

        send_request(CmdId, ParamList, From, StateData) ->
            SeqNum = StateData#state.sequence_number + 1,
            send_pdu(StateData#state.socket, operation:new(CmdId, SeqNum, ParamList)),
            Time   = StateData#state.response_time,

            %%%%%%%%%%%%% Add the FsmRef %%%%%%%%%%%%%%%%%%%%
            FsmRef = self(),
            Broker = spawn_link(fun() -> request_broker(FsmRef, From, CmdId, Time)end),
            %%%%%%%%%%%%%

            ets:insert(StateData#state.requests, {SeqNum, CmdId, Broker}),
            {ok, StateData#state{sequence_number = SeqNum}}.

        and add this extra argument on both clauses of the request_broker:

        request_broker(FsmRef, undefined, CmdId, Time) ->
            receive
                {FsmRef, {response, RespId, Pdu}} ->
        ...

            end;
        request_broker(FsmRef, Caller, CmdId, Time) ->
            receive
                {FsmRef, {response, RespId, Pdu}} ->
        ...

            end.

        I think that should work, but please be aware of the following.  The request_broker waits until the counterpart response PDU arrives or the response timeout expires.  This means that a request_broker spawned on behalf of an ENQUIRE_LINK request, would wait during the RESPONSE_TIME until the ENQUIRE_LINK_RESP arrives.  Well the problem here is that this ENQUIRE_LINK_RESP might never arrive, since the other peer, on response to an ENQUIRE_LINK may send an ENQUIRE_LINK_RESP PDU or any other valid operation.  Let me see if I can explain myself:

        RESPONSE_TIMER = 30 secs.

        * 1st scenario (enquire_link failure)

          Time 0: (ESME) ENQUIRE_LINK

          Time +30: (ESME) No ENQUIRE_LINK_RESP received after
              RESPONSE_TIMEOUT -> enquire_link failure

        * 2nd scenario (enquire_link response received)

          Time 0: (ESME) ENQUIRE_LINK

          Time +5: (SMSC) ENQUIRE_LINK_RESP

        * 3rd scenario (the tricky one)

          Time 0: (ESME) ENQUIRE_LINK

          Time +10: (SMSC) DELIVER_SM

          Time +30: (ESME) No ENQUIRE_LINK_RESP received after
              RESPONSE_TIMEOUT -> Since we received a DELIVER_SM
              in the meanwhile, we are not under an enquire_link
              failure.

        We have to think about the best way of avoiding these fake enquire_link failures.  Any ideas?  

        Right know enquire_link failures are ignored, and the session is dropped when the INACTIVITY_TIMER expires.

        Best regards,

        Quique

         
    • -]-[
      -]-[
      2005-02-18

      Just in case you were wondering, the post above was by me :)

       
      • I wonder why I suspected that? :-)

         
    • -]-[
      -]-[
      2005-02-21

      Not sure how to post files here, so I will just post my changes here.  Let me know if I should mail you the files.

      gen_esme_session.erl

      request_broker({undefined, FSM}, CmdId, Time) ->
          receive
              {FsmRef, {response, RespId, Pdu}} ->
                  case operation:get_param(command_status, Pdu) of
                      ?ESME_ROK when RespId == ?COMMAND_ID_BIND_RECEIVER_RESP;
                                     RespId == ?COMMAND_ID_BIND_TRANSMITTER_RESP;
                                     RespId == ?COMMAND_ID_BIND_TRANSCEIVER_RESP;
                                     RespId == ?COMMAND_ID_UNBIND_RESP ->
                          gen_fsm:send_event(FsmRef, RespId);
                      ?ESME_ROK ->
                          ok;
                      Error ->
                          {error, Error}
                  end
          after Time ->
                  Error = operation:request_failure_code(CmdId),
                  case CmdId of
                      ?COMMAND_ID_ENQUIRE_LINK ->
                        gen_fsm:send_all_state_event(FSM, enquire_link_faliure),
                        {error, Error};
                      _ ->
                        {error, Error}
                  end
          end;

      Two things changed here:
      A reference to the FSM is passed in as parameter.
      In the 'after Time' clause the time out of the enquire_link is handled specially.

      Unfortunately this means that all the internal calls to send_request needs to be updated to
      {ok, NewS} = send_request(?COMMAND_ID_ENQUIRE_LINK, [], {undefined,self()}, S),
      To pass in the reference to the fsm.

      The extra event handler is

      handle_event(enquire_link_faliure, StateName, StateData) ->
        spawn_link(fun() -> handle_peer_enquire_link_faliure(StateData) end),
        {next_state, StateName, StateData};

      handle_peer_enquire_link_faliure(S) ->
        (S#state.mod):handle_enquire_link_faliure(S#state.esme).

      Obviously the handle_enquire_link_faliure function needs to be added to the behaviour_info.

      In gen_esme the only change is the addition of

      handle_enquire_link_faliure (ServerRef) ->
          gen_server:call(ServerRef, enquire_link_faliure, infinity).

      And

      handle_call(enquire_link_faliure, From, S) ->
          pack((S#state.mod):handle_enquire_link_faliure(S#state.mod_state), S);

      This must be added to the behaviour_info too.

      Finally your module can get the handle_enquire_link_faliure(State) callback and do whatever is appropriate.

      Is there a better way of doing this?

       
      • Hi Heinrich,

        That's exactly the approach I was thinking of.  Could you please e-mail me mpquique_at_users.sourceforge.net) the files with the changes to save me the typing?   I'd appreciate that :)  As soon as you send them to me I will upload them into the CVS repository.

        Thanks a lot,

        Quique

         
    • Anders Nygren
      Anders Nygren
      2005-03-01

      Hi
      I am new here and a little late to this discussion, but have traded some emails with Quique about other ascpects of enquire_link, so I just thought that I should add my opinions here.

      - If there is no response to an enquire_link message within the defined time, the session MUST be dropped.
      The fact that the TCP connection is up does not mean anything, the peer is not responding and assumed dead.

      - When an enquire_link is received it must propagate ALL the way up to the application level. Not automatically responded to by gen_*_session. (See my previous point on why)

      I am not sure if You have resolved this issue yet, but here is my suggestion on how to handle the problem of enquire_link may be responded to by any valid message not just enquire_link_resp. (I haven't studied the code yet on how to do this yet, but here is my reasoning.)

      - Sending enquire_link is NOT a normal operation, we can only send one at a time. So it is possible to add state information in gen_esme_session to track it.

      - When we send enquire_link we can save the pid of the process spawned for sending it.

      - If enquire_link_resp is received the spawned process signals back to gen_*_session that enquire_link is OK.

      - If any other message is received we send cancel to the process that is waiting for the enquire_link_resp.

      /Anders Nygren

       
    • Please read:

      http://smsforum.net/smf/index.php?PHPSESSID=583e8d46bd1cc6ddc459e6082f8739ba;topic=1980.0

      The inactivity timer is there to solve the problem Heinrich mentions.

      Right now, as I said, the inactivity timer does not get reset on enquire_link responses.  This is wrong, it should be reset.  Default value won't be infinity anymore.  If the other peer stops responding our enquire_links, the inactivity timer will drop the session on expiration.  Following this approach, the new callback we implemented is unnecessary.

      Quique

       
    • Anders Nygren
      Anders Nygren
      2005-08-05

      Which callback do You mean?

      /Anders

       
    • handle_enquire_link_failure.  I think is not needed if the inactivity timer works as stated in the smsforum's discussion.

      Quique