Menu

#1127 IMM: Failure to send completed to PBE can cause cluster restart.

4.3.3
fixed
None
defect
imm
nd
4.0
major
2014-09-24
2014-09-23
No

This ticket is similar to #1096:

http://sourceforge.net/p/opensaf/tickets/1096/

The PBE detaches after having received the ccb-operations for a ccb but before
having received the completed-callback. In this case there are no OIs so
the completed-callback to PBE is to be sent directly when handling the apply
downcall from the user.

Detachment itself (of the PBE or any imm client) arrives over fevs, so that
is actually not the problem. The client node will only be removed in conjuction
with clearing of the implementer in ImmModel. Thus the return from ImmModel of
a non-null pbeConn means the client-node must exist. This is an "invariant"
i.e. an assertable condition.

The problem that does exist in immnd_evt_proc_ccb_apply is that the send
itself over MDS may fail, due to a race with a PBE going down. In that case
the code in immnd_evt_proc_ccb_apply will explititly abort, which will happen
on all nodes, which will result in a cluster restart.

It is this abort() on send failure which is wrong. The other abort on client
node not found should be changed to an assert.

So the problem that needs to be fixed is to remove the abort on send failure
and instead "drop" the ccb apply to the recovery case, letting the apply
result be resolved by the PBE restart/recovery.
Indeed, it is concewivable that the PBE may have received the completed&commit
message even if the sending IMMND receives an error from MDS on the send.

Related

Tickets: #1127
Wiki: ChangeLog-4.3.3
Wiki: ChangeLog-4.4.1

Discussion

  • Anders Bjornerstedt

    • summary: IMM: Detached PBE just before a ccb-apply can cause cluster restart. --> IMM: FAilure to send to completed to PBE can cause cluster restart.
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -3,12 +3,26 @@
      http://sourceforge.net/p/opensaf/tickets/1096/
    
     The PBE detaches after having received the ccb-operations for a ccb but before
    -having received the completed-callback. In this case there a re no OIs so
    +having received the completed-callback. In this case there are no OIs so
     the completed-callback to PBE is to be sent directly when handling the apply
     downcall from the user.
    
    -The problem is that if the PBE is detached here, the IMMNMDs will abort,
    -causing a cluster restart. 
    +Detachment itself (of the PBE or any imm client) arrives over fevs, so that
    +is actually not the problem. The client node will only be removed in conjuction
    +with clearing of the implementer in ImmModel. Thus the return from ImmModel of 
    +a non-null pbeConn means the client-node must exist. This is an "invariant"
    +i.e. an assertable condition. 
    
    -The IMMNDs must not abort in this case, they should simply let the apply be
    -handled by the PBE restart/recovery. 
    +The problem that *does* exist in immnd_evt_proc_ccb_apply is that the send 
    +itself over MDS may fail, due to a race with a PBE going down. In that case
    +the code in immnd_evt_proc_ccb_apply will explititly abort, which will happen
    +on all nodes, which will result in a cluster restart.
    +
    +It is this abort() on send failure which is wrong. The other abort on client
    +node not found should be changed to an assert.
    +
    +So the problem that needs to be fixed is to remove the abort on send failure
    +and instead "drop" the ccb apply to the recovery case, lettting the apply
    +result be resolved by the PBE restart/recovery.
    +Indeed, it is concewivable that the PBE may have received the completed&commit
    +message even if the sending IMMND receives an error from MDS on the send. 
    
     
  • Anders Bjornerstedt

    • summary: IMM: FAilure to send to completed to PBE can cause cluster restart. --> IMM: Failure to send completed to PBE can cause cluster restart.
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -22,7 +22,7 @@
     node not found should be changed to an assert.
    
     So the problem that needs to be fixed is to remove the abort on send failure
    -and instead "drop" the ccb apply to the recovery case, lettting the apply
    +and instead "drop" the ccb apply to the recovery case, letting the apply
     result be resolved by the PBE restart/recovery.
     Indeed, it is concewivable that the PBE may have received the completed&commit
     message even if the sending IMMND receives an error from MDS on the send. 
    
     
  • Anders Bjornerstedt

    • status: accepted --> review
     
  • Anders Bjornerstedt

    • status: review --> fixed
     
  • Anders Bjornerstedt

    changeset: 5933:bb53270bfe18
    tag: tip
    parent: 5929:468f7cf19611
    user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
    date: Wed Sep 24 15:48:12 2014 +0200
    summary: IMM: Failure to send completed to PBE defaulted to ccb-recovery [#1127]

    changeset: 5932:2505c06b19ca
    branch: opensaf-4.5.x
    parent: 5928:3cd62e8831a7
    user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
    date: Wed Sep 24 15:48:12 2014 +0200
    summary: IMM: Failure to send completed to PBE defaulted to ccb-recovery [#1127]

    changeset: 5931:3fff80ea7b42
    branch: opensaf-4.4.x
    parent: 5927:832244b78b65
    user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
    date: Wed Sep 24 15:52:11 2014 +0200
    summary: IMM: Failure to send completed to PBE defaulted to ccb-recovery [#1127]

    changeset: 5930:214972614415
    branch: opensaf-4.3.x
    parent: 5926:72def88cf2f8
    user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
    date: Wed Sep 24 15:52:11 2014 +0200
    summary: IMM: Failure to send completed to PBE defaulted to ccb-recovery [#1127]

     

    Related

    Tickets: #1127


Log in to post a comment.