OpenSAF / Tickets / #1127 IMM: Failure to send completed to PBE can cause cluster restart.

summary: IMM: Detached PBE just before a ccb-apply can cause cluster restart. --> IMM: FAilure to send to completed to PBE can cause cluster restart.
Description has changed:

Diff:

--- old
+++ new
@@ -3,12 +3,26 @@
  http://sourceforge.net/p/opensaf/tickets/1096/

 The PBE detaches after having received the ccb-operations for a ccb but before
-having received the completed-callback. In this case there a re no OIs so
+having received the completed-callback. In this case there are no OIs so
 the completed-callback to PBE is to be sent directly when handling the apply
 downcall from the user.

-The problem is that if the PBE is detached here, the IMMNMDs will abort,
-causing a cluster restart. 
+Detachment itself (of the PBE or any imm client) arrives over fevs, so that
+is actually not the problem. The client node will only be removed in conjuction
+with clearing of the implementer in ImmModel. Thus the return from ImmModel of 
+a non-null pbeConn means the client-node must exist. This is an "invariant"
+i.e. an assertable condition. 

-The IMMNDs must not abort in this case, they should simply let the apply be
-handled by the PBE restart/recovery. 
+The problem that *does* exist in immnd_evt_proc_ccb_apply is that the send 
+itself over MDS may fail, due to a race with a PBE going down. In that case
+the code in immnd_evt_proc_ccb_apply will explititly abort, which will happen
+on all nodes, which will result in a cluster restart.
+
+It is this abort() on send failure which is wrong. The other abort on client
+node not found should be changed to an assert.
+
+So the problem that needs to be fixed is to remove the abort on send failure
+and instead "drop" the ccb apply to the recovery case, lettting the apply
+result be resolved by the PBE restart/recovery.
+Indeed, it is concewivable that the PBE may have received the completed&commit
+message even if the sending IMMND receives an error from MDS on the send.

summary: IMM: FAilure to send to completed to PBE can cause cluster restart. --> IMM: Failure to send completed to PBE can cause cluster restart.
Description has changed:

Diff:

--- old
+++ new
@@ -22,7 +22,7 @@
 node not found should be changed to an assert.

 So the problem that needs to be fixed is to remove the abort on send failure
-and instead "drop" the ccb apply to the recovery case, lettting the apply
+and instead "drop" the ccb apply to the recovery case, letting the apply
 result be resolved by the PBE restart/recovery.
 Indeed, it is concewivable that the PBE may have received the completed&commit
 message even if the sending IMMND receives an error from MDS on the send.

Anders Bjornerstedt - 2014-09-24

status: accepted --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-24

http://sourceforge.net/p/opensaf/mailman/message/32863407/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-24

status: review --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-24

changeset: 5933:bb53270bfe18
tag: tip
parent: 5929:468f7cf19611
user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
date: Wed Sep 24 15:48:12 2014 +0200
summary: IMM: Failure to send completed to PBE defaulted to ccb-recovery [#1127]

changeset: 5932:2505c06b19ca
branch: opensaf-4.5.x
parent: 5928:3cd62e8831a7
user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
date: Wed Sep 24 15:48:12 2014 +0200
summary: IMM: Failure to send completed to PBE defaulted to ccb-recovery [#1127]

changeset: 5931:3fff80ea7b42
branch: opensaf-4.4.x
parent: 5927:832244b78b65
user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
date: Wed Sep 24 15:52:11 2014 +0200
summary: IMM: Failure to send completed to PBE defaulted to ccb-recovery [#1127]

changeset: 5930:214972614415
branch: opensaf-4.3.x
parent: 5926:72def88cf2f8
user: Anders Bjornerstedt anders.bjornerstedt@ericsson.com
date: Wed Sep 24 15:52:11 2014 +0200
summary: IMM: Failure to send completed to PBE defaulted to ccb-recovery [#1127]

Related

Tickets: ~~#1127~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

IMM: Failure to send completed to PBE can cause cluster restart.

Milestone

Searches

Help

#1127 IMM: Failure to send completed to PBE can cause cluster restart.

Related

Discussion

Related