Re: [tipc-discussion] Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast ack

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Karthick:

The commit that introduced the tipc_recv_msg() check can be found at http://tipc.git.sourceforge.net/git/gitweb.cgi?p=tipc/tipc;a=commitdiff;h=0e2824ace97f11685ab7869c17f621e0c4d2c5ca . (Alternatively, you can find it at http://tipc.git.sourceforge.net/git/gitweb.cgi?p=tipc/tipc;a=shortlog and look for the item labeled "tipc: Ensure both nodes recognize loss of contact between them"; the change went in November of 2010.

It sounds like the issue you've uncovered is more serious than you first indicated. If so, the need to resolve it is clearly greater than was indicated in your previous emails.

I'm a bit puzzled by your reference to a "link snap". We've previously been discussing a scenario in which the user has explicitly destroyed one side of a link by deleting the associated bearer, but you now seem to be talking about problems caused by a malfunction of the bonded interface (possibly in conjunction with broadcast link congestion); if so, this sounds like a different scenario in which both link endpoints continue to exist throughout the event, and the flow of messages between them may be different than in the deleted bearer case. If you can provide a description and/or analysis similar to the one you provided previously this would be a great help.

Regards,
Al

From: A.R Karthick [mailto:a.r...@gm...]
Sent: Thursday, March 31, 2011 11:29 AM
To: Stephens, Allan
Cc: jon...@er...; tipc-discussion
Subject: Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast acknowledgements ..." patch

Hi Allan

 Thanks for the mail. Comments inline -
On Wed, Mar 30, 2011 at 11:46 AM, Stephens, Allan <all...@wi...<mailto:all...@wi...>> wrote:
Hi Karthick:

Unfortunately, I don't think that the fix for this problem is as simple as the one you've proposed. The check that you've modified was specifically designed to ignore STATE messages because they could come from a previous incarnation of the peer's link endpoint, and we need to ensure these messages continue to be ignored.

The ignoring of STATE_MSG check which seems to be 1.7.7-rc4 specific is now causing us nightmares with 1.7.7-rc4, We have had very few issues with 1.7.6. But because of some rare  bcast link congestion bugs, we have our customers move to 1.7.7-rc4 which is now introducing split brain very easily. In fact, it isn't production ready without the fix for this bug considering that in the field, you could very easily hit a link snap (bonding bearer down), and restore immediately. And this is causing the split brain to occur with 1.7.7-rc4.

What are the impacts of my patch for not ignoring the STATE_MSGs. You mentioned that it could be from a previous incarnation of the peer's link endpoint. Can you point me to the exact bug/test or commit log that it addressed since this check in tipc_recv_msg seems to be newly introduced.

I think the proper solution is to ensure that the node doesn't send an ACTIVATE message when it isn't really prepared to bring the link up. However, this is easier said than done. I thought at first that a simple change to link_state_event() could be made to suppress the sending of ACTIVATE messages if the associated node's "blocked_setup" flag is set, but I don't think this will work because the current flow of control in TIPC 1.7 means that the flag won't be set by node_lost_contact() until after link_state_event() has finished processing the incoming RESET message. I'll have to think on this some more, and discuss the situation with Jon Maloy, to see what can be done to correct this problem. It's possible that some significant reworking of TIPC's link state machine may be needed; if so, this will probably have to be done in TIPC 2.0.


I understand your predicament here for skipping the ACTIVATE pertaining to a peer RESET_MSG but I request you guys to go through the commit again in tipc_recv_msg which has made tipc 1.7.7-rc4 unstable or almost unusable in the field causing split brains very easily.

Regards,
-Karthick

Regards,
Al

From: A.R Karthick [mailto:a.r...@gm...<mailto:a.r...@gm...>]
Sent: Friday, March 25, 2011 6:26 PM
To: Stephens, Allan; tipc-discussion

Subject: Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast acknowledgements ..." patch

Hi Allan,

On Fri, Mar 25, 2011 at 11:21 AM, Stephens, Allan <all...@wi...<mailto:all...@wi...>> wrote:
Hi Karthick:

Thanks for the update on the patch. Based on the feedback you've provided I'm concluding the patch is sane and have pushed it to the TIPC 1.7 git repository. (Peter Butler from PerfTech says the he will do some additional testing with that patch when he returns from vacation in a week or so; I'm assuming that he won't turn up any new issues, but if he does we'll deal with them then.)

I'll look into your report of a rapid bearer disable/enable sequence causing problems to see if it can be replicated. As you've already stated, this isn't really a major problem if it truly exists, but it's still something that we would want to fix (in TIPC 2.0, if not in TIPC 1.7.7).


I now think that this issue ought to be fixed in 1.7.7-rc4. The split brain in TIPC on a bearer disable and enable is extremely easy to reproduce. It was first reproduced by a customer using our product that relies heavily on tipc. But he isn't doing the bearer disable/enable step now for various reasons in his script.

However I just had about 1 hr at my disposal and wanted to try my hand at reproducing the split brain in TIPC.
Turns out that it is extremely easy to reproduce.

Have 2 nodes. with say eth0. established. Then set the link tolerance time to say 10 seconds (just for the heck as it isn't required really)

Then disable the bearer on one side:

tipc-config -bd=eth:eth0

The node on which you disable the bearer would get into RESET_UNKNOWN, lose contact with the peer as expected. But the peer node would have hit a TIMEOUT_EVT on the continuation interval (500 milliseconds is the MAX apparently), and the link state in the peer node becomes WORKING_UNKNOWN.

So far so good.

Now within the link tolerance abort limit/time, enable the bearer back.

tipc-config -be=eth:eth0

Now the node, resets a RESET_MSG on enable bearer,  the peer in WORKING_UNKNOWN state receives the RESET_MSG in link_state_event (reset while probing case), resets the link (loses contact with the bearer disabled node), changes state to RESET_RESET, and sends an ACTIVATE_MSG to the node that enabled the bearer.

The bearer enabled node then transitions from RESET_UNKNOWN to WORKING_WORKING, establishes contact and sends the STATE_MSG to activate the peer node in RESET_RESET state.

Apparently when the peer node lost contact in WORKING_UNKNOWN state above, it set the:
 n_ptr->block_setup = WAIT_PEER_DOWN | WAIT_NAMES_GONE ;
to prevent reestablishing contact with the peer till cleanup, schedules the kernel tasklet to clear off the WAIT_NAMES_GONE bitmask in node_name_purge_complete.

Now with the n_ptr->block_setup at WAIT_PEER_DOWN state, the peer in RESET_RESET link state now receives a STATE_MSG from the bearer enabled node that got activated back.

However this message is _discarded_ leading to a split brain with the bearer enabled node establishing contact with the peer but the peer apparently always remaining in lost contact or reset_reset state for the link.

tipc-config -n after bearer disable and enable on NODE0 shows NODE0 in established contact whilst NODE1 is still in lost contact:
[root@NODE0<mailto:ro...@sb...>]# tipc-config -n
Neighbors:
<1.1.1>: up

[root@NODE1<mailto:ro...@sb...>]# tipc-config -n
Neighbors:
<1.1.2>: down

Apparently, this code in tipc_recv_msg results in the STATE_MSG from the bearer enabled node that went to WORKING_WORKING from RESET_UNKNOWN being DROPPED and hence resulting in NODE1 not activating the link from RESET_RESET state to WORKING_WORKING.

       /* Verify that communication with node is currently allowed */
        if ((n_ptr->block_setup & WAIT_PEER_DOWN) &&
        (msg_user(msg) == LINK_PROTOCOL) &&
        (msg_type(msg) == RESET_MSG ||
             msg_type(msg) == ACTIVATE_MSG) &&
        !msg_redundant_link(msg))
            n_ptr->block_setup &= ~WAIT_PEER_DOWN;

        if (n_ptr->block_setup) { -> ends up being DROPPED for STATE_MSG
            tipc_node_unlock(n_ptr);
           goto cont;
        }

If I do the following change to clear the WAIT_PEER_DOWN mask for STATE_MSG above, the split brain is resolved and NODE1 reestablishes contact after the reset: In short, tipc seems to behave as expected on a bearer disable/enable without resulting in a split brain :)

       /* Verify that communication with node is currently allowed */

        if ((n_ptr->block_setup & WAIT_PEER_DOWN) &&
        (msg_user(msg) == LINK_PROTOCOL) &&
        (msg_type(msg) == RESET_MSG ||
         msg_type(msg) == ACTIVATE_MSG ||
         msg_type(msg) == STATE_MSG) &&
        !msg_redundant_link(msg))
            n_ptr->block_setup &= ~WAIT_PEER_DOWN;

        if (n_ptr->block_setup) {
            tipc_node_unlock(n_ptr);
            goto cont;
        }

Now I don't know if the above fix is safe to be integrated into TIPC and would cause regression for other cases.
And you are the best person to take a judgement call for the above fix which resolves the split brain thats very easy to reproduce :)

Regards,
-Karthick
From: A.R Karthick [mailto:a.r...@gm...<mailto:a.r...@gm...>]
Sent: Wednesday, March 23, 2011 6:25 PM

To: Stephens, Allan
Subject: Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast acknowledgements ..." patch

Hi Allan,
On Tue, Mar 22, 2011 at 10:23 AM, A.R Karthick <a.r...@gm...<mailto:a.r...@gm...>> wrote:

I meant the latest git. It doesn't apply clean even with that patch. Already have that since it was released.

And I believe that is the cause of the panic in the first place :) I have already asked our customer to revert my fail-safe patch which had worked fine for him. And anyway he is testing your patch as I speak as I provided him the one that applied clean. We will know in a day.

The patch seems to work. Atleast in 1 node he reported that he didn't see the panic. So its good. However he reported another issue which isn't serious for now as there is a workaround but it appears to expose a bug in 1.7.7-rc4 or TIPC discovery which is resulting in split brain when a bearer is disabled and immediately reenabled on 1 NODE (within the tipc link tolerance time)

I don't have much information on this but in short this was the state he had in the logs: (the logs are confusing because of the timestamps which are 1 hr. off and were also bounced).

(2 machines connected via cross-over: sbx103  and sbx31 have ~1 hr gap approximately):
>From sbx103:
Mar 23 14:23:27 sbx103 setupNP.sh: Failed to program NP, nptool returned 255, rebooting the server to recover from error
Mar 23 15:25:32 sbx103 kernel: imklog 3.18.6, log source = /proc/kmsg started.
Mar 23 15:25:32 sbx103 rsyslogd: [origin software="rsyslogd" swVersion="3.18.6" x-pid="3573" x-info="http://www.rsyslog.com<http://www.rsyslog.com/>"] restart

Mar 23 15:25:59 sbx103 kernel: [   65.972770] TIPC: Activated (version 1.7.7-rc4 compiled Mar 18 2011 13:34:25)
:
Mar 23 15:26:00 sbx103 kernel: [   68.909315] TIPC: Established link <1.1.2:bond0-1.1.1:bond0> on network plane A

Mar 23 15:41:42 sbx103 kernel: [ 1580.418058] TIPC: Disabling bearer <eth:bond0>
Mar 23 15:41:42 sbx103 kernel: [ 1580.418064] TIPC: Lost link <1.1.2:bond0-1.1.1:bond0> on network plane A
Mar 23 15:41:42 sbx103 kernel: [ 1580.418148] TIPC: Lost contact with <1.1.1>
Mar 23 15:41:42 sbx103 kernel: [ 1580.426061] TIPC: Enabled bearer <eth:bond0>, discovery domain <1.1.0>, priority 10
Mar 23 15:41:42 sbx103 kernel: [ 1580.898375] TIPC: Established link <1.1.2:bond0-1.1.1:bond0> on network plane A
Mar 23 15:41:47 sbx103 kernel: [ 1588.012302] drbd0: user_isp( 0 -> 1 )
Mar 23 15:41:47 sbx103 kernel: [ 1588.176513] SBX NRS module loaded.
Mar 23 15:41:47 sbx103 kernel: [ 1588.180711] SBX TCP Redundancy module loaded
Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Resetting link <1.1.2:bond0-1.1.1:bond0>, requested by peer while probing
Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Lost link <1.1.2:bond0-1.1.1:bond0> on network plane A
Mar 23 15:41:48 sbx103 kernel: [ 1588.432555] TIPC: Lost contact with <1.1.1>

:
Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Resetting link <1.1.2:bond0-1.1.1:bond0>, requested by peer while probing
:
Mar 23 15:58:02 sbx103 /opt/sonus/sbx/scripts/loadConfig.pl: Taking backup in /opt/sonus/sbx/.configRestoreStaging/beforeRestore(CDB)...
:
Mar 23 15:59:58 sbx103 /opt/sonus/sbx/scripts/loadConfig.pl: Starting Active...
Mar 23 15:59:59 sbx103 kernel: [ 6016.485441<tel:6016.485441>] TIPC: Disabling bearer <eth:bond0>
Mar 23 15:59:59 sbx103 kernel: [ 6016.491483<tel:6016.491483>] TIPC: Enabled bearer <eth:bond0>, discovery domain <1.1.0>, priority 10
Mar 23 15:59:59 sbx103 kernel: [ 6016.892589<tel:6016.892589>] TIPC: Established link <1.1.2:bond0-1.1.1:bond0> on network plane A


On SBX31:

Mar 23 10:53:44 sbx31 kernel: [    0.000000] Initializing cgroup subsys cpuset
:
:
Mar 23 14:18:54 sbx31 kernel: [54840.044003] SBX NRS module unloaded.
Mar 23 14:18:54 sbx31 kernel: [54840.106331] SBX TCP Redundancy module unloaded
Mar 23 14:20:00 sbx31 kernel: [54921.563407] drbd0: peer_isp( 0 -> 1 )
Mar 23 14:23:32 sbx31 kernel: [55189.890178] drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 )
Mar 23 14:23:32 sbx31 kernel: [55189.894180] drbd0: asender terminated
Mar 23 14:23:32 sbx31 kernel: [55189.976223] drbd0: Terminating asender thread
Mar 23 14:23:32 sbx31 kernel: [55189.985769] drbd0: Connection closed
Mar 23 14:23:32 sbx31 kernel: [55189.985774] drbd0: conn( NetworkFailure -> Unconnected )
Mar 23 14:23:32 sbx31 kernel: [55189.986325] drbd0: receiver terminated
Mar 23 14:23:32 sbx31 kernel: [55189.986325] drbd0: Restarting receiver thread
Mar 23 14:23:32 sbx31 kernel: [55189.989776] drbd0: receiver (re)started
Mar 23 14:23:32 sbx31 kernel: [55190.472951] drbd0: conn( Unconnected -> WFConnection )
Mar 23 14:23:32 sbx31 kernel: [55190.998516] TIPC: Resetting link <1.1.1:bond0-1.1.2:bond0>, peer not responding
Mar 23 14:23:32 sbx31 kernel: [55190.998625] TIPC: Lost link <1.1.1:bond0-1.1.2:bond0> on network plane A
Mar 23 14:23:32 sbx31 kernel: [55190.998625] TIPC: Lost contact with <1.1.2>
:
:
Mar 23 14:26:02 sbx31 kernel: [55382.915309] TIPC: Established link <1.1.1:bond0-1.1.2:bond0> on network plane A
:
:
Mar 23 14:41:44 sbx31 kernel: [56909.655509] TIPC: Resetting link <1.1.1:bond0-1.1.2:bond0>, requested by peer while probing
:
:
Mar 23 14:41:50 sbx31 kernel: [56919.533777] TIPC: Established link <1.1.1:bond0-1.1.2:bond0> on network plane A
:
:
Mar 23 15:00:01 sbx31 kernel: [61684.311304] TIPC: Resetting link <1.1.1:bond0-1.1.2:bond0>, requested by peer while probing


[ro...@sb...<mailto:ro...@sb...> openclovis]# tipc-config -n
Neighbors:
<1.1.1>: up

[ro...@sb...<mailto:ro...@sb...> openclovis]# tipc-config -n
Neighbors:
<1.1.2>: down
It looks as if the tipc link RESET_MSG on bearer disable and ACTIVATE_MSG on bearer enable went out of order from sbx103 to sbx31.

Since it isn't a blocker for us now as neither me or the customer has the bandwidth to reproduce this and try out patches as of now, he is presently not doing the bearer disable/enable step (which should avoid this issue) that he apparently had introduced to work around a bcast congestion bug in 1.7.6 which is resolved in 1.7.7-rc4 but ends up showing a split brain now :)

Regards,
-Karthick

From: A.R Karthick [mailto:a.r...@gm...<mailto:a.r...@gm...>]
Sent: Tuesday, March 22, 2011 12:41 PM
To: Stephens, Allan
Subject: Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast acknowledgements ..." patch

Hi Allan,
 Your patch doesn't apply clean on top of 1.7.7-rc4 (latest git).  You seem to have some local changes which causes the discrepancy.
Anyway I have applied it manually for now. Find the one that applies clean attached.

Regards,
-Karthick
On Mon, Mar 21, 2011 at 2:09 PM, Stephens, Allan <all...@wi...<mailto:all...@wi...>> wrote:
A.R. Karthick has recently identified an apparent problem with the recently published TIPC 1.7 patch "Ignore broadcast acknowledgements that are out-of-range". This email contains a supplementary patch that is designed to correct the bug introduced by that patch.

Any TIPC 1.7.7-rc4 user who has incorporated the defective patch and wishes to test this corrective patch can apply it to their system directly; there is no need to remove the previous patch. Feedback on this new patch -- either positive or negative -- is greatly appreciated. Once I've got some assurance that the corrective patch really does what it's supposed to, and doesn't introduce any new issues, I'll make it generally available in the TIPC 1.7 git repository at SourceForge.

Regards,
Al

...............

This patch revises the checks that were added to TIPC's broadcast link by the
patch "Ignore broadcast acknowledgements that are out-of-range" to
prevent a failed logic assertion from triggering a kernel BUG report.

This revision ensures that any unsent messages in the broadcast link's
transmit queue are properly purged when contact with TIPC's only
neighboring node is lost, thereby restoring the clean up capability
that existed before the new broadcast link checks were added. The checks
are now performed only if TIPC is still in contact with the acknowledging
node. Conversely, if contact with the acknowledging node has been lost
the checks are unnecessary and the broadcast link determines which
messages in the transmit queue need to be cleaned up, based on whether
or not the unsent messages (if any) can still be delivered.

Signed-off-by: Allan Stephens <all...@wi...<mailto:all...@wi...>>
---
 net/tipc/tipc_bcast.c |   47 +++++++++++++++++++++++++----------------------
 1 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/net/tipc/tipc_bcast.c b/net/tipc/tipc_bcast.c
index c07fc4c..c53dcfb 100644
--- a/net/tipc/tipc_bcast.c
+++ b/net/tipc/tipc_bcast.c
@@ -218,32 +218,37 @@ void tipc_bclink_acknowledge(struct tipc_node *n_ptr, u32 acked)

       spin_lock_bh(&bc_lock);

-       /*
-        * Invalid number indicates that all queued messages are acknowledged
-        * (used when contact with specified node has been lost)
-        */
-
-       if (acked == INVALID_LINK_SEQ)
-               acked = bcl->fsm_msg_cnt;
-
-       /*
-        * Bail out if no unacknowledged messages in queue
-        * (either the node is acknowledging messages it has previously ack'd
-        * or the ack value is invalid and should be ignored)
-        */
+       /* Bail out if tx queue is empty (no clean up is required) */

       crs = bcl->first_out;
       if (!crs)
               goto exit;

-       /*
-        * Validate sequence number to ensure it corresponds to a message
-        * that has been sent and not yet acknowledged
-        */
+       /* Determine which messages need to be acknowledged */

-       if (less(acked, buf_seqno(crs)) || less(bcl->fsm_msg_cnt, acked) ||
-           less_eq(acked, n_ptr->bclink.acked)) {
-               goto exit;
+       if (acked == INVALID_LINK_SEQ) {
+
+               /*
+                * Contact with specified node has been lost, so need to
+                * acknowledge sent messages only (if other nodes still exist)
+                * or both sent and unsent messages (otherwise)
+                */
+
+               if (bclink->bcast_nodes.count)
+                       acked = bcl->fsm_msg_cnt;
+               else
+                       acked = bcl->next_out_no;
+       } else {
+
+               /*
+                * Bail out if specified sequence number does not correspond
+                * to a message that has been sent and not yet acknowledged
+                */
+
+               if (less(acked, buf_seqno(crs)) ||
+                   less(bcl->fsm_msg_cnt, acked) ||
+                   less_eq(acked, n_ptr->bclink.acked))
+                       goto exit;
       }

       /* Skip over packets that node has previously acknowledged */
@@ -259,8 +264,6 @@ void tipc_bclink_acknowledge(struct tipc_node *n_ptr, u32 acked)

               if (crs != bcl->next_out)
                       bcbuf_decr_acks(crs);
-               else if (bclink->bcast_nodes.count)
-                       break;
               else {
                       bcbuf_set_acks(crs, 0);
                       bcl->next_out = next;
--
1.7.1



--
Software is like sex: it's better when it's free.
-Linus Torvalds

A.R.Karthick
http://github.com/karthick18



--
Software is like sex: it's better when it's free.
-Linus Torvalds

A.R.Karthick
http://github.com/karthick18



--
Software is like sex: it's better when it's free.
-Linus Torvalds

A.R.Karthick
http://github.com/karthick18



--
Software is like sex: it's better when it's free.
-Linus Torvalds

A.R.Karthick
http://github.com/karthick18



--
Software is like sex: it's better when it's free.
-Linus Torvalds

A.R.Karthick
http://github.com/karthick18



--
Software is like sex: it's better when it's free.
-Linus Torvalds

A.R.Karthick
http://github.com/karthick18



Re: [tipc-discussion] Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast ack

Cluster wide IPC providing datagram, connection, and bus messaging

Re: [tipc-discussion] Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast acknowledgements ..." patch