From: A.R K. <a.r...@gm...> - 2011-03-31 15:29:36
|
Hi Allan Thanks for the mail. Comments inline - On Wed, Mar 30, 2011 at 11:46 AM, Stephens, Allan < all...@wi...> wrote: > Hi Karthick: > > > Unfortunately, I don’t think that the fix for this problem is as simple as > the one you’ve proposed. The check that you’ve modified was specifically > designed to ignore STATE messages because they could come from a previous > incarnation of the peer’s link endpoint, and we need to ensure these > messages continue to be ignored. > The ignoring of STATE_MSG check which seems to be 1.7.7-rc4 specific is now causing us nightmares with 1.7.7-rc4, We have had very few issues with 1.7.6. But because of some rare bcast link congestion bugs, we have our customers move to 1.7.7-rc4 which is now introducing split brain very easily. In fact, it isn't production ready without the fix for this bug considering that in the field, you could very easily hit a link snap (bonding bearer down), and restore immediately. And this is causing the split brain to occur with 1.7.7-rc4. What are the impacts of my patch for not ignoring the STATE_MSGs. You mentioned that it could be from a previous incarnation of the peer's link endpoint. Can you point me to the exact bug/test or commit log that it addressed since this check in tipc_recv_msg seems to be newly introduced. > I think the proper solution is to ensure that the node doesn’t send an > ACTIVATE message when it isn’t really prepared to bring the link up. > However, this is easier said than done. I thought at first that a simple > change to link_state_event() could be made to suppress the sending of > ACTIVATE messages if the associated node’s “blocked_setup” flag is set, but > I don’t think this will work because the current flow of control in TIPC 1.7 > means that the flag won’t be set by node_lost_contact() until after > link_state_event() has finished processing the incoming RESET message. I’ll > have to think on this some more, and discuss the situation with Jon Maloy, > to see what can be done to correct this problem. It’s possible that some > significant reworking of TIPC’s link state machine may be needed; if so, > this will probably have to be done in TIPC 2.0. > > > I understand your predicament here for skipping the ACTIVATE pertaining to a peer RESET_MSG but I request you guys to go through the commit again in tipc_recv_msg which has made tipc 1.7.7-rc4 unstable or almost unusable in the field causing split brains very easily. Regards, -Karthick > Regards, > > Al > > > > *From:* A.R Karthick [mailto:a.r...@gm...] > *Sent:* Friday, March 25, 2011 6:26 PM > *To:* Stephens, Allan; tipc-discussion > > *Subject:* Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier > "Ignore broadcast acknowledgements ..." patch > > > > Hi Allan, > > > On Fri, Mar 25, 2011 at 11:21 AM, Stephens, Allan < > all...@wi...> wrote: > > Hi Karthick: > > > > Thanks for the update on the patch. Based on the feedback you’ve provided > I’m concluding the patch is sane and have pushed it to the TIPC 1.7 git > repository. (Peter Butler from PerfTech says the he will do some additional > testing with that patch when he returns from vacation in a week or so; I’m > assuming that he won’t turn up any new issues, but if he does we’ll deal > with them then.) > > > > I’ll look into your report of a rapid bearer disable/enable sequence > causing problems to see if it can be replicated. As you’ve already stated, > this isn’t really a major problem if it truly exists, but it’s still > something that we would want to fix (in TIPC 2.0, if not in TIPC 1.7.7). > > > > I now think that this issue ought to be fixed in 1.7.7-rc4. The split brain > in TIPC on a bearer disable and enable is extremely easy to reproduce. It > was first reproduced by a customer using our product that relies heavily on > tipc. But he isn't doing the bearer disable/enable step now for various > reasons in his script. > > However I just had about 1 hr at my disposal and wanted to try my hand at > reproducing the split brain in TIPC. > Turns out that it is extremely easy to reproduce. > > Have 2 nodes. with say eth0. established. Then set the link tolerance time > to say 10 seconds (just for the heck as it isn't required really) > > Then disable the bearer on one side: > > tipc-config -bd=eth:eth0 > > The node on which you disable the bearer would get into RESET_UNKNOWN, lose > contact with the peer as expected. But the peer node would have hit a > TIMEOUT_EVT on the continuation interval (500 milliseconds is the MAX > apparently), and the link state in the peer node becomes WORKING_UNKNOWN. > > So far so good. > > Now within the link tolerance abort limit/time, enable the bearer back. > > tipc-config -be=eth:eth0 > > Now the node, resets a RESET_MSG on enable bearer, the peer in > WORKING_UNKNOWN state receives the RESET_MSG in link_state_event (reset > while probing case), resets the link (loses contact with the bearer disabled > node), changes state to RESET_RESET, and sends an ACTIVATE_MSG to the node > that enabled the bearer. > > The bearer enabled node then transitions from RESET_UNKNOWN to > WORKING_WORKING, establishes contact and sends the STATE_MSG to activate the > peer node in RESET_RESET state. > > Apparently when the peer node lost contact in WORKING_UNKNOWN state above, > it set the: > n_ptr->block_setup = WAIT_PEER_DOWN | WAIT_NAMES_GONE ; > to prevent reestablishing contact with the peer till cleanup, schedules the > kernel tasklet to clear off the WAIT_NAMES_GONE bitmask in > node_name_purge_complete. > > Now with the n_ptr->block_setup at WAIT_PEER_DOWN state, the peer in > RESET_RESET link state now receives a STATE_MSG from the bearer enabled node > that got activated back. > > However this message is _discarded_ leading to a split brain with the > bearer enabled node establishing contact with the peer but the peer > apparently always remaining in lost contact or reset_reset state for the > link. > > tipc-config -n after bearer disable and enable on NODE0 shows NODE0 in > established contact whilst NODE1 is still in lost contact: > > [root@NODE0 <ro...@sb...>]# tipc-config -n > > Neighbors: > > <1.1.1>: up > > > > [root@NODE1 <ro...@sb...>]# tipc-config -n > > Neighbors: > > <1.1.2>: down > > > > Apparently, this code in tipc_recv_msg results in the STATE_MSG from the > bearer enabled node that went to WORKING_WORKING from RESET_UNKNOWN being > DROPPED and hence resulting in NODE1 not activating the link from > RESET_RESET state to WORKING_WORKING. > > /* Verify that communication with node is currently allowed */ > if ((n_ptr->block_setup & WAIT_PEER_DOWN) && > (msg_user(msg) == LINK_PROTOCOL) && > (msg_type(msg) == RESET_MSG || > msg_type(msg) == ACTIVATE_MSG) && > !msg_redundant_link(msg)) > n_ptr->block_setup &= ~WAIT_PEER_DOWN; > > if (n_ptr->block_setup) { -> ends up being DROPPED for STATE_MSG > tipc_node_unlock(n_ptr); > goto cont; > } > > If I do the following change to clear the WAIT_PEER_DOWN mask for STATE_MSG > above, the split brain is resolved and NODE1 reestablishes contact after the > reset: In short, tipc seems to behave as expected on a bearer disable/enable > without resulting in a split brain :) > > /* Verify that communication with node is currently allowed */ > > if ((n_ptr->block_setup & WAIT_PEER_DOWN) && > (msg_user(msg) == LINK_PROTOCOL) && > (msg_type(msg) == RESET_MSG || > msg_type(msg) == ACTIVATE_MSG || > msg_type(msg) == STATE_MSG) && > !msg_redundant_link(msg)) > n_ptr->block_setup &= ~WAIT_PEER_DOWN; > > if (n_ptr->block_setup) { > tipc_node_unlock(n_ptr); > goto cont; > } > > Now I don't know if the above fix is safe to be integrated into TIPC and > would cause regression for other cases. > And you are the best person to take a judgement call for the above fix > which resolves the split brain thats very easy to reproduce :) > > Regards, > -Karthick > > *From:* A.R Karthick [mailto:a.r...@gm...] > *Sent:* Wednesday, March 23, 2011 6:25 PM > > > *To:* Stephens, Allan > *Subject:* Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier > "Ignore broadcast acknowledgements ..." patch > > > > Hi Allan, > > On Tue, Mar 22, 2011 at 10:23 AM, A.R Karthick <a.r...@gm...> > wrote: > > > > I meant the latest git. It doesn't apply clean even with that patch. > Already have that since it was released. > > > And I believe that is the cause of the panic in the first place :) I have > already asked our customer to revert my fail-safe patch which had worked > fine for him. And anyway he is testing your patch as I speak as I provided > him the one that applied clean. We will know in a day. > > > The patch seems to work. Atleast in 1 node he reported that he didn't see > the panic. So its good. However he reported another issue which isn't > serious for now as there is a workaround but it appears to expose a bug in > 1.7.7-rc4 or TIPC discovery which is resulting in split brain when a bearer > is disabled and immediately reenabled on 1 NODE (within the tipc link > tolerance time) > > I don't have much information on this but in short this was the state he > had in the logs: (the logs are confusing because of the timestamps which are > 1 hr. off and were also bounced). > > (2 machines connected via cross-over: sbx103 and sbx31 have ~1 hr gap > approximately): > > From sbx103: > > *Mar 23 14:23:27 sbx103 setupNP.sh: Failed to program NP, nptool returned > 255, rebooting the server to recover from error* > > *Mar 23 15:25:32* sbx103 kernel: imklog 3.18.6, log source = /proc/kmsg > started. > > Mar 23 15:25:32 sbx103 rsyslogd: [origin software="rsyslogd" > swVersion="3.18.6" x-pid="3573" x-info="http://www.rsyslog.com"] restart > > > > Mar 23 15:25:59 sbx103 kernel: [ 65.972770] TIPC: Activated (version > 1.7.7-rc4 compiled Mar 18 2011 13:34:25) > > : > > *Mar 23 15:26:00 sbx103 kernel: [ 68.909315] TIPC: Established link > <1.1.2:bond0-1.1.1:bond0> on network plane A* > > > > Mar 23 15:41:42 sbx103 kernel: [ 1580.418058] TIPC: Disabling bearer > <eth:bond0> > > Mar 23 15:41:42 sbx103 kernel: [ 1580.418064] TIPC: Lost link > <1.1.2:bond0-1.1.1:bond0> on network plane A > > Mar 23 15:41:42 sbx103 kernel: [ 1580.418148] TIPC: Lost contact with > <1.1.1> > > Mar 23 15:41:42 sbx103 kernel: [ 1580.426061] TIPC: Enabled bearer > <eth:bond0>, discovery domain <1.1.0>, priority 10 > > Mar 23 15:41:42 sbx103 kernel: [ 1580.898375] TIPC: Established link > <1.1.2:bond0-1.1.1:bond0> on network plane A > > Mar 23 15:41:47 sbx103 kernel: [ 1588.012302] drbd0: user_isp( 0 -> 1 ) > > Mar 23 15:41:47 sbx103 kernel: [ 1588.176513] SBX NRS module loaded. > > Mar 23 15:41:47 sbx103 kernel: [ 1588.180711] SBX TCP Redundancy module > loaded > > Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Resetting link > <1.1.2:bond0-1.1.1:bond0>, requested by peer while probing > > Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Lost link > <1.1.2:bond0-1.1.1:bond0> on network plane A > > Mar 23 15:41:48 sbx103 kernel: [ 1588.432555] TIPC: Lost contact with > <1.1.1> > > > > : > > Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Resetting link > <1.1.2:bond0-1.1.1:bond0>, requested by peer while probing > > : > > Mar 23 15:58:02 sbx103 /opt/sonus/sbx/scripts/loadConfig.pl: Taking backup > in /opt/sonus/sbx/.configRestoreStaging/beforeRestore(CDB)... > > : > > Mar 23 15:59:58 sbx103 /opt/sonus/sbx/scripts/loadConfig.pl: Starting > Active... > > Mar 23 15:59:59 sbx103 kernel: [ 6016.485441] TIPC: Disabling bearer > <eth:bond0> > > Mar 23 15:59:59 sbx103 kernel: [ 6016.491483] TIPC: Enabled bearer > <eth:bond0>, discovery domain <1.1.0>, priority 10 > > Mar 23 15:59:59 sbx103 kernel: [ 6016.892589] TIPC: Established link > <1.1.2:bond0-1.1.1:bond0> on network plane A > > > > > > On SBX31: > > > > Mar 23 10:53:44 sbx31 kernel: [ 0.000000] Initializing cgroup subsys > cpuset > > : > > : > > Mar 23 14:18:54 sbx31 kernel: [54840.044003] SBX NRS module unloaded. > > Mar 23 14:18:54 sbx31 kernel: [54840.106331] SBX TCP Redundancy module > unloaded > > Mar 23 14:20:00 sbx31 kernel: [54921.563407] drbd0: peer_isp( 0 -> 1 ) > > Mar 23 14:23:32 sbx31 kernel: [55189.890178] drbd0: peer( Primary -> > Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > peer_isp( 1 -> 0 ) > > Mar 23 14:23:32 sbx31 kernel: [55189.894180] drbd0: asender terminated > > Mar 23 14:23:32 sbx31 kernel: [55189.976223] drbd0: Terminating asender > thread > > Mar 23 14:23:32 sbx31 kernel: [55189.985769] drbd0: Connection closed > > Mar 23 14:23:32 sbx31 kernel: [55189.985774] drbd0: conn( NetworkFailure -> > Unconnected ) > > Mar 23 14:23:32 sbx31 kernel: [55189.986325] drbd0: receiver terminated > > Mar 23 14:23:32 sbx31 kernel: [55189.986325] drbd0: Restarting receiver > thread > > Mar 23 14:23:32 sbx31 kernel: [55189.989776] drbd0: receiver (re)started > > Mar 23 14:23:32 sbx31 kernel: [55190.472951] drbd0: conn( Unconnected -> > WFConnection ) > > *Mar 23 14:23:32 sbx31 kernel: [55190.998516] TIPC: Resetting link > <1.1.1:bond0-1.1.2:bond0>, peer not responding* > > Mar 23 14:23:32 sbx31 kernel: [55190.998625] TIPC: Lost link > <1.1.1:bond0-1.1.2:bond0> on network plane A > > Mar 23 14:23:32 sbx31 kernel: [55190.998625] TIPC: Lost contact with > <1.1.2> > > : > > : > > *Mar 23 14:26:02 sbx31 kernel: [55382.915309] TIPC: Established link > <1.1.1:bond0-1.1.2:bond0> on network plane A* > > : > > : > > Mar 23 14:41:44 sbx31 kernel: [56909.655509] TIPC: Resetting link > <1.1.1:bond0-1.1.2:bond0>, requested by peer while probing > > : > > : > > Mar 23 14:41:50 sbx31 kernel: [56919.533777] TIPC: Established link > <1.1.1:bond0-1.1.2:bond0> on network plane A > > : > > : > > Mar 23 15:00:01 sbx31 kernel: [61684.311304] TIPC: Resetting link > <1.1.1:bond0-1.1.2:bond0>, requested by peer while probing > > > > > > [ro...@sb... openclovis]# tipc-config -n > > Neighbors: > > <1.1.1>: up > > > > [ro...@sb... openclovis]# tipc-config -n > > Neighbors: > > <1.1.2>: down > > It looks as if the tipc link RESET_MSG on bearer disable and ACTIVATE_MSG > on bearer enable went out of order from sbx103 to sbx31. > > Since it isn't a blocker for us now as neither me or the customer has the > bandwidth to reproduce this and try out patches as of now, he is presently > not doing the bearer disable/enable step (which should avoid this issue) > that he apparently had introduced to work around a bcast congestion bug in > 1.7.6 which is resolved in 1.7.7-rc4 but ends up showing a split brain now > :) > > Regards, > -Karthick > > > *From:* A.R Karthick [mailto:a.r...@gm...] > *Sent:* Tuesday, March 22, 2011 12:41 PM > *To:* Stephens, Allan > *Subject:* Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier > "Ignore broadcast acknowledgements ..." patch > > > > Hi Allan, > Your patch doesn't apply clean on top of 1.7.7-rc4 (latest git). You seem > to have some local changes which causes the discrepancy. > Anyway I have applied it manually for now. Find the one that applies clean > attached. > > Regards, > -Karthick > > On Mon, Mar 21, 2011 at 2:09 PM, Stephens, Allan < > all...@wi...> wrote: > > A.R. Karthick has recently identified an apparent problem with the recently > published TIPC 1.7 patch "Ignore broadcast acknowledgements that are > out-of-range". This email contains a supplementary patch that is designed to > correct the bug introduced by that patch. > > Any TIPC 1.7.7-rc4 user who has incorporated the defective patch and wishes > to test this corrective patch can apply it to their system directly; there > is no need to remove the previous patch. Feedback on this new patch -- > either positive or negative -- is greatly appreciated. Once I've got some > assurance that the corrective patch really does what it's supposed to, and > doesn't introduce any new issues, I'll make it generally available in the > TIPC 1.7 git repository at SourceForge. > > Regards, > Al > > ............... > > This patch revises the checks that were added to TIPC's broadcast link by > the > patch "Ignore broadcast acknowledgements that are out-of-range" to > prevent a failed logic assertion from triggering a kernel BUG report. > > This revision ensures that any unsent messages in the broadcast link's > transmit queue are properly purged when contact with TIPC's only > neighboring node is lost, thereby restoring the clean up capability > that existed before the new broadcast link checks were added. The checks > are now performed only if TIPC is still in contact with the acknowledging > node. Conversely, if contact with the acknowledging node has been lost > the checks are unnecessary and the broadcast link determines which > messages in the transmit queue need to be cleaned up, based on whether > or not the unsent messages (if any) can still be delivered. > > Signed-off-by: Allan Stephens <all...@wi...> > --- > net/tipc/tipc_bcast.c | 47 > +++++++++++++++++++++++++---------------------- > 1 files changed, 25 insertions(+), 22 deletions(-) > > diff --git a/net/tipc/tipc_bcast.c b/net/tipc/tipc_bcast.c > index c07fc4c..c53dcfb 100644 > --- a/net/tipc/tipc_bcast.c > +++ b/net/tipc/tipc_bcast.c > @@ -218,32 +218,37 @@ void tipc_bclink_acknowledge(struct tipc_node *n_ptr, > u32 acked) > > spin_lock_bh(&bc_lock); > > - /* > - * Invalid number indicates that all queued messages are > acknowledged > - * (used when contact with specified node has been lost) > - */ > - > - if (acked == INVALID_LINK_SEQ) > - acked = bcl->fsm_msg_cnt; > - > - /* > - * Bail out if no unacknowledged messages in queue > - * (either the node is acknowledging messages it has previously > ack'd > - * or the ack value is invalid and should be ignored) > - */ > + /* Bail out if tx queue is empty (no clean up is required) */ > > crs = bcl->first_out; > if (!crs) > goto exit; > > - /* > - * Validate sequence number to ensure it corresponds to a message > - * that has been sent and not yet acknowledged > - */ > + /* Determine which messages need to be acknowledged */ > > - if (less(acked, buf_seqno(crs)) || less(bcl->fsm_msg_cnt, acked) || > - less_eq(acked, n_ptr->bclink.acked)) { > - goto exit; > + if (acked == INVALID_LINK_SEQ) { > + > + /* > + * Contact with specified node has been lost, so need to > + * acknowledge sent messages only (if other nodes still > exist) > + * or both sent and unsent messages (otherwise) > + */ > + > + if (bclink->bcast_nodes.count) > + acked = bcl->fsm_msg_cnt; > + else > + acked = bcl->next_out_no; > + } else { > + > + /* > + * Bail out if specified sequence number does not > correspond > + * to a message that has been sent and not yet acknowledged > + */ > + > + if (less(acked, buf_seqno(crs)) || > + less(bcl->fsm_msg_cnt, acked) || > + less_eq(acked, n_ptr->bclink.acked)) > + goto exit; > } > > /* Skip over packets that node has previously acknowledged */ > @@ -259,8 +264,6 @@ void tipc_bclink_acknowledge(struct tipc_node *n_ptr, > u32 acked) > > if (crs != bcl->next_out) > bcbuf_decr_acks(crs); > - else if (bclink->bcast_nodes.count) > - break; > else { > bcbuf_set_acks(crs, 0); > bcl->next_out = next; > -- > 1.7.1 > > > > > -- > Software is like sex: it's better when it's free. > -Linus Torvalds > > A.R.Karthick > http://github.com/karthick18 > > > > > -- > Software is like sex: it's better when it's free. > -Linus Torvalds > > A.R.Karthick > http://github.com/karthick18 > > > > > -- > Software is like sex: it's better when it's free. > -Linus Torvalds > > A.R.Karthick > http://github.com/karthick18 > > > > > -- > Software is like sex: it's better when it's free. > -Linus Torvalds > > A.R.Karthick > http://github.com/karthick18 > > > > > -- > Software is like sex: it's better when it's free. > -Linus Torvalds > > A.R.Karthick > http://github.com/karthick18 > -- Software is like sex: it's better when it's free. -Linus Torvalds A.R.Karthick http://github.com/karthick18 |