Re: [tipc-discussion] Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast ack

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Allan,

> I cannot do what I did before as this is coming directly from the field or
> a customer using our product where we use tipc heavily.
> I just have these logs and no other information for now but its similar to
> the split brain mentioned and reproduced earlier:
>
> Mar 25 16:55:04 garfield kernel: TIPC: Activated (version 1.7.7-rc4
> compiled Mar  2 2011 09:10:04)
>
> Mar 25 16:55:04 garfield kernel: NET: Registered protocol family 30
>
> Mar 25 16:55:04 garfield kernel: TIPC: Started in single node mode
>
> Mar 25 16:55:05 garfield kernel: TIPC: Started in network mode
>
> Mar 25 16:55:05 garfield kernel: TIPC: Own node address <1.1.2>, network
> identity 1500
>
> Mar 25 16:55:05 garfield kernel: TIPC: Enabled bearer <eth:bond0>,
> discovery domain <1.1.0>, priority 10
>
> Mar 25 16:55:05 garfield kernel: TIPC: Established link
> <1.1.2:bond0-1.1.1:bond0> on network plane A
>
>
>
>
> Mar 25 16:55:05 odie kernel: TIPC: Resetting link
> <1.1.1:bond0-1.1.2:bond0>, requested by peer while probing
>
> Mar 25 16:55:05 odie kernel: TIPC: Lost link <1.1.1:bond0-1.1.2:bond0> on
> network plane A
>
> Mar 25 16:55:05 odie kernel: TIPC: Lost contact with <1.1.2>
>
>
> It appears that the peer link which reset garfields link wasn't really in
> WORKING_UNKNOWN state as before since it appears that there sufficient delay
> before garfied was brought up (5-6 seconds). But I don't have any
> information for now. Just a heads up and symptoms similar to the one we have
> seen before with 1.7.7-rc4 ...
>
>
Confirmed that its the exact same split brain issue mentioned earlier but
caused by an immediate tipc module unload and reload within the link
tolerance time when the peer remains in WORKING_UNKNOWN.

Looking at your 1.7.7-rc4 which waited for the peer to acknowledge that it
went down with a LINK_PROTOCOL/RESET_MSG/ACTIVATE_MSG in tipc_recv_msg
before processing the packet for the peer node, why can't you reset the
WAIT_PEER_DOWN when you send the ACTIVATE in the first place for the RESET
from the peer.

The reason the state machine got beaten was because it got a RESET when the
peer wasn't detected to be down. (WORKING_UNKNOWN). And it was the reset
which triggered the ACTIVATE from the node to establish the peer in the
first place.

How about resetting the WAIT_PEER_DOWN for the node whenever you trigger an
ACTIVATE.
For the WORKING_UNKNOWN, RESET_MSG case, it would boil down to:

link_state_event:

 [...SNIPPED...]

case WORKING_UNKNOWN:

[...SNIPPED...]

        case RESET_MSG:

          info("Resetting link <%s>, requested by peer "
                 "while probing\n", l_ptr->name);
            tipc_link_reset(l_ptr);
            l_ptr->state = RESET_RESET;
            l_ptr->fsm_msg_cnt = 0;

            l_ptr->owner->block_setup &= ~WAIT_PEER_DOWN;  // ->NOTE

            tipc_link_send_proto_msg(l_ptr,
                         ACTIVATE_MSG, 0, 0, 0, 0, 0);

            l_ptr->fsm_msg_cnt++;
            link_set_timer(l_ptr, cont_intv);
            break;
-------------------------------------------------------

So the above would reset the nodes WAIT_PEER_DOWN bit while sending an
activate. Does it make sense to leave the WAIT_PEER_DOWN enabled while
sending an ACTIVATE msg to a peer sending a RESET ?

If the above isn't safe, then you guys have to anyway change the state
machine as you were discussing to avoid sending the ACTIVATE for a peers
RESET msg. ( could be unsafe and more risky)

Regards,
-Karthick

>   *From:* A.R Karthick [mailto:a.r...@gm...]
>> *Sent:* Thursday, March 31, 2011 11:29 AM
>> *To:* Stephens, Allan
>> *Cc:* jon...@er...; tipc-discussion
>>
>> *Subject:* Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier
>> "Ignore broadcast acknowledgements ..." patch
>>
>>
>>
>> Hi Allan
>>
>>  Thanks for the mail. Comments inline -
>>
>> On Wed, Mar 30, 2011 at 11:46 AM, Stephens, Allan <
>> all...@wi...> wrote:
>>
>> Hi Karthick:
>>
>>
>>
>> Unfortunately, I don’t think that the fix for this problem is as simple as
>> the one you’ve proposed. The check that you’ve modified was specifically
>> designed to ignore STATE messages because they could come from a previous
>> incarnation of the peer’s link endpoint, and we need to ensure these
>> messages continue to be ignored.
>>
>>
>> The ignoring of STATE_MSG check which seems to be 1.7.7-rc4 specific is
>> now causing us nightmares with 1.7.7-rc4, We have had very few issues with
>> 1.7.6. But because of some rare  bcast link congestion bugs, we have our
>> customers move to 1.7.7-rc4 which is now introducing split brain very
>> easily. In fact, it isn't production ready without the fix for this bug
>> considering that in the field, you could very easily hit a link snap
>> (bonding bearer down), and restore immediately. And this is causing the
>> split brain to occur with 1.7.7-rc4.
>>
>> What are the impacts of my patch for not ignoring the STATE_MSGs. You
>> mentioned that it could be from a previous incarnation of the peer's link
>> endpoint. Can you point me to the exact bug/test or commit log that it
>> addressed since this check in tipc_recv_msg seems to be newly introduced.
>>
>>
>>  I think the proper solution is to ensure that the node doesn’t send an
>> ACTIVATE message when it isn’t really prepared to bring the link up.
>> However, this is easier said than done. I thought at first that a simple
>> change to link_state_event() could be made to suppress the sending of
>> ACTIVATE messages if the associated node’s “blocked_setup” flag is set, but
>> I don’t think this will work because the current flow of control in TIPC 1.7
>> means that the flag won’t be set by node_lost_contact() until after
>> link_state_event() has finished processing the incoming RESET message. I’ll
>> have to think on this some more, and discuss the situation with Jon Maloy,
>> to see what can be done to correct this problem. It’s possible that some
>> significant reworking of TIPC’s link state machine may be needed; if so,
>> this will probably have to be done in TIPC 2.0.
>>
>>
>>
>>
>> I understand your predicament here for skipping the ACTIVATE pertaining to
>> a peer RESET_MSG but I request you guys to go through the commit again in
>> tipc_recv_msg which has made tipc 1.7.7-rc4 unstable or almost unusable in
>> the field causing split brains very easily.
>>
>> Regards,
>> -Karthick
>>
>>
>>  Regards,
>>
>> Al
>>
>>
>>
>> *From:* A.R Karthick [mailto:a.r...@gm...]
>> *Sent:* Friday, March 25, 2011 6:26 PM
>> *To:* Stephens, Allan; tipc-discussion
>>
>>
>> *Subject:* Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier
>> "Ignore broadcast acknowledgements ..." patch
>>
>>
>>
>> Hi Allan,
>>
>>
>> On Fri, Mar 25, 2011 at 11:21 AM, Stephens, Allan <
>> all...@wi...> wrote:
>>
>> Hi Karthick:
>>
>>
>>
>> Thanks for the update on the patch. Based on the feedback you’ve provided
>> I’m concluding the patch is sane and have pushed it to the TIPC 1.7 git
>> repository. (Peter Butler from PerfTech says the he will do some additional
>> testing with that patch when he returns from vacation in a week or so; I’m
>> assuming that he won’t turn up any new issues, but if he does we’ll deal
>> with them then.)
>>
>>
>>
>> I’ll look into your report of a rapid bearer disable/enable sequence
>> causing problems to see if it can be replicated. As you’ve already stated,
>> this isn’t really a major problem if it truly exists, but it’s still
>> something that we would want to fix (in TIPC 2.0, if not in TIPC 1.7.7).
>>
>>
>>
>> I now think that this issue ought to be fixed in 1.7.7-rc4. The split
>> brain in TIPC on a bearer disable and enable is extremely easy to reproduce.
>> It was first reproduced by a customer using our product that relies heavily
>> on tipc. But he isn't doing the bearer disable/enable step now for various
>> reasons in his script.
>>
>> However I just had about 1 hr at my disposal and wanted to try my hand at
>> reproducing the split brain in TIPC.
>> Turns out that it is extremely easy to reproduce.
>>
>> Have 2 nodes. with say eth0. established. Then set the link tolerance time
>> to say 10 seconds (just for the heck as it isn't required really)
>>
>> Then disable the bearer on one side:
>>
>> tipc-config -bd=eth:eth0
>>
>> The node on which you disable the bearer would get into RESET_UNKNOWN,
>> lose contact with the peer as expected. But the peer node would have hit a
>> TIMEOUT_EVT on the continuation interval (500 milliseconds is the MAX
>> apparently), and the link state in the peer node becomes WORKING_UNKNOWN.
>>
>> So far so good.
>>
>> Now within the link tolerance abort limit/time, enable the bearer back.
>>
>> tipc-config -be=eth:eth0
>>
>> Now the node, resets a RESET_MSG on enable bearer,  the peer in
>> WORKING_UNKNOWN state receives the RESET_MSG in link_state_event (reset
>> while probing case), resets the link (loses contact with the bearer disabled
>> node), changes state to RESET_RESET, and sends an ACTIVATE_MSG to the node
>> that enabled the bearer.
>>
>> The bearer enabled node then transitions from RESET_UNKNOWN to
>> WORKING_WORKING, establishes contact and sends the STATE_MSG to activate the
>> peer node in RESET_RESET state.
>>
>> Apparently when the peer node lost contact in WORKING_UNKNOWN state above,
>> it set the:
>>  n_ptr->block_setup = WAIT_PEER_DOWN | WAIT_NAMES_GONE ;
>> to prevent reestablishing contact with the peer till cleanup, schedules
>> the kernel tasklet to clear off the WAIT_NAMES_GONE bitmask in
>> node_name_purge_complete.
>>
>> Now with the n_ptr->block_setup at WAIT_PEER_DOWN state, the peer in
>> RESET_RESET link state now receives a STATE_MSG from the bearer enabled node
>> that got activated back.
>>
>> However this message is _discarded_ leading to a split brain with the
>> bearer enabled node establishing contact with the peer but the peer
>> apparently always remaining in lost contact or reset_reset state for the
>> link.
>>
>> tipc-config -n after bearer disable and enable on NODE0 shows NODE0 in
>> established contact whilst NODE1 is still in lost contact:
>>
>> [root@NODE0 <ro...@sb...>]# tipc-config -n
>>
>> Neighbors:
>>
>> <1.1.1>: up
>>
>>
>>
>> [root@NODE1 <ro...@sb...>]# tipc-config -n
>>
>> Neighbors:
>>
>> <1.1.2>: down
>>
>>
>>
>> Apparently, this code in tipc_recv_msg results in the STATE_MSG from the
>> bearer enabled node that went to WORKING_WORKING from RESET_UNKNOWN being
>> DROPPED and hence resulting in NODE1 not activating the link from
>> RESET_RESET state to WORKING_WORKING.
>>
>>        /* Verify that communication with node is currently allowed */
>>         if ((n_ptr->block_setup & WAIT_PEER_DOWN) &&
>>         (msg_user(msg) == LINK_PROTOCOL) &&
>>         (msg_type(msg) == RESET_MSG ||
>>              msg_type(msg) == ACTIVATE_MSG) &&
>>         !msg_redundant_link(msg))
>>             n_ptr->block_setup &= ~WAIT_PEER_DOWN;
>>
>>         if (n_ptr->block_setup) { -> ends up being DROPPED for STATE_MSG
>>             tipc_node_unlock(n_ptr);
>>            goto cont;
>>         }
>>
>> If I do the following change to clear the WAIT_PEER_DOWN mask for
>> STATE_MSG above, the split brain is resolved and NODE1 reestablishes contact
>> after the reset: In short, tipc seems to behave as expected on a bearer
>> disable/enable without resulting in a split brain :)
>>
>>        /* Verify that communication with node is currently allowed */
>>
>>         if ((n_ptr->block_setup & WAIT_PEER_DOWN) &&
>>         (msg_user(msg) == LINK_PROTOCOL) &&
>>         (msg_type(msg) == RESET_MSG ||
>>          msg_type(msg) == ACTIVATE_MSG ||
>>          msg_type(msg) == STATE_MSG) &&
>>         !msg_redundant_link(msg))
>>             n_ptr->block_setup &= ~WAIT_PEER_DOWN;
>>
>>         if (n_ptr->block_setup) {
>>             tipc_node_unlock(n_ptr);
>>             goto cont;
>>         }
>>
>> Now I don't know if the above fix is safe to be integrated into TIPC and
>> would cause regression for other cases.
>> And you are the best person to take a judgement call for the above fix
>> which resolves the split brain thats very easy to reproduce :)
>>
>> Regards,
>> -Karthick
>>
>>    *From:* A.R Karthick [mailto:a.r...@gm...]
>> *Sent:* Wednesday, March 23, 2011 6:25 PM
>>
>>
>> *To:* Stephens, Allan
>> *Subject:* Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier
>> "Ignore broadcast acknowledgements ..." patch
>>
>>
>>
>> Hi Allan,
>>
>> On Tue, Mar 22, 2011 at 10:23 AM, A.R Karthick <a.r...@gm...>
>> wrote:
>>
>>
>>
>> I meant the latest git. It doesn't apply clean even with that patch.
>> Already have that since it was released.
>>
>>
>> And I believe that is the cause of the panic in the first place :) I have
>> already asked our customer to revert my fail-safe patch which had worked
>> fine for him. And anyway he is testing your patch as I speak as I provided
>> him the one that applied clean. We will know in a day.
>>
>>
>> The patch seems to work. Atleast in 1 node he reported that he didn't see
>> the panic. So its good. However he reported another issue which isn't
>> serious for now as there is a workaround but it appears to expose a bug in
>> 1.7.7-rc4 or TIPC discovery which is resulting in split brain when a bearer
>> is disabled and immediately reenabled on 1 NODE (within the tipc link
>> tolerance time)
>>
>> I don't have much information on this but in short this was the state he
>> had in the logs: (the logs are confusing because of the timestamps which are
>> 1 hr. off and were also bounced).
>>
>> (2 machines connected via cross-over: sbx103  and sbx31 have ~1 hr gap
>> approximately):
>>
>> From sbx103:
>>
>> *Mar 23 14:23:27 sbx103 setupNP.sh: Failed to program NP, nptool returned
>> 255, rebooting the server to recover from error*
>>
>> *Mar 23 15:25:32* sbx103 kernel: imklog 3.18.6, log source = /proc/kmsg
>> started.
>>
>> Mar 23 15:25:32 sbx103 rsyslogd: [origin software="rsyslogd"
>> swVersion="3.18.6" x-pid="3573" x-info="http://www.rsyslog.com"] restart
>>
>>
>>
>> Mar 23 15:25:59 sbx103 kernel: [   65.972770] TIPC: Activated (version
>> 1.7.7-rc4 compiled Mar 18 2011 13:34:25)
>>
>> :
>>
>> *Mar 23 15:26:00 sbx103 kernel: [   68.909315] TIPC: Established link
>> <1.1.2:bond0-1.1.1:bond0> on network plane A*
>>
>>
>>
>> Mar 23 15:41:42 sbx103 kernel: [ 1580.418058] TIPC: Disabling bearer
>> <eth:bond0>
>>
>> Mar 23 15:41:42 sbx103 kernel: [ 1580.418064] TIPC: Lost link
>> <1.1.2:bond0-1.1.1:bond0> on network plane A
>>
>> Mar 23 15:41:42 sbx103 kernel: [ 1580.418148] TIPC: Lost contact with
>> <1.1.1>
>>
>> Mar 23 15:41:42 sbx103 kernel: [ 1580.426061] TIPC: Enabled bearer
>> <eth:bond0>, discovery domain <1.1.0>, priority 10
>>
>> Mar 23 15:41:42 sbx103 kernel: [ 1580.898375] TIPC: Established link
>> <1.1.2:bond0-1.1.1:bond0> on network plane A
>>
>> Mar 23 15:41:47 sbx103 kernel: [ 1588.012302] drbd0: user_isp( 0 -> 1 )
>>
>> Mar 23 15:41:47 sbx103 kernel: [ 1588.176513] SBX NRS module loaded.
>>
>> Mar 23 15:41:47 sbx103 kernel: [ 1588.180711] SBX TCP Redundancy module
>> loaded
>>
>> Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Resetting link
>> <1.1.2:bond0-1.1.1:bond0>, requested by peer while probing
>>
>> Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Lost link
>> <1.1.2:bond0-1.1.1:bond0> on network plane A
>>
>> Mar 23 15:41:48 sbx103 kernel: [ 1588.432555] TIPC: Lost contact with
>> <1.1.1>
>>
>>
>>
>> :
>>
>> Mar 23 15:41:48 sbx103 kernel: [ 1588.428553] TIPC: Resetting link
>> <1.1.2:bond0-1.1.1:bond0>, requested by peer while probing
>>
>> :
>>
>> Mar 23 15:58:02 sbx103 /opt/sonus/sbx/scripts/loadConfig.pl: Taking backup
>> in /opt/sonus/sbx/.configRestoreStaging/beforeRestore(CDB)...
>>
>> :
>>
>> Mar 23 15:59:58 sbx103 /opt/sonus/sbx/scripts/loadConfig.pl: Starting
>> Active...
>>
>> Mar 23 15:59:59 sbx103 kernel: [ 6016.485441] TIPC: Disabling bearer
>> <eth:bond0>
>>
>> Mar 23 15:59:59 sbx103 kernel: [ 6016.491483] TIPC: Enabled bearer
>> <eth:bond0>, discovery domain <1.1.0>, priority 10
>>
>> Mar 23 15:59:59 sbx103 kernel: [ 6016.892589] TIPC: Established link
>> <1.1.2:bond0-1.1.1:bond0> on network plane A
>>
>>
>>
>>
>>
>> On SBX31:
>>
>>
>>
>> Mar 23 10:53:44 sbx31 kernel: [    0.000000] Initializing cgroup subsys
>> cpuset
>>
>> :
>>
>> :
>>
>> Mar 23 14:18:54 sbx31 kernel: [54840.044003] SBX NRS module unloaded.
>>
>> Mar 23 14:18:54 sbx31 kernel: [54840.106331] SBX TCP Redundancy module
>> unloaded
>>
>> Mar 23 14:20:00 sbx31 kernel: [54921.563407] drbd0: peer_isp( 0 -> 1 )
>>
>> Mar 23 14:23:32 sbx31 kernel: [55189.890178] drbd0: peer( Primary ->
>> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> peer_isp( 1 -> 0 )
>>
>> Mar 23 14:23:32 sbx31 kernel: [55189.894180] drbd0: asender terminated
>>
>> Mar 23 14:23:32 sbx31 kernel: [55189.976223] drbd0: Terminating asender
>> thread
>>
>> Mar 23 14:23:32 sbx31 kernel: [55189.985769] drbd0: Connection closed
>>
>> Mar 23 14:23:32 sbx31 kernel: [55189.985774] drbd0: conn( NetworkFailure
>> -> Unconnected )
>>
>> Mar 23 14:23:32 sbx31 kernel: [55189.986325] drbd0: receiver terminated
>>
>> Mar 23 14:23:32 sbx31 kernel: [55189.986325] drbd0: Restarting receiver
>> thread
>>
>> Mar 23 14:23:32 sbx31 kernel: [55189.989776] drbd0: receiver (re)started
>>
>> Mar 23 14:23:32 sbx31 kernel: [55190.472951] drbd0: conn( Unconnected ->
>> WFConnection )
>>
>> *Mar 23 14:23:32 sbx31 kernel: [55190.998516] TIPC: Resetting link
>> <1.1.1:bond0-1.1.2:bond0>, peer not responding*
>>
>> Mar 23 14:23:32 sbx31 kernel: [55190.998625] TIPC: Lost link
>> <1.1.1:bond0-1.1.2:bond0> on network plane A
>>
>> Mar 23 14:23:32 sbx31 kernel: [55190.998625] TIPC: Lost contact with
>> <1.1.2>
>>
>> :
>>
>> :
>>
>> *Mar 23 14:26:02 sbx31 kernel: [55382.915309] TIPC: Established link
>> <1.1.1:bond0-1.1.2:bond0> on network plane A*
>>
>> :
>>
>> :
>>
>> Mar 23 14:41:44 sbx31 kernel: [56909.655509] TIPC: Resetting link
>> <1.1.1:bond0-1.1.2:bond0>, requested by peer while probing
>>
>> :
>>
>> :
>>
>> Mar 23 14:41:50 sbx31 kernel: [56919.533777] TIPC: Established link
>> <1.1.1:bond0-1.1.2:bond0> on network plane A
>>
>> :
>>
>> :
>>
>> Mar 23 15:00:01 sbx31 kernel: [61684.311304] TIPC: Resetting link
>> <1.1.1:bond0-1.1.2:bond0>, requested by peer while probing
>>
>>
>>
>>
>>
>> [ro...@sb... openclovis]# tipc-config -n
>>
>> Neighbors:
>>
>> <1.1.1>: up
>>
>>
>>
>> [ro...@sb... openclovis]# tipc-config -n
>>
>> Neighbors:
>>
>> <1.1.2>: down
>>
>> It looks as if the tipc link RESET_MSG on bearer disable and ACTIVATE_MSG
>> on bearer enable went out of order from sbx103 to sbx31.
>>
>> Since it isn't a blocker for us now as neither me or the customer has the
>> bandwidth to reproduce this and try out patches as of now, he is presently
>> not doing the bearer disable/enable step (which should avoid this issue)
>> that he apparently had introduced to work around a bcast congestion bug in
>> 1.7.6 which is resolved in 1.7.7-rc4 but ends up showing a split brain now
>> :)
>>
>> Regards,
>> -Karthick
>>
>>        *From:* A.R Karthick [mailto:a.r...@gm...]
>> *Sent:* Tuesday, March 22, 2011 12:41 PM
>> *To:* Stephens, Allan
>> *Subject:* Re: Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier
>> "Ignore broadcast acknowledgements ..." patch
>>
>>
>>
>> Hi Allan,
>>  Your patch doesn't apply clean on top of 1.7.7-rc4 (latest git).  You
>> seem to have some local changes which causes the discrepancy.
>> Anyway I have applied it manually for now. Find the one that applies clean
>> attached.
>>
>> Regards,
>> -Karthick
>>
>> On Mon, Mar 21, 2011 at 2:09 PM, Stephens, Allan <
>> all...@wi...> wrote:
>>
>> A.R. Karthick has recently identified an apparent problem with the
>> recently published TIPC 1.7 patch "Ignore broadcast acknowledgements that
>> are out-of-range". This email contains a supplementary patch that is
>> designed to correct the bug introduced by that patch.
>>
>> Any TIPC 1.7.7-rc4 user who has incorporated the defective patch and
>> wishes to test this corrective patch can apply it to their system directly;
>> there is no need to remove the previous patch. Feedback on this new patch --
>> either positive or negative -- is greatly appreciated. Once I've got some
>> assurance that the corrective patch really does what it's supposed to, and
>> doesn't introduce any new issues, I'll make it generally available in the
>> TIPC 1.7 git repository at SourceForge.
>>
>> Regards,
>> Al
>>
>> ...............
>>
>> This patch revises the checks that were added to TIPC's broadcast link by
>> the
>> patch "Ignore broadcast acknowledgements that are out-of-range" to
>> prevent a failed logic assertion from triggering a kernel BUG report.
>>
>> This revision ensures that any unsent messages in the broadcast link's
>> transmit queue are properly purged when contact with TIPC's only
>> neighboring node is lost, thereby restoring the clean up capability
>> that existed before the new broadcast link checks were added. The checks
>> are now performed only if TIPC is still in contact with the acknowledging
>> node. Conversely, if contact with the acknowledging node has been lost
>> the checks are unnecessary and the broadcast link determines which
>> messages in the transmit queue need to be cleaned up, based on whether
>> or not the unsent messages (if any) can still be delivered.
>>
>> Signed-off-by: Allan Stephens <all...@wi...>
>> ---
>>  net/tipc/tipc_bcast.c |   47
>> +++++++++++++++++++++++++----------------------
>>  1 files changed, 25 insertions(+), 22 deletions(-)
>>
>> diff --git a/net/tipc/tipc_bcast.c b/net/tipc/tipc_bcast.c
>> index c07fc4c..c53dcfb 100644
>> --- a/net/tipc/tipc_bcast.c
>> +++ b/net/tipc/tipc_bcast.c
>> @@ -218,32 +218,37 @@ void tipc_bclink_acknowledge(struct tipc_node
>> *n_ptr, u32 acked)
>>
>>        spin_lock_bh(&bc_lock);
>>
>> -       /*
>> -        * Invalid number indicates that all queued messages are
>> acknowledged
>> -        * (used when contact with specified node has been lost)
>> -        */
>> -
>> -       if (acked == INVALID_LINK_SEQ)
>> -               acked = bcl->fsm_msg_cnt;
>> -
>> -       /*
>> -        * Bail out if no unacknowledged messages in queue
>> -        * (either the node is acknowledging messages it has previously
>> ack'd
>> -        * or the ack value is invalid and should be ignored)
>> -        */
>> +       /* Bail out if tx queue is empty (no clean up is required) */
>>
>>        crs = bcl->first_out;
>>        if (!crs)
>>                goto exit;
>>
>> -       /*
>> -        * Validate sequence number to ensure it corresponds to a message
>> -        * that has been sent and not yet acknowledged
>> -        */
>> +       /* Determine which messages need to be acknowledged */
>>
>> -       if (less(acked, buf_seqno(crs)) || less(bcl->fsm_msg_cnt, acked)
>> ||
>> -           less_eq(acked, n_ptr->bclink.acked)) {
>> -               goto exit;
>> +       if (acked == INVALID_LINK_SEQ) {
>> +
>> +               /*
>> +                * Contact with specified node has been lost, so need to
>> +                * acknowledge sent messages only (if other nodes still
>> exist)
>> +                * or both sent and unsent messages (otherwise)
>> +                */
>> +
>> +               if (bclink->bcast_nodes.count)
>> +                       acked = bcl->fsm_msg_cnt;
>> +               else
>> +                       acked = bcl->next_out_no;
>> +       } else {
>> +
>> +               /*
>> +                * Bail out if specified sequence number does not
>> correspond
>> +                * to a message that has been sent and not yet
>> acknowledged
>> +                */
>> +
>> +               if (less(acked, buf_seqno(crs)) ||
>> +                   less(bcl->fsm_msg_cnt, acked) ||
>> +                   less_eq(acked, n_ptr->bclink.acked))
>> +                       goto exit;
>>        }
>>
>>        /* Skip over packets that node has previously acknowledged */
>> @@ -259,8 +264,6 @@ void tipc_bclink_acknowledge(struct tipc_node *n_ptr,
>> u32 acked)
>>
>>                if (crs != bcl->next_out)
>>                        bcbuf_decr_acks(crs);
>> -               else if (bclink->bcast_nodes.count)
>> -                       break;
>>                else {
>>                        bcbuf_set_acks(crs, 0);
>>                        bcl->next_out = next;
>> --
>> 1.7.1
>>
>>
>>
>>
>> --
>> Software is like sex: it's better when it's free.
>> -Linus Torvalds
>>
>> A.R.Karthick
>> http://github.com/karthick18
>>
>>
>>
>>
>> --
>> Software is like sex: it's better when it's free.
>> -Linus Torvalds
>>
>> A.R.Karthick
>> http://github.com/karthick18
>>
>>
>>
>>
>> --
>> Software is like sex: it's better when it's free.
>> -Linus Torvalds
>>
>> A.R.Karthick
>> http://github.com/karthick18
>>
>>
>>
>>
>> --
>> Software is like sex: it's better when it's free.
>> -Linus Torvalds
>>
>> A.R.Karthick
>> http://github.com/karthick18
>>
>>
>>
>>
>> --
>> Software is like sex: it's better when it's free.
>> -Linus Torvalds
>>
>> A.R.Karthick
>> http://github.com/karthick18
>>
>>
>>
>>
>> --
>> Software is like sex: it's better when it's free.
>> -Linus Torvalds
>>
>> A.R.Karthick
>> http://github.com/karthick18
>>
>
>
>
> --
> Software is like sex: it's better when it's free.
> -Linus Torvalds
>
> A.R.Karthick
> http://github.com/karthick18
>
>

-- 
Software is like sex: it's better when it's free.
-Linus Torvalds

A.R.Karthick
http://github.com/karthick18

Re: [tipc-discussion] Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast ack

Cluster wide IPC providing datagram, connection, and bus messaging

Re: [tipc-discussion] Proposed TIPC 1.7.7-rc4 patch to correct bug in earlier "Ignore broadcast acknowledgements ..." patch