From: Jon M. <jm...@re...> - 2020-11-20 17:25:34
|
Hi Howard, This is the code executed when TIPC receives a NETDEV_CHANGE event: switch (evt) { | case NETDEV_CHANGE: | | if (netif_carrier_ok(dev) && netif_oper_up(dev)) { | | | test_and_set_bit_lock(0, &b->up); | | | break; | | } | | fallthrough; | case NETDEV_GOING_DOWN: | | clear_bit_unlock(0, &b->up); | | tipc_reset_bearer(net, b); | | break; | case NETDEV_UP: | | test_and_set_bit_lock(0, &b->up); | | break; | case NETDEV_CHANGEMTU: So, unless the bond interface really reports that it is going down TIPC doesn't reset any links. And if it *does* report that it is going down, what else can we do? To me this looks more like a problem with the bond device rather than with TIPC, but we might of course have misunderstood its expected behavior. We will look into this. On a different note, you could instead omit the bond interface and try using dual TIPC links, which work in active-active mode and give better performance. Is that an option for you? BR Jon Maloy On 11/19/20 11:36 PM, Howard Finer wrote: > I am trying to use TIPC (kernel version 4.19) over a bond device that is > configured for active-backup and arp monitoring for the slaves. If a slave > goes down, TIPC is receiving a netdev_change during the timeframe that the > bond device is working towards brining up the new slave. This causes TIPC > to disable the bearer, which in turn causes a temporary loss of > communication between the nodes. > > > > Instrumentation of the bond and tipc drivers shows the following: > > <6> 1 2020-11-19T23:58:33.111549+01:00 LABNBS5A kernel - - - [ 153.655776] > Enabled bearer <eth:bond0>, priority 10 > > <6> 1 2020-11-20T00:07:58.544040+01:00 LABNBS5A kernel - - - [ 718.799259] > bond0: bond_ab_arp_commit: BOND_LINK_DOWN: link status definitely down for > interface eth1, disabling it > > <6> 1 2020-11-20T00:07:58.544063+01:00 LABNBS5A kernel - - - [ 718.799261] > bond0: bond_ab_arp_commit: do_failover, block netpoll_tx and call > select_active_slave > > <6> 1 2020-11-20T00:07:58.544069+01:00 LABNBS5A kernel - - - [ 718.799263] > bond0: bond_select_active_slave: bond_find_best_slave returned NULL > > <6> 1 2020-11-20T00:07:58.544072+01:00 LABNBS5A kernel - - - [ 718.799347] > bond0: bond_select_active_slave: now running without any active interface! > > <6> 1 2020-11-20T00:07:58.544080+01:00 LABNBS5A kernel - - - [ 718.799349] > bond0: bond_ab_arp_commit: do_failover, returned from select_active_slave > and unblock netpoll tx > > <6> 1 2020-11-20T00:07:58.544081+01:00 LABNBS5A kernel - - - [ 718.799611] > Resetting bearer <eth:bond0> > > <6> 1 2020-11-20T00:07:58.655535+01:00 LABNBS5A kernel - - - [ 718.907245] > bond0: bond_ab_arp_commit: BOND_LINK_UP: link status definitely up for > interface eth0 > > <6> 1 2020-11-20T00:07:58.655545+01:00 LABNBS5A kernel - - - [ 718.907247] > bond0: bond_ab_arp_commit: do_failover, block netpoll_tx and call > select_active_slave > > <6> 1 2020-11-20T00:07:58.655548+01:00 LABNBS5A kernel - - - [ 718.907248] > bond0: bond_select_active_slave: bond_find_best_slave returned slave eth0 > > <6> 1 2020-11-20T00:07:58.655559+01:00 LABNBS5A kernel - - - [ 718.907249] > bond0: making interface eth0 the new active one > > <6> 1 2020-11-20T00:07:58.655562+01:00 LABNBS5A kernel - - - [ 718.907560] > bond0: bond_select_active_slave: first active interface up! > > > > With arp based monitoring only 1 slave will be 'up'. When the active slave > goes down, the other slave needs to be brought up. During that timeframe we > see TIPC is resetting the bearer. That defeats the entire purpose of > using the bond device. > > It seems that the handling of the netdev_change event for a active/backup > bond device is not correct. It needs to leave the bearer intact so that > when the backup slave is brought up the communication is properly restored > without any upper layer applications being aware that something happened at > the lower level. > > > > Thanks, > > Howard > > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |