From: Howard F. <ho...@th...> - 2020-11-20 18:39:24
|
Hi Jon, Thanks for the response. I was continuing to debug the situation today and I agree that the issue is with the bond device. When using active/backup and AP monitoring, it is issuing a NETDEV_CHANGE event that arguably it should not be doing. I am already in the process of filing a bug against the bond driver. Essentially what happens is that the bond driver sees the slave down and stops using it. It doesn't immediately switch to the backup though because it is technically 'down'. However, it does send a NETDEV_CHANGE here. it then immediately brings up the backup slave and sends another NETDEV_CHANGE. By then however the bearer has been reset. The concept of the bond driver means it should switch the slave without notifying the upper layers anything happened. But that is not the case here, and hence the bond device error. If MII monitoring is used instead of ARP, it switches the slave immediately and TIPC is unaware. However MII monitoring does not make much sense when going across a network involving switches, etc. Dual TIPC links is an interesting suggestion. However the functionality where I see this issue is on a product that is 10+ years old, and we use the bond device for redundancy. It is not a throughput/performance issue. Thanks again for the quick response. Howard -----Original Message----- From: Jon Maloy <jm...@re...> Sent: Friday, November 20, 2020 12:25 PM To: Howard Finer <ho...@th...>; tip...@li... Subject: Re: [tipc-discussion] tipc over an active/backup bond device Hi Howard, This is the code executed when TIPC receives a NETDEV_CHANGE event: switch (evt) { | case NETDEV_CHANGE: | | if (netif_carrier_ok(dev) && netif_oper_up(dev)) { | | | test_and_set_bit_lock(0, &b->up); | | | break; | | } | | fallthrough; | case NETDEV_GOING_DOWN: | | clear_bit_unlock(0, &b->up); | | tipc_reset_bearer(net, b); | | break; | case NETDEV_UP: | | test_and_set_bit_lock(0, &b->up); | | break; | case NETDEV_CHANGEMTU: So, unless the bond interface really reports that it is going down TIPC doesn't reset any links. And if it *does* report that it is going down, what else can we do? To me this looks more like a problem with the bond device rather than with TIPC, but we might of course have misunderstood its expected behavior. We will look into this. On a different note, you could instead omit the bond interface and try using dual TIPC links, which work in active-active mode and give better performance. Is that an option for you? BR Jon Maloy On 11/19/20 11:36 PM, Howard Finer wrote: > I am trying to use TIPC (kernel version 4.19) over a bond device that is > configured for active-backup and arp monitoring for the slaves. If a slave > goes down, TIPC is receiving a netdev_change during the timeframe that the > bond device is working towards brining up the new slave. This causes TIPC > to disable the bearer, which in turn causes a temporary loss of > communication between the nodes. > > > > Instrumentation of the bond and tipc drivers shows the following: > > <6> 1 2020-11-19T23:58:33.111549+01:00 LABNBS5A kernel - - - [ 153.655776] > Enabled bearer <eth:bond0>, priority 10 > > <6> 1 2020-11-20T00:07:58.544040+01:00 LABNBS5A kernel - - - [ 718.799259] > bond0: bond_ab_arp_commit: BOND_LINK_DOWN: link status definitely down for > interface eth1, disabling it > > <6> 1 2020-11-20T00:07:58.544063+01:00 LABNBS5A kernel - - - [ 718.799261] > bond0: bond_ab_arp_commit: do_failover, block netpoll_tx and call > select_active_slave > > <6> 1 2020-11-20T00:07:58.544069+01:00 LABNBS5A kernel - - - [ 718.799263] > bond0: bond_select_active_slave: bond_find_best_slave returned NULL > > <6> 1 2020-11-20T00:07:58.544072+01:00 LABNBS5A kernel - - - [ 718.799347] > bond0: bond_select_active_slave: now running without any active interface! > > <6> 1 2020-11-20T00:07:58.544080+01:00 LABNBS5A kernel - - - [ 718.799349] > bond0: bond_ab_arp_commit: do_failover, returned from select_active_slave > and unblock netpoll tx > > <6> 1 2020-11-20T00:07:58.544081+01:00 LABNBS5A kernel - - - [ 718.799611] > Resetting bearer <eth:bond0> > > <6> 1 2020-11-20T00:07:58.655535+01:00 LABNBS5A kernel - - - [ 718.907245] > bond0: bond_ab_arp_commit: BOND_LINK_UP: link status definitely up for > interface eth0 > > <6> 1 2020-11-20T00:07:58.655545+01:00 LABNBS5A kernel - - - [ 718.907247] > bond0: bond_ab_arp_commit: do_failover, block netpoll_tx and call > select_active_slave > > <6> 1 2020-11-20T00:07:58.655548+01:00 LABNBS5A kernel - - - [ 718.907248] > bond0: bond_select_active_slave: bond_find_best_slave returned slave eth0 > > <6> 1 2020-11-20T00:07:58.655559+01:00 LABNBS5A kernel - - - [ 718.907249] > bond0: making interface eth0 the new active one > > <6> 1 2020-11-20T00:07:58.655562+01:00 LABNBS5A kernel - - - [ 718.907560] > bond0: bond_select_active_slave: first active interface up! > > > > With arp based monitoring only 1 slave will be 'up'. When the active slave > goes down, the other slave needs to be brought up. During that timeframe we > see TIPC is resetting the bearer. That defeats the entire purpose of > using the bond device. > > It seems that the handling of the netdev_change event for a active/backup > bond device is not correct. It needs to leave the bearer intact so that > when the backup slave is brought up the communication is properly restored > without any upper layer applications being aware that something happened at > the lower level. > > > > Thanks, > > Howard > > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |