From: Howard F. <ho...@th...> - 2020-11-20 04:56:19
|
I am trying to use TIPC (kernel version 4.19) over a bond device that is configured for active-backup and arp monitoring for the slaves. If a slave goes down, TIPC is receiving a netdev_change during the timeframe that the bond device is working towards brining up the new slave. This causes TIPC to disable the bearer, which in turn causes a temporary loss of communication between the nodes. Instrumentation of the bond and tipc drivers shows the following: <6> 1 2020-11-19T23:58:33.111549+01:00 LABNBS5A kernel - - - [ 153.655776] Enabled bearer <eth:bond0>, priority 10 <6> 1 2020-11-20T00:07:58.544040+01:00 LABNBS5A kernel - - - [ 718.799259] bond0: bond_ab_arp_commit: BOND_LINK_DOWN: link status definitely down for interface eth1, disabling it <6> 1 2020-11-20T00:07:58.544063+01:00 LABNBS5A kernel - - - [ 718.799261] bond0: bond_ab_arp_commit: do_failover, block netpoll_tx and call select_active_slave <6> 1 2020-11-20T00:07:58.544069+01:00 LABNBS5A kernel - - - [ 718.799263] bond0: bond_select_active_slave: bond_find_best_slave returned NULL <6> 1 2020-11-20T00:07:58.544072+01:00 LABNBS5A kernel - - - [ 718.799347] bond0: bond_select_active_slave: now running without any active interface! <6> 1 2020-11-20T00:07:58.544080+01:00 LABNBS5A kernel - - - [ 718.799349] bond0: bond_ab_arp_commit: do_failover, returned from select_active_slave and unblock netpoll tx <6> 1 2020-11-20T00:07:58.544081+01:00 LABNBS5A kernel - - - [ 718.799611] Resetting bearer <eth:bond0> <6> 1 2020-11-20T00:07:58.655535+01:00 LABNBS5A kernel - - - [ 718.907245] bond0: bond_ab_arp_commit: BOND_LINK_UP: link status definitely up for interface eth0 <6> 1 2020-11-20T00:07:58.655545+01:00 LABNBS5A kernel - - - [ 718.907247] bond0: bond_ab_arp_commit: do_failover, block netpoll_tx and call select_active_slave <6> 1 2020-11-20T00:07:58.655548+01:00 LABNBS5A kernel - - - [ 718.907248] bond0: bond_select_active_slave: bond_find_best_slave returned slave eth0 <6> 1 2020-11-20T00:07:58.655559+01:00 LABNBS5A kernel - - - [ 718.907249] bond0: making interface eth0 the new active one <6> 1 2020-11-20T00:07:58.655562+01:00 LABNBS5A kernel - - - [ 718.907560] bond0: bond_select_active_slave: first active interface up! With arp based monitoring only 1 slave will be 'up'. When the active slave goes down, the other slave needs to be brought up. During that timeframe we see TIPC is resetting the bearer. That defeats the entire purpose of using the bond device. It seems that the handling of the netdev_change event for a active/backup bond device is not correct. It needs to leave the bearer intact so that when the backup slave is brought up the communication is properly restored without any upper layer applications being aware that something happened at the lower level. Thanks, Howard |