From: Amit J. <ami...@te...> - 2019-02-08 05:29:38
|
Hi , We are using tipc (version 3.10.0-862 for RHEL 7.3). Scenario : kernel crash in 2 node cluster , no traffic (bearers enabled/disabled) We are able to consistently reproduce the crash on a 2 node setup and enabling/disabling bearers in a loop. (On 1 node enable/disable bearers within 0.5 seconds , on 2nd node with 4 seconds interval) Root Cause Analysis of the issue : Disabling the bearer invokes tipc_link_delete which frees the struct tipc_link *l_ptr. This is done under tipc_net_lock, bearer locks Enabling the bearer, sends discovery messages and on reception from peer would invoke tipc_link_create which creates a new l_ptr and associates with the bearer. All this is done with correct locks. However it schedules link_start as a tasklet. Inside link_start, bearer lock is taken and work is performed (send proto msg). So if this tasklet is scheduled before tipc_link_delete , everything is fine. However if the tipc_link_delete happens first then the tasklet is invoked with dangling pointer. Proposed Solution: Cancel the tasklet when deleting the link. But since its a work queue item within the tasklet, its not possible to cancel a particular work item. So within link_start, we do sanity check whether l_ptr is valid o r not. if l_ptr is invalid (does not match with l_ptr inside any n_ptr in node list) we simply return. Following is the diff which seems to solve the issue, # git diff diff --git a/tipc/link.c b/tipc/link.c index 349dbfd..f52dca9 100644 --- a/tipc/link.c +++ b/tipc/link.c @@ -393,9 +393,24 @@ void tipc_link_delete(struct tipc_link *l_ptr) static void link_start(struct tipc_link *l_ptr) { - tipc_node_lock(l_ptr->owner); + struct tipc_node *n_ptr; + read_lock_bh(&tipc_net_lock); + list_for_each_entry(n_ptr, &tipc_node_list, list) { + u32 i; + tipc_node_lock(n_ptr); + for (i = 0; i < MAX_BEARERS; i++) { + if (n_ptr->links[i] == l_ptr) { + goto LINK_START; + } + } + tipc_node_unlock(n_ptr); + } + read_unlock_bh(&tipc_net_lock); + return; +LINK_START: link_state_event(l_ptr, STARTING_EVT); tipc_node_unlock(l_ptr->owner); + read_unlock_bh(&tipc_net_lock); } Would like to know if there are any obvious issues due to above change? Appreciate any feedback. Regards, Amit |