From: Jon M. <jon...@er...> - 2019-02-08 11:50:58
|
Hi Amit, Your solution seems ok to me. BR ///jon > -----Original Message----- > From: Amit Jain <ami...@te...> > Sent: 8-Feb-19 00:04 > To: tip...@li... > Subject: [tipc-discussion] Race condition in link_start and tipc_link_delete > function - RHEL 7.3 > > Hi , > > We are using tipc (version 3.10.0-862 for RHEL 7.3). > > Scenario : > kernel crash in 2 node cluster , no traffic (bearers enabled/disabled) We are > able to consistently reproduce the crash on a 2 node setup and > enabling/disabling bearers in a loop. (On 1 node enable/disable bearers > within 0.5 seconds , on 2nd node with 4 seconds interval) > > Root Cause Analysis of the issue : > > Disabling the bearer invokes tipc_link_delete which frees the struct tipc_link > *l_ptr. > This is done under tipc_net_lock, bearer locks > > Enabling the bearer, sends discovery messages and on reception from peer > would invoke tipc_link_create which creates a new l_ptr and associates with > the bearer. All this is done with correct locks. However it schedules link_start > as a tasklet. > > Inside link_start, bearer lock is taken and work is performed (send proto > msg). So if this tasklet is scheduled before tipc_link_delete , everything is > fine. However if the tipc_link_delete happens first then the tasklet is > invoked with dangling pointer. > > Proposed Solution: > Cancel the tasklet when deleting the link. But since its a work queue item > within the tasklet, its not possible to cancel a particular work item. So within > link_start, we do sanity check whether l_ptr is valid o r not. if l_ptr is invalid > (does not match with l_ptr inside any n_ptr in node list) we simply return. > > Following is the diff which seems to solve the issue, > > # git diff > diff --git a/tipc/link.c b/tipc/link.c > index 349dbfd..f52dca9 100644 > --- a/tipc/link.c > +++ b/tipc/link.c > @@ -393,9 +393,24 @@ void tipc_link_delete(struct tipc_link *l_ptr) > > static void link_start(struct tipc_link *l_ptr) { > - tipc_node_lock(l_ptr->owner); > + struct tipc_node *n_ptr; > + read_lock_bh(&tipc_net_lock); > + list_for_each_entry(n_ptr, &tipc_node_list, list) { > + u32 i; > + tipc_node_lock(n_ptr); > + for (i = 0; i < MAX_BEARERS; i++) { > + if (n_ptr->links[i] == l_ptr) { > + goto LINK_START; > + } > + } > + tipc_node_unlock(n_ptr); > + } > + read_unlock_bh(&tipc_net_lock); > + return; > +LINK_START: > link_state_event(l_ptr, STARTING_EVT); > tipc_node_unlock(l_ptr->owner); > + read_unlock_bh(&tipc_net_lock); > } > > Would like to know if there are any obvious issues due to above change? > Appreciate any feedback. > > Regards, > Amit > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion |