From: Amit J. <ami...@te...> - 2019-02-11 19:27:51
|
Thanks Jon, Regards, Amit On Fri, 8 Feb 2019, 17:20 Jon Maloy <jon...@er... wrote: > Hi Amit, > Your solution seems ok to me. > > BR > ///jon > > > > -----Original Message----- > > From: Amit Jain <ami...@te...> > > Sent: 8-Feb-19 00:04 > > To: tip...@li... > > Subject: [tipc-discussion] Race condition in link_start and > tipc_link_delete > > function - RHEL 7.3 > > > > Hi , > > > > We are using tipc (version 3.10.0-862 for RHEL 7.3). > > > > Scenario : > > kernel crash in 2 node cluster , no traffic (bearers enabled/disabled) > We are > > able to consistently reproduce the crash on a 2 node setup and > > enabling/disabling bearers in a loop. (On 1 node enable/disable bearers > > within 0.5 seconds , on 2nd node with 4 seconds interval) > > > > Root Cause Analysis of the issue : > > > > Disabling the bearer invokes tipc_link_delete which frees the struct > tipc_link > > *l_ptr. > > This is done under tipc_net_lock, bearer locks > > > > Enabling the bearer, sends discovery messages and on reception from peer > > would invoke tipc_link_create which creates a new l_ptr and associates > with > > the bearer. All this is done with correct locks. However it schedules > link_start > > as a tasklet. > > > > Inside link_start, bearer lock is taken and work is performed (send proto > > msg). So if this tasklet is scheduled before tipc_link_delete , > everything is > > fine. However if the tipc_link_delete happens first then the tasklet is > > invoked with dangling pointer. > > > > Proposed Solution: > > Cancel the tasklet when deleting the link. But since its a work queue > item > > within the tasklet, its not possible to cancel a particular work item. > So within > > link_start, we do sanity check whether l_ptr is valid o r not. if l_ptr > is invalid > > (does not match with l_ptr inside any n_ptr in node list) we simply > return. > > > > Following is the diff which seems to solve the issue, > > > > # git diff > > diff --git a/tipc/link.c b/tipc/link.c > > index 349dbfd..f52dca9 100644 > > --- a/tipc/link.c > > +++ b/tipc/link.c > > @@ -393,9 +393,24 @@ void tipc_link_delete(struct tipc_link *l_ptr) > > > > static void link_start(struct tipc_link *l_ptr) { > > - tipc_node_lock(l_ptr->owner); > > + struct tipc_node *n_ptr; > > + read_lock_bh(&tipc_net_lock); > > + list_for_each_entry(n_ptr, &tipc_node_list, list) { > > + u32 i; > > + tipc_node_lock(n_ptr); > > + for (i = 0; i < MAX_BEARERS; i++) { > > + if (n_ptr->links[i] == l_ptr) { > > + goto LINK_START; > > + } > > + } > > + tipc_node_unlock(n_ptr); > > + } > > + read_unlock_bh(&tipc_net_lock); > > + return; > > +LINK_START: > > link_state_event(l_ptr, STARTING_EVT); > > tipc_node_unlock(l_ptr->owner); > > + read_unlock_bh(&tipc_net_lock); > > } > > > > Would like to know if there are any obvious issues due to above change? > > Appreciate any feedback. > > > > Regards, > > Amit > > > > _______________________________________________ > > tipc-discussion mailing list > > tip...@li... > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |