From: Jon M. <jm...@re...> - 2020-05-12 21:21:42
|
cc to tipc-discussion and others. ///jon -------- Forwarded Message -------- Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast Date: Tue, 12 May 2020 14:38:29 -0400 From: Jon Maloy <jm...@re...> To: Hoang Huu Le <hoa...@de...>, ma...@do... <ma...@do...> Hi Hoang, I had to re-read the discussion we had back in November to refresh my memory about this. To me v2 looks like a better base for further development than the most recent version, so I am sorry to have wasted your time here. I am still convinced that the mcast/rcast synch protocol we developed can solve almost all our problems if we understand to use it right. What about the following: Sending side: - Even the initial bulk is sent via tipc_mxast_xmit(), but with only one destination node and as "mandatory rcast". The broadcast link may still choose to send it as bcast in case rcast has been manually disabled by the user, but the surplus messages will be dropped on the non-destination nodes. (In case of legacy nodes this is certain, in the case of newer nodes we must make sure that it happens by setting the 'destnode' field in the header and always check for this in tipc_named_rcv().) I think this is a weird and rare case where we can accept the extra overhead of broadcast. Under all normal conditions rcast will be used here. - Regular PUBLISH/WITHDRAW messages are sent as "mandatory bcast", or if some nodes don't not support TIPC_MCAST_RBCTL as "mandatory rcast". The broadcast link may still choose to send it as rcast, but this is nothing we need to worry about as long as the syncronization mechanism is active. Under all normal conditions bcast will be selected when possible, since all nodes are destinations. Receiving side: - We need a deferred queue, but last time i missed that we don't need one per node, it is sufficient to have a global one, as you were suggesting. So to me the code is ok here. - As mentioned earlier, tipc_named_rcv() only accepts messages where destnode is "self" (bulk) or zero (publish/withdraw). The point with all this is that all messages will now be subject to the synchronization mechanism, so the bulk messages are guaranteed to be delivered in order and ahead of individual publications/withdraws, and even topology changes will almost never lead to change of sending method and need for synchronization. We must make one change to the broadcast protocol though: even "mandatory" messages must be made subject to the synchronization mechanism, something that doesn't seem to be the case now. ("Or maybe we don't need to set "mandatory" at all? This needs to be checked.) The advantage of this method is simplicity. - We don't need any separate synchronization mechanism for NAME_DISTR messages at all, we just use what is already there. The disadvantage is that there will be a lot of SYN messages sent if we are not careful. - Every time a new node comes up, the other nodes will send it a a bcast SYN before they send the bulk, because they believe they have just switched from bcast (normal mode) to rcast (bulk mode). This one is in reality unnecessary, since we can be sure that the new node has not been sent any previous bcast that needs to be synched with. - At the first publication/withdraw sent after the bulk the neighbor nodes will send an rcast SYN to all other nodes, because they just "switched" back from rcast to bcast. In this case we really need to send an rcast SYN to the new node that came up, since this is the only one where there is risk of a race. This message does in reality serve as "end-of-bulk" message but is handled like any other SYN by the reciving tipc_mcast_filter_rcv() So, to avoid this the broadcast protocol needs to be able to recognize NAME_DISTR messages and treat then slightly different from the others, or it needs to be told what to do by the name_distr.c code via the API. Maybe a couple of new fields in struct tipc_mcast_method? What do you think? Regards ///jon On 5/12/20 6:22 AM, Hoang Huu Le wrote: > Just forward the patch I mentioned. > > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: Tuesday, November 19, 2019 5:01 PM > To: jon...@er...; ma...@do...; tip...@de... > Subject: [PATCH 2/2] tipc: update a binding service via broadcast > > Currently, updating binding table (add service binding to > name table/withdraw a service binding) is being sent over replicast. > However, if we are scaling up clusters to > 100 nodes/containers this > method is less affection because of looping through nodes in a cluster one > by one. > > It is worth to use broadcast to update a binding service. Then binding > table updates in all nodes for one shot. > > The mechanism is backward compatible as sync message slient dropped. > > v2: resolve synchronization problem when switching from unicast to > broadcast > > Signed-off-by: Hoang Le <hoa...@de...> > --- > net/tipc/bcast.c | 3 ++- > net/tipc/link.c | 6 ++++++ > net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++--- > net/tipc/name_table.h | 4 ++++ > net/tipc/node.c | 32 ++++++++++++++++++++++++++++++++ > net/tipc/node.h | 2 ++ > 6 files changed, 76 insertions(+), 4 deletions(-) > > diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c > index e06f05d55534..44ed481fec47 100644 > --- a/net/tipc/bcast.c > +++ b/net/tipc/bcast.c > @@ -324,7 +324,8 @@ static int tipc_mcast_send_sync(struct net *net, > struct sk_buff *skb, > hdr = buf_msg(skb); > if (msg_user(hdr) == MSG_FRAGMENTER) > hdr = msg_inner_hdr(hdr); > - if (msg_type(hdr) != TIPC_MCAST_MSG) > + if (msg_user(hdr) != NAME_DISTRIBUTOR && > + msg_type(hdr) != TIPC_MCAST_MSG) > return 0; > /* Allocate dummy message */ > diff --git a/net/tipc/link.c b/net/tipc/link.c > index fb72031228c9..a2e9a64d5a0f 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct tipc_link *l, > struct sk_buff *skb, > struct sk_buff_head *inputq) > { > struct sk_buff_head *mc_inputq = l->bc_rcvlink->inputq; > + struct name_table *nt = tipc_name_table(l->net); > + struct sk_buff_head *defnq = &nt->defer_namedq; > struct tipc_msg *hdr = buf_msg(skb); > switch (msg_user(hdr)) { > @@ -1211,6 +1213,10 @@ static bool tipc_data_input(struct tipc_link > *l, struct sk_buff *skb, > case NAME_DISTRIBUTOR: > l->bc_rcvlink->state = LINK_ESTABLISHED; > skb_queue_tail(l->namedq, skb); > + > + spin_lock_bh(&defnq->lock); > + tipc_mcast_filter_msg(l->net, defnq, l->namedq); > + spin_unlock_bh(&defnq->lock); > return true; > case MSG_BUNDLER: > case TUNNEL_PROTOCOL: > diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c > index 66a65c2cdb23..593dcd11357f 100644 > --- a/net/tipc/name_table.c > +++ b/net/tipc/name_table.c > @@ -615,9 +615,11 @@ struct publication *tipc_nametbl_publish(struct > net *net, u32 type, u32 lower, > struct tipc_net *tn = tipc_net(net); > struct publication *p = NULL; > struct sk_buff *skb = NULL; > + bool rcast; > spin_lock_bh(&tn->nametbl_lock); > + rcast = nt->rcast; > if (nt->local_publ_count >= TIPC_MAX_PUBL) { > pr_warn("Bind failed, max limit %u reached\n", TIPC_MAX_PUBL); > goto exit; > @@ -632,8 +634,18 @@ struct publication *tipc_nametbl_publish(struct > net *net, u32 type, u32 lower, > exit: > spin_unlock_bh(&tn->nametbl_lock); > - if (skb) > - tipc_node_broadcast(net, skb); > + if (skb) { > + /* Use broadcast if all nodes support broadcast NAME_DISTR */ > + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) { > + tipc_node_broadcast_named_publish(net, skb, &rcast); > + spin_lock_bh(&tn->nametbl_lock); > + nt->rcast = rcast; > + spin_unlock_bh(&tn->nametbl_lock); > + } else { > + /* Otherwise, be backwards compatible */ > + tipc_node_broadcast(net, skb); > + } > + } > return p; > } > @@ -648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net *net, u32 > type, u32 lower, > u32 self = tipc_own_addr(net); > struct sk_buff *skb = NULL; > struct publication *p; > + bool rcast; > spin_lock_bh(&tn->nametbl_lock); > + rcast = nt->rcast; > p = tipc_nametbl_remove_publ(net, type, lower, upper, self, key); > if (p) { > @@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net *net, u32 > type, u32 lower, > spin_unlock_bh(&tn->nametbl_lock); > if (skb) { > - tipc_node_broadcast(net, skb); > + /* Use broadcast if all nodes support broadcast NAME_DISTR */ > + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) { > + tipc_node_broadcast_named_publish(net, skb, &rcast); > + spin_lock_bh(&tn->nametbl_lock); > + nt->rcast = rcast; > + spin_unlock_bh(&tn->nametbl_lock); > + } else { > + /* Otherwise, be backwards compatible */ > + tipc_node_broadcast(net, skb); > + } > return 1; > } > return 0; > @@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net) > INIT_LIST_HEAD(&nt->cluster_scope); > rwlock_init(&nt->cluster_scope_lock); > tn->nametbl = nt; > + /* 'bulk' updated messages via unicast */ > + nt->rcast = true; > + skb_queue_head_init(&nt->defer_namedq); > spin_lock_init(&tn->nametbl_lock); > return 0; > } > @@ -784,6 +810,7 @@ void tipc_nametbl_stop(struct net *net) > * publications, then release the name table > */ > spin_lock_bh(&tn->nametbl_lock); > + skb_queue_purge(&nt->defer_namedq); > for (i = 0; i < TIPC_NAMETBL_SIZE; i++) { > if (hlist_empty(&nt->services[i])) > continue; > diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h > index f79066334cc8..b8cdf2a29d48 100644 > --- a/net/tipc/name_table.h > +++ b/net/tipc/name_table.h > @@ -95,6 +95,8 @@ struct publication { > * - used by name_distr to send bulk updates to new nodes > * - used by name_distr during re-init of name table > * @local_publ_count: number of publications issued by this node > + * @defer_namedq: temporarily queue for 'synching' update > + * @rcast: previous method used to publish/withdraw a service > */ > struct name_table { > struct hlist_head services[TIPC_NAMETBL_SIZE]; > @@ -102,6 +104,8 @@ struct name_table { > struct list_head cluster_scope; > rwlock_t cluster_scope_lock; > u32 local_publ_count; > + struct sk_buff_head defer_namedq; > + bool rcast; > }; > int tipc_nl_name_table_dump(struct sk_buff *skb, struct > netlink_callback *cb); > diff --git a/net/tipc/node.c b/net/tipc/node.c > index aaf595613e6e..b058647fa78b 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -2981,3 +2981,35 @@ void tipc_node_pre_cleanup_net(struct net > *exit_net) > } > rcu_read_unlock(); > } > + > +int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff > *skb, > + bool *rcast) > +{ > + struct tipc_mc_method method = {.rcast = *rcast}; > + struct sk_buff_head xmitq; > + struct tipc_nlist dests; > + struct tipc_node *n; > + u16 cong_link_cnt; > + int rc = 0; > + > + __skb_queue_head_init(&xmitq); > + __skb_queue_tail(&xmitq, skb); > + > + tipc_nlist_init(&dests, tipc_own_addr(net)); > + rcu_read_lock(); > + list_for_each_entry_rcu(n, tipc_nodes(net), list) { > + if (in_own_node(net, n->addr)) > + continue; > + if (!node_is_up(n)) > + continue; > + tipc_nlist_add(&dests, n->addr); > + } > + rcu_read_unlock(); > + > + rc = tipc_mcast_xmit(net, &xmitq, &method, &dests, &cong_link_cnt); > + *rcast = method.rcast; > + > + tipc_nlist_purge(&dests); > + __skb_queue_purge(&xmitq); > + return rc; > +} > diff --git a/net/tipc/node.h b/net/tipc/node.h > index a6803b449a2c..d7d19f9932b1 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -124,4 +124,6 @@ int tipc_nl_node_set_key(struct sk_buff *skb, > struct genl_info *info); > int tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info); > #endif > void tipc_node_pre_cleanup_net(struct net *exit_net); > +int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff > *skb, > + bool *rcast); > #endif |