From: Jon M. <jm...@re...> - 2020-06-04 13:59:33
|
On 6/4/20 5:14 AM, Hoang Huu Le wrote: > > Hi Jon, > > Please see my inline comment > > Regards, > > Hoang > > *From:* Jon Maloy <jm...@re...> > *Sent:* Friday, May 29, 2020 11:11 PM > *To:* tip...@li...; tipc-dek > <tip...@de...>; Tuong Tong Lien > <tuo...@de...>; Xin Long <luc...@gm...>; Tung > Quang Nguyen <tun...@de...>; Ying Xue > <yin...@wi...> > *Subject:* Fwd: Re: FW: [PATCH 2/2] tipc: update a binding service via > broadcast > > Added more recipients. > > > > -------- Forwarded Message -------- > > *Subject: * > > > > Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast > > *Date: * > > > > Fri, 29 May 2020 12:08:02 -0400 > > *From: * > > > > Jon Maloy <jm...@re...> <mailto:jm...@re...> > > *To: * > > > > Hoang Huu Le <hoa...@de...> > <mailto:hoa...@de...>, ma...@do... > <mailto:ma...@do...> <ma...@do...> <mailto:ma...@do...> > > > > Hi Hoang, > See below. > > On 5/27/20 6:49 AM, Hoang Huu Le wrote: > > Hi Jon, > > I got DRAFT version base on your idea (attachment file). > But from my point, this version introduce too much code > implementation at sending side. > I don't think this is bright idea to keep rcast_list and > bcast_list in the name table. > I think we should find out a new way or just ignore the feature. > > Yes, you are right. > I came up with a new idea, to just add a sequence number to the > broadcast/replicast messages and re-order them at reception. This even > handles the case if the first broadcast message arrives before the > first bulk message, something we have not anticipated before. > I couldn't resist the temptation trying to code it, as you can see i > the patch I just sent out. > It is totally untested, I just added the code as I thought it should > be and made sure it compiled. > There is still a little too much new code to my taste, but this might > be a way forward. > Please give your feedback on this. > > > I also noticed a couple of things while working with this: > > 1) There is still an 'expires' field in the tipc_mcast_method, and it > seems to even be considered when switching bcast/rcast. The whole > point of adding the mcast synchronization mechanism was to get rid of > this delay. Have you tested that syncronization really works without > the 'expires' ? > [Hoang] Yes, I did. > > I issue the command below to force rcast/bcast regardless ‘expires’. > > $ tipc link set bro REPLICAST > > or > > $tipc link set bro BROADCAST > Yes, but what happens when the protocol selects by itself, based on number of destinations, and this number has just passed a threshold? > > 2) There are some remnants of old code for the name table dist_queue. > This functionality was made redundant by me at least two years ago, so > this should be cleaned up. > [Hoang] I will check and clean those stuff in separate commit. > > > 3) We might have a potential race condition when new nodes come up, so > that publications are distributed twice. > a) A publication is added to the name table, and the name table > lock is released. > b) A new node comes up, and the new publication is delivered in > the bulk message. > c) The broadcast of the publication goes ahead and sends it out to > all nodes, even the one that just came up. > d) We end up with a double publication on one of the nodes. > e) One of those will linger in the name table after the > publication is withdrawn. > I have never seen this happen, and my analysis might be wrong, but > to me this looks like a possible scenario. > Note that my patch doesn't fix this, but we could possibly arrange > it by adding a 'distributed' flag i the publication item on the > sending side, so that the bulk distribution will ignore it. > > [Hoang] I try to simulate as your scenario description on 8 nodes with > publication >100 services at the same time and bring interface > down/up. Sometimes I got below error logs: > > [ 537.322414] tipc: Failed to remove binding 1000,2 from 1001001 > > […] > > [ 537.358957] tipc: Failed to remove binding 1000,11 from 1001001 > > I’m not sure above counting as bug whether or not. If yes, we also fix > this in another commit too. > This is not what I expected, but might be another manifestation of the same problem. We are probably observing a replicast withdraw arriving before the corresponding bulk publication. If you see a binding <1000,2> in the table after this printout that would be a confirmation. For my scenario: Do you see duplicate publication instances before you do withdraw? Do you see lingering publication after a withdraw? Luckily, all of this will be fixed with the new broadcast distribution protocol. Regards ///jon > > Regards > ///jon > > > > > Regards, > Hoang > -----Original Message----- > From: Jon Maloy <jm...@re...> <mailto:jm...@re...> > Sent: Thursday, May 21, 2020 9:44 PM > To: Hoang Huu Le <hoa...@de...> > <mailto:hoa...@de...>; ma...@do... > <mailto:ma...@do...> > Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via > broadcast > > Hi, > I have one more comment below. Looking forward to your feedback. > ///jon > > > On 5/20/20 9:03 PM, Hoang Huu Le wrote: > > Yeah, thanks Jon. > I will investigate more on your idea when I finish the issue > with lab infrastructure. > > Hoang > -----Original Message----- > From: Jon Maloy <jm...@re...> <mailto:jm...@re...> > Sent: Thursday, May 21, 2020 7:13 AM > To: Hoang Huu Le <hoa...@de...> > <mailto:hoa...@de...>; ma...@do... > <mailto:ma...@do...> > Subject: Re: FW: [PATCH 2/2] tipc: update a binding service > via broadcast > > Hi Hoang, > Below I try to summarize my newest proposal in relation to v2 > of your patch. > > 1) The bulk can be sent just as is done now, with the addition > that we > add the NOT_LAST bit to the header. > 2) We need a new capability bit identifying nodes which support > broadcast NAME_DISTR. I don't see we can just > reuse TIPC_MCAST_RBCTL, because the recipients need to > have code > for handling concurrent > bulk/broadcast receptions. This bit is not added to the > cluster > capability word. > 3) We need to keep two structs of type tipc_nlist in the name > table. One > contains tipc_dest structures for > all nodes NOT supporting TIPC_MCAST_NAMEDISTR > (rcast_list), and the > other those which do (bcast_list). > > > For more comments see below. > > > > On 5/12/20 6:22 AM, Hoang Huu Le wrote: > > Just forward the patch I mentioned. > > -----Original Message----- > From: Hoang Le <hoa...@de...> > <mailto:hoa...@de...> > Sent: Tuesday, November 19, 2019 5:01 PM > To: jon...@er... > <mailto:jon...@er...>; ma...@do... > <mailto:ma...@do...>; tip...@de... > <mailto:tip...@de...> > Subject: [PATCH 2/2] tipc: update a binding service via > broadcast > > Currently, updating binding table (add service binding to > name table/withdraw a service binding) is being sent over > replicast. > However, if we are scaling up clusters to > 100 > nodes/containers this > method is less affection because of looping through nodes > in a cluster one > by one. > > It is worth to use broadcast to update a binding service. > Then binding > table updates in all nodes for one shot. > > The mechanism is backward compatible as sync message > slient dropped. > > v2: resolve synchronization problem when switching from > unicast to > broadcast > > Signed-off-by: Hoang Le <hoa...@de...> > <mailto:hoa...@de...> > --- > net/tipc/bcast.c | 3 ++- > net/tipc/link.c | 6 ++++++ > net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++--- > net/tipc/name_table.h | 4 ++++ > net/tipc/node.c | 32 ++++++++++++++++++++++++++++++++ > net/tipc/node.h | 2 ++ > 6 files changed, 76 insertions(+), 4 deletions(-) > > diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c > index e06f05d55534..44ed481fec47 100644 > --- a/net/tipc/bcast.c > +++ b/net/tipc/bcast.c > @@ -324,7 +324,8 @@ static int tipc_mcast_send_sync(struct > net *net, struct sk_buff *skb, > hdr = buf_msg(skb); > if (msg_user(hdr) == MSG_FRAGMENTER) > hdr = msg_inner_hdr(hdr); > - if (msg_type(hdr) != TIPC_MCAST_MSG) > + if (msg_user(hdr) != NAME_DISTRIBUTOR && > + msg_type(hdr) != TIPC_MCAST_MSG) > return 0; > /* Allocate dummy message */ > diff --git a/net/tipc/link.c b/net/tipc/link.c > index fb72031228c9..a2e9a64d5a0f 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct > tipc_link *l, struct sk_buff *skb, > struct sk_buff_head *inputq) > { > struct sk_buff_head *mc_inputq = l->bc_rcvlink->inputq; > + struct name_table *nt = tipc_name_table(l->net); > + struct sk_buff_head *defnq = &nt->defer_namedq; > struct tipc_msg *hdr = buf_msg(skb); > switch (msg_user(hdr)) { > @@ -1211,6 +1213,10 @@ static bool tipc_data_input(struct > tipc_link *l, struct sk_buff *skb, > case NAME_DISTRIBUTOR: > l->bc_rcvlink->state = LINK_ESTABLISHED; > skb_queue_tail(l->namedq, skb); > + > + spin_lock_bh(&defnq->lock); > + tipc_mcast_filter_msg(l->net, defnq, l->namedq); > > Should not be needed here. > You can instead do this in tipc_named_rcv(), using l->namedq > as deferred queue and creating an > temporary namedq queue on the stack for the messages ready to > be delivered. > We sort the messages in two steps: > 1) If there are any chains of bulk messages, we sort those per > source node into the temporary namedq and deliver them first, > when a chain is complete. > 2) If we find that a chain is incomplete we push it back to > the head of n->namedq and return without further action. > 3) When there are no bulk messages left in n->namedq we call > tipc_mcast_filter_msgs() to sort the remaining messages into > the temporary namedq, as far as possible, and deliver those > which are ready to be delivered. > > > > + spin_unlock_bh(&defnq->lock); > return true; > case MSG_BUNDLER: > case TUNNEL_PROTOCOL: > diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c > index 66a65c2cdb23..593dcd11357f 100644 > --- a/net/tipc/name_table.c > +++ b/net/tipc/name_table.c > @@ -615,9 +615,11 @@ struct publication > *tipc_nametbl_publish(struct net *net, u32 type, u32 lower, > struct tipc_net *tn = tipc_net(net); > struct publication *p = NULL; > struct sk_buff *skb = NULL; > + bool rcast; > spin_lock_bh(&tn->nametbl_lock); > + rcast = nt->rcast; > if (nt->local_publ_count >= TIPC_MAX_PUBL) { > pr_warn("Bind failed, max limit %u reached\n", TIPC_MAX_PUBL); > goto exit; > @@ -632,8 +634,18 @@ struct publication > *tipc_nametbl_publish(struct net *net, u32 type, u32 lower, > exit: > spin_unlock_bh(&tn->nametbl_lock); > - if (skb) > - tipc_node_broadcast(net, skb); > > Here we make two calls to tipc_mcast_xmit(), one with > method->rcast/mandatory and rcast_list,if not empty, and one > with method->bcast/mandatory and bcast_list, if not empty. > > Actually, the latter should not be mandatory. We can easily imagine a > situation where we start out with only legacy nodes, and then > upgrade to > bcast nodes, one by one. > > In the beginning we want tipc_mcast_xmit() to select rcast even > for the > bcast nodes, and then, as their proportion of the cluster grows, it > should switch to bcast. This does mean that the method struct for > bcast > must be kept between the calls. I.e., another member of struct > name_table. > > ///jon > > > > skb must of course be cloned if necessary. > > > > + if (skb) { > + /* Use broadcast if all nodes support broadcast > NAME_DISTR */ > + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) { > + tipc_node_broadcast_named_publish(net, skb, &rcast); > + spin_lock_bh(&tn->nametbl_lock); > + nt->rcast = rcast; > + spin_unlock_bh(&tn->nametbl_lock); > + } else { > + /* Otherwise, be backwards compatible */ > + tipc_node_broadcast(net, skb); > + } > > Not needed according to above. > > return p; > } > @@ -648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net > *net, u32 type, u32 lower, > u32 self = tipc_own_addr(net); > struct sk_buff *skb = NULL; > struct publication *p; > + bool rcast; > spin_lock_bh(&tn->nametbl_lock); > + rcast = nt->rcast; > p = tipc_nametbl_remove_publ(net, type, lower, upper, > self, key); > if (p) { > @@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net > *net, u32 type, u32 lower, > spin_unlock_bh(&tn->nametbl_lock); > if (skb) { > - tipc_node_broadcast(net, skb); > + /* Use broadcast if all nodes support broadcast > NAME_DISTR */ > + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) { > + tipc_node_broadcast_named_publish(net, skb, &rcast); > + spin_lock_bh(&tn->nametbl_lock); > + nt->rcast = rcast; > + spin_unlock_bh(&tn->nametbl_lock); > + } else { > + /* Otherwise, be backwards compatible */ > + tipc_node_broadcast(net, skb); > + } > > One or two calls to tipc_mcast_xmit(), just as above. > > return 1; > } > return 0; > @@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net) > INIT_LIST_HEAD(&nt->cluster_scope); > rwlock_init(&nt->cluster_scope_lock); > tn->nametbl = nt; > + /* 'bulk' updated messages via unicast */ > + nt->rcast = true; > + skb_queue_head_init(&nt->defer_namedq); > > Not needed. Now, node->namedq *is* the deferred queue. > > spin_lock_init(&tn->nametbl_lock); > return 0; > } > @@ -784,6 +810,7 @@ void tipc_nametbl_stop(struct net *net) > * publications, then release the name table > */ > spin_lock_bh(&tn->nametbl_lock); > + skb_queue_purge(&nt->defer_namedq); > for (i = 0; i < TIPC_NAMETBL_SIZE; i++) { > if (hlist_empty(&nt->services[i])) > continue; > diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h > index f79066334cc8..b8cdf2a29d48 100644 > --- a/net/tipc/name_table.h > +++ b/net/tipc/name_table.h > @@ -95,6 +95,8 @@ struct publication { > * - used by name_distr to send bulk updates to new nodes > * - used by name_distr during re-init of name table > * @local_publ_count: number of publications issued by this > node > + * @defer_namedq: temporarily queue for 'synching' update > + * @rcast: previous method used to publish/withdraw a service > */ > struct name_table { > struct hlist_head services[TIPC_NAMETBL_SIZE]; > @@ -102,6 +104,8 @@ struct name_table { > struct list_head cluster_scope; > rwlock_t cluster_scope_lock; > u32 local_publ_count; > + struct sk_buff_head defer_namedq; > + bool rcast; > }; > int tipc_nl_name_table_dump(struct sk_buff *skb, struct > netlink_callback *cb); > diff --git a/net/tipc/node.c b/net/tipc/node.c > index aaf595613e6e..b058647fa78b 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -2981,3 +2981,35 @@ void > tipc_node_pre_cleanup_net(struct net *exit_net) > } > rcu_read_unlock(); > } > + > +int tipc_node_broadcast_named_publish(struct net *net, > struct sk_buff *skb, > + bool *rcast) > +{ > + struct tipc_mc_method method = {.rcast = *rcast}; > + struct sk_buff_head xmitq; > + struct tipc_nlist dests; > + struct tipc_node *n; > + u16 cong_link_cnt; > + int rc = 0; > + > + __skb_queue_head_init(&xmitq); > + __skb_queue_tail(&xmitq, skb); > + > + tipc_nlist_init(&dests, tipc_own_addr(net)); > + rcu_read_lock(); > + list_for_each_entry_rcu(n, tipc_nodes(net), list) { > + if (in_own_node(net, n->addr)) > + continue; > + if (!node_is_up(n)) > + continue; > + tipc_nlist_add(&dests, n->addr); > + } > + rcu_read_unlock(); > + > + rc = tipc_mcast_xmit(net, &xmitq, &method, &dests, > &cong_link_cnt); > + *rcast = method.rcast; > + > + tipc_nlist_purge(&dests); > + __skb_queue_purge(&xmitq); > + return rc; > +} > > Not needed any more. > > diff --git a/net/tipc/node.h b/net/tipc/node.h > index a6803b449a2c..d7d19f9932b1 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -124,4 +124,6 @@ int tipc_nl_node_set_key(struct > sk_buff *skb, struct genl_info *info); > int tipc_nl_node_flush_key(struct sk_buff *skb, struct > genl_info *info); > #endif > void tipc_node_pre_cleanup_net(struct net *exit_net); > +int tipc_node_broadcast_named_publish(struct net *net, > struct sk_buff *skb, > + bool *rcast); > #endif > > There may of course be flaws or potential for improvements to > this, but > to me > this should solve our problem without too much new complexity > and code. > > ///jon > |