Re: [tipc-discussion] FW: [PATCH 2/2] tipc: update a binding service via broadcast

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 6/4/20 5:14 AM, Hoang Huu Le wrote:
>
> Hi Jon,
>
> Please see my inline comment
>
> Regards,
>
> Hoang
>
> *From:* Jon Maloy <jm...@re...>
> *Sent:* Friday, May 29, 2020 11:11 PM
> *To:* tip...@li...; tipc-dek 
> <tip...@de...>; Tuong Tong Lien 
> <tuo...@de...>; Xin Long <luc...@gm...>; Tung 
> Quang Nguyen <tun...@de...>; Ying Xue 
> <yin...@wi...>
> *Subject:* Fwd: Re: FW: [PATCH 2/2] tipc: update a binding service via 
> broadcast
>
> Added more recipients.
>
>
>
> -------- Forwarded Message --------
>
> *Subject: *
>
> 	
>
> Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast
>
> *Date: *
>
> 	
>
> Fri, 29 May 2020 12:08:02 -0400
>
> *From: *
>
> 	
>
> Jon Maloy <jm...@re...> <mailto:jm...@re...>
>
> *To: *
>
> 	
>
> Hoang Huu Le <hoa...@de...> 
> <mailto:hoa...@de...>, ma...@do... 
> <mailto:ma...@do...> <ma...@do...> <mailto:ma...@do...>
>
>
>
> Hi Hoang,
> See below.
>
> On 5/27/20 6:49 AM, Hoang Huu Le wrote:
>
>     Hi Jon,
>
>     I got DRAFT version base on your idea (attachment file).
>     But from my point, this version introduce too much code
>     implementation at sending side.
>     I don't think this is bright idea to keep rcast_list and
>     bcast_list in the name table.
>     I think we should find out a new way or just ignore the feature.
>
> Yes, you are right.
> I came up with a new idea, to just add a sequence number to the 
> broadcast/replicast messages and re-order them at reception. This even 
> handles the case if the first broadcast message arrives before the 
> first bulk message, something we have not anticipated before.
> I couldn't resist the temptation trying to code it, as you can see i 
> the patch I just sent out.
> It is totally untested, I just added the code as I thought it should 
> be and made sure it compiled.
> There is still a little too much new code to my taste, but this might 
> be a way forward.
> Please give your feedback on this.
>
>
> I also noticed a couple of things while working with this:
>
> 1) There is still an 'expires' field in the tipc_mcast_method, and it 
> seems to even be considered when switching bcast/rcast.     The whole 
> point of adding the mcast synchronization mechanism was to get rid of 
> this delay. Have you tested that syncronization really works without 
> the 'expires' ?
> [Hoang] Yes, I did.
>
> I issue the command below to force rcast/bcast regardless ‘expires’.
>
> $ tipc link set bro REPLICAST
>
> or
>
> $tipc link set bro BROADCAST
>
Yes, but what happens when the protocol selects by itself, based on 
number of destinations, and this number has just passed a threshold?

>
> 2) There are some remnants of old code for the name table dist_queue. 
> This functionality was made redundant by me at least two years ago, so 
> this should be cleaned up.
> [Hoang] I will check and clean those stuff in separate commit.
>
>
> 3) We might have a potential race condition when new nodes come up, so 
> that publications are distributed twice.
>     a) A publication is added to the name table, and the name table 
> lock is released.
>     b) A new node comes up, and the new publication is delivered in 
> the bulk message.
>     c) The broadcast of the publication goes ahead and sends it out to 
> all nodes, even the one that just came up.
>     d) We end up with a double publication on one of the nodes.
>     e) One of those will linger in the name table after the 
> publication is withdrawn.
>     I have never seen this happen, and my analysis might be wrong, but 
> to me this looks like a possible scenario.
>     Note that my patch doesn't fix this, but we could possibly arrange 
> it by adding a 'distributed' flag i the publication item on the 
> sending side, so that the bulk distribution will ignore it.
>
> [Hoang] I try to simulate as your scenario description on 8 nodes with 
> publication >100 services at the same time and bring interface 
> down/up. Sometimes I got below error logs:
>
> [  537.322414] tipc: Failed to remove binding 1000,2 from 1001001
>
> […]
>
> [  537.358957] tipc: Failed to remove binding 1000,11 from 1001001
>
> I’m not sure above counting as bug whether or not. If yes, we also fix 
> this in another commit too.
>
This is not what I expected, but might be another manifestation of the 
same problem.
We are probably observing a replicast withdraw arriving before the 
corresponding bulk publication.
If you see a binding <1000,2> in the table after this printout that 
would be a confirmation.

For my scenario:
Do you see duplicate publication instances before you do withdraw?
Do you see lingering publication after a withdraw?

Luckily, all of this will be fixed with the new broadcast distribution 
protocol.

Regards
///jon

>
> Regards
> ///jon
>
>
>
>
>     Regards,
>     Hoang
>     -----Original Message-----
>     From: Jon Maloy <jm...@re...> <mailto:jm...@re...>
>     Sent: Thursday, May 21, 2020 9:44 PM
>     To: Hoang Huu Le <hoa...@de...>
>     <mailto:hoa...@de...>; ma...@do...
>     <mailto:ma...@do...>
>     Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via
>     broadcast
>
>     Hi,
>     I have one more comment below. Looking forward to your feedback.
>     ///jon
>
>
>     On 5/20/20 9:03 PM, Hoang Huu Le wrote:
>
>         Yeah, thanks Jon.
>         I will investigate more on your idea when I finish the issue
>         with lab infrastructure.
>
>         Hoang
>         -----Original Message-----
>         From: Jon Maloy <jm...@re...> <mailto:jm...@re...>
>         Sent: Thursday, May 21, 2020 7:13 AM
>         To: Hoang Huu Le <hoa...@de...>
>         <mailto:hoa...@de...>; ma...@do...
>         <mailto:ma...@do...>
>         Subject: Re: FW: [PATCH 2/2] tipc: update a binding service
>         via broadcast
>
>         Hi Hoang,
>         Below I try to summarize my newest proposal in relation to v2
>         of your patch.
>
>         1) The bulk can be sent just as is done now, with the addition
>         that we
>         add the NOT_LAST bit to the header.
>         2) We need a new capability bit identifying nodes which support
>         broadcast NAME_DISTR. I don't see we can just
>              reuse TIPC_MCAST_RBCTL, because the recipients need to
>         have code
>         for handling concurrent
>              bulk/broadcast receptions. This bit is not added to the
>         cluster
>         capability word.
>         3) We need to keep two structs of type tipc_nlist in the name
>         table. One
>         contains tipc_dest structures for
>             all nodes NOT supporting TIPC_MCAST_NAMEDISTR
>         (rcast_list), and the
>         other those which do (bcast_list).
>
>
>         For more comments see below.
>
>
>
>         On 5/12/20 6:22 AM, Hoang Huu Le wrote:
>
>             Just forward the patch I mentioned.
>
>             -----Original Message-----
>             From: Hoang Le <hoa...@de...>
>             <mailto:hoa...@de...>
>             Sent: Tuesday, November 19, 2019 5:01 PM
>             To: jon...@er...
>             <mailto:jon...@er...>; ma...@do...
>             <mailto:ma...@do...>; tip...@de...
>             <mailto:tip...@de...>
>             Subject: [PATCH 2/2] tipc: update a binding service via
>             broadcast
>
>             Currently, updating binding table (add service binding to
>             name table/withdraw a service binding) is being sent over
>             replicast.
>             However, if we are scaling up clusters to > 100
>             nodes/containers this
>             method is less affection because of looping through nodes
>             in a cluster one
>             by one.
>
>             It is worth to use broadcast to update a binding service.
>             Then binding
>             table updates in all nodes for one shot.
>
>             The mechanism is backward compatible as sync message
>             slient dropped.
>
>             v2: resolve synchronization problem when switching from
>             unicast to
>             broadcast
>
>             Signed-off-by: Hoang Le <hoa...@de...>
>             <mailto:hoa...@de...>
>             ---
>             net/tipc/bcast.c | 3 ++-
>             net/tipc/link.c | 6 ++++++
>             net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++---
>             net/tipc/name_table.h | 4 ++++
>             net/tipc/node.c | 32 ++++++++++++++++++++++++++++++++
>             net/tipc/node.h | 2 ++
>             6 files changed, 76 insertions(+), 4 deletions(-)
>
>             diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
>             index e06f05d55534..44ed481fec47 100644
>             --- a/net/tipc/bcast.c
>             +++ b/net/tipc/bcast.c
>             @@ -324,7 +324,8 @@ static int tipc_mcast_send_sync(struct
>             net *net, struct sk_buff *skb,
>             hdr = buf_msg(skb);
>             if (msg_user(hdr) == MSG_FRAGMENTER)
>             hdr = msg_inner_hdr(hdr);
>             - if (msg_type(hdr) != TIPC_MCAST_MSG)
>             + if (msg_user(hdr) != NAME_DISTRIBUTOR &&
>             + msg_type(hdr) != TIPC_MCAST_MSG)
>             return 0;
>             /* Allocate dummy message */
>             diff --git a/net/tipc/link.c b/net/tipc/link.c
>             index fb72031228c9..a2e9a64d5a0f 100644
>             --- a/net/tipc/link.c
>             +++ b/net/tipc/link.c
>             @@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct
>             tipc_link *l, struct sk_buff *skb,
>             struct sk_buff_head *inputq)
>             {
>             struct sk_buff_head *mc_inputq = l->bc_rcvlink->inputq;
>             + struct name_table *nt = tipc_name_table(l->net);
>             + struct sk_buff_head *defnq = &nt->defer_namedq;
>             struct tipc_msg *hdr = buf_msg(skb);
>             switch (msg_user(hdr)) {
>             @@ -1211,6 +1213,10 @@ static bool tipc_data_input(struct
>             tipc_link *l, struct sk_buff *skb,
>             case NAME_DISTRIBUTOR:
>             l->bc_rcvlink->state = LINK_ESTABLISHED;
>             skb_queue_tail(l->namedq, skb);
>             +
>             + spin_lock_bh(&defnq->lock);
>             + tipc_mcast_filter_msg(l->net, defnq, l->namedq);
>
>         Should not be needed here.
>         You can instead do this in tipc_named_rcv(), using l->namedq 
>         as deferred queue and creating an
>         temporary namedq queue on the stack for the messages ready to
>         be delivered.
>         We sort the messages in two steps:
>         1) If there are any chains of bulk messages, we sort those per
>         source node into the temporary namedq and deliver them first,
>         when a chain is complete.
>         2) If we find that a chain is incomplete we push it back to
>         the head of n->namedq and return without further action.
>         3) When there are no bulk messages left in n->namedq we call
>         tipc_mcast_filter_msgs() to sort the remaining messages into
>         the temporary namedq, as far as possible, and deliver those
>         which are ready to be delivered.
>
>
>
>             + spin_unlock_bh(&defnq->lock);
>             return true;
>             case MSG_BUNDLER:
>             case TUNNEL_PROTOCOL:
>             diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
>             index 66a65c2cdb23..593dcd11357f 100644
>             --- a/net/tipc/name_table.c
>             +++ b/net/tipc/name_table.c
>             @@ -615,9 +615,11 @@ struct publication
>             *tipc_nametbl_publish(struct net *net, u32 type, u32 lower,
>             struct tipc_net *tn = tipc_net(net);
>             struct publication *p = NULL;
>             struct sk_buff *skb = NULL;
>             + bool rcast;
>             spin_lock_bh(&tn->nametbl_lock);
>             + rcast = nt->rcast;
>             if (nt->local_publ_count >= TIPC_MAX_PUBL) {
>             pr_warn("Bind failed, max limit %u reached\n", TIPC_MAX_PUBL);
>             goto exit;
>             @@ -632,8 +634,18 @@ struct publication
>             *tipc_nametbl_publish(struct net *net, u32 type, u32 lower,
>             exit:
>             spin_unlock_bh(&tn->nametbl_lock);
>             - if (skb)
>             - tipc_node_broadcast(net, skb);
>
>         Here we make two calls to tipc_mcast_xmit(), one with
>         method->rcast/mandatory and rcast_list,if not empty, and one
>         with method->bcast/mandatory and bcast_list, if not empty.
>
>     Actually, the latter should not be mandatory. We can easily imagine a
>     situation where we start out with only legacy nodes, and then
>     upgrade to
>     bcast nodes, one by one.
>
>     In the beginning we want tipc_mcast_xmit() to select rcast even
>     for the
>     bcast nodes, and then, as their proportion of the cluster grows, it
>     should switch to bcast. This does mean that the method struct for
>     bcast
>     must be kept between the calls. I.e., another member of struct
>     name_table.
>
>     ///jon
>
>
>
>         skb must of course be cloned if necessary.
>
>
>
>             + if (skb) {
>             + /* Use broadcast if all nodes support broadcast
>             NAME_DISTR */
>             + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
>             + tipc_node_broadcast_named_publish(net, skb, &rcast);
>             + spin_lock_bh(&tn->nametbl_lock);
>             + nt->rcast = rcast;
>             + spin_unlock_bh(&tn->nametbl_lock);
>             + } else {
>             + /* Otherwise, be backwards compatible */
>             + tipc_node_broadcast(net, skb);
>             + }
>
>         Not needed according to above.
>
>             return p;
>             }
>             @@ -648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net
>             *net, u32 type, u32 lower,
>             u32 self = tipc_own_addr(net);
>             struct sk_buff *skb = NULL;
>             struct publication *p;
>             + bool rcast;
>             spin_lock_bh(&tn->nametbl_lock);
>             + rcast = nt->rcast;
>             p = tipc_nametbl_remove_publ(net, type, lower, upper,
>             self, key);
>             if (p) {
>             @@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net
>             *net, u32 type, u32 lower,
>             spin_unlock_bh(&tn->nametbl_lock);
>             if (skb) {
>             - tipc_node_broadcast(net, skb);
>             + /* Use broadcast if all nodes support broadcast
>             NAME_DISTR */
>             + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
>             + tipc_node_broadcast_named_publish(net, skb, &rcast);
>             + spin_lock_bh(&tn->nametbl_lock);
>             + nt->rcast = rcast;
>             + spin_unlock_bh(&tn->nametbl_lock);
>             + } else {
>             + /* Otherwise, be backwards compatible */
>             + tipc_node_broadcast(net, skb);
>             + }
>
>         One or two calls to tipc_mcast_xmit(), just as above.
>
>             return 1;
>             }
>             return 0;
>             @@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net)
>             INIT_LIST_HEAD(&nt->cluster_scope);
>             rwlock_init(&nt->cluster_scope_lock);
>             tn->nametbl = nt;
>             + /* 'bulk' updated messages via unicast */
>             + nt->rcast = true;
>             + skb_queue_head_init(&nt->defer_namedq);
>
>         Not needed. Now, node->namedq *is* the deferred queue.
>
>             spin_lock_init(&tn->nametbl_lock);
>             return 0;
>             }
>             @@ -784,6 +810,7 @@ void tipc_nametbl_stop(struct net *net)
>             * publications, then release the name table
>             */
>             spin_lock_bh(&tn->nametbl_lock);
>             + skb_queue_purge(&nt->defer_namedq);
>             for (i = 0; i < TIPC_NAMETBL_SIZE; i++) {
>             if (hlist_empty(&nt->services[i]))
>             continue;
>             diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
>             index f79066334cc8..b8cdf2a29d48 100644
>             --- a/net/tipc/name_table.h
>             +++ b/net/tipc/name_table.h
>             @@ -95,6 +95,8 @@ struct publication {
>             * - used by name_distr to send bulk updates to new nodes
>             * - used by name_distr during re-init of name table
>             * @local_publ_count: number of publications issued by this
>             node
>             + * @defer_namedq: temporarily queue for 'synching' update
>             + * @rcast: previous method used to publish/withdraw a service
>             */
>             struct name_table {
>             struct hlist_head services[TIPC_NAMETBL_SIZE];
>             @@ -102,6 +104,8 @@ struct name_table {
>             struct list_head cluster_scope;
>             rwlock_t cluster_scope_lock;
>             u32 local_publ_count;
>             + struct sk_buff_head defer_namedq;
>             + bool rcast;
>             };
>             int tipc_nl_name_table_dump(struct sk_buff *skb, struct
>             netlink_callback *cb);
>             diff --git a/net/tipc/node.c b/net/tipc/node.c
>             index aaf595613e6e..b058647fa78b 100644
>             --- a/net/tipc/node.c
>             +++ b/net/tipc/node.c
>             @@ -2981,3 +2981,35 @@ void
>             tipc_node_pre_cleanup_net(struct net *exit_net)
>             }
>             rcu_read_unlock();
>             }
>             +
>             +int tipc_node_broadcast_named_publish(struct net *net,
>             struct sk_buff *skb,
>             + bool *rcast)
>             +{
>             + struct tipc_mc_method method = {.rcast = *rcast};
>             + struct sk_buff_head xmitq;
>             + struct tipc_nlist dests;
>             + struct tipc_node *n;
>             + u16 cong_link_cnt;
>             + int rc = 0;
>             +
>             + __skb_queue_head_init(&xmitq);
>             + __skb_queue_tail(&xmitq, skb);
>             +
>             + tipc_nlist_init(&dests, tipc_own_addr(net));
>             + rcu_read_lock();
>             + list_for_each_entry_rcu(n, tipc_nodes(net), list) {
>             + if (in_own_node(net, n->addr))
>             + continue;
>             + if (!node_is_up(n))
>             + continue;
>             + tipc_nlist_add(&dests, n->addr);
>             + }
>             + rcu_read_unlock();
>             +
>             + rc = tipc_mcast_xmit(net, &xmitq, &method, &dests,
>             &cong_link_cnt);
>             + *rcast = method.rcast;
>             +
>             + tipc_nlist_purge(&dests);
>             + __skb_queue_purge(&xmitq);
>             + return rc;
>             +}
>
>         Not needed any more.
>
>             diff --git a/net/tipc/node.h b/net/tipc/node.h
>             index a6803b449a2c..d7d19f9932b1 100644
>             --- a/net/tipc/node.h
>             +++ b/net/tipc/node.h
>             @@ -124,4 +124,6 @@ int tipc_nl_node_set_key(struct
>             sk_buff *skb, struct genl_info *info);
>             int tipc_nl_node_flush_key(struct sk_buff *skb, struct
>             genl_info *info);
>             #endif
>             void tipc_node_pre_cleanup_net(struct net *exit_net);
>             +int tipc_node_broadcast_named_publish(struct net *net,
>             struct sk_buff *skb,
>             + bool *rcast);
>             #endif
>
>         There may of course be flaws or potential for improvements to
>         this, but
>         to me
>         this should solve our problem without too much new complexity
>         and code.
>
>         ///jon
>

Re: [tipc-discussion] FW: [PATCH 2/2] tipc: update a binding service via broadcast

Cluster wide IPC providing datagram, connection, and bus messaging

Re: [tipc-discussion] FW: [PATCH 2/2] tipc: update a binding service via broadcast