[tipc-discussion] Fwd: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

cc to tipc-discussion and others.
///jon

-------- Forwarded Message --------
Subject: 	Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast
Date: 	Tue, 12 May 2020 14:38:29 -0400
From: 	Jon Maloy <jm...@re...>
To: 	Hoang Huu Le <hoa...@de...>, ma...@do... 
<ma...@do...>

Hi Hoang,
I had to re-read the discussion we had back in November to refresh my 
memory about this.
To me v2 looks like a better base for further development than the most 
recent version,
so I am sorry to have wasted your time here. I am still convinced that 
the mcast/rcast synch
protocol we developed can solve almost all our problems if we understand 
to use it right.

What about the following:

Sending side:

- Even the initial bulk is sent via tipc_mxast_xmit(), but with only one
   destination node  and as "mandatory rcast". The broadcast link may
   still choose to send it as bcast in case rcast has been manually disabled
   by the user, but the surplus messages will be dropped on the
   non-destination nodes. (In case of legacy nodes this is certain, in the
   case of newer nodes we must make sure that it happens by setting
   the 'destnode' field in the header and always check for this in
   tipc_named_rcv().)  I think this is a weird and rare case where we
   can accept the extra overhead of broadcast. Under all normal
   conditions rcast will be used here.
- Regular PUBLISH/WITHDRAW messages are sent as "mandatory bcast",
   or if some nodes don't not support TIPC_MCAST_RBCTL as "mandatory rcast".
   The broadcast link may still choose to send it as rcast, but this is 
nothing
   we need to worry about as long as the syncronization mechanism is
   active. Under all normal conditions bcast will be selected when possible,
   since  all nodes are destinations.

Receiving side:
- We need a deferred queue, but last time i missed that we don't need
   one per node, it is sufficient to have a global one, as you were 
suggesting.
   So to me the code is ok here.
- As mentioned earlier, tipc_named_rcv() only accepts messages where
   destnode is "self" (bulk) or zero (publish/withdraw).

The point with all this is that all messages will now be subject to the
  synchronization mechanism, so the bulk messages are
guaranteed to be delivered in order and ahead of individual
publications/withdraws, and even topology changes will almost
never lead to change of sending method and need for synchronization.
We must make one change to the broadcast protocol though: even
"mandatory" messages must be made subject to the synchronization
mechanism, something that doesn't seem to be the case now.
("Or maybe we don't need to set "mandatory" at all? This needs to
   be checked.)

The advantage of this method is simplicity.
- We don't need any separate synchronization mechanism for
   NAME_DISTR messages at all, we just use what is already there.

The disadvantage is that there will be a lot of SYN messages sent
if we are not careful.
- Every time a new node comes up, the other nodes will send it a
   a bcast SYN before they send the bulk, because they believe they have
   just switched  from bcast (normal mode)  to rcast (bulk mode).
   This one is in reality unnecessary, since we can be sure that
   the new node has not been sent any previous bcast that needs
   to be synched with.
- At the first publication/withdraw sent after the bulk the neighbor
   nodes will  send an rcast  SYN to all other nodes, because they just
   "switched" back  from rcast to bcast.
   In this case we really need to send an rcast SYN to the new node
   that came up, since this is the only one where there is risk of
   a race. This message does in reality serve as "end-of-bulk" message
   but is handled like any other SYN by the reciving tipc_mcast_filter_rcv()
   So, to avoid this the broadcast protocol needs to be able to recognize
   NAME_DISTR messages and treat then slightly different from the others,
   or it needs to be told what to do by the name_distr.c code via the API.
   Maybe a couple of new fields in struct tipc_mcast_method?

What do you think?

Regards
///jon

On 5/12/20 6:22 AM, Hoang Huu Le wrote:
> Just forward the patch I mentioned.
>
> -----Original Message-----
> From: Hoang Le <hoa...@de...>
> Sent: Tuesday, November 19, 2019 5:01 PM
> To: jon...@er...; ma...@do...; tip...@de...
> Subject: [PATCH 2/2] tipc: update a binding service via broadcast
>
> Currently, updating binding table (add service binding to
> name table/withdraw a service binding) is being sent over replicast.
> However, if we are scaling up clusters to > 100 nodes/containers this
> method is less affection because of looping through nodes in a cluster one
> by one.
>
> It is worth to use broadcast to update a binding service. Then binding
> table updates in all nodes for one shot.
>
> The mechanism is backward compatible as sync message slient dropped.
>
> v2: resolve synchronization problem when switching from unicast to
> broadcast
>
> Signed-off-by: Hoang Le <hoa...@de...>
> ---
> net/tipc/bcast.c | 3 ++-
> net/tipc/link.c | 6 ++++++
> net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++---
> net/tipc/name_table.h | 4 ++++
> net/tipc/node.c | 32 ++++++++++++++++++++++++++++++++
> net/tipc/node.h | 2 ++
> 6 files changed, 76 insertions(+), 4 deletions(-)
>
> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
> index e06f05d55534..44ed481fec47 100644
> --- a/net/tipc/bcast.c
> +++ b/net/tipc/bcast.c
> @@ -324,7 +324,8 @@ static int tipc_mcast_send_sync(struct net *net, 
> struct sk_buff *skb,
> hdr = buf_msg(skb);
> if (msg_user(hdr) == MSG_FRAGMENTER)
> hdr = msg_inner_hdr(hdr);
> - if (msg_type(hdr) != TIPC_MCAST_MSG)
> + if (msg_user(hdr) != NAME_DISTRIBUTOR &&
> + msg_type(hdr) != TIPC_MCAST_MSG)
> return 0;
> /* Allocate dummy message */
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index fb72031228c9..a2e9a64d5a0f 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct tipc_link *l, 
> struct sk_buff *skb,
> struct sk_buff_head *inputq)
> {
> struct sk_buff_head *mc_inputq = l->bc_rcvlink->inputq;
> + struct name_table *nt = tipc_name_table(l->net);
> + struct sk_buff_head *defnq = &nt->defer_namedq;
> struct tipc_msg *hdr = buf_msg(skb);
> switch (msg_user(hdr)) {
> @@ -1211,6 +1213,10 @@ static bool tipc_data_input(struct tipc_link 
> *l, struct sk_buff *skb,
> case NAME_DISTRIBUTOR:
> l->bc_rcvlink->state = LINK_ESTABLISHED;
> skb_queue_tail(l->namedq, skb);
> +
> + spin_lock_bh(&defnq->lock);
> + tipc_mcast_filter_msg(l->net, defnq, l->namedq);
> + spin_unlock_bh(&defnq->lock);
> return true;
> case MSG_BUNDLER:
> case TUNNEL_PROTOCOL:
> diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
> index 66a65c2cdb23..593dcd11357f 100644
> --- a/net/tipc/name_table.c
> +++ b/net/tipc/name_table.c
> @@ -615,9 +615,11 @@ struct publication *tipc_nametbl_publish(struct 
> net *net, u32 type, u32 lower,
> struct tipc_net *tn = tipc_net(net);
> struct publication *p = NULL;
> struct sk_buff *skb = NULL;
> + bool rcast;
> spin_lock_bh(&tn->nametbl_lock);
> + rcast = nt->rcast;
> if (nt->local_publ_count >= TIPC_MAX_PUBL) {
> pr_warn("Bind failed, max limit %u reached\n", TIPC_MAX_PUBL);
> goto exit;
> @@ -632,8 +634,18 @@ struct publication *tipc_nametbl_publish(struct 
> net *net, u32 type, u32 lower,
> exit:
> spin_unlock_bh(&tn->nametbl_lock);
> - if (skb)
> - tipc_node_broadcast(net, skb);
> + if (skb) {
> + /* Use broadcast if all nodes support broadcast NAME_DISTR */
> + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
> + tipc_node_broadcast_named_publish(net, skb, &rcast);
> + spin_lock_bh(&tn->nametbl_lock);
> + nt->rcast = rcast;
> + spin_unlock_bh(&tn->nametbl_lock);
> + } else {
> + /* Otherwise, be backwards compatible */
> + tipc_node_broadcast(net, skb);
> + }
> + }
> return p;
> }
> @@ -648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net *net, u32 
> type, u32 lower,
> u32 self = tipc_own_addr(net);
> struct sk_buff *skb = NULL;
> struct publication *p;
> + bool rcast;
> spin_lock_bh(&tn->nametbl_lock);
> + rcast = nt->rcast;
> p = tipc_nametbl_remove_publ(net, type, lower, upper, self, key);
> if (p) {
> @@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net *net, u32 
> type, u32 lower,
> spin_unlock_bh(&tn->nametbl_lock);
> if (skb) {
> - tipc_node_broadcast(net, skb);
> + /* Use broadcast if all nodes support broadcast NAME_DISTR */
> + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) {
> + tipc_node_broadcast_named_publish(net, skb, &rcast);
> + spin_lock_bh(&tn->nametbl_lock);
> + nt->rcast = rcast;
> + spin_unlock_bh(&tn->nametbl_lock);
> + } else {
> + /* Otherwise, be backwards compatible */
> + tipc_node_broadcast(net, skb);
> + }
> return 1;
> }
> return 0;
> @@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net)
> INIT_LIST_HEAD(&nt->cluster_scope);
> rwlock_init(&nt->cluster_scope_lock);
> tn->nametbl = nt;
> + /* 'bulk' updated messages via unicast */
> + nt->rcast = true;
> + skb_queue_head_init(&nt->defer_namedq);
> spin_lock_init(&tn->nametbl_lock);
> return 0;
> }
> @@ -784,6 +810,7 @@ void tipc_nametbl_stop(struct net *net)
> * publications, then release the name table
> */
> spin_lock_bh(&tn->nametbl_lock);
> + skb_queue_purge(&nt->defer_namedq);
> for (i = 0; i < TIPC_NAMETBL_SIZE; i++) {
> if (hlist_empty(&nt->services[i]))
> continue;
> diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
> index f79066334cc8..b8cdf2a29d48 100644
> --- a/net/tipc/name_table.h
> +++ b/net/tipc/name_table.h
> @@ -95,6 +95,8 @@ struct publication {
> * - used by name_distr to send bulk updates to new nodes
> * - used by name_distr during re-init of name table
> * @local_publ_count: number of publications issued by this node
> + * @defer_namedq: temporarily queue for 'synching' update
> + * @rcast: previous method used to publish/withdraw a service
> */
> struct name_table {
> struct hlist_head services[TIPC_NAMETBL_SIZE];
> @@ -102,6 +104,8 @@ struct name_table {
> struct list_head cluster_scope;
> rwlock_t cluster_scope_lock;
> u32 local_publ_count;
> + struct sk_buff_head defer_namedq;
> + bool rcast;
> };
> int tipc_nl_name_table_dump(struct sk_buff *skb, struct 
> netlink_callback *cb);
> diff --git a/net/tipc/node.c b/net/tipc/node.c
> index aaf595613e6e..b058647fa78b 100644
> --- a/net/tipc/node.c
> +++ b/net/tipc/node.c
> @@ -2981,3 +2981,35 @@ void tipc_node_pre_cleanup_net(struct net 
> *exit_net)
> }
> rcu_read_unlock();
> }
> +
> +int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff 
> *skb,
> + bool *rcast)
> +{
> + struct tipc_mc_method method = {.rcast = *rcast};
> + struct sk_buff_head xmitq;
> + struct tipc_nlist dests;
> + struct tipc_node *n;
> + u16 cong_link_cnt;
> + int rc = 0;
> +
> + __skb_queue_head_init(&xmitq);
> + __skb_queue_tail(&xmitq, skb);
> +
> + tipc_nlist_init(&dests, tipc_own_addr(net));
> + rcu_read_lock();
> + list_for_each_entry_rcu(n, tipc_nodes(net), list) {
> + if (in_own_node(net, n->addr))
> + continue;
> + if (!node_is_up(n))
> + continue;
> + tipc_nlist_add(&dests, n->addr);
> + }
> + rcu_read_unlock();
> +
> + rc = tipc_mcast_xmit(net, &xmitq, &method, &dests, &cong_link_cnt);
> + *rcast = method.rcast;
> +
> + tipc_nlist_purge(&dests);
> + __skb_queue_purge(&xmitq);
> + return rc;
> +}
> diff --git a/net/tipc/node.h b/net/tipc/node.h
> index a6803b449a2c..d7d19f9932b1 100644
> --- a/net/tipc/node.h
> +++ b/net/tipc/node.h
> @@ -124,4 +124,6 @@ int tipc_nl_node_set_key(struct sk_buff *skb, 
> struct genl_info *info);
> int tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info);
> #endif
> void tipc_node_pre_cleanup_net(struct net *exit_net);
> +int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff 
> *skb,
> + bool *rcast);
> #endif

[tipc-discussion] Fwd: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast

Cluster wide IPC providing datagram, connection, and bus messaging

[tipc-discussion] Fwd: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast