tipc-discussion Mailing List for TIPC Cluster Domain Sockets (Page 30)

Cluster wide IPC providing datagram, connection, and bus messaging

Brought to you by: goranm, johnr64, jonmaloy, ldcanh, and 6 others

tipc-discussion — User questions & problem reports, project news, developer discussion

You can subscribe to this list here.

2003	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (6)
2004	Jan (9)	Feb (11)	Mar (22)	Apr (73)	May (78)	Jun (146)	Jul (80)	Aug (27)	Sep (5)	Oct (14)	Nov (18)	Dec (27)
2005	Jan (20)	Feb (30)	Mar (19)	Apr (28)	May (50)	Jun (31)	Jul (32)	Aug (14)	Sep (36)	Oct (43)	Nov (74)	Dec (63)
2006	Jan (34)	Feb (32)	Mar (21)	Apr (76)	May (106)	Jun (72)	Jul (70)	Aug (175)	Sep (130)	Oct (39)	Nov (81)	Dec (43)
2007	Jan (81)	Feb (36)	Mar (20)	Apr (43)	May (54)	Jun (34)	Jul (44)	Aug (55)	Sep (44)	Oct (54)	Nov (43)	Dec (41)
2008	Jan (42)	Feb (84)	Mar (73)	Apr (30)	May (119)	Jun (54)	Jul (54)	Aug (93)	Sep (173)	Oct (130)	Nov (145)	Dec (153)
2009	Jan (59)	Feb (12)	Mar (28)	Apr (18)	May (56)	Jun (9)	Jul (28)	Aug (62)	Sep (16)	Oct (19)	Nov (15)	Dec (17)
2010	Jan (14)	Feb (36)	Mar (37)	Apr (30)	May (33)	Jun (53)	Jul (42)	Aug (50)	Sep (67)	Oct (66)	Nov (69)	Dec (36)
2011	Jan (52)	Feb (45)	Mar (49)	Apr (21)	May (34)	Jun (13)	Jul (19)	Aug (37)	Sep (43)	Oct (10)	Nov (23)	Dec (30)
2012	Jan (42)	Feb (36)	Mar (46)	Apr (25)	May (96)	Jun (146)	Jul (40)	Aug (28)	Sep (61)	Oct (45)	Nov (100)	Dec (53)
2013	Jan (79)	Feb (24)	Mar (134)	Apr (156)	May (118)	Jun (75)	Jul (278)	Aug (145)	Sep (136)	Oct (168)	Nov (137)	Dec (439)
2014	Jan (284)	Feb (158)	Mar (231)	Apr (275)	May (259)	Jun (91)	Jul (222)	Aug (215)	Sep (165)	Oct (166)	Nov (211)	Dec (150)
2015	Jan (164)	Feb (324)	Mar (299)	Apr (214)	May (111)	Jun (109)	Jul (105)	Aug (36)	Sep (58)	Oct (131)	Nov (68)	Dec (30)
2016	Jan (46)	Feb (87)	Mar (135)	Apr (174)	May (132)	Jun (135)	Jul (149)	Aug (125)	Sep (79)	Oct (49)	Nov (95)	Dec (102)
2017	Jan (104)	Feb (75)	Mar (72)	Apr (53)	May (18)	Jun (5)	Jul (14)	Aug (19)	Sep (2)	Oct (13)	Nov (21)	Dec (67)
2018	Jan (56)	Feb (50)	Mar (148)	Apr (41)	May (37)	Jun (34)	Jul (34)	Aug (11)	Sep (52)	Oct (48)	Nov (28)	Dec (46)
2019	Jan (29)	Feb (63)	Mar (95)	Apr (54)	May (14)	Jun (71)	Jul (60)	Aug (49)	Sep (3)	Oct (64)	Nov (115)	Dec (57)
2020	Jan (15)	Feb (9)	Mar (38)	Apr (27)	May (60)	Jun (53)	Jul (35)	Aug (46)	Sep (37)	Oct (64)	Nov (20)	Dec (25)
2021	Jan (20)	Feb (31)	Mar (27)	Apr (23)	May (21)	Jun (30)	Jul (30)	Aug (7)	Sep (18)	Oct	Nov (15)	Dec (4)
2022	Jan (3)	Feb (1)	Mar (10)	Apr	May (2)	Jun (26)	Jul (5)	Aug	Sep (1)	Oct (2)	Nov (9)	Dec (2)
2023	Jan (4)	Feb (4)	Mar (5)	Apr (10)	May (29)	Jun (17)	Jul	Aug	Sep (1)	Oct (1)	Nov (2)	Dec
2024	Jan	Feb (6)	Mar	Apr (1)	May (6)	Jun	Jul (5)	Aug	Sep (3)	Oct	Nov	Dec
2025	Jan	Feb (3)	Mar	Apr	May	Jun	Jul (6)	Aug	Sep	Oct	Nov	Dec

Flat | Threaded

<< < 1 .. 28 29 30 31 32 .. 595 > >> (Page 30 of 595)

Re: [tipc-discussion] [RFC PATCH 1/2] tipc: fix large latency in smart Nagle streaming

From: Jon M. <jm...@re...> - 2020-05-04 16:04:10


On 5/4/20 7:28 AM, Tuong Lien wrote:
> Currently when a connection is in Nagle mode, we set the 'ack_required'
> bit in the last sending buffer and wait for the corresponding ACK prior
> to pushing more data. However, on the receiving side, the ACK is issued
> only when application actually gets
/gets/reads/
> the whole data. Even if part of the
> last buffer is received, we will not do the ACK as required.
Hmm, this sounds more like a bug than a weakness. Good that you caught it.
> This might
> cause an unnecessary delay since the receiver does not always fetch the
> message as fast as the sender, resulting in a large latency in the user
> message sending, which is [one RTT + the receiver processing time].
>
> The commit makes Nagle ACK as soon as possible i.e. when a message with
> the 'ack_required' arrives in the receiving side's stack even before it
> is processed / backlogged, put in the socket receive queue...
> This way, we can limit the streaming latency to one RTT as committed in
> Nagle mode.
>
> v2: optimize code
> v3: rebase to non debug
> v4: rename patch subject
>
> Signed-off-by: Tuong Lien <tuo...@de...>
> ---
>   net/tipc/socket.c | 43 ++++++++++++++++++++++++++++++++-----------
>   1 file changed, 32 insertions(+), 11 deletions(-)
>
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> index 693e8902161e..4e71774528ad 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -1739,22 +1739,21 @@ static int tipc_sk_anc_data_recv(struct msghdr *m, struct sk_buff *skb,
>   	return 0;
>   }
>   
> -static void tipc_sk_send_ack(struct tipc_sock *tsk)
> +static struct sk_buff *tipc_sk_build_ack(struct tipc_sock *tsk)
>   {
>   	struct sock *sk = &tsk->sk;
> -	struct net *net = sock_net(sk);
>   	struct sk_buff *skb = NULL;
>   	struct tipc_msg *msg;
>   	u32 peer_port = tsk_peer_port(tsk);
>   	u32 dnode = tsk_peer_node(tsk);
>   
>   	if (!tipc_sk_connected(sk))
> -		return;
> +		return NULL;
>   	skb = tipc_msg_create(CONN_MANAGER, CONN_ACK, INT_H_SIZE, 0,
>   			      dnode, tsk_own_node(tsk), peer_port,
>   			      tsk->portid, TIPC_OK);
>   	if (!skb)
> -		return;
> +		return NULL;
>   	msg = buf_msg(skb);
>   	msg_set_conn_ack(msg, tsk->rcv_unacked);
>   	tsk->rcv_unacked = 0;
> @@ -1764,7 +1763,20 @@ static void tipc_sk_send_ack(struct tipc_sock *tsk)
>   		tsk->rcv_win = tsk_adv_blocks(tsk->sk.sk_rcvbuf);
>   		msg_set_adv_win(msg, tsk->rcv_win);
>   	}
> -	tipc_node_xmit_skb(net, skb, dnode, msg_link_selector(msg));
> +
> +	return skb;
> +}
> +
> +static void tipc_sk_send_ack(struct tipc_sock *tsk)
> +{
> +	struct sk_buff *skb;
> +
> +	skb = tipc_sk_build_ack(tsk);
> +	if (!skb)
> +		return;
> +
> +	tipc_node_xmit_skb(sock_net(&tsk->sk), skb, tsk_peer_node(tsk),
> +			   msg_link_selector(buf_msg(skb)));
>   }
>   
>   static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop)
> @@ -1938,7 +1950,6 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m,
>   	bool peek = flags & MSG_PEEK;
>   	int offset, required, copy, copied = 0;
>   	int hlen, dlen, err, rc;
> -	bool ack = false;
>   	long timeout;
>   
>   	/* Catch invalid receive attempts */
> @@ -1983,7 +1994,6 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m,
>   
>   		/* Copy data if msg ok, otherwise return error/partial data */
>   		if (likely(!err)) {
> -			ack = msg_ack_required(hdr);
>   			offset = skb_cb->bytes_read;
>   			copy = min_t(int, dlen - offset, buflen - copied);
>   			rc = skb_copy_datagram_msg(skb, hlen + offset, m, copy);
> @@ -2011,7 +2021,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m,
>   
>   		/* Send connection flow control advertisement when applicable */
>   		tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen);
> -		if (ack || tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)
> +		if (tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)
>   			tipc_sk_send_ack(tsk);
>   
>   		/* Exit if all requested data or FIN/error received */
> @@ -2105,9 +2115,11 @@ static void tipc_sk_proto_rcv(struct sock *sk,
>    * tipc_sk_filter_connect - check incoming message for a connection-based socket
>    * @tsk: TIPC socket
>    * @skb: pointer to message buffer.
> + * @xmitq: for Nagle ACK if any
>    * Returns true if message should be added to receive queue, false otherwise
>    */
> -static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb)
> +static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb,
> +				   struct sk_buff_head *xmitq)
>   {
>   	struct sock *sk = &tsk->sk;
>   	struct net *net = sock_net(sk);
> @@ -2171,8 +2183,17 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb)
>   		if (!skb_queue_empty(&sk->sk_write_queue))
>   			tipc_sk_push_backlog(tsk);
>   		/* Accept only connection-based messages sent by peer */
> -		if (likely(con_msg && !err && pport == oport && pnode == onode))
> +		if (likely(con_msg && !err && pport == oport &&
> +			   pnode == onode)) {
> +			if (msg_ack_required(hdr)) {
> +				struct sk_buff *skb;
> +
> +				skb = tipc_sk_build_ack(tsk);
> +				if (skb)
> +					__skb_queue_tail(xmitq, skb);
> +			}
>   			return true;
> +		}
>   		if (!tsk_peer_msg(tsk, hdr))
>   			return false;
>   		if (!err)
> @@ -2267,7 +2288,7 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb,
>   	while ((skb = __skb_dequeue(&inputq))) {
>   		hdr = buf_msg(skb);
>   		limit = rcvbuf_limit(sk, skb);
> -		if ((sk_conn && !tipc_sk_filter_connect(tsk, skb)) ||
> +		if ((sk_conn && !tipc_sk_filter_connect(tsk, skb, xmitq)) ||
>   		    (!sk_conn && msg_connected(hdr)) ||
>   		    (!grp && msg_in_group(hdr)))
>   			err = TIPC_ERR_NO_PORT;
Acked-by: Jon Maloy <jm...@re...>

[tipc-discussion] [net] tipc: fix partial topology connection closure

From: Tuong L. <tuo...@de...> - 2020-05-04 12:50:47

When an application connects to the TIPC topology server and subscribes
to some services, a new connection is created along with some objects -
'tipc_subscription' to store related data correspondingly...
However, there is one omission in the connection handling that when the
connection or application is orderly shutdown (e.g. via SIGQUIT, etc.),
the connection is not closed in kernel, the 'tipc_subscription' objects
are not freed too.
This results in:
- The maximum number of subscriptions (65535) will be reached soon, new
subscriptions will be rejected;
- TIPC module cannot be removed (unless the objects  are somehow forced
to release first);

The commit fixes the issue by closing the connection if the 'recvmsg()'
returns '0' i.e. when the peer is shutdown gracefully. It also includes
the other unexpected cases.

Acked-by: Jon Maloy <jm...@re...>
Acked-by: Ying Xue <yin...@wi...>
Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/topsrv.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index ad78f7cff379..c364335623ab 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -405,10 +405,11 @@ static int tipc_conn_rcv_from_sock(struct tipc_conn *con)
 		read_lock_bh(&sk->sk_callback_lock);
 		ret = tipc_conn_rcv_sub(srv, con, &s);
 		read_unlock_bh(&sk->sk_callback_lock);
+		if (!ret)
+			return 0;
 	}
-	if (ret < 0)
-		tipc_conn_close(con);
 
+	tipc_conn_close(con);
 	return ret;
 }
 
-- 
2.13.7

[tipc-discussion] [RFC PATCH 2/2] tipc: add test for Nagle algorithm effectiveness

From: Tuong L. <tuo...@de...> - 2020-05-04 11:44:47

When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.

In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.

In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.

When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.

Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.

This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.

In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.

Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.

Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/msg.c    |  3 ---
 net/tipc/msg.h    | 14 +++++++++++--
 net/tipc/socket.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 60 insertions(+), 17 deletions(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 69d68512300a..732cd95b5c75 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -235,9 +235,6 @@ int tipc_msg_append(struct tipc_msg *_hdr, struct msghdr *m, int dlen,
 			msg_set_size(hdr, MIN_H_SIZE);
 			__skb_queue_tail(txq, skb);
 			total += 1;
-			if (prev)
-				msg_set_ack_required(buf_msg(prev), 0);
-			msg_set_ack_required(hdr, 1);
 		}
 		hdr = buf_msg(skb);
 		curr = msg_blocks(hdr);
diff --git a/net/tipc/msg.h b/net/tipc/msg.h
index 5f37a611e8c9..44543892af11 100644
--- a/net/tipc/msg.h
+++ b/net/tipc/msg.h
@@ -340,9 +340,19 @@ static inline int msg_ack_required(struct tipc_msg *m)
 	return msg_bits(m, 0, 18, 1);
 }
 
-static inline void msg_set_ack_required(struct tipc_msg *m, u32 d)
+static inline void msg_set_ack_required(struct tipc_msg *m)
 {
-	msg_set_bits(m, 0, 18, 1, d);
+	msg_set_bits(m, 0, 18, 1, 1);
+}
+
+static inline int msg_nagle_ack(struct tipc_msg *m)
+{
+	return msg_bits(m, 0, 18, 1);
+}
+
+static inline void msg_set_nagle_ack(struct tipc_msg *m)
+{
+	msg_set_bits(m, 0, 18, 1, 1);
 }
 
 static inline bool msg_is_rcast(struct tipc_msg *m)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 4e71774528ad..93b0a6159cb1 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -119,7 +119,10 @@ struct tipc_sock {
 	struct rcu_head rcu;
 	struct tipc_group *group;
 	u32 oneway;
+	u32 nagle_start;
 	u16 snd_backlog;
+	u16 msg_acc;
+	u16 pkt_cnt;
 	bool expect_ack;
 	bool nodelay;
 	bool group_is_open;
@@ -143,7 +146,7 @@ static int tipc_sk_insert(struct tipc_sock *tsk);
 static void tipc_sk_remove(struct tipc_sock *tsk);
 static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dsz);
 static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dsz);
-static void tipc_sk_push_backlog(struct tipc_sock *tsk);
+static void tipc_sk_push_backlog(struct tipc_sock *tsk, bool nagle_ack);
 
 static const struct proto_ops packet_ops;
 static const struct proto_ops stream_ops;
@@ -474,6 +477,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock,
 	tsk = tipc_sk(sk);
 	tsk->max_pkt = MAX_PKT_DEFAULT;
 	tsk->maxnagle = 0;
+	tsk->nagle_start = 4;
 	INIT_LIST_HEAD(&tsk->publications);
 	INIT_LIST_HEAD(&tsk->cong_links);
 	msg = &tsk->phdr;
@@ -541,7 +545,7 @@ static void __tipc_shutdown(struct socket *sock, int error)
 					    !tsk_conn_cong(tsk)));
 
 	/* Push out delayed messages if in Nagle mode */
-	tipc_sk_push_backlog(tsk);
+	tipc_sk_push_backlog(tsk, false);
 	/* Remove pending SYN */
 	__skb_queue_purge(&sk->sk_write_queue);
 
@@ -1252,14 +1256,37 @@ void tipc_sk_mcast_rcv(struct net *net, struct sk_buff_head *arrvq,
 /* tipc_sk_push_backlog(): send accumulated buffers in socket write queue
  *                         when socket is in Nagle mode
  */
-static void tipc_sk_push_backlog(struct tipc_sock *tsk)
+static void tipc_sk_push_backlog(struct tipc_sock *tsk, bool nagle_ack)
 {
 	struct sk_buff_head *txq = &tsk->sk.sk_write_queue;
+	struct sk_buff *skb = skb_peek_tail(txq);
 	struct net *net = sock_net(&tsk->sk);
 	u32 dnode = tsk_peer_node(tsk);
-	struct sk_buff *skb = skb_peek(txq);
 	int rc;
 
+	if (nagle_ack) {
+		tsk->pkt_cnt += skb_queue_len(txq);
+		if (!tsk->pkt_cnt || tsk->msg_acc / tsk->pkt_cnt < 2) {
+			tsk->oneway = 0;
+			if (tsk->nagle_start < 1000)
+				tsk->nagle_start *= 2;
+			tsk->expect_ack = false;
+			pr_debug("tsk %10u: bad nagle %u -> %u, next start %u!\n",
+				 tsk->portid, tsk->msg_acc, tsk->pkt_cnt,
+				 tsk->nagle_start);
+		} else {
+			tsk->nagle_start = 4;
+			if (skb) {
+				msg_set_ack_required(buf_msg(skb));
+				tsk->expect_ack = true;
+			} else {
+				tsk->expect_ack = false;
+			}
+		}
+		tsk->msg_acc = 0;
+		tsk->pkt_cnt = 0;
+	}
+
 	if (!skb || tsk->cong_link_cnt)
 		return;
 
@@ -1267,9 +1294,10 @@ static void tipc_sk_push_backlog(struct tipc_sock *tsk)
 	if (msg_is_syn(buf_msg(skb)))
 		return;
 
+	if (tsk->msg_acc)
+		tsk->pkt_cnt += skb_queue_len(txq);
 	tsk->snt_unacked += tsk->snd_backlog;
 	tsk->snd_backlog = 0;
-	tsk->expect_ack = true;
 	rc = tipc_node_xmit(net, txq, dnode, tsk->portid);
 	if (rc == -ELINKCONG)
 		tsk->cong_link_cnt = 1;
@@ -1322,8 +1350,7 @@ static void tipc_sk_conn_proto_rcv(struct tipc_sock *tsk, struct sk_buff *skb,
 		return;
 	} else if (mtyp == CONN_ACK) {
 		was_cong = tsk_conn_cong(tsk);
-		tsk->expect_ack = false;
-		tipc_sk_push_backlog(tsk);
+		tipc_sk_push_backlog(tsk, msg_nagle_ack(hdr));
 		tsk->snt_unacked -= msg_conn_ack(hdr);
 		if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL)
 			tsk->snd_win = msg_adv_win(hdr);
@@ -1544,17 +1571,24 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen)
 			break;
 		send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE);
 		blocks = tsk->snd_backlog;
-		if (tsk->oneway++ >= 4 && send <= maxnagle) {
+		if (tsk->oneway++ >= tsk->nagle_start && send <= maxnagle) {
 			rc = tipc_msg_append(hdr, m, send, maxnagle, txq);
 			if (unlikely(rc < 0))
 				break;
 			blocks += rc;
+			tsk->msg_acc++;
 			if (blocks <= 64 && tsk->expect_ack) {
 				tsk->snd_backlog = blocks;
 				sent += send;
 				break;
+			} else if (blocks > 64) {
+				tsk->pkt_cnt += skb_queue_len(txq);
+			} else {
+				msg_set_ack_required(buf_msg(skb_peek_tail(txq)));
+				tsk->expect_ack = true;
+				tsk->msg_acc = 0;
+				tsk->pkt_cnt = 0;
 			}
-			tsk->expect_ack = true;
 		} else {
 			rc = tipc_msg_build(hdr, m, sent, send, maxpkt, txq);
 			if (unlikely(rc != send))
@@ -2092,7 +2126,7 @@ static void tipc_sk_proto_rcv(struct sock *sk,
 		smp_wmb();
 		tsk->cong_link_cnt--;
 		wakeup = true;
-		tipc_sk_push_backlog(tsk);
+		tipc_sk_push_backlog(tsk, false);
 		break;
 	case GROUP_PROTOCOL:
 		tipc_group_proto_rcv(grp, &wakeup, hdr, inputq, xmitq);
@@ -2181,7 +2215,7 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb,
 		return false;
 	case TIPC_ESTABLISHED:
 		if (!skb_queue_empty(&sk->sk_write_queue))
-			tipc_sk_push_backlog(tsk);
+			tipc_sk_push_backlog(tsk, false);
 		/* Accept only connection-based messages sent by peer */
 		if (likely(con_msg && !err && pport == oport &&
 			   pnode == onode)) {
@@ -2189,8 +2223,10 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb,
 				struct sk_buff *skb;
 
 				skb = tipc_sk_build_ack(tsk);
-				if (skb)
+				if (skb) {
+					msg_set_nagle_ack(buf_msg(skb));
 					__skb_queue_tail(xmitq, skb);
+				}
 			}
 			return true;
 		}
-- 
2.13.7

[tipc-discussion] [RFC PATCH 1/2] tipc: fix large latency in smart Nagle streaming

From: Tuong L. <tuo...@de...> - 2020-05-04 11:43:43

Currently when a connection is in Nagle mode, we set the 'ack_required'
bit in the last sending buffer and wait for the corresponding ACK prior
to pushing more data. However, on the receiving side, the ACK is issued
only when application actually gets the whole data. Even if part of the
last buffer is received, we will not do the ACK as required. This might
cause an unnecessary delay since the receiver does not always fetch the
message as fast as the sender, resulting in a large latency in the user
message sending, which is [one RTT + the receiver processing time].

The commit makes Nagle ACK as soon as possible i.e. when a message with
the 'ack_required' arrives in the receiving side's stack even before it
is processed / backlogged, put in the socket receive queue...
This way, we can limit the streaming latency to one RTT as committed in
Nagle mode.

v2: optimize code
v3: rebase to non debug
v4: rename patch subject

Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/socket.c | 43 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 693e8902161e..4e71774528ad 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1739,22 +1739,21 @@ static int tipc_sk_anc_data_recv(struct msghdr *m, struct sk_buff *skb,
 	return 0;
 }
 
-static void tipc_sk_send_ack(struct tipc_sock *tsk)
+static struct sk_buff *tipc_sk_build_ack(struct tipc_sock *tsk)
 {
 	struct sock *sk = &tsk->sk;
-	struct net *net = sock_net(sk);
 	struct sk_buff *skb = NULL;
 	struct tipc_msg *msg;
 	u32 peer_port = tsk_peer_port(tsk);
 	u32 dnode = tsk_peer_node(tsk);
 
 	if (!tipc_sk_connected(sk))
-		return;
+		return NULL;
 	skb = tipc_msg_create(CONN_MANAGER, CONN_ACK, INT_H_SIZE, 0,
 			      dnode, tsk_own_node(tsk), peer_port,
 			      tsk->portid, TIPC_OK);
 	if (!skb)
-		return;
+		return NULL;
 	msg = buf_msg(skb);
 	msg_set_conn_ack(msg, tsk->rcv_unacked);
 	tsk->rcv_unacked = 0;
@@ -1764,7 +1763,20 @@ static void tipc_sk_send_ack(struct tipc_sock *tsk)
 		tsk->rcv_win = tsk_adv_blocks(tsk->sk.sk_rcvbuf);
 		msg_set_adv_win(msg, tsk->rcv_win);
 	}
-	tipc_node_xmit_skb(net, skb, dnode, msg_link_selector(msg));
+
+	return skb;
+}
+
+static void tipc_sk_send_ack(struct tipc_sock *tsk)
+{
+	struct sk_buff *skb;
+
+	skb = tipc_sk_build_ack(tsk);
+	if (!skb)
+		return;
+
+	tipc_node_xmit_skb(sock_net(&tsk->sk), skb, tsk_peer_node(tsk),
+			   msg_link_selector(buf_msg(skb)));
 }
 
 static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop)
@@ -1938,7 +1950,6 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m,
 	bool peek = flags & MSG_PEEK;
 	int offset, required, copy, copied = 0;
 	int hlen, dlen, err, rc;
-	bool ack = false;
 	long timeout;
 
 	/* Catch invalid receive attempts */
@@ -1983,7 +1994,6 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m,
 
 		/* Copy data if msg ok, otherwise return error/partial data */
 		if (likely(!err)) {
-			ack = msg_ack_required(hdr);
 			offset = skb_cb->bytes_read;
 			copy = min_t(int, dlen - offset, buflen - copied);
 			rc = skb_copy_datagram_msg(skb, hlen + offset, m, copy);
@@ -2011,7 +2021,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m,
 
 		/* Send connection flow control advertisement when applicable */
 		tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen);
-		if (ack || tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)
+		if (tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)
 			tipc_sk_send_ack(tsk);
 
 		/* Exit if all requested data or FIN/error received */
@@ -2105,9 +2115,11 @@ static void tipc_sk_proto_rcv(struct sock *sk,
  * tipc_sk_filter_connect - check incoming message for a connection-based socket
  * @tsk: TIPC socket
  * @skb: pointer to message buffer.
+ * @xmitq: for Nagle ACK if any
  * Returns true if message should be added to receive queue, false otherwise
  */
-static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb)
+static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb,
+				   struct sk_buff_head *xmitq)
 {
 	struct sock *sk = &tsk->sk;
 	struct net *net = sock_net(sk);
@@ -2171,8 +2183,17 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb)
 		if (!skb_queue_empty(&sk->sk_write_queue))
 			tipc_sk_push_backlog(tsk);
 		/* Accept only connection-based messages sent by peer */
-		if (likely(con_msg && !err && pport == oport && pnode == onode))
+		if (likely(con_msg && !err && pport == oport &&
+			   pnode == onode)) {
+			if (msg_ack_required(hdr)) {
+				struct sk_buff *skb;
+
+				skb = tipc_sk_build_ack(tsk);
+				if (skb)
+					__skb_queue_tail(xmitq, skb);
+			}
 			return true;
+		}
 		if (!tsk_peer_msg(tsk, hdr))
 			return false;
 		if (!err)
@@ -2267,7 +2288,7 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb,
 	while ((skb = __skb_dequeue(&inputq))) {
 		hdr = buf_msg(skb);
 		limit = rcvbuf_limit(sk, skb);
-		if ((sk_conn && !tipc_sk_filter_connect(tsk, skb)) ||
+		if ((sk_conn && !tipc_sk_filter_connect(tsk, skb, xmitq)) ||
 		    (!sk_conn && msg_connected(hdr)) ||
 		    (!grp && msg_in_group(hdr)))
 			err = TIPC_ERR_NO_PORT;
-- 
2.13.7

[tipc-discussion] [RFC PATCH 0/2] tipc: patches for Nagle algorithm issues

From: Tuong L. <tuo...@de...> - 2020-05-04 11:43:41

Hi Jon, all,

Here are the patches for the Nagle issues that I mentioned in the last
meeting, please take a look and give me feedback.

Thanks a lot!

BR/Tuong

Tuong Lien (2):
  tipc: fix large latency in smart Nagle streaming
  tipc: add test for Nagle algorithm effectiveness

 net/tipc/msg.c    |   3 --
 net/tipc/msg.h    |  14 ++++++--
 net/tipc/socket.c | 101 ++++++++++++++++++++++++++++++++++++++++++------------
 3 files changed, 91 insertions(+), 27 deletions(-)

-- 
2.13.7

Re: [tipc-discussion] [net] tipc: fix partial topology connection closure

From: Jon M. <jm...@re...> - 2020-04-29 15:51:28


On 4/28/20 4:58 AM, Tuong Lien wrote:
> When an application connects to the TIPC topology server and subscribes
> to some services, a new connection is created along with some objects -
> 'tipc_subscription' to store related data correspondingly...
> However, there is one omission in the connection handling that when the
> connection or application is orderly shutdown (e.g. via SIGQUIT, etc.),
> the connection is not closed in kernel, the 'tipc_subscription' objects
> are not freed too.
> This results in:
> - The maximum number of subscriptions (65535) will be reached soon, new
> subscriptions will be rejected;
> - TIPC module cannot be removed (unless the objectes are somehow forced
> to release first);
>
> The commit fixes the issue by closing the connection if the 'recvmsg()'
> returns '0' i.e. when the peer is shutdown gracefully. It also includes
> the other unexpected cases.
>
> Signed-off-by: Tuong Lien <tuo...@de...>
> ---
>   net/tipc/topsrv.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
> index ad78f7cff379..c364335623ab 100644
> --- a/net/tipc/topsrv.c
> +++ b/net/tipc/topsrv.c
> @@ -405,10 +405,11 @@ static int tipc_conn_rcv_from_sock(struct tipc_conn *con)
>   		read_lock_bh(&sk->sk_callback_lock);
>   		ret = tipc_conn_rcv_sub(srv, con, &s);
>   		read_unlock_bh(&sk->sk_callback_lock);
> +		if (!ret)
> +			return 0;
>   	}
> -	if (ret < 0)
> -		tipc_conn_close(con);
>   
> +	tipc_conn_close(con);
>   	return ret;
>   }
>   
Acked-by: Jon Maloy <jm...@re...>

Re: [tipc-discussion] [net] tipc: fix partial topology connection closure

From: Ying X. <yin...@wi...> - 2020-04-29 13:53:07

On 4/28/20 4:58 PM, Tuong Lien wrote:
> When an application connects to the TIPC topology server and subscribes
> to some services, a new connection is created along with some objects -
> 'tipc_subscription' to store related data correspondingly...
> However, there is one omission in the connection handling that when the
> connection or application is orderly shutdown (e.g. via SIGQUIT, etc.),
> the connection is not closed in kernel, the 'tipc_subscription' objects
> are not freed too.
> This results in:
> - The maximum number of subscriptions (65535) will be reached soon, new
> subscriptions will be rejected;
> - TIPC module cannot be removed (unless the objectes are somehow forced
> to release first);
> 
> The commit fixes the issue by closing the connection if the 'recvmsg()'
> returns '0' i.e. when the peer is shutdown gracefully. It also includes
> the other unexpected cases.
> 
> Signed-off-by: Tuong Lien <tuo...@de...>

Acked-by: Ying Xue <yin...@wi...>

> ---
>  net/tipc/topsrv.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
> index ad78f7cff379..c364335623ab 100644
> --- a/net/tipc/topsrv.c
> +++ b/net/tipc/topsrv.c
> @@ -405,10 +405,11 @@ static int tipc_conn_rcv_from_sock(struct tipc_conn *con)
>  		read_lock_bh(&sk->sk_callback_lock);
>  		ret = tipc_conn_rcv_sub(srv, con, &s);
>  		read_unlock_bh(&sk->sk_callback_lock);
> +		if (!ret)
> +			return 0;
>  	}
> -	if (ret < 0)
> -		tipc_conn_close(con);
>  
> +	tipc_conn_close(con);
>  	return ret;
>  }
>  
>

[tipc-discussion] [net] tipc: fix partial topology connection closure

From: Tuong L. <tuo...@de...> - 2020-04-28 08:58:48

When an application connects to the TIPC topology server and subscribes
to some services, a new connection is created along with some objects -
'tipc_subscription' to store related data correspondingly...
However, there is one omission in the connection handling that when the
connection or application is orderly shutdown (e.g. via SIGQUIT, etc.),
the connection is not closed in kernel, the 'tipc_subscription' objects
are not freed too.
This results in:
- The maximum number of subscriptions (65535) will be reached soon, new
subscriptions will be rejected;
- TIPC module cannot be removed (unless the objectes are somehow forced
to release first);

The commit fixes the issue by closing the connection if the 'recvmsg()'
returns '0' i.e. when the peer is shutdown gracefully. It also includes
the other unexpected cases.

Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/topsrv.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index ad78f7cff379..c364335623ab 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -405,10 +405,11 @@ static int tipc_conn_rcv_from_sock(struct tipc_conn *con)
 		read_lock_bh(&sk->sk_callback_lock);
 		ret = tipc_conn_rcv_sub(srv, con, &s);
 		read_unlock_bh(&sk->sk_callback_lock);
+		if (!ret)
+			return 0;
 	}
-	if (ret < 0)
-		tipc_conn_close(con);
 
+	tipc_conn_close(con);
 	return ret;
 }
 
-- 
2.13.7

[tipc-discussion] [RFC PATCH] tipc: define TIPC version 3 address types

From: <jm...@re...> - 2020-04-28 01:53:34

From: Jon Maloy <jm...@re...>

TIPC would be more attractive in a modern user environment such
as Kubernetes if it could provide a larger address range.

Advantages:
- Users could directly use UUIDs, strings or other values as service
  instances types and instances.
- No more risk of collisions between randomly selected service types

The effect on the TIPC implementation and protocol would be significant,
but this is still worth considering.
---
 include/linux/socket.h     |  5 ++-
 include/uapi/linux/tipc3.h | 79 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+), 2 deletions(-)
 create mode 100644 include/uapi/linux/tipc3.h

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 54338fac45cb..ff2268ceedaf 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -209,8 +209,8 @@ struct ucred {
 				 * reuses AF_INET address family
 				 */
 #define AF_XDP		44	/* XDP sockets			*/
-
-#define AF_MAX		45	/* For now.. */
+#define AF_TIPC3	45	/* TIPC version 3 sockets	*/
+#define AF_MAX		46	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -260,6 +260,7 @@ struct ucred {
 #define PF_QIPCRTR	AF_QIPCRTR
 #define PF_SMC		AF_SMC
 #define PF_XDP		AF_XDP
+#define PF_TIPC3	AF_TIPC3
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
diff --git a/include/uapi/linux/tipc3.h b/include/uapi/linux/tipc3.h
new file mode 100644
index 000000000000..0d385bc41b66
--- /dev/null
+++ b/include/uapi/linux/tipc3.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/*
+ * include/uapi/linux/tipc3.h: Header for TIPC v3 socket interface
+ *
+ * Copyright (c) 2020 Red Hat Inc
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *    contributors may be used to endorse or promote products derived from
+ *    this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _LINUX_TIPC3_H_
+#define _LINUX_TIPC3_H_
+
+#include <linux/types.h>
+#include <linux/sockios.h>
+#include <linux/tipc.h>
+
+struct tipc3_addr {
+	__u8[16] type;      /* zero if socket address */
+	__u8[16] instance;  /* port if socket address */
+	__u8[16] node;      /* zero if whole cluster */
+};
+
+struct tipc3_subscr {
+	__u8[16] type;
+	__u8[16] lower;
+	__u8[16] upper;
+	__u8[16] node;
+	__u32 timeout;			/* subscription duration (in ms) */
+	__u32 filter;			/* bitmask of filter options */
+	__u8 usr_handle[16];		/* available for subscriber use */
+};
+
+struct tipc3_event {
+	__u8[16] lower;		        /* matching range */
+	__u8[16] upper;		        /*    "      "    */
+	struct tipc3_addr socket;	/* associated socket */
+	struct tipc2_subscr sub;	/* associated subscription */
+	__u32 event;			/* event type */
+};
+
+struct sockaddr_tipc3 {
+	unsigned short family;
+	bool mcast;
+	struct tipc3_addr addr;
+};
+
+struct tipc3_group_req {
+	struct tipc3_addr addr;
+	__u32 flags;
+};
+
+#endif
-- 
2.25.1

[tipc-discussion] No reply detected from Netlink

From: Andy S. <And...@in...> - 2020-04-16 18:27:05

After upgrading from Redhat 7.7 to 7.8 (kernel 3.10.0-1127.el7.x86_64) we're getting "no reply detected from Netlink" when issuing any tipc-config command.  Here's the stack trace:


5819  execve("/usr/sbin/tipc-config", ["/usr/sbin/tipc-config", "-netid=56", "-a=1.1.10", "-be=eth:eth0"], 0x7ffe9d102928 /* 18 vars */) = 0
5819  brk(NULL)                         = 0xd39000
5819  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f211c3de000
5819  access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
5819  open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
5819  fstat(3, {st_mode=S_IFREG|0644, st_size=58850, ...}) = 0
5819  mmap(NULL, 58850, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f211c3cf000
5819  close(3)                          = 0
5819  open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
5819  read(3, "\x7f\x45\x4c\x46\x02\x01\x01\x03\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x3e\x00\x01\x00\x00\x00\x60\x26\x02\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\x10\xd4\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x40\x00\x38\x00\x0a\x00\x40\x00\x4b\x00\x4a\x00\x06\x00\x00\x00\x05\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\x40\x00\x00\x00\x00\x00\x00\x00\x30\x02\x00\x00"..., 832) = 832
5819  fstat(3, {st_mode=S_IFREG|0755, st_size=2156240, ...}) = 0
5819  mmap(NULL, 3985920, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f211bdf0000
5819  mprotect(0x7f211bfb3000, 2097152, PROT_NONE) = 0
5819  mmap(0x7f211c1b3000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c3000) = 0x7f211c1b3000
5819  mmap(0x7f211c1b9000, 16896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f211c1b9000
5819  close(3)                          = 0
5819  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f211c3ce000
5819  mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f211c3cc000
5819  arch_prctl(ARCH_SET_FS, 0x7f211c3cc740) = 0
5819  mprotect(0x7f211c1b3000, 16384, PROT_READ) = 0
5819  mprotect(0x605000, 4096, PROT_READ) = 0
5819  mprotect(0x7f211c3df000, 4096, PROT_READ) = 0
5819  munmap(0x7f211c3cf000, 58850)     = 0
5819  socket(AF_TIPC, SOCK_RDM, 0)      = 3
5819  getsockname(3, {sa_family=AF_TIPC, sa_data="\x03\x00\x03\x00\x40\x9b\x00\x00\x00\x00\x00\x00\x00\x00"}, [16]) = 0
5819  close(3)                          = 0
5819  socket(AF_TIPC, SOCK_RDM, 0)      = 3
5819  getsockname(3, {sa_family=AF_TIPC, sa_data="\x03\x00\x03\x00\x48\x9b\x00\x00\x00\x00\x00\x00\x00\x00"}, [16]) = 0
5819  close(3)                          = 0
5819  brk(NULL)                         = 0xd39000
5819  brk(0xd5a000)                     = 0xd5a000
5819  brk(NULL)                         = 0xd5a000
5819  getpid()                          = 5819
5819  socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC) = 3
5819  bind(3, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
5819  setsockopt(3, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
5819  setsockopt(3, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
5819  write(3, "\x20\x00\x00\x00\x10\x00\x01\x00\x00\x00\x00\x00\xbb\x16\x00\x00\x03\x00\x00\x00\x09\x00\x02\x00\x54\x49\x50\x43\x00\x00\x00\x00", 32) = 32
5819  poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLWRNORM|POLLRDBAND|POLLWRBAND|POLLERR|POLLHUP|POLLNVAL|POLLMSG|POLLREMOVE|POLLRDHUP|POLL_BUSY_LOOP|0x4800}], 1, 3000) = 1 ([{fd=3, revents=POLLIN|POLLRDNORM|POLLWRNORM|POLLWRBAND}])
5819  recvfrom(3, {{len=88, type=nlctrl, flags=0, seq=0, pid=5819}, "\x01\x02\x00\x00\x09\x00\x02\x00\x54\x49\x50\x43\x00\x00\x00\x00\x06\x00\x01\x00\x1b\x00\x00\x00\x08\x00\x03\x00\x01\x00\x00\x00\x08\x00\x04\x00\x08\x00\x00\x00\x08\x00\x05\x00\x00\x00\x00\x00\x18\x00\x06\x00\x14\x00\x01\x00\x08\x00\x01\x00\x01\x00\x00\x00\x08\x00\x02\x00\x02\x00\x00\x00"}, 276, 0, NULL, NULL) = 88
5819  close(3)                          = 0
5819  socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC) = 3
5819  bind(3, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
5819  setsockopt(3, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
5819  setsockopt(3, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
5819  write(3, "\x24\x00\x00\x00\x1b\x00\x01\x00\x00\x00\x00\x00\xbb\x16\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x0b\x80\x00\x00\x00\x08\x00\x02\x00\x00\x00\x38", 36) = 36
5819  poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLWRNORM|POLLRDBAND|POLLWRBAND|POLLERR|POLLHUP|POLLNVAL|POLLMSG|POLLREMOVE|POLLRDHUP|POLL_BUSY_LOOP|0x4800}], 1, 3000) = 1 ([{fd=3, revents=POLLWRNORM|POLLWRBAND}])
5819  fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
5819  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f211c3dd000
5819  write(1, "no reply detected from Netlink\n", 31) = 31
5819  exit_group(1)                     = ?
5819  +++ exited with 1 +++

This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify so to the sender by e-mail and delete the original message. In such cases, please notify us immediately at in...@in... . Further, you are not to copy, disclose, or distribute this e-mail or its contents to any unauthorized person(s). Any such actions are considered unlawful. This e-mail may contain viruses. Infinite has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachments. Infinite reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infinite e-mail system. 

 

***INFINITE******** End of Disclaimer********INFINITE********

Re: [tipc-discussion] [net] tipc: fix incorrect increasing of link window

From: David M. <da...@da...> - 2020-04-15 23:41:22

From: Tuong Lien <tuo...@de...>
Date: Wed, 15 Apr 2020 18:34:49 +0700

> In commit 16ad3f4022bb ("tipc: introduce variable window congestion
> control"), we allow link window to change with the congestion avoidance
> algorithm. However, there is a bug that during the slow-start if packet
> retransmission occurs, the link will enter the fast-recovery phase, set
> its window to the 'ssthresh' which is never less than 300, so the link
> window suddenly increases to that limit instead of decreasing.
> 
> Consequently, two issues have been observed:
> 
> - For broadcast-link: it can leave a gap between the link queues that a
> new packet will be inserted and sent before the previous ones, i.e. not
> in-order.
> 
> - For unicast: the algorithm does not work as expected, the link window
> jumps to the slow-start threshold whereas packet retransmission occurs.
> 
> This commit fixes the issues by avoiding such the link window increase,
> but still decreasing if the 'ssthresh' is lowered.
> 
> Fixes: 16ad3f4022bb ("tipc: introduce variable window congestion control")
> Acked-by: Jon Maloy <jm...@re...>
> Signed-off-by: Tuong Lien <tuo...@de...>

Applied, thanks.

[tipc-discussion] [net] tipc: fix incorrect increasing of link window

From: Tuong L. <tuo...@de...> - 2020-04-15 11:35:17

In commit 16ad3f4022bb ("tipc: introduce variable window congestion
control"), we allow link window to change with the congestion avoidance
algorithm. However, there is a bug that during the slow-start if packet
retransmission occurs, the link will enter the fast-recovery phase, set
its window to the 'ssthresh' which is never less than 300, so the link
window suddenly increases to that limit instead of decreasing.

Consequently, two issues have been observed:

- For broadcast-link: it can leave a gap between the link queues that a
new packet will be inserted and sent before the previous ones, i.e. not
in-order.

- For unicast: the algorithm does not work as expected, the link window
jumps to the slow-start threshold whereas packet retransmission occurs.

This commit fixes the issues by avoiding such the link window increase,
but still decreasing if the 'ssthresh' is lowered.

Fixes: 16ad3f4022bb ("tipc: introduce variable window congestion control")
Acked-by: Jon Maloy <jm...@re...>
Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/link.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 467c53a1fb5c..d4675e922a8f 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1065,7 +1065,7 @@ static void tipc_link_update_cwin(struct tipc_link *l, int released,
 	/* Enter fast recovery */
 	if (unlikely(retransmitted)) {
 		l->ssthresh = max_t(u16, l->window / 2, 300);
-		l->window = l->ssthresh;
+		l->window = min_t(u16, l->ssthresh, l->window);
 		return;
 	}
 	/* Enter slow start */
-- 
2.13.7

Re: [tipc-discussion] FW: [net-next 1/1] tipc: add smart nagle feature

From: Jon M. <jm...@re...> - 2020-04-14 20:20:09


On 4/14/20 5:57 AM, Tuong Tong Lien wrote:
> Hi Jon,
>
> As mentioned in the last meeting, we got a report from customer that TIPC has some latency increased (~ 30-50%) in message streaming, it's after backporting some recent patches. By only disabling/enabling the Nagle feature, we are now able to confirm this Nagle patch causing the latency issue.
>
> We may need to turn this feature off by default (i.e. initializing 'tsk->nodelay = true') but likely no one will try to turn it on again by setting 'TIPC_NODELAY' = false ☹. However, I wonder if that increase in latency is expected, since we declared that it must be < 1 RTT...?
> Now when testing with our benchmark tool, I can indeed observe a large delay, some messages have been bundled and queued in the socket backlog for hundreds of milliseconds before they could be sent out when the required 'ACK' came. I have some concerns here:
>
> 1) On the receiving side, the required 'ACK' is only done when application actually gets the whole data. Even if only part of the last bundle is received, we will not do the 'ACK' as required. Why do we need to do that so strictly? This will cause some latency since the receiver doesn't always get the message as quick as the sender. I think we can separate it from the traditional 'CONN_ACK' which is for the receiver buffer flow control, the Nagle 'ACK' should be done as soon as possible i.e. just after the link layer that indicates the receiving side has got the bundles, so the sending side can push the next ones if any?
Yes, I now remember that I thought about this myself; that we could send 
the ACK already in tipc_sk_filter_connect(), but I deemed that was 
slightly more complex and that sending  ACK in tipc_sk_recvstream() 
would be fast enough. I was clearly wrong in that decision. My concern 
was also that RTT could become too short, so that fewer messages would 
be bundled, with lower throughput as result.

Obviously we should try this change and see what it brings. The longer 
latency is unacceptable, so we me must be ready to sacrifice some 
throughput in this case if necessary. We could even send the ACK 
earlier, directly from where the link adds the message to inputq, but 
that will cost more CPU cycles and I doubt that will make much difference.

>
> 2) The 'tsk->oneway' doesn't really make sense as it is only reset after the peer sends back a data message. In case there is only one side sending data on a socket (like our benchmark tool), the counter just keeps increasing, so we never leave the Nagle mode after entering...
That is the point. Why would we want to leave nagle mode if traffic only 
goes in one direction?
> Do we need to force it to leave the Nagel mode somehow, for example: when receiving the Nagle 'ACK' without any bundles to be pushed?
In practice we do that already. When we receive an ACK and the send 
queue is empty, we enter the state "no outstanding acks", and the next 
message will be sent immediately, with the need_ack bit set. But we are 
still in nagle mode for the followig message, as we should be.
>
> 3) Do you have any other ideas that could cause the latency increase?
No. And I think your diagnosis makes perfectly sense. I suggest you go 
ahead and move the sending of ACK to tipc_sk_filter_connect() and 
observe the result.

BR
///jon


>
> BR/Tuong
>
> -----Original Message-----
> From: David Miller <da...@da...>
> Sent: Thursday, October 31, 2019 2:17 AM
> To: jon...@er...
> Cc: ne...@vg...; lx...@re...; edu...@go...; tip...@li...
> Subject: Re: [tipc-discussion] [net-next 1/1] tipc: add smart nagle feature
>
> From: Jon Maloy <jon...@er...>
> Date: Wed, 30 Oct 2019 14:00:41 +0100
>
>> We introduce a feature that works like a combination of TCP_NAGLE and
>> TCP_CORK, but without some of the weaknesses of those. In particular,
>> we will not observe long delivery delays because of delayed acks, since
>> the algorithm itself decides if and when acks are to be sent from the
>> receiving peer.
>>
>> - The nagle property as such is determined by manipulating a new
>>    'maxnagle' field in struct tipc_sock. If certain conditions are met,
>>    'maxnagle' will define max size of the messages which can be bundled.
>>    If it is set to zero no messages are ever bundled, implying that the
>>    nagle property is disabled.
>> - A socket with the nagle property enabled enters nagle mode when more
>>    than 4 messages have been sent out without receiving any data message
>>    from the peer.
>> - A socket leaves nagle mode whenever it receives a data message from
>>    the peer.
>>
>> In nagle mode, messages smaller than 'maxnagle' are accumulated in the
>> socket write queue. The last buffer in the queue is marked with a new
>> 'ack_required' bit, which forces the receiving peer to send a CONN_ACK
>> message back to the sender upon reception.
>>
>> The accumulated contents of the write queue is transmitted when one of
>> the following events or conditions occur.
>>
>> - A CONN_ACK message is received from the peer.
>> - A data message is received from the peer.
>> - A SOCK_WAKEUP pseudo message is received from the link level.
>> - The write queue contains more than 64 1k blocks of data.
>> - The connection is being shut down.
>> - There is no CONN_ACK message to expect. I.e., there is currently
>>    no outstanding message where the 'ack_required' bit was set. As a
>>    consequence, the first message added after we enter nagle mode
>>    is always sent directly with this bit set.
>>
>> This new feature gives a 50-100% improvement of throughput for small
>> (i.e., less than MTU size) messages, while it might add up to one RTT
>> to latency time when the socket is in nagle mode.
>>
>> Acked-by: Ying Xue <yin...@wi...>
>> Signed-off-by: Jon Maloy <jon...@er...>
> Applied, thanks Jon.
>
>
> _______________________________________________
> tipc-discussion mailing list
> tip...@li...
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] [PATCH RFC] tipc: make peer node cleanup timer configurable

From: Jon M. <jm...@re...> - 2020-04-14 19:51:49


On 4/14/20 3:30 AM, Tuong Lien wrote:
> Since commit 6a939f365bdb ("tipc: Auto removal of peer down node
> instance"), a peer node instance is automatically clean up after 5 mins
> without contact. This changed behavior is not really expected from user
> who would expect to see the node with the "down" state in the node list
> as before instead.
> Also, the timer value is said to be small that the peer node might just
> be rebooting, so will come back soon. This is absolutely no problem but
> one wants to extend it to 10 minutes.
>
> This commit makes the timer configurable via the sysctl file:
>
> /proc/sys/net/tipc/node_cleanup
>
> which indicates in seconds how long a peer down node should be removed.
> The default value is 300 i.e. 5 mins, while a value of '0' will disable
> the auto removal feature.
>
> Signed-off-by: Tuong Lien <tuo...@de...>
> ---
>   net/tipc/core.h   |  1 +
>   net/tipc/node.c   | 15 +++++++++------
>   net/tipc/sysctl.c |  8 ++++++++
>   3 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/net/tipc/core.h b/net/tipc/core.h
> index 631d83c9705f..7d07f9086936 100644
> --- a/net/tipc/core.h
> +++ b/net/tipc/core.h
> @@ -89,6 +89,7 @@ struct tipc_crypto;
>   extern unsigned int tipc_net_id __read_mostly;
>   extern int sysctl_tipc_rmem[3] __read_mostly;
>   extern int sysctl_tipc_named_timeout __read_mostly;
> +extern int sysctl_tipc_node_cleanup __read_mostly;
>   
>   struct tipc_net {
>   	u8  node_id[NODE_ID_LEN];
> diff --git a/net/tipc/node.c b/net/tipc/node.c
> index 2373a66f3b59..5cf01e182730 100644
> --- a/net/tipc/node.c
> +++ b/net/tipc/node.c
> @@ -46,8 +46,11 @@
>   #include "trace.h"
>   #include "crypto.h"
>   
> +int sysctl_tipc_node_cleanup __read_mostly = 300; /* in seconds */
> +
>   #define INVALID_NODE_SIG	0x10000
> -#define NODE_CLEANUP_AFTER	300000
> +#define NODE_CLEANUP_AFTER	\
> +	msecs_to_jiffies(sysctl_tipc_node_cleanup * 1000)
>   
>   /* Flags used to take different actions according to flag type
>    * TIPC_NOTIFY_NODE_DOWN: notify node is down
> @@ -130,7 +133,7 @@ struct tipc_node {
>   	unsigned long keepalive_intv;
>   	struct timer_list timer;
>   	struct rcu_head rcu;
> -	unsigned long delete_at;
> +	unsigned long checkpt;
>   	struct net *peer_net;
>   	u32 peer_hash_mix;
>   #ifdef CONFIG_TIPC_CRYPTO
> @@ -537,7 +540,7 @@ struct tipc_node *tipc_node_create(struct net *net, u32 addr, u8 *peer_id,
>   	for (i = 0; i < MAX_BEARERS; i++)
>   		spin_lock_init(&n->links[i].lock);
>   	n->state = SELF_DOWN_PEER_LEAVING;
> -	n->delete_at = jiffies + msecs_to_jiffies(NODE_CLEANUP_AFTER);
> +	n->checkpt = jiffies;
>   	n->signature = INVALID_NODE_SIG;
>   	n->active_links[0] = INVALID_BEARER_ID;
>   	n->active_links[1] = INVALID_BEARER_ID;
> @@ -726,8 +729,8 @@ static bool tipc_node_cleanup(struct tipc_node *peer)
>   		return false;
>   
>   	tipc_node_write_lock(peer);
> -
> -	if (!node_is_up(peer) && time_after(jiffies, peer->delete_at)) {
> +	if (!node_is_up(peer) && NODE_CLEANUP_AFTER &&
> +	    time_after(jiffies, peer->checkpt + NODE_CLEANUP_AFTER)) {
>   		tipc_node_clear_links(peer);
>   		tipc_node_delete_from_list(peer);
>   		deleted = true;
> @@ -1478,7 +1481,7 @@ static void node_lost_contact(struct tipc_node *n,
>   	uint i;
>   
>   	pr_debug("Lost contact with %x\n", n->addr);
> -	n->delete_at = jiffies + msecs_to_jiffies(NODE_CLEANUP_AFTER);
> +	n->checkpt = jiffies;
>   	trace_tipc_node_lost_contact(n, true, " ");
>   
>   	/* Clean up broadcast state */
> diff --git a/net/tipc/sysctl.c b/net/tipc/sysctl.c
> index 58ab3d6dcdce..087a89b27b9a 100644
> --- a/net/tipc/sysctl.c
> +++ b/net/tipc/sysctl.c
> @@ -75,6 +75,14 @@ static struct ctl_table tipc_table[] = {
>   		.extra1         = SYSCTL_ONE,
>   	},
>   #endif
> +	{
> +		.procname	= "node_cleanup",
> +		.data		= &sysctl_tipc_node_cleanup,
> +		.maxlen		= sizeof(sysctl_tipc_node_cleanup),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1         = SYSCTL_ZERO,
> +	},
>   	{}
>   };
>   
Acked-by: Jon Maloy <jm...@re...>

Re: [tipc-discussion] [net] tipc: fix incorrect increasing of link window

From: Jon M. <jm...@re...> - 2020-04-14 19:49:09

Hmm, I was sure I already did.
But anyway:

Acked-by: Jon Maloy <jm...@re...>


On 4/14/20 3:54 AM, Tuong Tong Lien wrote:
> Hi Jon, all,
>
> I understand that you have agreed on the solution to the issue in previous discussions, but please make a formal "ack-by" so I will send it to net.
> Thanks a lot!
>
> BR/Tuong
>
> -----Original Message-----
> From: Tuong Lien <tuo...@de...>
> Sent: Friday, March 27, 2020 7:31 PM
> To: jm...@re...; ma...@do...; yin...@wi...; tip...@li...
> Cc: tip...@de...
> Subject: [net] tipc: fix incorrect increasing of link window
>
> In commit 16ad3f4022bb ("tipc: introduce variable window congestion
> control"), we allow link window to change with the congestion avoidance
> algorithm. However, there is a bug that during the slow-start if packet
> retransmission occurs, the link will enter the fast-recovery phase, set
> its window to the 'ssthresh' which is never less than 300, so the link
> window suddenly increases to that limit instead of decreasing.
>
> Consequently, two issues have been observed:
>
> - For broadcast-link: it can leave a gap between the link queues that a
> new packet will be inserted and sent before the previous ones, i.e. not
> in-order.
>
> - For unicast: the algorithm does not work as expected, the link window
> jumps to the slow-start threshold whereas packet retransmission occurs.
>
> This commit fixes the issues by avoiding such the link window increase,
> but still decreasing if the 'ssthresh' is lowered.
>
> Signed-off-by: Tuong Lien <tuo...@de...>
> ---
>   net/tipc/link.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index 467c53a1fb5c..d4675e922a8f 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -1065,7 +1065,7 @@ static void tipc_link_update_cwin(struct tipc_link *l, int released,
>   	/* Enter fast recovery */
>   	if (unlikely(retransmitted)) {
>   		l->ssthresh = max_t(u16, l->window / 2, 300);
> -		l->window = l->ssthresh;
> +		l->window = min_t(u16, l->ssthresh, l->window);
>   		return;
>   	}
>   	/* Enter slow start */

[tipc-discussion] FW: [net-next 1/1] tipc: add smart nagle feature

From: Tuong T. L. <tuo...@de...> - 2020-04-14 09:57:33

Hi Jon,

As mentioned in the last meeting, we got a report from customer that TIPC has some latency increased (~ 30-50%) in message streaming, it's after backporting some recent patches. By only disabling/enabling the Nagle feature, we are now able to confirm this Nagle patch causing the latency issue.

We may need to turn this feature off by default (i.e. initializing 'tsk->nodelay = true') but likely no one will try to turn it on again by setting 'TIPC_NODELAY' = false ☹. However, I wonder if that increase in latency is expected, since we declared that it must be < 1 RTT...?
Now when testing with our benchmark tool, I can indeed observe a large delay, some messages have been bundled and queued in the socket backlog for hundreds of milliseconds before they could be sent out when the required 'ACK' came. I have some concerns here:

1) On the receiving side, the required 'ACK' is only done when application actually gets the whole data. Even if only part of the last bundle is received, we will not do the 'ACK' as required. Why do we need to do that so strictly? This will cause some latency since the receiver doesn't always get the message as quick as the sender. I think we can separate it from the traditional 'CONN_ACK' which is for the receiver buffer flow control, the Nagle 'ACK' should be done as soon as possible i.e. just after the link layer that indicates the receiving side has got the bundles, so the sending side can push the next ones if any?

2) The 'tsk->oneway' doesn't really make sense as it is only reset after the peer sends back a data message. In case there is only one side sending data on a socket (like our benchmark tool), the counter just keeps increasing, so we never leave the Nagle mode after entering... Do we need to force it to leave the Nagel mode somehow, for example: when receiving the Nagle 'ACK' without any bundles to be pushed?

3) Do you have any other ideas that could cause the latency increase?

BR/Tuong

-----Original Message-----
From: David Miller <da...@da...> 
Sent: Thursday, October 31, 2019 2:17 AM
To: jon...@er...
Cc: ne...@vg...; lx...@re...; edu...@go...; tip...@li...
Subject: Re: [tipc-discussion] [net-next 1/1] tipc: add smart nagle feature

From: Jon Maloy <jon...@er...>
Date: Wed, 30 Oct 2019 14:00:41 +0100

> We introduce a feature that works like a combination of TCP_NAGLE and
> TCP_CORK, but without some of the weaknesses of those. In particular,
> we will not observe long delivery delays because of delayed acks, since
> the algorithm itself decides if and when acks are to be sent from the
> receiving peer.
> 
> - The nagle property as such is determined by manipulating a new
>   'maxnagle' field in struct tipc_sock. If certain conditions are met,
>   'maxnagle' will define max size of the messages which can be bundled.
>   If it is set to zero no messages are ever bundled, implying that the
>   nagle property is disabled.
> - A socket with the nagle property enabled enters nagle mode when more
>   than 4 messages have been sent out without receiving any data message
>   from the peer.
> - A socket leaves nagle mode whenever it receives a data message from
>   the peer.
> 
> In nagle mode, messages smaller than 'maxnagle' are accumulated in the
> socket write queue. The last buffer in the queue is marked with a new
> 'ack_required' bit, which forces the receiving peer to send a CONN_ACK
> message back to the sender upon reception.
> 
> The accumulated contents of the write queue is transmitted when one of
> the following events or conditions occur.
> 
> - A CONN_ACK message is received from the peer.
> - A data message is received from the peer.
> - A SOCK_WAKEUP pseudo message is received from the link level.
> - The write queue contains more than 64 1k blocks of data.
> - The connection is being shut down.
> - There is no CONN_ACK message to expect. I.e., there is currently
>   no outstanding message where the 'ack_required' bit was set. As a
>   consequence, the first message added after we enter nagle mode
>   is always sent directly with this bit set.
> 
> This new feature gives a 50-100% improvement of throughput for small
> (i.e., less than MTU size) messages, while it might add up to one RTT
> to latency time when the socket is in nagle mode.
> 
> Acked-by: Ying Xue <yin...@wi...>
> Signed-off-by: Jon Maloy <jon...@er...>

Applied, thanks Jon.


_______________________________________________
tipc-discussion mailing list
tip...@li...
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] [PATCH RFC] tipc: make peer node cleanup timer configurable

From: Tuong T. L. <tuo...@de...> - 2020-04-14 08:20:28

Hi a.Hoang,

I have really tried to make it as "short" as possible because of the "80-character" rule. Moreover, the value is not only a time value but also playing as a "switch" to turn off/on the auto removal feature (see the commit message). Anyway, it is just a name so we can change it to whatever. I will wait for other opinions...

Regarding its unit, yes I thought about that (coding would be even one step less) but 'milliseconds' does not seem to make sense, especially from a user perspective, that a node can just need to be removed in minutes, like 5 or 10 minutes. So, I'd prefer to keep it in 'seconds' which is scalable enough.

BR/Tuong

-----Original Message-----
From: Hoang Huu Le <hoa...@de...> 
Sent: Tuesday, April 14, 2020 2:54 PM
To: Tuong Tong Lien <tuo...@de...>; jm...@re...; ma...@do...; yin...@wi...; tip...@li...
Cc: tipc-dek <tip...@de...>
Subject: RE: [PATCH RFC] tipc: make peer node cleanup timer configurable

Hi Tuong,

The  'node_cleanup' is not clear in meaning.  I'd prefer using 'node_cleanup_time_secs'.
I think we should use 'milliseconds' for more popular than 'seconds' in proc/sysctl.

Regards,
Hoang
-----Original Message-----
From: Tuong Tong Lien <tuo...@de...> 
Sent: Tuesday, April 14, 2020 2:30 PM
To: jm...@re...; ma...@do...; yin...@wi...; tip...@li...
Cc: tipc-dek <tip...@de...>
Subject: [PATCH RFC] tipc: make peer node cleanup timer configurable

Since commit 6a939f365bdb ("tipc: Auto removal of peer down node
instance"), a peer node instance is automatically clean up after 5 mins
without contact. This changed behavior is not really expected from user
who would expect to see the node with the "down" state in the node list
as before instead.
Also, the timer value is said to be small that the peer node might just
be rebooting, so will come back soon. This is absolutely no problem but
one wants to extend it to 10 minutes.

This commit makes the timer configurable via the sysctl file:

/proc/sys/net/tipc/node_cleanup

which indicates in seconds how long a peer down node should be removed.
The default value is 300 i.e. 5 mins, while a value of '0' will disable
the auto removal feature.

Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/core.h   |  1 +
 net/tipc/node.c   | 15 +++++++++------
 net/tipc/sysctl.c |  8 ++++++++
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index 631d83c9705f..7d07f9086936 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -89,6 +89,7 @@ struct tipc_crypto;
 extern unsigned int tipc_net_id __read_mostly;
 extern int sysctl_tipc_rmem[3] __read_mostly;
 extern int sysctl_tipc_named_timeout __read_mostly;
+extern int sysctl_tipc_node_cleanup __read_mostly;
 
 struct tipc_net {
 	u8  node_id[NODE_ID_LEN];
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 2373a66f3b59..5cf01e182730 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -46,8 +46,11 @@
 #include "trace.h"
 #include "crypto.h"
 
+int sysctl_tipc_node_cleanup __read_mostly = 300; /* in seconds */
+
 #define INVALID_NODE_SIG	0x10000
-#define NODE_CLEANUP_AFTER	300000
+#define NODE_CLEANUP_AFTER	\
+	msecs_to_jiffies(sysctl_tipc_node_cleanup * 1000)
 
 /* Flags used to take different actions according to flag type
  * TIPC_NOTIFY_NODE_DOWN: notify node is down
@@ -130,7 +133,7 @@ struct tipc_node {
 	unsigned long keepalive_intv;
 	struct timer_list timer;
 	struct rcu_head rcu;
-	unsigned long delete_at;
+	unsigned long checkpt;
 	struct net *peer_net;
 	u32 peer_hash_mix;
 #ifdef CONFIG_TIPC_CRYPTO
@@ -537,7 +540,7 @@ struct tipc_node *tipc_node_create(struct net *net, u32 addr, u8 *peer_id,
 	for (i = 0; i < MAX_BEARERS; i++)
 		spin_lock_init(&n->links[i].lock);
 	n->state = SELF_DOWN_PEER_LEAVING;
-	n->delete_at = jiffies + msecs_to_jiffies(NODE_CLEANUP_AFTER);
+	n->checkpt = jiffies;
 	n->signature = INVALID_NODE_SIG;
 	n->active_links[0] = INVALID_BEARER_ID;
 	n->active_links[1] = INVALID_BEARER_ID;
@@ -726,8 +729,8 @@ static bool tipc_node_cleanup(struct tipc_node *peer)
 		return false;
 
 	tipc_node_write_lock(peer);
-
-	if (!node_is_up(peer) && time_after(jiffies, peer->delete_at)) {
+	if (!node_is_up(peer) && NODE_CLEANUP_AFTER &&
+	    time_after(jiffies, peer->checkpt + NODE_CLEANUP_AFTER)) {
 		tipc_node_clear_links(peer);
 		tipc_node_delete_from_list(peer);
 		deleted = true;
@@ -1478,7 +1481,7 @@ static void node_lost_contact(struct tipc_node *n,
 	uint i;
 
 	pr_debug("Lost contact with %x\n", n->addr);
-	n->delete_at = jiffies + msecs_to_jiffies(NODE_CLEANUP_AFTER);
+	n->checkpt = jiffies;
 	trace_tipc_node_lost_contact(n, true, " ");
 
 	/* Clean up broadcast state */
diff --git a/net/tipc/sysctl.c b/net/tipc/sysctl.c
index 58ab3d6dcdce..087a89b27b9a 100644
--- a/net/tipc/sysctl.c
+++ b/net/tipc/sysctl.c
@@ -75,6 +75,14 @@ static struct ctl_table tipc_table[] = {
 		.extra1         = SYSCTL_ONE,
 	},
 #endif
+	{
+		.procname	= "node_cleanup",
+		.data		= &sysctl_tipc_node_cleanup,
+		.maxlen		= sizeof(sysctl_tipc_node_cleanup),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1         = SYSCTL_ZERO,
+	},
 	{}
 };
 
-- 
2.13.7

Re: [tipc-discussion] [net] tipc: fix incorrect increasing of link window

From: Tuong T. L. <tuo...@de...> - 2020-04-14 07:54:38

Hi Jon, all,

I understand that you have agreed on the solution to the issue in previous discussions, but please make a formal "ack-by" so I will send it to net.
Thanks a lot!

BR/Tuong

-----Original Message-----
From: Tuong Lien <tuo...@de...> 
Sent: Friday, March 27, 2020 7:31 PM
To: jm...@re...; ma...@do...; yin...@wi...; tip...@li...
Cc: tip...@de...
Subject: [net] tipc: fix incorrect increasing of link window

In commit 16ad3f4022bb ("tipc: introduce variable window congestion
control"), we allow link window to change with the congestion avoidance
algorithm. However, there is a bug that during the slow-start if packet
retransmission occurs, the link will enter the fast-recovery phase, set
its window to the 'ssthresh' which is never less than 300, so the link
window suddenly increases to that limit instead of decreasing.

Consequently, two issues have been observed:

- For broadcast-link: it can leave a gap between the link queues that a
new packet will be inserted and sent before the previous ones, i.e. not
in-order.

- For unicast: the algorithm does not work as expected, the link window
jumps to the slow-start threshold whereas packet retransmission occurs.

This commit fixes the issues by avoiding such the link window increase,
but still decreasing if the 'ssthresh' is lowered.

Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/link.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 467c53a1fb5c..d4675e922a8f 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1065,7 +1065,7 @@ static void tipc_link_update_cwin(struct tipc_link *l, int released,
 	/* Enter fast recovery */
 	if (unlikely(retransmitted)) {
 		l->ssthresh = max_t(u16, l->window / 2, 300);
-		l->window = l->ssthresh;
+		l->window = min_t(u16, l->ssthresh, l->window);
 		return;
 	}
 	/* Enter slow start */
-- 
2.13.7

Re: [tipc-discussion] [PATCH RFC] tipc: make peer node cleanup timer configurable

From: Hoang H. Le <hoa...@de...> - 2020-04-14 07:54:31

Hi Tuong,

The  'node_cleanup' is not clear in meaning.  I'd prefer using 'node_cleanup_time_secs'.
I think we should use 'milliseconds' for more popular than 'seconds' in proc/sysctl.

Regards,
Hoang
-----Original Message-----
From: Tuong Tong Lien <tuo...@de...> 
Sent: Tuesday, April 14, 2020 2:30 PM
To: jm...@re...; ma...@do...; yin...@wi...; tip...@li...
Cc: tipc-dek <tip...@de...>
Subject: [PATCH RFC] tipc: make peer node cleanup timer configurable

Since commit 6a939f365bdb ("tipc: Auto removal of peer down node
instance"), a peer node instance is automatically clean up after 5 mins
without contact. This changed behavior is not really expected from user
who would expect to see the node with the "down" state in the node list
as before instead.
Also, the timer value is said to be small that the peer node might just
be rebooting, so will come back soon. This is absolutely no problem but
one wants to extend it to 10 minutes.

This commit makes the timer configurable via the sysctl file:

/proc/sys/net/tipc/node_cleanup

which indicates in seconds how long a peer down node should be removed.
The default value is 300 i.e. 5 mins, while a value of '0' will disable
the auto removal feature.

Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/core.h   |  1 +
 net/tipc/node.c   | 15 +++++++++------
 net/tipc/sysctl.c |  8 ++++++++
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index 631d83c9705f..7d07f9086936 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -89,6 +89,7 @@ struct tipc_crypto;
 extern unsigned int tipc_net_id __read_mostly;
 extern int sysctl_tipc_rmem[3] __read_mostly;
 extern int sysctl_tipc_named_timeout __read_mostly;
+extern int sysctl_tipc_node_cleanup __read_mostly;
 
 struct tipc_net {
 	u8  node_id[NODE_ID_LEN];
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 2373a66f3b59..5cf01e182730 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -46,8 +46,11 @@
 #include "trace.h"
 #include "crypto.h"
 
+int sysctl_tipc_node_cleanup __read_mostly = 300; /* in seconds */
+
 #define INVALID_NODE_SIG	0x10000
-#define NODE_CLEANUP_AFTER	300000
+#define NODE_CLEANUP_AFTER	\
+	msecs_to_jiffies(sysctl_tipc_node_cleanup * 1000)
 
 /* Flags used to take different actions according to flag type
  * TIPC_NOTIFY_NODE_DOWN: notify node is down
@@ -130,7 +133,7 @@ struct tipc_node {
 	unsigned long keepalive_intv;
 	struct timer_list timer;
 	struct rcu_head rcu;
-	unsigned long delete_at;
+	unsigned long checkpt;
 	struct net *peer_net;
 	u32 peer_hash_mix;
 #ifdef CONFIG_TIPC_CRYPTO
@@ -537,7 +540,7 @@ struct tipc_node *tipc_node_create(struct net *net, u32 addr, u8 *peer_id,
 	for (i = 0; i < MAX_BEARERS; i++)
 		spin_lock_init(&n->links[i].lock);
 	n->state = SELF_DOWN_PEER_LEAVING;
-	n->delete_at = jiffies + msecs_to_jiffies(NODE_CLEANUP_AFTER);
+	n->checkpt = jiffies;
 	n->signature = INVALID_NODE_SIG;
 	n->active_links[0] = INVALID_BEARER_ID;
 	n->active_links[1] = INVALID_BEARER_ID;
@@ -726,8 +729,8 @@ static bool tipc_node_cleanup(struct tipc_node *peer)
 		return false;
 
 	tipc_node_write_lock(peer);
-
-	if (!node_is_up(peer) && time_after(jiffies, peer->delete_at)) {
+	if (!node_is_up(peer) && NODE_CLEANUP_AFTER &&
+	    time_after(jiffies, peer->checkpt + NODE_CLEANUP_AFTER)) {
 		tipc_node_clear_links(peer);
 		tipc_node_delete_from_list(peer);
 		deleted = true;
@@ -1478,7 +1481,7 @@ static void node_lost_contact(struct tipc_node *n,
 	uint i;
 
 	pr_debug("Lost contact with %x\n", n->addr);
-	n->delete_at = jiffies + msecs_to_jiffies(NODE_CLEANUP_AFTER);
+	n->checkpt = jiffies;
 	trace_tipc_node_lost_contact(n, true, " ");
 
 	/* Clean up broadcast state */
diff --git a/net/tipc/sysctl.c b/net/tipc/sysctl.c
index 58ab3d6dcdce..087a89b27b9a 100644
--- a/net/tipc/sysctl.c
+++ b/net/tipc/sysctl.c
@@ -75,6 +75,14 @@ static struct ctl_table tipc_table[] = {
 		.extra1         = SYSCTL_ONE,
 	},
 #endif
+	{
+		.procname	= "node_cleanup",
+		.data		= &sysctl_tipc_node_cleanup,
+		.maxlen		= sizeof(sysctl_tipc_node_cleanup),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1         = SYSCTL_ZERO,
+	},
 	{}
 };
 
-- 
2.13.7

[tipc-discussion] [PATCH RFC] tipc: make peer node cleanup timer configurable

From: Tuong L. <tuo...@de...> - 2020-04-14 07:30:38

Since commit 6a939f365bdb ("tipc: Auto removal of peer down node
instance"), a peer node instance is automatically clean up after 5 mins
without contact. This changed behavior is not really expected from user
who would expect to see the node with the "down" state in the node list
as before instead.
Also, the timer value is said to be small that the peer node might just
be rebooting, so will come back soon. This is absolutely no problem but
one wants to extend it to 10 minutes.

This commit makes the timer configurable via the sysctl file:

/proc/sys/net/tipc/node_cleanup

which indicates in seconds how long a peer down node should be removed.
The default value is 300 i.e. 5 mins, while a value of '0' will disable
the auto removal feature.

Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/core.h   |  1 +
 net/tipc/node.c   | 15 +++++++++------
 net/tipc/sysctl.c |  8 ++++++++
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index 631d83c9705f..7d07f9086936 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -89,6 +89,7 @@ struct tipc_crypto;
 extern unsigned int tipc_net_id __read_mostly;
 extern int sysctl_tipc_rmem[3] __read_mostly;
 extern int sysctl_tipc_named_timeout __read_mostly;
+extern int sysctl_tipc_node_cleanup __read_mostly;
 
 struct tipc_net {
 	u8  node_id[NODE_ID_LEN];
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 2373a66f3b59..5cf01e182730 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -46,8 +46,11 @@
 #include "trace.h"
 #include "crypto.h"
 
+int sysctl_tipc_node_cleanup __read_mostly = 300; /* in seconds */
+
 #define INVALID_NODE_SIG	0x10000
-#define NODE_CLEANUP_AFTER	300000
+#define NODE_CLEANUP_AFTER	\
+	msecs_to_jiffies(sysctl_tipc_node_cleanup * 1000)
 
 /* Flags used to take different actions according to flag type
  * TIPC_NOTIFY_NODE_DOWN: notify node is down
@@ -130,7 +133,7 @@ struct tipc_node {
 	unsigned long keepalive_intv;
 	struct timer_list timer;
 	struct rcu_head rcu;
-	unsigned long delete_at;
+	unsigned long checkpt;
 	struct net *peer_net;
 	u32 peer_hash_mix;
 #ifdef CONFIG_TIPC_CRYPTO
@@ -537,7 +540,7 @@ struct tipc_node *tipc_node_create(struct net *net, u32 addr, u8 *peer_id,
 	for (i = 0; i < MAX_BEARERS; i++)
 		spin_lock_init(&n->links[i].lock);
 	n->state = SELF_DOWN_PEER_LEAVING;
-	n->delete_at = jiffies + msecs_to_jiffies(NODE_CLEANUP_AFTER);
+	n->checkpt = jiffies;
 	n->signature = INVALID_NODE_SIG;
 	n->active_links[0] = INVALID_BEARER_ID;
 	n->active_links[1] = INVALID_BEARER_ID;
@@ -726,8 +729,8 @@ static bool tipc_node_cleanup(struct tipc_node *peer)
 		return false;
 
 	tipc_node_write_lock(peer);
-
-	if (!node_is_up(peer) && time_after(jiffies, peer->delete_at)) {
+	if (!node_is_up(peer) && NODE_CLEANUP_AFTER &&
+	    time_after(jiffies, peer->checkpt + NODE_CLEANUP_AFTER)) {
 		tipc_node_clear_links(peer);
 		tipc_node_delete_from_list(peer);
 		deleted = true;
@@ -1478,7 +1481,7 @@ static void node_lost_contact(struct tipc_node *n,
 	uint i;
 
 	pr_debug("Lost contact with %x\n", n->addr);
-	n->delete_at = jiffies + msecs_to_jiffies(NODE_CLEANUP_AFTER);
+	n->checkpt = jiffies;
 	trace_tipc_node_lost_contact(n, true, " ");
 
 	/* Clean up broadcast state */
diff --git a/net/tipc/sysctl.c b/net/tipc/sysctl.c
index 58ab3d6dcdce..087a89b27b9a 100644
--- a/net/tipc/sysctl.c
+++ b/net/tipc/sysctl.c
@@ -75,6 +75,14 @@ static struct ctl_table tipc_table[] = {
 		.extra1         = SYSCTL_ONE,
 	},
 #endif
+	{
+		.procname	= "node_cleanup",
+		.data		= &sysctl_tipc_node_cleanup,
+		.maxlen		= sizeof(sysctl_tipc_node_cleanup),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1         = SYSCTL_ZERO,
+	},
 	{}
 };
 
-- 
2.13.7

Re: [tipc-discussion] [PATCH RFC 0/4] tipc: add some improvements for broadcast

From: Jon M. <jm...@re...> - 2020-04-07 14:38:54

Hi all,
I have followed the discussion, and think Ying's concerns have been well 
responded to.
That switch is annoying, but I cannot see how we can get rid of it 
easily, at least for now.
My hunch is that retransmission by unicast will work well in smaller 
clusters, while broadcast may work better in larger clusters where there 
are more likely to be multiple losses. I.e., when one node reports a 
loss it is highly likely that even other nodes have experienced the same 
loss, so a broadcast retransmission may be more optimal. But we have no 
way of knowing this except from empirical data,  and the optimal cluster 
size where to switch may vary from system to system and even depending 
on the overall load situation.

The best outcome would be if we could test this in large clusters and 
find that unicast always is best. Then no switch will be needed.
The second best, given the assumption above, would be some algorithm 
that auto-adapts (probably with hysteresis) and does the switch based on 
counters. E.g, number of identical unicast re-transmissions as a 
percentage of cluster size in the unicast->broadcast direction, and 
number of broadcast duplicates received by the peers in the 
broadcast->unicast direction. But my assumption may of course be totally 
wrong and a different algorithm is needed, or none at all.

Anyway,  series
Acked by Jon Maloy <jm...@re...>

On 4/7/20 2:52 AM, Xue, Ying wrote:
> Good work Tuong!
>
> Acked-by: Ying Xue <yin...@wi...>
>
> -----Original Message-----
> From: Tuong Lien [mailto:tuo...@de...]
> Sent: Saturday, March 28, 2020 12:03 PM
> To: jm...@re...; ma...@do...; Xue, Ying; tip...@li...
> Cc: tip...@de...
> Subject: [PATCH RFC 0/4] tipc: add some improvements for broadcast
>
> Hi Jon, all,
>
> Please find the full series here,
> + For the 1st patch: it's really the last one I sent before, so you have
> ack-ed already.
> + For the other ones, please help take a look. Also, I will send another
> patch for iproute2/tipc which is user-space part of the last one in this
> series i.e. broadcast rcv stats dumping.
>
> Thanks alot!
>
> Tuong Lien (4):
>    tipc: introduce Gap ACK blocks for broadcast link
>    tipc: add back link trace events
>    tipc: enable broadcast retrans via unicast
>    tipc: add support for broadcast rcv stats dumping
>
>   net/tipc/bcast.c   |  22 ++-
>   net/tipc/bcast.h   |   9 +-
>   net/tipc/link.c    | 500 +++++++++++++++++++++++++++++++----------------------
>   net/tipc/link.h    |  11 +-
>   net/tipc/msg.c     |   9 +-
>   net/tipc/msg.h     |  16 +-
>   net/tipc/netlink.c |   2 +-
>   net/tipc/node.c    |  75 ++++++--
>   net/tipc/sysctl.c  |   9 +-
>   net/tipc/trace.h   |  17 +-
>   10 files changed, 424 insertions(+), 246 deletions(-)
>

Re: [tipc-discussion] [PATCH RFC 0/4] tipc: add some improvements for broadcast

From: Xue, Y. <Yin...@wi...> - 2020-04-07 08:25:30

Good work Tuong!

Acked-by: Ying Xue <yin...@wi...>

-----Original Message-----
From: Tuong Lien [mailto:tuo...@de...] 
Sent: Saturday, March 28, 2020 12:03 PM
To: jm...@re...; ma...@do...; Xue, Ying; tip...@li...
Cc: tip...@de...
Subject: [PATCH RFC 0/4] tipc: add some improvements for broadcast

Hi Jon, all,

Please find the full series here,
+ For the 1st patch: it's really the last one I sent before, so you have
ack-ed already.
+ For the other ones, please help take a look. Also, I will send another
patch for iproute2/tipc which is user-space part of the last one in this
series i.e. broadcast rcv stats dumping.

Thanks alot!

Tuong Lien (4):
  tipc: introduce Gap ACK blocks for broadcast link
  tipc: add back link trace events
  tipc: enable broadcast retrans via unicast
  tipc: add support for broadcast rcv stats dumping

 net/tipc/bcast.c   |  22 ++-
 net/tipc/bcast.h   |   9 +-
 net/tipc/link.c    | 500 +++++++++++++++++++++++++++++++----------------------
 net/tipc/link.h    |  11 +-
 net/tipc/msg.c     |   9 +-
 net/tipc/msg.h     |  16 +-
 net/tipc/netlink.c |   2 +-
 net/tipc/node.c    |  75 ++++++--
 net/tipc/sysctl.c  |   9 +-
 net/tipc/trace.h   |  17 +-
 10 files changed, 424 insertions(+), 246 deletions(-)

-- 
2.13.7

Re: [tipc-discussion] [PATCH RFC 4/4] tipc: add support for broadcast rcv stats dumping

From: Xue, Y. <Yin...@wi...> - 2020-04-07 06:50:29

Basically when it performs by the iproute2/tipc tool, it first asks the kernel to dump everything and then makes a filter on specific links according to the command options. Please see the other patch on iproute2/tipc for more details:

[iproute2-next] tipc: enable printing of broadcast rcv link stats

So, for this patch, we just introduce a flag for user to dump the broadcast-receiver links stats (in addition to the traditional links ones) when needed. That is the 'TIPC_NL_LINK_GET'/'TIPC_NLA_LINK_BROADCAST' flag as mentioned in the commit message. In the iproute2/tipc, it will be:

> +		/* Set the flag to dump all bc links */
> +		attrs = mnl_attr_nest_start(nlh, TIPC_NLA_LINK);
> +		mnl_attr_put(nlh, TIPC_NLA_LINK_BROADCAST, 0, NULL);
> +		mnl_attr_nest_end(nlh, attrs);

===
The reason why I asked the question is that "type" is still a hard code in the following function, which is difficult for us to understand what " type == 2" is meaning:
 
/* Caller should hold node lock  */
 static int __tipc_nl_add_node_links(struct net *net, struct tipc_nl_msg *msg,
-				    struct tipc_node *node, u32 *prev_link)
+				    struct tipc_node *node, u32 *prev_link,
+				    u32 type)
 {
 	u32 i;
 	int err;
@@ -2536,6 +2553,14 @@ static int __tipc_nl_add_node_links(struct net *net, struct tipc_nl_msg *msg,
 		if (err)
 			return err;
 	}
+
+	if (type == 2) {
+		*prev_link = 3;
+		err = tipc_nl_add_bc_link(net, msg, node->bc_entry.link);
+		if (err)
+			return err;
+	}
+
 	*prev_link = 0;
===

Thanks,
Ying

-----Original Message-----
From: Xue, Ying <Yin...@wi...> 
Sent: Monday, April 6, 2020 1:45 PM
To: Tuong Tong Lien <tuo...@de...>; jm...@re...; ma...@do...; tip...@li...
Cc: tipc-dek <tip...@de...>
Subject: RE: [PATCH RFC 4/4] tipc: add support for broadcast rcv stats dumping

Just a minor comment:

Please define macros for the cases:
1. Dump broadcast-link & unicast links
2. Dump broadcast-receiver links

Thanks,
Ying

-----Original Message-----
From: Tuong Lien [mailto:tuo...@de...] 
Sent: Saturday, March 28, 2020 12:03 PM
To: jm...@re...; ma...@do...; Xue, Ying; tip...@li...
Cc: tip...@de...
Subject: [PATCH RFC 4/4] tipc: add support for broadcast rcv stats dumping

This commit enables dumping the statistics of a broadcast-receiver link
like the traditional 'broadcast-link' one (which is for broadcast-
sender). The link dumping can be triggered via netlink (e.g. the
iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
indicator.

The name of a broadcast-receiver link of a specific peer will be in the
format: 'broadcast-link:<peer-id>'.

For example:

Link <broadcast-link:1001002>
  Window:50 packets
  RX packets:7841 fragments:2408/440 bundles:0/0
  TX packets:0 fragments:0/0 bundles:0/0
  RX naks:0 defs:124 dups:0
  TX naks:21 acks:0 retrans:0
  Congestion link:0  Send queue max:0 avg:0

In addition, the broadcast-receiver link statistics can be reset in the
usual way via netlink by specifying that link name in command.

Note: the 'tipc_link_name_ext()' is removed because the link name can
now be retrieved simply via the 'l->name'.

Signed-off-by: Tuong Lien <tuo...@de...>
---
 net/tipc/bcast.c   |  6 ++---
 net/tipc/bcast.h   |  5 +++--
 net/tipc/link.c    | 65 +++++++++++++++++++++++++++---------------------------
 net/tipc/link.h    |  3 +--
 net/tipc/msg.c     |  9 ++++----
 net/tipc/msg.h     |  2 +-
 net/tipc/netlink.c |  2 +-
 net/tipc/node.c    | 63 +++++++++++++++++++++++++++++++++++++++++++++-------
 net/tipc/trace.h   |  4 ++--
 9 files changed, 103 insertions(+), 56 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 50a16f8bebd9..383f87bc1061 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -563,10 +563,8 @@ void tipc_bcast_remove_peer(struct net *net, struct tipc_link *rcv_l)
 		tipc_sk_rcv(net, inputq);
 }
 
-int tipc_bclink_reset_stats(struct net *net)
+int tipc_bclink_reset_stats(struct net *net, struct tipc_link *l)
 {
-	struct tipc_link *l = tipc_bc_sndlink(net);
-
 	if (!l)
 		return -ENOPROTOOPT;
 
@@ -694,7 +692,7 @@ int tipc_bcast_init(struct net *net)
 	tn->bcbase = bb;
 	spin_lock_init(&tipc_net(net)->bclock);
 
-	if (!tipc_link_bc_create(net, 0, 0,
+	if (!tipc_link_bc_create(net, 0, 0, NULL,
 				 FB_MTU,
 				 BCLINK_WIN_DEFAULT,
 				 BCLINK_WIN_DEFAULT,
diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h
index 97d3cf9d3e4d..4240c95188b1 100644
--- a/net/tipc/bcast.h
+++ b/net/tipc/bcast.h
@@ -96,9 +96,10 @@ void tipc_bcast_ack_rcv(struct net *net, struct tipc_link *l,
 int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
 			struct tipc_msg *hdr,
 			struct sk_buff_head *retrq);
-int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg);
+int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg,
+			struct tipc_link *bcl);
 int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]);
-int tipc_bclink_reset_stats(struct net *net);
+int tipc_bclink_reset_stats(struct net *net, struct tipc_link *l);
 
 u32 tipc_bcast_get_broadcast_mode(struct net *net);
 u32 tipc_bcast_get_broadcast_ratio(struct net *net);
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 3071e46f029a..808d3a76c27f 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -539,7 +539,7 @@ bool tipc_link_create(struct net *net, char *if_name, int bearer_id,
  *
  * Returns true if link was created, otherwise false
  */
-bool tipc_link_bc_create(struct net *net, u32 ownnode, u32 peer,
+bool tipc_link_bc_create(struct net *net, u32 ownnode, u32 peer, u8 *peer_id,
 			 int mtu, u32 min_win, u32 max_win, u16 peer_caps,
 			 struct sk_buff_head *inputq,
 			 struct sk_buff_head *namedq,
@@ -554,7 +554,18 @@ bool tipc_link_bc_create(struct net *net, u32 ownnode, u32 peer,
 		return false;
 
 	l = *link;
-	strcpy(l->name, tipc_bclink_name);
+	if (peer_id) {
+		char peer_str[NODE_ID_STR_LEN] = {0,};
+
+		tipc_nodeid2string(peer_str, peer_id);
+		if (strlen(peer_str) > 16)
+			sprintf(peer_str, "%x", peer);
+		/* Broadcast receiver link name: "broadcast-link:<peer>" */
+		snprintf(l->name, sizeof(l->name), "%s:%s", tipc_bclink_name,
+			 peer_str);
+	} else {
+		strcpy(l->name, tipc_bclink_name);
+	}
 	trace_tipc_link_reset(l, TIPC_DUMP_ALL, "bclink created!");
 	tipc_link_reset(l);
 	l->state = LINK_RESET;
@@ -1412,11 +1423,8 @@ static u8 __tipc_build_gap_ack_blks(struct tipc_gap_ack_blks *ga,
 			gacks[n].ack = htons(expect - 1);
 			gacks[n].gap = htons(seqno - expect);
 			if (++n >= MAX_GAP_ACK_BLKS / 2) {
-				char buf[TIPC_MAX_LINK_NAME];
-
 				pr_info_ratelimited("Gacks on %s: %d, ql: %d!\n",
-						    tipc_link_name_ext(l, buf),
-						    n,
+						    l->name, n,
 						    skb_queue_len(&l->deferdq));
 				return n;
 			}
@@ -1600,6 +1608,8 @@ static int tipc_link_advance_transmq(struct tipc_link *l, struct tipc_link *r,
 			_skb->priority = TC_PRIO_CONTROL;
 			__skb_queue_tail(xmitq, _skb);
 			l->stats.retransmitted++;
+			if (!is_uc)
+				r->stats.retransmitted++;
 			*retransmitted = true;
 			/* Increase actual retrans counter & mark first time */
 			if (!TIPC_SKB_CB(skb)->retr_cnt++)
@@ -1766,7 +1776,8 @@ int tipc_link_rcv(struct tipc_link *l, struct sk_buff *skb,
 
 		/* Defer delivery if sequence gap */
 		if (unlikely(seqno != rcv_nxt)) {
-			__tipc_skb_queue_sorted(defq, seqno, skb);
+			if (!__tipc_skb_queue_sorted(defq, seqno, skb))
+				l->stats.duplicates++;
 			rc |= tipc_link_build_nack_msg(l, xmitq);
 			break;
 		}
@@ -1800,15 +1811,15 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
 				      int tolerance, int priority,
 				      struct sk_buff_head *xmitq)
 {
+	struct tipc_mon_state *mstate = &l->mon_state;
+	struct sk_buff_head *dfq = &l->deferdq;
 	struct tipc_link *bcl = l->bc_rcvlink;
-	struct sk_buff *skb;
 	struct tipc_msg *hdr;
-	struct sk_buff_head *dfq = &l->deferdq;
+	struct sk_buff *skb;
 	bool node_up = link_is_up(bcl);
-	struct tipc_mon_state *mstate = &l->mon_state;
+	u16 glen = 0, bc_rcvgap = 0;
 	int dlen = 0;
 	void *data;
-	u16 glen = 0;
 
 	/* Don't send protocol message during reset or link failover */
 	if (tipc_link_is_blocked(l))
@@ -1846,7 +1857,8 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
 		if (l->peer_caps & TIPC_LINK_PROTO_SEQNO)
 			msg_set_seqno(hdr, l->snd_nxt_state++);
 		msg_set_seq_gap(hdr, rcvgap);
-		msg_set_bc_gap(hdr, link_bc_rcv_gap(bcl));
+		bc_rcvgap = link_bc_rcv_gap(bcl);
+		msg_set_bc_gap(hdr, bc_rcvgap);
 		msg_set_probe(hdr, probe);
 		msg_set_is_keepalive(hdr, probe || probe_reply);
 		if (l->peer_caps & TIPC_GAP_ACK_BLOCK)
@@ -1871,6 +1883,8 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
 		l->stats.sent_probes++;
 	if (rcvgap)
 		l->stats.sent_nacks++;
+	if (bc_rcvgap)
+		bcl->stats.sent_nacks++;
 	skb->priority = TC_PRIO_CONTROL;
 	__skb_queue_tail(xmitq, skb);
 	trace_tipc_proto_build(skb, false, l->name);
@@ -2371,8 +2385,6 @@ int tipc_link_bc_sync_rcv(struct tipc_link *l, struct tipc_msg *hdr,
 	if (!l->bc_peer_is_up)
 		return rc;
 
-	l->stats.recv_nacks++;
-
 	/* Ignore if peers_snd_nxt goes beyond receive window */
 	if (more(peers_snd_nxt, l->rcv_nxt + l->window))
 		return rc;
@@ -2423,6 +2435,11 @@ int tipc_link_bc_ack_rcv(struct tipc_link *r, u16 acked, u16 gap,
 	if (!link_is_up(r) || !r->bc_peer_is_up)
 		return 0;
 
+	if (gap) {
+		l->stats.recv_nacks++;
+		r->stats.recv_nacks++;
+	}
+
 	if (less(acked, r->acked) || (acked == r->acked && !gap && !ga))
 		return 0;
 
@@ -2734,16 +2751,15 @@ static int __tipc_nl_add_bc_link_stat(struct sk_buff *skb,
 	return -EMSGSIZE;
 }
 
-int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg)
+int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg,
+			struct tipc_link *bcl)
 {
 	int err;
 	void *hdr;
 	struct nlattr *attrs;
 	struct nlattr *prop;
-	struct tipc_net *tn = net_generic(net, tipc_net_id);
 	u32 bc_mode = tipc_bcast_get_broadcast_mode(net);
 	u32 bc_ratio = tipc_bcast_get_broadcast_ratio(net);
-	struct tipc_link *bcl = tn->bcl;
 
 	if (!bcl)
 		return 0;
@@ -2830,21 +2846,6 @@ void tipc_link_set_abort_limit(struct tipc_link *l, u32 limit)
 	l->abort_limit = limit;
 }
 
-char *tipc_link_name_ext(struct tipc_link *l, char *buf)
-{
-	if (!l)
-		scnprintf(buf, TIPC_MAX_LINK_NAME, "null");
-	else if (link_is_bc_sndlink(l))
-		scnprintf(buf, TIPC_MAX_LINK_NAME, "broadcast-sender");
-	else if (link_is_bc_rcvlink(l))
-		scnprintf(buf, TIPC_MAX_LINK_NAME,
-			  "broadcast-receiver, peer %x", l->addr);
-	else
-		memcpy(buf, l->name, TIPC_MAX_LINK_NAME);
-
-	return buf;
-}
-
 /**
  * tipc_link_dump - dump TIPC link data
  * @l: tipc link to be dumped
diff --git a/net/tipc/link.h b/net/tipc/link.h
index 4d0768cf91d5..fc07232c9a12 100644
--- a/net/tipc/link.h
+++ b/net/tipc/link.h
@@ -80,7 +80,7 @@ bool tipc_link_create(struct net *net, char *if_name, int bearer_id,
 		      struct sk_buff_head *inputq,
 		      struct sk_buff_head *namedq,
 		      struct tipc_link **link);
-bool tipc_link_bc_create(struct net *net, u32 ownnode, u32 peer,
+bool tipc_link_bc_create(struct net *net, u32 ownnode, u32 peer, u8 *peer_id,
 			 int mtu, u32 min_win, u32 max_win, u16 peer_caps,
 			 struct sk_buff_head *inputq,
 			 struct sk_buff_head *namedq,
@@ -111,7 +111,6 @@ u16 tipc_link_rcv_nxt(struct tipc_link *l);
 u16 tipc_link_acked(struct tipc_link *l);
 u32 tipc_link_id(struct tipc_link *l);
 char *tipc_link_name(struct tipc_link *l);
-char *tipc_link_name_ext(struct tipc_link *l, char *buf);
 u32 tipc_link_state(struct tipc_link *l);
 char tipc_link_plane(struct tipc_link *l);
 int tipc_link_prio(struct tipc_link *l);
diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 0d515d20b056..69d68512300a 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -828,19 +828,19 @@ bool tipc_msg_pskb_copy(u32 dst, struct sk_buff_head *msg,
  * @seqno: sequence number of buffer to add
  * @skb: buffer to add
  */
-void __tipc_skb_queue_sorted(struct sk_buff_head *list, u16 seqno,
+bool __tipc_skb_queue_sorted(struct sk_buff_head *list, u16 seqno,
 			     struct sk_buff *skb)
 {
 	struct sk_buff *_skb, *tmp;
 
 	if (skb_queue_empty(list) || less(seqno, buf_seqno(skb_peek(list)))) {
 		__skb_queue_head(list, skb);
-		return;
+		return true;
 	}
 
 	if (more(seqno, buf_seqno(skb_peek_tail(list)))) {
 		__skb_queue_tail(list, skb);
-		return;
+		return true;
 	}
 
 	skb_queue_walk_safe(list, _skb, tmp) {
@@ -849,9 +849,10 @@ void __tipc_skb_queue_sorted(struct sk_buff_head *list, u16 seqno,
 		if (seqno == buf_seqno(_skb))
 			break;
 		__skb_queue_before(list, _skb, skb);
-		return;
+		return true;
 	}
 	kfree_skb(skb);
+	return false;
 }
 
 void tipc_skb_reject(struct net *net, int err, struct sk_buff *skb,
diff --git a/net/tipc/msg.h b/net/tipc/msg.h
index 9a38f9c9d6eb..87e2d472f75f 100644
--- a/net/tipc/msg.h
+++ b/net/tipc/msg.h
@@ -1127,7 +1127,7 @@ bool tipc_msg_assemble(struct sk_buff_head *list);
 bool tipc_msg_reassemble(struct sk_buff_head *list, struct sk_buff_head *rcvq);
 bool tipc_msg_pskb_copy(u32 dst, struct sk_buff_head *msg,
 			struct sk_buff_head *cpy);
-void __tipc_skb_queue_sorted(struct sk_buff_head *list, u16 seqno,
+bool __tipc_skb_queue_sorted(struct sk_buff_head *list, u16 seqno,
 			     struct sk_buff *skb);
 bool tipc_msg_skb_clone(struct sk_buff_head *msg, struct sk_buff_head *cpy);
 
diff --git a/net/tipc/netlink.c b/net/tipc/netlink.c
index 7c35094c20b8..8dfad18330bc 100644
--- a/net/tipc/netlink.c
+++ b/net/tipc/netlink.c
@@ -187,7 +187,7 @@ static const struct genl_ops tipc_genl_v2_ops[] = {
 	},
 	{
 		.cmd	= TIPC_NL_LINK_GET,
-		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
+		.validate = GENL_DONT_VALIDATE_STRICT,
 		.doit   = tipc_nl_node_get_link,
 		.dumpit	= tipc_nl_node_dump_link,
 	},
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 917ad3920fac..373d07ae6730 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1138,7 +1138,7 @@ void tipc_node_check_dest(struct net *net, u32 addr,
 	if (unlikely(!n->bc_entry.link)) {
 		snd_l = tipc_bc_sndlink(net);
 		if (!tipc_link_bc_create(net, tipc_own_addr(net),
-					 addr, U16_MAX,
+					 addr, peer_id, U16_MAX,
 					 tipc_link_min_win(snd_l),
 					 tipc_link_max_win(snd_l),
 					 n->capabilities,
@@ -2432,7 +2432,7 @@ int tipc_nl_node_get_link(struct sk_buff *skb, struct genl_info *info)
 		return -ENOMEM;
 
 	if (strcmp(name, tipc_bclink_name) == 0) {
-		err = tipc_nl_add_bc_link(net, &msg);
+		err = tipc_nl_add_bc_link(net, &msg, tipc_net(net)->bcl);
 		if (err)
 			goto err_free;
 	} else {
@@ -2476,6 +2476,7 @@ int tipc_nl_node_reset_link_stats(struct sk_buff *skb, struct genl_info *info)
 	struct tipc_node *node;
 	struct nlattr *attrs[TIPC_NLA_LINK_MAX + 1];
 	struct net *net = sock_net(skb->sk);
+	struct tipc_net *tn = tipc_net(net);
 	struct tipc_link_entry *le;
 
 	if (!info->attrs[TIPC_NLA_LINK])
@@ -2492,11 +2493,26 @@ int tipc_nl_node_reset_link_stats(struct sk_buff *skb, struct genl_info *info)
 
 	link_name = nla_data(attrs[TIPC_NLA_LINK_NAME]);
 
-	if (strcmp(link_name, tipc_bclink_name) == 0) {
-		err = tipc_bclink_reset_stats(net);
+	err = -EINVAL;
+	if (!strcmp(link_name, tipc_bclink_name)) {
+		err = tipc_bclink_reset_stats(net, tipc_bc_sndlink(net));
 		if (err)
 			return err;
 		return 0;
+	} else if (strstr(link_name, tipc_bclink_name)) {
+		rcu_read_lock();
+		list_for_each_entry_rcu(node, &tn->node_list, list) {
+			tipc_node_read_lock(node);
+			link = node->bc_entry.link;
+			if (link && !strcmp(link_name, tipc_link_name(link))) {
+				err = tipc_bclink_reset_stats(net, link);
+				tipc_node_read_unlock(node);
+				break;
+			}
+			tipc_node_read_unlock(node);
+		}
+		rcu_read_unlock();
+		return err;
 	}
 
 	node = tipc_node_find_by_name(net, link_name, &bearer_id);
@@ -2520,7 +2536,8 @@ int tipc_nl_node_reset_link_stats(struct sk_buff *skb, struct genl_info *info)
 
 /* Caller should hold node lock  */
 static int __tipc_nl_add_node_links(struct net *net, struct tipc_nl_msg *msg,
-				    struct tipc_node *node, u32 *prev_link)
+				    struct tipc_node *node, u32 *prev_link,
+				    u32 type)
 {
 	u32 i;
 	int err;
@@ -2536,6 +2553,14 @@ static int __tipc_nl_add_node_links(struct net *net, struct tipc_nl_msg *msg,
 		if (err)
 			return err;
 	}
+
+	if (type == 2) {
+		*prev_link = 3;
+		err = tipc_nl_add_bc_link(net, msg, node->bc_entry.link);
+		if (err)
+			return err;
+	}
+
 	*prev_link = 0;
 
 	return 0;
@@ -2544,17 +2569,38 @@ static int __tipc_nl_add_node_links(struct net *net, struct tipc_nl_msg *msg,
 int tipc_nl_node_dump_link(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
+	struct nlattr **attrs = genl_dumpit_info(cb)->attrs;
+	struct nlattr *link[TIPC_NLA_LINK_MAX + 1];
 	struct tipc_net *tn = net_generic(net, tipc_net_id);
 	struct tipc_node *node;
 	struct tipc_nl_msg msg;
 	u32 prev_node = cb->args[0];
 	u32 prev_link = cb->args[1];
 	int done = cb->args[2];
+	u32 type = cb->args[3];
 	int err;
 
 	if (done)
 		return 0;
 
+	if (!type) {
+		/* Dump broadcast-link & unicast links */
+		type = 1;
+		if (attrs && attrs[TIPC_NLA_LINK]) {
+			err = nla_parse_nested_deprecated(link,
+							  TIPC_NLA_LINK_MAX,
+							  attrs[TIPC_NLA_LINK],
+							  tipc_nl_link_policy,
+							  NULL);
+			if (unlikely(err))
+				return err;
+			if (unlikely(!link[TIPC_NLA_LINK_BROADCAST]))
+				return -EINVAL;
+			/* Dump broadcast-receiver links as well */
+			type = 2;
+		}
+	}
+
 	msg.skb = skb;
 	msg.portid = NETLINK_CB(cb->skb).portid;
 	msg.seq = cb->nlh->nlmsg_seq;
@@ -2578,7 +2624,7 @@ int tipc_nl_node_dump_link(struct sk_buff *skb, struct netlink_callback *cb)
 						 list) {
 			tipc_node_read_lock(node);
 			err = __tipc_nl_add_node_links(net, &msg, node,
-						       &prev_link);
+						       &prev_link, type);
 			tipc_node_read_unlock(node);
 			if (err)
 				goto out;
@@ -2586,14 +2632,14 @@ int tipc_nl_node_dump_link(struct sk_buff *skb, struct netlink_callback *cb)
 			prev_node = node->addr;
 		}
 	} else {
-		err = tipc_nl_add_bc_link(net, &msg);
+		err = tipc_nl_add_bc_link(net, &msg, tn->bcl);
 		if (err)
 			goto out;
 
 		list_for_each_entry_rcu(node, &tn->node_list, list) {
 			tipc_node_read_lock(node);
 			err = __tipc_nl_add_node_links(net, &msg, node,
-						       &prev_link);
+						       &prev_link, type);
 			tipc_node_read_unlock(node);
 			if (err)
 				goto out;
@@ -2608,6 +2654,7 @@ int tipc_nl_node_dump_link(struct sk_buff *skb, struct netlink_callback *cb)
 	cb->args[0] = prev_node;
 	cb->args[1] = prev_link;
 	cb->args[2] = done;
+	cb->args[3] = type;
 
 	return skb->len;
 }
diff --git a/net/tipc/trace.h b/net/tipc/trace.h
index e7535ab75255..04af83f0500c 100644
--- a/net/tipc/trace.h
+++ b/net/tipc/trace.h
@@ -255,7 +255,7 @@ DECLARE_EVENT_CLASS(tipc_link_class,
 
 	TP_fast_assign(
 		__assign_str(header, header);
-		tipc_link_name_ext(l, __entry->name);
+		memcpy(__entry->name, tipc_link_name(l), TIPC_MAX_LINK_NAME);
 		tipc_link_dump(l, dqueues, __get_str(buf));
 	),
 
@@ -295,7 +295,7 @@ DECLARE_EVENT_CLASS(tipc_link_transmq_class,
 	),
 
 	TP_fast_assign(
-		tipc_link_name_ext(r, __entry->name);
+		memcpy(__entry->name, tipc_link_name(r), TIPC_MAX_LINK_NAME);
 		__entry->from = f;
 		__entry->to = t;
 		__entry->len = skb_queue_len(tq);
-- 
2.13.7

Re: [tipc-discussion] [PATCH RFC 3/4] tipc: enable broadcast retrans via unicast

From: Xue, Y. <Yin...@wi...> - 2020-04-07 05:57:53

Hi Tuong,

Thank you for your following clear explanations. 

The reason why I asked your test environment is that the more cores CPU has, the more queues NIC has and the more TIPC nodes there are in one TIPC cluster, the more packets will be disordered and the more retransmission requests will happen. If we have such test environment to validate broadcast throughput performance, it can help us clearly identify whether our current proposal is good or bad. If good, it also can help to tell how many percent broadcast throughput will  be improved. In addition, under RT_PREEMPT kernel, packets will be more easily disordered as hard-IRQ and soft-IRQ have been threaded. As a consequence, issuing retransmission requests become normal. On the contrary, if we only validate with two nodes and under virtual environment, it's a bit hard to measure whether bcl flow control algorithms are good or bad.

Anyway, I am still happy to see this patch being merged into upstream. But that will be wonderful if the switch option can be eliminated one day.

Thanks,
Ying

-----Original Message-----
From: Tuong Tong Lien [mailto:tuo...@de...] 
Sent: Monday, April 6, 2020 4:54 PM
To: Xue, Ying; jm...@re...; ma...@do...; tip...@li...
Cc: tipc-dek
Subject: RE: [PATCH RFC 3/4] tipc: enable broadcast retrans via unicast

Hi Ying,

No problem, I'm using outlook too...
Please see my answers correspondingly:
1. When you mention "this proposal", do you mean the single patch or the whole series since these features are actually separated and not dependent together...?
Anyway, there have been some issues with the lab here, so I just tested this new feature on KVM/QEMU nodes using the virtio_net driver, with 4 vCPUs and only one TX/RX queue enabled. Also, the real-time kernel is not patched yet...
If you have a better environment, may I ask you to help verify this?

Anyhow, if I could catch up your concerns in the last meeting, it was mainly related to the amount of packet retransmissions that could panic the NIC or kernel, so not really scalable? If so, in theoretically, it should not be a problem since we have already had the following mechanisms to control it:
- Link window (e.g. max 50 outstanding/retransmitted packets);
- Retransmission restricting timer on individual packets (e.g. within 10ms, if a new retransmission request comes it will be ignored...);
- The priority queue for packet retransmissions (that is unlikely congested);
Or do you have any other concerns, so please clarify?

2. Yes, in the commit it has mentioned about the "bandwidth limit on broadcast" but it can be invisible to user. One obvious thing is probably through broadcast statistics (so there is a need for the other patch for the broadcast rcv link stats) that users can see the sender trying to make a lot of (re-)transmissions, but it doesn't really work, the receiver gets only a few...
Ok, I will make this clear by repeating some performance tests.

3. Hmm, this totally was my mistake... I removed it when merging/separating the patches for this series ☹. In a premature patch, it was:

@@ -2425,7 +2426,7 @@ int tipc_link_bc_ack_rcv(struct tipc_link *r, u16 acked, u16 gap,
                return 0;

        trace_tipc_link_bc_ack(r, r->acked, acked, &l->transmq);
-       tipc_link_advance_transmq(l, r, acked, gap, ga, xmitq, &unused, &rc);
+       tipc_link_advance_transmq(l, r, acked, gap, ga, retrq, &unused, &rc);

        tipc_link_advance_backlog(l, xmitq);
        if (unlikely(!skb_queue_empty(&l->wakeupq)))

Thanks a lot for your finding, I will update this to the series!

BR/Tuong

-----Original Message-----
From: Xue, Ying <Yin...@wi...> 
Sent: Monday, April 6, 2020 1:20 PM
To: Tuong Tong Lien <tuo...@de...>; jm...@re...; ma...@do...; tip...@li...
Cc: tipc-dek <tip...@de...>
Subject: RE: [PATCH RFC 3/4] tipc: enable broadcast retrans via unicast

Hi Tuong,

Sorry, I have to use outlook client to reply to your email, which will make the email messed a bit. 

Please see my following comments:

==
[Ying] 1. Did you ever conduct comprehensive verification about this proposal? What kinds of test environment did you use in your testing? For example, how many TIPC physical nodes were gotten involved into your testing? Did the NICs used during your testing support multiqueue feature? How many cores were there on one your used physical TIPC machine? 

In addition, if possible, I suggest you could try to enable RT_PREEMPT kernel to measure what throughput results we would get. 
==

In some environment, broadcast traffic is suppressed at high rate (i.e.
a kind of bandwidth limit setting). When it is applied, TIPC broadcast
can still run successfully. However, when it comes to a high load, some
packets will be dropped first and TIPC tries to retransmit them but the
packet retransmission is intentionally broadcast too, so making things
worse and not helpful at all.

This commit enables the broadcast retransmission via unicast which only
retransmits packets to the specific peer that has really reported a gap
i.e. not broadcasting to all nodes in the cluster, so will prevent from
being suppressed, and also reduce some overheads on the other peers due
to duplicates, finally improve the overall TIPC broadcast performance.

Note: the functionality can be turned on/off via the sysctl file:

echo 1 > /proc/sys/net/tipc/bc_retruni
echo 0 > /proc/sys/net/tipc/bc_retruni

Default is '0', i.e. the broadcast retransmission still works as usual.
==
[Ying] 2.  Actually I had a similar idea before, so I also think the broadcast performance might be significantly improved through this proposal, but we act as TIPC developers, we should explicitly tell users what condition they should enable this option and what condition they should disable it, otherwise, users have no idea at all about when to enable this option or when to disable this option. 

 So, please give more performance data obtained in different test conditions. If this patch can give broadcast performance a clear benefit under any test condition, ideally we completely remove this option. Otherwise, at least we can tell users when to enable this option.
==

Signed-off-by: Tuong Lien <tuo...@de...>

 int tipc_link_bc_ack_rcv(struct tipc_link *r, u16 acked, u16 gap,
 			 struct tipc_gap_ack_blks *ga,
-			 struct sk_buff_head *xmitq)
+			 struct sk_buff_head *xmitq,
+			 struct sk_buff_head *retrq)
 {
 	struct tipc_link *l = r->bc_sndlink;
 	bool unused = false;

==
3. [Ying] Sorry, I felt a bit confused. One new "retrq" parameter was introduced, but I didn't find where it was used in this function. 
Can you please explain how the new parameter works?
==

Thanks,
Ying

@@ -2460,7 +2461,8 @@ int tipc_link_bc_nack_rcv(struct tipc_link *l, struct sk_buff *skb,
 		return 0;
 
 	if (dnode == tipc_own_addr(l->net)) {
-		rc = tipc_link_bc_ack_rcv(l, acked, to - acked, NULL, xmitq);
+		rc = tipc_link_bc_ack_rcv(l, acked, to - acked, NULL, xmitq,
+					  xmitq);
 		l->stats.recv_nacks++;
 		return rc;
 	}
diff --git a/net/tipc/link.h b/net/tipc/link.h
index 0a0fa7350722..4d0768cf91d5 100644
--- a/net/tipc/link.h
+++ b/net/tipc/link.h
@@ -147,7 +147,8 @@ u16 tipc_get_gap_ack_blks(struct tipc_gap_ack_blks **ga, struct tipc_link *l,
 			  struct tipc_msg *hdr, bool uc);
 int tipc_link_bc_ack_rcv(struct tipc_link *l, u16 acked, u16 gap,
 			 struct tipc_gap_ack_blks *ga,
-			 struct sk_buff_head *xmitq);
+			 struct sk_buff_head *xmitq,
+			 struct sk_buff_head *retrq);
 void tipc_link_build_bc_sync_msg(struct tipc_link *l,
 				 struct sk_buff_head *xmitq);
 void tipc_link_bc_init_rcv(struct tipc_link *l, struct tipc_msg *hdr);
diff --git a/net/tipc/node.c b/net/tipc/node.c
index eb6b62de81a7..917ad3920fac 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1771,7 +1771,7 @@ static void tipc_node_bc_sync_rcv(struct tipc_node *n, struct tipc_msg *hdr,
 	struct tipc_link *ucl;
 	int rc;
 
-	rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr);
+	rc = tipc_bcast_sync_rcv(n->net, n->bc_entry.link, hdr, xmitq);
 
 	if (rc & TIPC_LINK_DOWN_EVT) {
 		tipc_node_reset_links(n);
diff --git a/net/tipc/sysctl.c b/net/tipc/sysctl.c
index 58ab3d6dcdce..97a6264a2993 100644
--- a/net/tipc/sysctl.c
+++ b/net/tipc/sysctl.c
@@ -36,7 +36,7 @@
 #include "core.h"
 #include "trace.h"
 #include "crypto.h"
-
+#include "bcast.h"
 #include <linux/sysctl.h>
 
 static struct ctl_table_header *tipc_ctl_hdr;
@@ -75,6 +75,13 @@ static struct ctl_table tipc_table[] = {
 		.extra1         = SYSCTL_ONE,
 	},
 #endif
+	{
+		.procname	= "bc_retruni",
+		.data		= &sysctl_tipc_bc_retruni,
+		.maxlen		= sizeof(sysctl_tipc_bc_retruni),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+	},
 	{}
 };
 
-- 
2.13.7

Re: [tipc-discussion] [PATCH RFC 1/4] tipc: introduce Gap ACK blocks for broadcast link

From: Tuong T. L. <tuo...@de...> - 2020-04-06 09:06:25

-----Original Message-----
From: Xue, Ying <Yin...@wi...> 
Sent: Monday, April 6, 2020 2:19 PM
To: Tuong Tong Lien <tuo...@de...>; jm...@re...; ma...@do...; tip...@li...
Cc: tipc-dek <tip...@de...>
Subject: RE: [PATCH RFC 1/4] tipc: introduce Gap ACK blocks for broadcast link

 31                       16 15                        0
+-------------+-------------+-------------+-------------+
|  bgack_cnt  |  ugack_cnt  |            len            |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > bc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > uc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -

which is "automatically" backward-compatible.

===
[Ying] In my opinion,  this patch will cause the backward-compatible issue below:

1) On the TIPC node with the patch:
When sending a 'PROTOCOL/STATE_MSG' message , its 'Gap ACK blocks' data field only contains bcl gap ack blocks, but no any unicast link gap ack block.
2) On the TIPC node without the patch:
Upon receiving the message sent by the node of case 1), this node will suppose its 'Gap ACK blocks' data field are unicast link gap ack blocks rather than broadcast link gap ack blocks.

[Tuong]: As you can see in the figure above, we have two different "b/ugack_cnt" fields which determine the number of broadcast/unicast gap ack blocks in the message. The "ugack_cnt" is fully identical to the "gack_cnt" in the old version (- without the patch) i.e. indicating the number of unicast gap ack blocks anyway, whereas the "bgack_cnt" was a reserved field.
So, in your situation, the sending side will send the message with the "ugack_cnt" = 0 and this is completely compatible to the old version that the receiving side will see no unicast gap ack blocks and just ignore the broadcast gap ack blocks (- it doesn't really know). Actually, there is also a sanity check on the length in the old code that will shortly ignore such the gap ack block report... So, we have no problem at all. That is why I've declared it backward compatible automatically.

>>[Ying]: Thanks for your clarification. Yes, you are right. Now it's really compatible between old and new versions.

So I wonder no backward-compatible issue will exist and everything will become pretty easy if we use LINK_PROTOCOL to only contain unicast gap ack blocks and use BCAST_PROTOCOL to convey broadcast gap ack blocks.
In other words, we don't need to enlarge current gap ack block space, and we don't need to change the current code related unicast gap ack blocks. Instead, we just need to add the support for broadcast gap ack blocks through BCAST_PROTOCOL rather than LINK_PROTOCOL. 

[Tuong]: The BCAST_PROTOCOL is currently only used for broadcast initializing or synching when a new peer joins, the old mechanism as broadcast NACKs is deprecated... I suppose that using the LINK_PROTOCOL is much more convenient since the traditional ack/gap reports for broadcast link is also made via the message, so we don't need to create a new code flow to handle the gap/ack blocks.
Actually, the change in the current code related unicast gap ack blocks is just to optimize the code e.g. removing an old functions, etc., there is no impact in its functionality.

>>[Ying]: Sorry, I forgot this comment: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=02d11ca20091fcef904f05defda80c53e5b4e793.
It made broadcast NACK delivered through link state. 

Okay, both unicast link and bcl gap ack blocked can be transferred in the same link state message. 

>>[Ying]: By the way, I have another minor comment:

As  "bgack_cnt" is defined after " ugack_cnt " in struct tipc_gap_ack_blks, please reverse their order in this struct description section.

/* struct tipc_gap_ack_blks
  * @len: actual length of the record
- * @gack_cnt: number of Gap ACK blocks in the record
+ * @bgack_cnt: number of Gap ACK blocks for broadcast in the record
+ * @ugack_cnt: number of Gap ACK blocks for unicast (following the broadcast
+ *             ones)
+ * @start_index: starting index for "valid" broadcast Gap ACK blocks
  * @gacks: array of Gap ACK blocks
  */
 struct tipc_gap_ack_blks {
 	__be16 len;
-	u8 gack_cnt;
-	u8 reserved;
+	union {
+		u8 ugack_cnt;
+		u8 start_index;
+	};
+	u8 bgack_cnt;
 	struct tipc_gap_ack gacks[];
 };

Thanks, I will update the order in the struct description!

BR/Tuong

451 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 28 29 30 31 32 .. 595 > >> (Page 30 of 595)