You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(9) |
Feb
(11) |
Mar
(22) |
Apr
(73) |
May
(78) |
Jun
(146) |
Jul
(80) |
Aug
(27) |
Sep
(5) |
Oct
(14) |
Nov
(18) |
Dec
(27) |
2005 |
Jan
(20) |
Feb
(30) |
Mar
(19) |
Apr
(28) |
May
(50) |
Jun
(31) |
Jul
(32) |
Aug
(14) |
Sep
(36) |
Oct
(43) |
Nov
(74) |
Dec
(63) |
2006 |
Jan
(34) |
Feb
(32) |
Mar
(21) |
Apr
(76) |
May
(106) |
Jun
(72) |
Jul
(70) |
Aug
(175) |
Sep
(130) |
Oct
(39) |
Nov
(81) |
Dec
(43) |
2007 |
Jan
(81) |
Feb
(36) |
Mar
(20) |
Apr
(43) |
May
(54) |
Jun
(34) |
Jul
(44) |
Aug
(55) |
Sep
(44) |
Oct
(54) |
Nov
(43) |
Dec
(41) |
2008 |
Jan
(42) |
Feb
(84) |
Mar
(73) |
Apr
(30) |
May
(119) |
Jun
(54) |
Jul
(54) |
Aug
(93) |
Sep
(173) |
Oct
(130) |
Nov
(145) |
Dec
(153) |
2009 |
Jan
(59) |
Feb
(12) |
Mar
(28) |
Apr
(18) |
May
(56) |
Jun
(9) |
Jul
(28) |
Aug
(62) |
Sep
(16) |
Oct
(19) |
Nov
(15) |
Dec
(17) |
2010 |
Jan
(14) |
Feb
(36) |
Mar
(37) |
Apr
(30) |
May
(33) |
Jun
(53) |
Jul
(42) |
Aug
(50) |
Sep
(67) |
Oct
(66) |
Nov
(69) |
Dec
(36) |
2011 |
Jan
(52) |
Feb
(45) |
Mar
(49) |
Apr
(21) |
May
(34) |
Jun
(13) |
Jul
(19) |
Aug
(37) |
Sep
(43) |
Oct
(10) |
Nov
(23) |
Dec
(30) |
2012 |
Jan
(42) |
Feb
(36) |
Mar
(46) |
Apr
(25) |
May
(96) |
Jun
(146) |
Jul
(40) |
Aug
(28) |
Sep
(61) |
Oct
(45) |
Nov
(100) |
Dec
(53) |
2013 |
Jan
(79) |
Feb
(24) |
Mar
(134) |
Apr
(156) |
May
(118) |
Jun
(75) |
Jul
(278) |
Aug
(145) |
Sep
(136) |
Oct
(168) |
Nov
(137) |
Dec
(439) |
2014 |
Jan
(284) |
Feb
(158) |
Mar
(231) |
Apr
(275) |
May
(259) |
Jun
(91) |
Jul
(222) |
Aug
(215) |
Sep
(165) |
Oct
(166) |
Nov
(211) |
Dec
(150) |
2015 |
Jan
(164) |
Feb
(324) |
Mar
(299) |
Apr
(214) |
May
(111) |
Jun
(109) |
Jul
(105) |
Aug
(36) |
Sep
(58) |
Oct
(131) |
Nov
(68) |
Dec
(30) |
2016 |
Jan
(46) |
Feb
(87) |
Mar
(135) |
Apr
(174) |
May
(132) |
Jun
(135) |
Jul
(149) |
Aug
(125) |
Sep
(79) |
Oct
(49) |
Nov
(95) |
Dec
(102) |
2017 |
Jan
(104) |
Feb
(75) |
Mar
(72) |
Apr
(53) |
May
(18) |
Jun
(5) |
Jul
(14) |
Aug
(19) |
Sep
(2) |
Oct
(13) |
Nov
(21) |
Dec
(67) |
2018 |
Jan
(56) |
Feb
(50) |
Mar
(148) |
Apr
(41) |
May
(37) |
Jun
(34) |
Jul
(34) |
Aug
(11) |
Sep
(52) |
Oct
(48) |
Nov
(28) |
Dec
(46) |
2019 |
Jan
(29) |
Feb
(63) |
Mar
(95) |
Apr
(54) |
May
(14) |
Jun
(71) |
Jul
(60) |
Aug
(49) |
Sep
(3) |
Oct
(64) |
Nov
(115) |
Dec
(57) |
2020 |
Jan
(15) |
Feb
(9) |
Mar
(38) |
Apr
(27) |
May
(60) |
Jun
(53) |
Jul
(35) |
Aug
(46) |
Sep
(37) |
Oct
(64) |
Nov
(20) |
Dec
(25) |
2021 |
Jan
(20) |
Feb
(31) |
Mar
(27) |
Apr
(23) |
May
(21) |
Jun
(30) |
Jul
(30) |
Aug
(7) |
Sep
(18) |
Oct
|
Nov
(15) |
Dec
(4) |
2022 |
Jan
(3) |
Feb
(1) |
Mar
(10) |
Apr
|
May
(2) |
Jun
(26) |
Jul
(5) |
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(9) |
Dec
(2) |
2023 |
Jan
(4) |
Feb
(4) |
Mar
(5) |
Apr
(10) |
May
(29) |
Jun
(17) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
(2) |
Dec
|
2024 |
Jan
|
Feb
(6) |
Mar
|
Apr
(1) |
May
(6) |
Jun
|
Jul
(5) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Tung N. <tun...@de...> - 2019-02-19 07:03:31
|
Fix issue of hung sendto() in user space. Tung Nguyen (1): tipc: fix race condition causing hung sendto net/tipc/socket.c | 4 ++++ 1 file changed, 4 insertions(+) mode change 100644 => 100755 net/tipc/socket.c -- 2.17.1 |
From: Tung N. <tun...@de...> - 2019-02-19 04:21:11
|
Some improvements for tipc_wait_for_xzy(). Tung Nguyen (2): tipc: improve function tipc_wait_for_cond() tipc: improve function tipc_wait_for_rcvmsg() net/tipc/socket.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) -- 2.17.1 |
From: Tung N. <tun...@de...> - 2019-02-19 04:21:11
|
Commit 844cf763fba6 ("tipc: make macro tipc_wait_for_cond() smp safe") replaced finish_wait() with remove_wait_queue() but still used prepare_to_wait(). This causes unnecessary conditional checking before adding to wait queue in prepare_to_wait(). This commit replaces prepare_to_wait() with add_wait_queue() as the pair function with remove_wait_queue(). Acked-by: Ying Xue <yin...@wi...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Tung Nguyen <tun...@de...> --- net/tipc/socket.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 1217c90a363b..81b87916a0eb 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -388,7 +388,7 @@ static int tipc_sk_sock_err(struct socket *sock, long *timeout) rc_ = tipc_sk_sock_err((sock_), timeo_); \ if (rc_) \ break; \ - prepare_to_wait(sk_sleep(sk_), &wait_, TASK_INTERRUPTIBLE); \ + add_wait_queue(sk_sleep(sk_), &wait_); \ release_sock(sk_); \ *(timeo_) = wait_woken(&wait_, TASK_INTERRUPTIBLE, *(timeo_)); \ sched_annotate_sleep(); \ -- 2.17.1 |
From: Tung N. <tun...@de...> - 2019-02-19 04:21:09
|
This commit replaces schedule_timeout() with wait_woken() in function tipc_wait_for_rcvmsg(). wait_woken() uses memory barriers in its implementation to avoid potential race condition when putting a process into sleeping state and then waking it up. Acked-by: Ying Xue <yin...@wi...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Tung Nguyen <tun...@de...> --- net/tipc/socket.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 81b87916a0eb..684f2125fc6b 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -1677,7 +1677,7 @@ static void tipc_sk_send_ack(struct tipc_sock *tsk) static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop) { struct sock *sk = sock->sk; - DEFINE_WAIT(wait); + DEFINE_WAIT_FUNC(wait, woken_wake_function); long timeo = *timeop; int err = sock_error(sk); @@ -1685,15 +1685,17 @@ static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop) return err; for (;;) { - prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); if (timeo && skb_queue_empty(&sk->sk_receive_queue)) { if (sk->sk_shutdown & RCV_SHUTDOWN) { err = -ENOTCONN; break; } + add_wait_queue(sk_sleep(sk), &wait); release_sock(sk); - timeo = schedule_timeout(timeo); + timeo = wait_woken(&wait, TASK_INTERRUPTIBLE, timeo); + sched_annotate_sleep(); lock_sock(sk); + remove_wait_queue(sk_sleep(sk), &wait); } err = 0; if (!skb_queue_empty(&sk->sk_receive_queue)) @@ -1709,7 +1711,6 @@ static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop) if (err) break; } - finish_wait(sk_sleep(sk), &wait); *timeop = timeo; return err; } -- 2.17.1 |
From: Jon M. <jon...@er...> - 2019-02-19 00:03:11
|
This also looks very good now, I only have some comments to the log text. Just make sure you test this with the groupcast test program before you post it. Acked-by: Jon > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 18-Feb-19 05:08 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [net-next v3 3/3] tipc: smooth change between replicast and > broadcast > > Currently, a multicast stream may start out using replicast, because there are > few destinations, and then it should ideally switch to L2/broadcast > IGMP/multicast when the number of destinations grows beyond a certain > limit. The opposite should happen when the number decreases below the > limit. > > To eliminate the risk of message reordering caused by method change, a > sending socket must stick to a previously selected method until it enters an > idle period of 5 seconds. Means there is a 5 seconds pause in the traffic from > the sender socket. If the sender never makes such a pause, the method will never change, and transmission may become very inefficient as the cluster grows. > > With this commit, we allow such a switch between replicast and broadcast > without 5 seconds pause in the traffic. without any need for a traffic pause. > > Solution is to send a dummy message with only the header, also with the SYN > bit set, via broadcast or replicast. For the data message, the SYN bit is set and > sending via replicast or broadcast (inverse method with dummy). > > Then, at receiving side any messages follow first SYN bit message (data or > dummy message), they will be hold s/hold/held in deferred queue until another pair > (dummy or data message) arrived in other link. ///jon > > Signed-off-by: Hoang Le <hoa...@de...> > --- > net/tipc/bcast.c | 165 > +++++++++++++++++++++++++++++++++++++++++++++- > net/tipc/bcast.h | 5 ++ > net/tipc/msg.h | 10 +++ > net/tipc/socket.c | 5 ++ > 4 files changed, 184 insertions(+), 1 deletion(-) > > diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index > 12b59268bdd6..d806e3714280 100644 > --- a/net/tipc/bcast.c > +++ b/net/tipc/bcast.c > @@ -220,9 +220,24 @@ static void tipc_bcast_select_xmit_method(struct > net *net, int dests, > } > /* Can current method be changed ? */ > method->expires = jiffies + TIPC_METHOD_EXPIRE; > - if (method->mandatory || time_before(jiffies, exp)) > + if (method->mandatory) > return; > > + if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) && > + time_before(jiffies, exp)) > + return; > + > + /* Configuration as force 'broadcast' method */ > + if (bb->force_bcast) { > + method->rcast = false; > + return; > + } > + /* Configuration as force 'replicast' method */ > + if (bb->force_rcast) { > + method->rcast = true; > + return; > + } > + /* Configuration as 'selectable' or default method */ > /* Determine method to use now */ > method->rcast = dests <= bb->bc_threshold; } @@ -285,6 +300,63 > @@ static int tipc_rcast_xmit(struct net *net, struct sk_buff_head *pkts, > return 0; > } > > +/* tipc_mcast_send_sync - deliver a dummy message with SYN bit > + * @net: the applicable net namespace > + * @skb: socket buffer to copy > + * @method: send method to be used > + * @dests: destination nodes for message. > + * @cong_link_cnt: returns number of encountered congested destination > +links > + * Returns 0 if success, otherwise errno */ static int > +tipc_mcast_send_sync(struct net *net, struct sk_buff *skb, > + struct tipc_mc_method *method, > + struct tipc_nlist *dests, > + u16 *cong_link_cnt) > +{ > + struct sk_buff_head tmpq; > + struct sk_buff *_skb; > + struct tipc_msg *hdr, *_hdr; > + > + /* Is a cluster supporting with new capabilities ? */ > + if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL)) > + return 0; > + > + hdr = buf_msg(skb); > + if (msg_user(hdr) == MSG_FRAGMENTER) > + hdr = msg_get_wrapped(hdr); > + if (msg_type(hdr) != TIPC_MCAST_MSG) > + return 0; > + > + /* Allocate dummy message */ > + _skb = tipc_buf_acquire(MCAST_H_SIZE, GFP_KERNEL); > + if (!skb) > + return -ENOMEM; > + > + /* Preparing for 'synching' header */ > + msg_set_syn(hdr, 1); > + > + /* Copy skb's header into a dummy header */ > + skb_copy_to_linear_data(_skb, hdr, MCAST_H_SIZE); > + skb_orphan(_skb); > + > + /* Reverse method for dummy message */ > + _hdr = buf_msg(_skb); > + msg_set_size(_hdr, MCAST_H_SIZE); > + msg_set_is_rcast(_hdr, !msg_is_rcast(hdr)); > + > + skb_queue_head_init(&tmpq); > + __skb_queue_tail(&tmpq, _skb); > + if (method->rcast) > + tipc_bcast_xmit(net, &tmpq, cong_link_cnt); > + else > + tipc_rcast_xmit(net, &tmpq, dests, cong_link_cnt); > + > + /* This queue should normally be empty by now */ > + __skb_queue_purge(&tmpq); > + > + return 0; > +} > + > /* tipc_mcast_xmit - deliver message to indicated destination nodes > * and to identified node local sockets > * @net: the applicable net namespace > @@ -300,6 +372,9 @@ int tipc_mcast_xmit(struct net *net, struct > sk_buff_head *pkts, > u16 *cong_link_cnt) > { > struct sk_buff_head inputq, localq; > + struct sk_buff *skb; > + struct tipc_msg *hdr; > + bool rcast = method->rcast; > int rc = 0; > > skb_queue_head_init(&inputq); > @@ -313,6 +388,18 @@ int tipc_mcast_xmit(struct net *net, struct > sk_buff_head *pkts, > /* Send according to determined transmit method */ > if (dests->remote) { > tipc_bcast_select_xmit_method(net, dests->remote, > method); > + > + skb = skb_peek(pkts); > + hdr = buf_msg(skb); > + if (msg_user(hdr) == MSG_FRAGMENTER) > + hdr = msg_get_wrapped(hdr); > + msg_set_is_rcast(hdr, method->rcast); > + > + /* Switch method ? */ > + if (rcast != method->rcast) > + tipc_mcast_send_sync(net, skb, method, > + dests, cong_link_cnt); > + > if (method->rcast) > rc = tipc_rcast_xmit(net, pkts, dests, cong_link_cnt); > else > @@ -672,3 +759,79 @@ u32 tipc_bcast_get_broadcast_ratio(struct net *net) > > return bb->rc_ratio; > } > + > +void tipc_mcast_filter_msg(struct sk_buff_head *defq, > + struct sk_buff_head *inputq) > +{ > + struct sk_buff *skb, *_skb, *tmp; > + struct tipc_msg *hdr, *_hdr; > + bool match = false; > + u32 node, port; > + > + skb = skb_peek(inputq); > + hdr = buf_msg(skb); > + > + if (likely(!msg_is_syn(hdr) && skb_queue_empty(defq))) > + return; > + > + node = msg_orignode(hdr); > + port = msg_origport(hdr); > + > + /* Has the twin SYN message already arrived ? */ > + skb_queue_walk(defq, _skb) { > + _hdr = buf_msg(_skb); > + if (msg_orignode(_hdr) != node) > + continue; > + if (msg_origport(_hdr) != port) > + continue; > + match = true; > + break; > + } > + > + if (!match) { > + if (!msg_is_syn(hdr)) > + return; > + __skb_dequeue(inputq); > + __skb_queue_tail(defq, skb); > + return; > + } > + > + /* Deliver non-SYN message from other link, otherwise queue it */ > + if (!msg_is_syn(hdr)) { > + if (msg_is_rcast(hdr) != msg_is_rcast(_hdr)) > + return; > + __skb_dequeue(inputq); > + __skb_queue_tail(defq, skb); > + return; > + } > + > + /* Queue non-SYN/SYN message from same link */ > + if (msg_is_rcast(hdr) == msg_is_rcast(_hdr)) { > + __skb_dequeue(inputq); > + __skb_queue_tail(defq, skb); > + return; > + } > + > + /* Matching SYN messages => return the one with data, if any */ > + __skb_unlink(_skb, defq); > + if (msg_data_sz(hdr)) { > + kfree_skb(_skb); > + } else { > + __skb_dequeue(inputq); > + kfree_skb(skb); > + __skb_queue_tail(inputq, _skb); > + } > + > + /* Deliver subsequent non-SYN messages from same peer */ > + skb_queue_walk_safe(defq, _skb, tmp) { > + _hdr = buf_msg(_skb); > + if (msg_orignode(_hdr) != node) > + continue; > + if (msg_origport(_hdr) != port) > + continue; > + if (msg_is_syn(_hdr)) > + break; > + __skb_unlink(_skb, defq); > + __skb_queue_tail(inputq, _skb); > + } > +} > diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index > 37c55e7347a5..484bde289d3a 100644 > --- a/net/tipc/bcast.h > +++ b/net/tipc/bcast.h > @@ -67,11 +67,13 @@ void tipc_nlist_del(struct tipc_nlist *nl, u32 node); > /* Cookie to be used between socket and broadcast layer > * @rcast: replicast (instead of broadcast) was used at previous xmit > * @mandatory: broadcast/replicast indication was set by user > + * @deferredq: defer queue to make message in order > * @expires: re-evaluate non-mandatory transmit method if we are past this > */ > struct tipc_mc_method { > bool rcast; > bool mandatory; > + struct sk_buff_head deferredq; > unsigned long expires; > }; > > @@ -99,6 +101,9 @@ int tipc_bclink_reset_stats(struct net *net); > u32 tipc_bcast_get_broadcast_mode(struct net *net); > u32 tipc_bcast_get_broadcast_ratio(struct net *net); > > +void tipc_mcast_filter_msg(struct sk_buff_head *defq, > + struct sk_buff_head *inputq); > + > static inline void tipc_bcast_lock(struct net *net) { > spin_lock_bh(&tipc_net(net)->bclock); > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index > d7e4b8b93f9d..528ba9241acc 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -257,6 +257,16 @@ static inline void msg_set_src_droppable(struct > tipc_msg *m, u32 d) > msg_set_bits(m, 0, 18, 1, d); > } > > +static inline bool msg_is_rcast(struct tipc_msg *m) { > + return msg_bits(m, 0, 18, 0x1); > +} > + > +static inline void msg_set_is_rcast(struct tipc_msg *m, bool d) { > + msg_set_bits(m, 0, 18, 0x1, d); > +} > + > static inline void msg_set_size(struct tipc_msg *m, u32 sz) { > m->hdr[0] = htonl((msg_word(m, 0) & ~0x1ffff) | sz); diff --git > a/net/tipc/socket.c b/net/tipc/socket.c index 8fc5acd4820d..de83eb1e718e > 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -483,6 +483,7 @@ static int tipc_sk_create(struct net *net, struct socket > *sock, > tsk_set_unreturnable(tsk, true); > if (sock->type == SOCK_DGRAM) > tsk_set_unreliable(tsk, true); > + __skb_queue_head_init(&tsk->mc_method.deferredq); > } > > trace_tipc_sk_create(sk, NULL, TIPC_DUMP_NONE, " "); @@ -580,6 > +581,7 @@ static int tipc_release(struct socket *sock) > sk->sk_shutdown = SHUTDOWN_MASK; > tipc_sk_leave(tsk); > tipc_sk_withdraw(tsk, 0, NULL); > + __skb_queue_purge(&tsk->mc_method.deferredq); > sk_stop_timer(sk, &sk->sk_timer); > tipc_sk_remove(tsk); > > @@ -2157,6 +2159,9 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct > sk_buff *skb, > if (unlikely(grp)) > tipc_group_filter_msg(grp, &inputq, xmitq); > > + if (msg_type(hdr) == TIPC_MCAST_MSG) > + tipc_mcast_filter_msg(&tsk->mc_method.deferredq, > &inputq); > + > /* Validate and add to receive buffer if there is space */ > while ((skb = __skb_dequeue(&inputq))) { > hdr = buf_msg(skb); > -- > 2.17.1 |
From: Jon M. <jon...@er...> - 2019-02-18 23:54:21
|
Hi Hoang, I couldn't resist to improve the commit log, as I think it is very important that people understand why we are doing this. The patch as such looks good to me, and you can add an ack from me if you update the log text. See below. > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 18-Feb-19 05:08 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [net-next v3 1/3] tipc: support broadcast/replicast configurable for > bc-link > > Currently, a multicast stream uses method broadcast or replicast to transmit > base on cluster size and number of destinations. [Jon] Currently, a multicast stream uses either broadcast or replicast as transmission method, based on the ratio between number of actual destinations nodes and cluster size. > > However, when L2 interface (e.g VXLAN) pseudo support broadcast, we > should make broadcast link possible for the user to switch to replicast for all > multicast traffic; besides, we would like to test the broadcast implementation > we also want to switch to broadcast for all multicast traffic. [jon] However, when an L2 interface (e.g., VXLAN) provides pseudo broadcast support, this becomes very inefficient, as it blindly replicates multicast packets to all cluster/subnet nodes, irrespective of whether they host actual target sockets or not. The TIPC multicast algorithm is able to distinguish real destination nodes from other nodes, and hence provides a smarter and more efficient method for transferring multicast messages than pseudo broadcast can do. Because of this, we now make it possible for users to force the broadcast link to permanently switch to using replicast, irrespective of which capabilities the bearer provides, or pretend to provide. > multicast traffic; besides, we would like to test the broadcast implementation > we also want to switch to broadcast for all multicast traffic. [jon] Conversely, we also make it possible to force the broadcast link to always use true broadcast. While maybe less useful in deployed systems, this may at least be useful for testing the broadcast algorithm in small clusters. > > The current algorithm implementation is still available to switch between > broadcast and replicast depending on cluster size and destination number > (default), but ratio (selection threshold) is also configurable. And the default > for 'ratio' reduce to 10 (meaning 10%). [Jon] We retain the current AUTOSELECT ability, i.e., to let the broadcast link automatically select which algorithm to use, and to switch back and forth between broadcast and replicast as the ratio between destination node number and cluster size changes. This remains the default method. Furthermore, we make it possible to configure the threshold ratio for such switches. The default ratio is now set to 10%, down from 25% in the earlier implementation. ///jon > > Signed-off-by: Hoang Le <hoa...@de...> > --- > include/uapi/linux/tipc_netlink.h | 2 + > net/tipc/bcast.c | 104 ++++++++++++++++++++++++++++-- > net/tipc/bcast.h | 7 ++ > net/tipc/link.c | 8 +++ > net/tipc/netlink.c | 4 +- > 5 files changed, 120 insertions(+), 5 deletions(-) > > diff --git a/include/uapi/linux/tipc_netlink.h > b/include/uapi/linux/tipc_netlink.h > index 0ebe02ef1a86..efb958fd167d 100644 > --- a/include/uapi/linux/tipc_netlink.h > +++ b/include/uapi/linux/tipc_netlink.h > @@ -281,6 +281,8 @@ enum { > TIPC_NLA_PROP_TOL, /* u32 */ > TIPC_NLA_PROP_WIN, /* u32 */ > TIPC_NLA_PROP_MTU, /* u32 */ > + TIPC_NLA_PROP_BROADCAST, /* u32 */ > + TIPC_NLA_PROP_BROADCAST_RATIO, /* u32 */ > > __TIPC_NLA_PROP_MAX, > TIPC_NLA_PROP_MAX = __TIPC_NLA_PROP_MAX - 1 diff --git > a/net/tipc/bcast.c b/net/tipc/bcast.c index d8026543bf4c..12b59268bdd6 > 100644 > --- a/net/tipc/bcast.c > +++ b/net/tipc/bcast.c > @@ -54,7 +54,9 @@ const char tipc_bclink_name[] = "broadcast-link"; > * @dests: array keeping number of reachable destinations per bearer > * @primary_bearer: a bearer having links to all broadcast destinations, if any > * @bcast_support: indicates if primary bearer, if any, supports broadcast > + * @force_bcast: forces broadcast for multicast traffic > * @rcast_support: indicates if all peer nodes support replicast > + * @force_rcast: forces replicast for multicast traffic > * @rc_ratio: dest count as percentage of cluster size where send method > changes > * @bc_threshold: calculated from rc_ratio; if dests > threshold use > broadcast > */ > @@ -64,7 +66,9 @@ struct tipc_bc_base { > int dests[MAX_BEARERS]; > int primary_bearer; > bool bcast_support; > + bool force_bcast; > bool rcast_support; > + bool force_rcast; > int rc_ratio; > int bc_threshold; > }; > @@ -485,10 +489,63 @@ static int tipc_bc_link_set_queue_limits(struct net > *net, u32 limit) > return 0; > } > > +static int tipc_bc_link_set_broadcast_mode(struct net *net, u32 > +bc_mode) { > + struct tipc_bc_base *bb = tipc_bc_base(net); > + > + switch (bc_mode) { > + case BCLINK_MODE_BCAST: > + if (!bb->bcast_support) > + return -ENOPROTOOPT; > + > + bb->force_bcast = true; > + bb->force_rcast = false; > + break; > + case BCLINK_MODE_RCAST: > + if (!bb->rcast_support) > + return -ENOPROTOOPT; > + > + bb->force_bcast = false; > + bb->force_rcast = true; > + break; > + case BCLINK_MODE_SEL: > + if (!bb->bcast_support || !bb->rcast_support) > + return -ENOPROTOOPT; > + > + bb->force_bcast = false; > + bb->force_rcast = false; > + break; > + default: > + return -EINVAL; > + } > + > + return 0; > +} > + > +static int tipc_bc_link_set_broadcast_ratio(struct net *net, u32 > +bc_ratio) { > + struct tipc_bc_base *bb = tipc_bc_base(net); > + > + if (!bb->bcast_support || !bb->rcast_support) > + return -ENOPROTOOPT; > + > + if (bc_ratio > 100 || bc_ratio <= 0) > + return -EINVAL; > + > + bb->rc_ratio = bc_ratio; > + tipc_bcast_lock(net); > + tipc_bcbase_calc_bc_threshold(net); > + tipc_bcast_unlock(net); > + > + return 0; > +} > + > int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]) { > int err; > u32 win; > + u32 bc_mode; > + u32 bc_ratio; > struct nlattr *props[TIPC_NLA_PROP_MAX + 1]; > > if (!attrs[TIPC_NLA_LINK_PROP]) > @@ -498,12 +555,28 @@ int tipc_nl_bc_link_set(struct net *net, struct nlattr > *attrs[]) > if (err) > return err; > > - if (!props[TIPC_NLA_PROP_WIN]) > + if (!props[TIPC_NLA_PROP_WIN] && > + !props[TIPC_NLA_PROP_BROADCAST] && > + !props[TIPC_NLA_PROP_BROADCAST_RATIO]) { > return -EOPNOTSUPP; > + } > + > + if (props[TIPC_NLA_PROP_BROADCAST]) { > + bc_mode = > nla_get_u32(props[TIPC_NLA_PROP_BROADCAST]); > + err = tipc_bc_link_set_broadcast_mode(net, bc_mode); > + } > > - win = nla_get_u32(props[TIPC_NLA_PROP_WIN]); > + if (!err && props[TIPC_NLA_PROP_BROADCAST_RATIO]) { > + bc_ratio = > nla_get_u32(props[TIPC_NLA_PROP_BROADCAST_RATIO]); > + err = tipc_bc_link_set_broadcast_ratio(net, bc_ratio); > + } > > - return tipc_bc_link_set_queue_limits(net, win); > + if (!err && props[TIPC_NLA_PROP_WIN]) { > + win = nla_get_u32(props[TIPC_NLA_PROP_WIN]); > + err = tipc_bc_link_set_queue_limits(net, win); > + } > + > + return err; > } > > int tipc_bcast_init(struct net *net) > @@ -529,7 +602,7 @@ int tipc_bcast_init(struct net *net) > goto enomem; > bb->link = l; > tn->bcl = l; > - bb->rc_ratio = 25; > + bb->rc_ratio = 10; > bb->rcast_support = true; > return 0; > enomem: > @@ -576,3 +649,26 @@ void tipc_nlist_purge(struct tipc_nlist *nl) > nl->remote = 0; > nl->local = false; > } > + > +u32 tipc_bcast_get_broadcast_mode(struct net *net) { > + struct tipc_bc_base *bb = tipc_bc_base(net); > + > + if (bb->force_bcast) > + return BCLINK_MODE_BCAST; > + > + if (bb->force_rcast) > + return BCLINK_MODE_RCAST; > + > + if (bb->bcast_support && bb->rcast_support) > + return BCLINK_MODE_SEL; > + > + return 0; > +} > + > +u32 tipc_bcast_get_broadcast_ratio(struct net *net) { > + struct tipc_bc_base *bb = tipc_bc_base(net); > + > + return bb->rc_ratio; > +} > diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index > 751530ab0c49..37c55e7347a5 100644 > --- a/net/tipc/bcast.h > +++ b/net/tipc/bcast.h > @@ -48,6 +48,10 @@ extern const char tipc_bclink_name[]; > > #define TIPC_METHOD_EXPIRE msecs_to_jiffies(5000) > > +#define BCLINK_MODE_BCAST 0x1 > +#define BCLINK_MODE_RCAST 0x2 > +#define BCLINK_MODE_SEL 0x4 > + > struct tipc_nlist { > struct list_head list; > u32 self; > @@ -92,6 +96,9 @@ int tipc_nl_add_bc_link(struct net *net, struct > tipc_nl_msg *msg); int tipc_nl_bc_link_set(struct net *net, struct nlattr > *attrs[]); int tipc_bclink_reset_stats(struct net *net); > > +u32 tipc_bcast_get_broadcast_mode(struct net *net); > +u32 tipc_bcast_get_broadcast_ratio(struct net *net); > + > static inline void tipc_bcast_lock(struct net *net) { > spin_lock_bh(&tipc_net(net)->bclock); > diff --git a/net/tipc/link.c b/net/tipc/link.c index 341ecd796aa4..52d23b3ffaf5 > 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -2197,6 +2197,8 @@ int tipc_nl_add_bc_link(struct net *net, struct > tipc_nl_msg *msg) > struct nlattr *attrs; > struct nlattr *prop; > struct tipc_net *tn = net_generic(net, tipc_net_id); > + u32 bc_mode = tipc_bcast_get_broadcast_mode(net); > + u32 bc_ratio = tipc_bcast_get_broadcast_ratio(net); > struct tipc_link *bcl = tn->bcl; > > if (!bcl) > @@ -2233,6 +2235,12 @@ int tipc_nl_add_bc_link(struct net *net, struct > tipc_nl_msg *msg) > goto attr_msg_full; > if (nla_put_u32(msg->skb, TIPC_NLA_PROP_WIN, bcl->window)) > goto prop_msg_full; > + if (nla_put_u32(msg->skb, TIPC_NLA_PROP_BROADCAST, > bc_mode)) > + goto prop_msg_full; > + if (bc_mode & BCLINK_MODE_SEL) > + if (nla_put_u32(msg->skb, > TIPC_NLA_PROP_BROADCAST_RATIO, > + bc_ratio)) > + goto prop_msg_full; > nla_nest_end(msg->skb, prop); > > err = __tipc_nl_add_bc_link_stat(msg->skb, &bcl->stats); diff --git > a/net/tipc/netlink.c b/net/tipc/netlink.c index 99ee419210ba..5240f64e8ccc > 100644 > --- a/net/tipc/netlink.c > +++ b/net/tipc/netlink.c > @@ -110,7 +110,9 @@ const struct nla_policy > tipc_nl_prop_policy[TIPC_NLA_PROP_MAX + 1] = { > [TIPC_NLA_PROP_UNSPEC] = { .type = NLA_UNSPEC }, > [TIPC_NLA_PROP_PRIO] = { .type = NLA_U32 }, > [TIPC_NLA_PROP_TOL] = { .type = NLA_U32 }, > - [TIPC_NLA_PROP_WIN] = { .type = NLA_U32 } > + [TIPC_NLA_PROP_WIN] = { .type = NLA_U32 }, > + [TIPC_NLA_PROP_BROADCAST] = { .type = NLA_U32 }, > + [TIPC_NLA_PROP_BROADCAST_RATIO] = { .type = NLA_U32 } > }; > > const struct nla_policy tipc_nl_bearer_policy[TIPC_NLA_BEARER_MAX + 1] > = { > -- > 2.17.1 |
From: Jon M. <jon...@er...> - 2019-02-18 23:14:03
|
Same here. Change SELECTABLE to AUTOSELECT. Acked-by: Jon > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 18-Feb-19 02:55 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [iproute2-next v2 2/2] tipc: add link broadcast get > > The command prints the actually method that multicast is running in the > system. > Also 'ratio' value for SELECTABLE method. > > A sample usage is shown below: > $tipc link get broadcast > BROADCAST > > $tipc link get broadcast > SELECTABLE ratio:30% > > $tipc link get broadcast -j -p > [ { > "method": "SELECTABLE" > },{ > "ratio": 30 > } ] > > Signed-off-by: Hoang Le <hoa...@de...> > --- > tipc/link.c | 83 > ++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 82 insertions(+), 1 deletion(-) > > diff --git a/tipc/link.c b/tipc/link.c > index c1db400f1b26..97c75122038e 100644 > --- a/tipc/link.c > +++ b/tipc/link.c > @@ -175,10 +175,90 @@ static void cmd_link_get_help(struct cmdl *cmdl) > "PROPERTIES\n" > " tolerance - Get link tolerance\n" > " priority - Get link priority\n" > - " window - Get link window\n", > + " window - Get link window\n" > + " broadcast - Get link broadcast\n", > cmdl->argv[0]); > } > > +static int cmd_link_get_bcast_cb(const struct nlmsghdr *nlh, void > +*data) { > + int *prop = data; > + struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh); > + struct nlattr *info[TIPC_NLA_MAX + 1] = {}; > + struct nlattr *attrs[TIPC_NLA_LINK_MAX + 1] = {}; > + struct nlattr *props[TIPC_NLA_PROP_MAX + 1] = {}; > + int bc_mode; > + > + mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info); > + if (!info[TIPC_NLA_LINK]) > + return MNL_CB_ERROR; > + > + mnl_attr_parse_nested(info[TIPC_NLA_LINK], parse_attrs, attrs); > + if (!attrs[TIPC_NLA_LINK_PROP]) > + return MNL_CB_ERROR; > + > + mnl_attr_parse_nested(attrs[TIPC_NLA_LINK_PROP], parse_attrs, > props); > + if (!props[*prop]) > + return MNL_CB_ERROR; > + > + bc_mode = mnl_attr_get_u32(props[*prop]); > + > + new_json_obj(json); > + open_json_object(NULL); > + switch (bc_mode) { > + case 0x1: > + print_string(PRINT_ANY, "method", "%s\n", > "BROADCAST"); > + break; > + case 0x2: > + print_string(PRINT_ANY, "method", "%s\n", > "REPLICAST"); > + break; > + case 0x4: > + print_string(PRINT_ANY, "method", "%s", > "SELECTABLE"); > + close_json_object(); > + open_json_object(NULL); > + print_uint(PRINT_ANY, "ratio", " ratio:%u%\n", > mnl_attr_get_u32(props[TIPC_NLA_PROP_BROADCAST_RATIO])); > + break; > + default: > + print_string(PRINT_ANY, NULL, "UNKNOWN\n", > NULL); > + break; > + } > + close_json_object(); > + delete_json_obj(); > + return MNL_CB_OK; > +} > + > +static void cmd_link_get_bcast_help(struct cmdl *cmdl) { > + fprintf(stderr, "Usage: %s link get PPROPERTY\n\n" > + "PROPERTIES\n" > + " broadcast - Get link broadcast\n", > + cmdl->argv[0]); > +} > + > +static int cmd_link_get_bcast(struct nlmsghdr *nlh, const struct cmd *cmd, > + struct cmdl *cmdl, void *data) > +{ > + int prop = TIPC_NLA_PROP_BROADCAST; > + char buf[MNL_SOCKET_BUFFER_SIZE]; > + struct nlattr *attrs; > + > + if (help_flag) { > + (cmd->help)(cmdl); > + return -EINVAL; > + } > + > + nlh = msg_init(buf, TIPC_NL_LINK_GET); > + if (!nlh) { > + fprintf(stderr, "error, message initialisation failed\n"); > + return -1; > + } > + attrs = mnl_attr_nest_start(nlh, TIPC_NLA_LINK); > + /* Direct to broadcast-link setting */ > + mnl_attr_put_strz(nlh, TIPC_NLA_LINK_NAME, tipc_bclink_name); > + mnl_attr_nest_end(nlh, attrs); > + return msg_doit(nlh, cmd_link_get_bcast_cb, &prop); } > + > static int cmd_link_get(struct nlmsghdr *nlh, const struct cmd *cmd, > struct cmdl *cmdl, void *data) > { > @@ -186,6 +266,7 @@ static int cmd_link_get(struct nlmsghdr *nlh, const > struct cmd *cmd, > { PRIORITY_STR, cmd_link_get_prop, > cmd_link_get_help }, > { TOLERANCE_STR, cmd_link_get_prop, > cmd_link_get_help }, > { WINDOW_STR, cmd_link_get_prop, > cmd_link_get_help }, > + { BROADCAST_STR, cmd_link_get_bcast, > cmd_link_get_bcast_help }, > { NULL } > }; > > -- > 2.17.1 |
From: Jon M. <jon...@er...> - 2019-02-18 23:12:27
|
Acked if you make the small changes I am suggesting below. ///jon > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 18-Feb-19 02:54 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [iproute2-next v2 1/2] tipc: add link broadcast set method and ratio > > This command is to support broacast/replicast configurable for broadcast- > link. [jon] The command added here makes it possible to forcibly configure the broadcast link to use either broadcast or replicast, in addition to the already existing auto selection algorithm. > > A sample usage is shown below: > $tipc link set broadcast BROADCAST > $tipc link set broadcast SELECTABLE ratio 25 > > $tipc link set broadcast -h > Usage: tipc link set broadcast PROPERTY > > PROPERTIES > BROADCAST - Forces all multicast traffic to be > transmitted via broadcast only, > irrespective of cluster size and number > of destinations > > REPLICAST - Forces all multicast traffic to be > transmitted via replicast only, > irrespective of cluster size and number > of destinations > > SELECTABLE - Auto switching to broadcast or replicast > depending on cluster size and destination [jon] node > number [jon] I would like to rename SELECTABLE to AUTOSELECT. The same number of letter, but easier to understand. > > ratio SIZE - Set the selection ratio for SELECTABLE PROPERTY [jon] - Set the AUTOSELECT criteria, percentage of destination nodes vs cluster size > > Signed-off-by: Hoang Le <hoa...@de...> > --- > include/uapi/linux/tipc_netlink.h | 2 + > tipc/link.c | 95 ++++++++++++++++++++++++++++++- > 2 files changed, 96 insertions(+), 1 deletion(-) > > diff --git a/include/uapi/linux/tipc_netlink.h > b/include/uapi/linux/tipc_netlink.h > index 0ebe02ef1a86..efb958fd167d 100644 > --- a/include/uapi/linux/tipc_netlink.h > +++ b/include/uapi/linux/tipc_netlink.h > @@ -281,6 +281,8 @@ enum { > TIPC_NLA_PROP_TOL, /* u32 */ > TIPC_NLA_PROP_WIN, /* u32 */ > TIPC_NLA_PROP_MTU, /* u32 */ > + TIPC_NLA_PROP_BROADCAST, /* u32 */ > + TIPC_NLA_PROP_BROADCAST_RATIO, /* u32 */ > > __TIPC_NLA_PROP_MAX, > TIPC_NLA_PROP_MAX = __TIPC_NLA_PROP_MAX - 1 diff --git > a/tipc/link.c b/tipc/link.c index 43e26da3fa6b..c1db400f1b26 100644 > --- a/tipc/link.c > +++ b/tipc/link.c > @@ -28,6 +28,9 @@ > #define PRIORITY_STR "priority" > #define TOLERANCE_STR "tolerance" > #define WINDOW_STR "window" > +#define BROADCAST_STR "broadcast" > + > +static const char tipc_bclink_name[] = "broadcast-link"; > > static int link_list_cb(const struct nlmsghdr *nlh, void *data) { @@ -521,7 > +524,8 @@ static void cmd_link_set_help(struct cmdl *cmdl) > "PROPERTIES\n" > " tolerance TOLERANCE - Set link tolerance\n" > " priority PRIORITY - Set link priority\n" > - " window WINDOW - Set link window\n", > + " window WINDOW - Set link window\n" > + " broadcast BROADCAST - Set link broadcast\n", > cmdl->argv[0]); > } > > @@ -585,6 +589,94 @@ static int cmd_link_set_prop(struct nlmsghdr *nlh, > const struct cmd *cmd, > return msg_doit(nlh, link_get_cb, &prop); } > > +static void cmd_link_set_bcast_help(struct cmdl *cmdl) { > + fprintf(stderr, "Usage: %s link set broadcast PROPERTY\n\n" > + "PROPERTIES\n" > + " BROADCAST - Forces all multicast traffic to be\n" > + " transmitted via broadcast only,\n" > + " irrespective of cluster size and number\n" > + " of destinations\n\n" > + " REPLICAST - Forces all multicast traffic to be\n" > + " transmitted via replicast only,\n" > + " irrespective of cluster size and number\n" > + " of destinations\n\n" > + " SELECTABLE - Auto switching to broadcast or replicast\n" > + " depending on cluster size and destination\n" > + " number\n\n" > + " ratio SIZE - Set the selection ratio for SELECTABLE\n\n", > + cmdl->argv[0]); > +} > + > +static int cmd_link_set_bcast(struct nlmsghdr *nlh, const struct cmd *cmd, > + struct cmdl *cmdl, void *data) > +{ > + char buf[MNL_SOCKET_BUFFER_SIZE]; > + struct nlattr *props; > + struct nlattr *attrs; > + struct opt *opt; > + struct opt opts[] = { > + { "BROADCAST", OPT_KEY, NULL }, > + { "REPLICAST", OPT_KEY, NULL }, > + { "SELECTABLE", OPT_KEY, NULL }, > + { "ratio", OPT_KEYVAL, NULL }, > + { NULL } > + }; > + int method = 0; > + > + if (help_flag) { > + (cmd->help)(cmdl); > + return -EINVAL; > + } > + > + if (parse_opts(opts, cmdl) < 0) > + return -EINVAL; > + > + for (opt = opts; opt->key; opt++) > + if (opt->val) > + break; > + > + if (!opt || !opt->key) { > + (cmd->help)(cmdl); > + return -EINVAL; > + } > + > + nlh = msg_init(buf, TIPC_NL_LINK_SET); > + if (!nlh) { > + fprintf(stderr, "error, message initialisation failed\n"); > + return -1; > + } > + > + attrs = mnl_attr_nest_start(nlh, TIPC_NLA_LINK); > + /* Direct to broadcast-link setting */ > + mnl_attr_put_strz(nlh, TIPC_NLA_LINK_NAME, tipc_bclink_name); > + props = mnl_attr_nest_start(nlh, TIPC_NLA_LINK_PROP); > + > + if (get_opt(opts, "BROADCAST")) > + method = 0x1; > + else if (get_opt(opts, "REPLICAST")) > + method = 0x2; > + else if (get_opt(opts, "SELECTABLE")) > + method = 0x4; > + > + opt = get_opt(opts, "ratio"); > + if (!method && !opt) { > + (cmd->help)(cmdl); > + return -EINVAL; > + } > + > + if (method) > + mnl_attr_put_u32(nlh, TIPC_NLA_PROP_BROADCAST, > method); > + > + if (opt) > + mnl_attr_put_u32(nlh, > TIPC_NLA_PROP_BROADCAST_RATIO, > + atoi(opt->val)); > + > + mnl_attr_nest_end(nlh, props); > + mnl_attr_nest_end(nlh, attrs); > + return msg_doit(nlh, NULL, NULL); > +} > + > static int cmd_link_set(struct nlmsghdr *nlh, const struct cmd *cmd, > struct cmdl *cmdl, void *data) > { > @@ -592,6 +684,7 @@ static int cmd_link_set(struct nlmsghdr *nlh, const > struct cmd *cmd, > { PRIORITY_STR, cmd_link_set_prop, > cmd_link_set_help }, > { TOLERANCE_STR, cmd_link_set_prop, > cmd_link_set_help }, > { WINDOW_STR, cmd_link_set_prop, > cmd_link_set_help }, > + { BROADCAST_STR, cmd_link_set_bcast, > cmd_link_set_bcast_help }, > { NULL } > }; > > -- > 2.17.1 |
From: Jon M. <jon...@er...> - 2019-02-18 20:05:37
|
Acked-by: Jon Maloy <jon...@er...> ///jon > -----Original Message----- > From: Tung Nguyen <tun...@de...> > Sent: 14-Feb-19 01:42 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [tipc-discussion][PATCH v1 1/1] tipc: fix race condition causing hung > sendto > > When sending multicast messages via blocking socket, if sending link is > congested (tsk->cong_link_cnt is set to 1), the sending thread is put into > sleeping state. However, > tipc_sk_filter_rcv() is called under socket spin lock but > tipc_wait_for_cond() is not. So, there is no guarantee that the setting of tsk- > >cong_link_cnt to 0 in tipc_sk_proto_rcv() in > CPU-1 will be perceived by CPU-0. If that is the case, the sending thread in > CPU-0 after being waken up, will continue to see > tsk->cong_link_cnt as 1 and put the sending thread into sleeping > state again. The sending thread will sleep forever. > > CPU-0 | CPU-1 > tipc_wait_for_cond() | > { | > // condition_ = !tsk->cong_link_cnt | > while ((rc_ = !(condition_))) { | > ... | > release_sock(sk_); | > wait_woken(); | > | if (!sock_owned_by_user(sk)) > | tipc_sk_filter_rcv() > | { > | ... > | tipc_sk_proto_rcv() > | { > | ... > | tsk->cong_link_cnt--; > | ... > | sk->sk_write_space(sk); > | ... > | } > | ... > | } > sched_annotate_sleep(); | > lock_sock(sk_); | > remove_wait_queue(); | > } | > } | > > This commit fixes it by adding memory barrier to tipc_sk_proto_rcv() and > tipc_wait_for_cond(). > > Signed-off-by: Tung Nguyen <tun...@de...> > --- > net/tipc/socket.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/net/tipc/socket.c b/net/tipc/socket.c index > 1217c90a363b..d8f054d45941 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -383,6 +383,8 @@ static int tipc_sk_sock_err(struct socket *sock, long > *timeout) > int rc_; \ > \ > while ((rc_ = !(condition_))) { \ > + /* coupled with smp_wmb() in tipc_sk_proto_rcv() */ \ > + smp_rmb(); \ > DEFINE_WAIT_FUNC(wait_, woken_wake_function); > \ > sk_ = (sock_)->sk; \ > rc_ = tipc_sk_sock_err((sock_), timeo_); \ > @@ -1982,6 +1984,8 @@ static void tipc_sk_proto_rcv(struct sock *sk, > return; > case SOCK_WAKEUP: > tipc_dest_del(&tsk->cong_links, msg_orignode(hdr), 0); > + /* coupled with smp_rmb() in tipc_wait_for_cond() */ > + smp_wmb(); > tsk->cong_link_cnt--; > wakeup = true; > break; > -- > 2.17.1 |
From: Jon M. <jon...@er...> - 2019-02-18 20:04:10
|
Looks good. Both acked by me. ///jon > -----Original Message----- > From: Tung Nguyen <tun...@de...> > Sent: 13-Feb-19 03:18 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [tipc-discussion][PATCH v1 0/2] improvement for wait and wakeup > function at socket layer > > Some improvements for tipc_wait_for_xyz(). > > Tung Nguyen (2): > tipc: improve function tipc_wait_for_cond() > tipc: improve function tipc_wait_for_rcvmsg() > > net/tipc/socket.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > -- > 2.16.2 > |
From: Hoang Le <hoa...@de...> - 2019-02-18 10:08:29
|
Currently, a multicast stream uses method broadcast or replicast to transmit base on cluster size and number of destinations. However, when L2 interface (e.g VXLAN) pseudo support broadcast, we should make broadcast link possible for the user to switch to replicast for all multicast traffic; besides, we would like to test the broadcast implementation we also want to switch to broadcast for all multicast traffic. The current algorithm implementation is still available to switch between broadcast and replicast depending on cluster size and destination number (default), but ratio (selection threshold) is also configurable. And the default for 'ratio' reduce to 10 (meaning 10%). Signed-off-by: Hoang Le <hoa...@de...> --- include/uapi/linux/tipc_netlink.h | 2 + net/tipc/bcast.c | 104 ++++++++++++++++++++++++++++-- net/tipc/bcast.h | 7 ++ net/tipc/link.c | 8 +++ net/tipc/netlink.c | 4 +- 5 files changed, 120 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/tipc_netlink.h b/include/uapi/linux/tipc_netlink.h index 0ebe02ef1a86..efb958fd167d 100644 --- a/include/uapi/linux/tipc_netlink.h +++ b/include/uapi/linux/tipc_netlink.h @@ -281,6 +281,8 @@ enum { TIPC_NLA_PROP_TOL, /* u32 */ TIPC_NLA_PROP_WIN, /* u32 */ TIPC_NLA_PROP_MTU, /* u32 */ + TIPC_NLA_PROP_BROADCAST, /* u32 */ + TIPC_NLA_PROP_BROADCAST_RATIO, /* u32 */ __TIPC_NLA_PROP_MAX, TIPC_NLA_PROP_MAX = __TIPC_NLA_PROP_MAX - 1 diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index d8026543bf4c..12b59268bdd6 100644 --- a/net/tipc/bcast.c +++ b/net/tipc/bcast.c @@ -54,7 +54,9 @@ const char tipc_bclink_name[] = "broadcast-link"; * @dests: array keeping number of reachable destinations per bearer * @primary_bearer: a bearer having links to all broadcast destinations, if any * @bcast_support: indicates if primary bearer, if any, supports broadcast + * @force_bcast: forces broadcast for multicast traffic * @rcast_support: indicates if all peer nodes support replicast + * @force_rcast: forces replicast for multicast traffic * @rc_ratio: dest count as percentage of cluster size where send method changes * @bc_threshold: calculated from rc_ratio; if dests > threshold use broadcast */ @@ -64,7 +66,9 @@ struct tipc_bc_base { int dests[MAX_BEARERS]; int primary_bearer; bool bcast_support; + bool force_bcast; bool rcast_support; + bool force_rcast; int rc_ratio; int bc_threshold; }; @@ -485,10 +489,63 @@ static int tipc_bc_link_set_queue_limits(struct net *net, u32 limit) return 0; } +static int tipc_bc_link_set_broadcast_mode(struct net *net, u32 bc_mode) +{ + struct tipc_bc_base *bb = tipc_bc_base(net); + + switch (bc_mode) { + case BCLINK_MODE_BCAST: + if (!bb->bcast_support) + return -ENOPROTOOPT; + + bb->force_bcast = true; + bb->force_rcast = false; + break; + case BCLINK_MODE_RCAST: + if (!bb->rcast_support) + return -ENOPROTOOPT; + + bb->force_bcast = false; + bb->force_rcast = true; + break; + case BCLINK_MODE_SEL: + if (!bb->bcast_support || !bb->rcast_support) + return -ENOPROTOOPT; + + bb->force_bcast = false; + bb->force_rcast = false; + break; + default: + return -EINVAL; + } + + return 0; +} + +static int tipc_bc_link_set_broadcast_ratio(struct net *net, u32 bc_ratio) +{ + struct tipc_bc_base *bb = tipc_bc_base(net); + + if (!bb->bcast_support || !bb->rcast_support) + return -ENOPROTOOPT; + + if (bc_ratio > 100 || bc_ratio <= 0) + return -EINVAL; + + bb->rc_ratio = bc_ratio; + tipc_bcast_lock(net); + tipc_bcbase_calc_bc_threshold(net); + tipc_bcast_unlock(net); + + return 0; +} + int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]) { int err; u32 win; + u32 bc_mode; + u32 bc_ratio; struct nlattr *props[TIPC_NLA_PROP_MAX + 1]; if (!attrs[TIPC_NLA_LINK_PROP]) @@ -498,12 +555,28 @@ int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]) if (err) return err; - if (!props[TIPC_NLA_PROP_WIN]) + if (!props[TIPC_NLA_PROP_WIN] && + !props[TIPC_NLA_PROP_BROADCAST] && + !props[TIPC_NLA_PROP_BROADCAST_RATIO]) { return -EOPNOTSUPP; + } + + if (props[TIPC_NLA_PROP_BROADCAST]) { + bc_mode = nla_get_u32(props[TIPC_NLA_PROP_BROADCAST]); + err = tipc_bc_link_set_broadcast_mode(net, bc_mode); + } - win = nla_get_u32(props[TIPC_NLA_PROP_WIN]); + if (!err && props[TIPC_NLA_PROP_BROADCAST_RATIO]) { + bc_ratio = nla_get_u32(props[TIPC_NLA_PROP_BROADCAST_RATIO]); + err = tipc_bc_link_set_broadcast_ratio(net, bc_ratio); + } - return tipc_bc_link_set_queue_limits(net, win); + if (!err && props[TIPC_NLA_PROP_WIN]) { + win = nla_get_u32(props[TIPC_NLA_PROP_WIN]); + err = tipc_bc_link_set_queue_limits(net, win); + } + + return err; } int tipc_bcast_init(struct net *net) @@ -529,7 +602,7 @@ int tipc_bcast_init(struct net *net) goto enomem; bb->link = l; tn->bcl = l; - bb->rc_ratio = 25; + bb->rc_ratio = 10; bb->rcast_support = true; return 0; enomem: @@ -576,3 +649,26 @@ void tipc_nlist_purge(struct tipc_nlist *nl) nl->remote = 0; nl->local = false; } + +u32 tipc_bcast_get_broadcast_mode(struct net *net) +{ + struct tipc_bc_base *bb = tipc_bc_base(net); + + if (bb->force_bcast) + return BCLINK_MODE_BCAST; + + if (bb->force_rcast) + return BCLINK_MODE_RCAST; + + if (bb->bcast_support && bb->rcast_support) + return BCLINK_MODE_SEL; + + return 0; +} + +u32 tipc_bcast_get_broadcast_ratio(struct net *net) +{ + struct tipc_bc_base *bb = tipc_bc_base(net); + + return bb->rc_ratio; +} diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index 751530ab0c49..37c55e7347a5 100644 --- a/net/tipc/bcast.h +++ b/net/tipc/bcast.h @@ -48,6 +48,10 @@ extern const char tipc_bclink_name[]; #define TIPC_METHOD_EXPIRE msecs_to_jiffies(5000) +#define BCLINK_MODE_BCAST 0x1 +#define BCLINK_MODE_RCAST 0x2 +#define BCLINK_MODE_SEL 0x4 + struct tipc_nlist { struct list_head list; u32 self; @@ -92,6 +96,9 @@ int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg); int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]); int tipc_bclink_reset_stats(struct net *net); +u32 tipc_bcast_get_broadcast_mode(struct net *net); +u32 tipc_bcast_get_broadcast_ratio(struct net *net); + static inline void tipc_bcast_lock(struct net *net) { spin_lock_bh(&tipc_net(net)->bclock); diff --git a/net/tipc/link.c b/net/tipc/link.c index 341ecd796aa4..52d23b3ffaf5 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -2197,6 +2197,8 @@ int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg) struct nlattr *attrs; struct nlattr *prop; struct tipc_net *tn = net_generic(net, tipc_net_id); + u32 bc_mode = tipc_bcast_get_broadcast_mode(net); + u32 bc_ratio = tipc_bcast_get_broadcast_ratio(net); struct tipc_link *bcl = tn->bcl; if (!bcl) @@ -2233,6 +2235,12 @@ int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg) goto attr_msg_full; if (nla_put_u32(msg->skb, TIPC_NLA_PROP_WIN, bcl->window)) goto prop_msg_full; + if (nla_put_u32(msg->skb, TIPC_NLA_PROP_BROADCAST, bc_mode)) + goto prop_msg_full; + if (bc_mode & BCLINK_MODE_SEL) + if (nla_put_u32(msg->skb, TIPC_NLA_PROP_BROADCAST_RATIO, + bc_ratio)) + goto prop_msg_full; nla_nest_end(msg->skb, prop); err = __tipc_nl_add_bc_link_stat(msg->skb, &bcl->stats); diff --git a/net/tipc/netlink.c b/net/tipc/netlink.c index 99ee419210ba..5240f64e8ccc 100644 --- a/net/tipc/netlink.c +++ b/net/tipc/netlink.c @@ -110,7 +110,9 @@ const struct nla_policy tipc_nl_prop_policy[TIPC_NLA_PROP_MAX + 1] = { [TIPC_NLA_PROP_UNSPEC] = { .type = NLA_UNSPEC }, [TIPC_NLA_PROP_PRIO] = { .type = NLA_U32 }, [TIPC_NLA_PROP_TOL] = { .type = NLA_U32 }, - [TIPC_NLA_PROP_WIN] = { .type = NLA_U32 } + [TIPC_NLA_PROP_WIN] = { .type = NLA_U32 }, + [TIPC_NLA_PROP_BROADCAST] = { .type = NLA_U32 }, + [TIPC_NLA_PROP_BROADCAST_RATIO] = { .type = NLA_U32 } }; const struct nla_policy tipc_nl_bearer_policy[TIPC_NLA_BEARER_MAX + 1] = { -- 2.17.1 |
From: Hoang Le <hoa...@de...> - 2019-02-18 10:08:27
|
Currently, a multicast stream may start out using replicast, because there are few destinations, and then it should ideally switch to L2/broadcast IGMP/multicast when the number of destinations grows beyond a certain limit. The opposite should happen when the number decreases below the limit. To eliminate the risk of message reordering caused by method change, a sending socket must stick to a previously selected method until it enters an idle period of 5 seconds. Means there is a 5 seconds pause in the traffic from the sender socket. With this commit, we allow such a switch between replicast and broadcast without 5 seconds pause in the traffic. Solution is to send a dummy message with only the header, also with the SYN bit set, via broadcast or replicast. For the data message, the SYN bit is set and sending via replicast or broadcast (inverse method with dummy). Then, at receiving side any messages follow first SYN bit message (data or dummy message), they will be hold in deferred queue until another pair (dummy or data message) arrived in other link. Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/bcast.c | 165 +++++++++++++++++++++++++++++++++++++++++++++- net/tipc/bcast.h | 5 ++ net/tipc/msg.h | 10 +++ net/tipc/socket.c | 5 ++ 4 files changed, 184 insertions(+), 1 deletion(-) diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index 12b59268bdd6..d806e3714280 100644 --- a/net/tipc/bcast.c +++ b/net/tipc/bcast.c @@ -220,9 +220,24 @@ static void tipc_bcast_select_xmit_method(struct net *net, int dests, } /* Can current method be changed ? */ method->expires = jiffies + TIPC_METHOD_EXPIRE; - if (method->mandatory || time_before(jiffies, exp)) + if (method->mandatory) return; + if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) && + time_before(jiffies, exp)) + return; + + /* Configuration as force 'broadcast' method */ + if (bb->force_bcast) { + method->rcast = false; + return; + } + /* Configuration as force 'replicast' method */ + if (bb->force_rcast) { + method->rcast = true; + return; + } + /* Configuration as 'selectable' or default method */ /* Determine method to use now */ method->rcast = dests <= bb->bc_threshold; } @@ -285,6 +300,63 @@ static int tipc_rcast_xmit(struct net *net, struct sk_buff_head *pkts, return 0; } +/* tipc_mcast_send_sync - deliver a dummy message with SYN bit + * @net: the applicable net namespace + * @skb: socket buffer to copy + * @method: send method to be used + * @dests: destination nodes for message. + * @cong_link_cnt: returns number of encountered congested destination links + * Returns 0 if success, otherwise errno + */ +static int tipc_mcast_send_sync(struct net *net, struct sk_buff *skb, + struct tipc_mc_method *method, + struct tipc_nlist *dests, + u16 *cong_link_cnt) +{ + struct sk_buff_head tmpq; + struct sk_buff *_skb; + struct tipc_msg *hdr, *_hdr; + + /* Is a cluster supporting with new capabilities ? */ + if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL)) + return 0; + + hdr = buf_msg(skb); + if (msg_user(hdr) == MSG_FRAGMENTER) + hdr = msg_get_wrapped(hdr); + if (msg_type(hdr) != TIPC_MCAST_MSG) + return 0; + + /* Allocate dummy message */ + _skb = tipc_buf_acquire(MCAST_H_SIZE, GFP_KERNEL); + if (!skb) + return -ENOMEM; + + /* Preparing for 'synching' header */ + msg_set_syn(hdr, 1); + + /* Copy skb's header into a dummy header */ + skb_copy_to_linear_data(_skb, hdr, MCAST_H_SIZE); + skb_orphan(_skb); + + /* Reverse method for dummy message */ + _hdr = buf_msg(_skb); + msg_set_size(_hdr, MCAST_H_SIZE); + msg_set_is_rcast(_hdr, !msg_is_rcast(hdr)); + + skb_queue_head_init(&tmpq); + __skb_queue_tail(&tmpq, _skb); + if (method->rcast) + tipc_bcast_xmit(net, &tmpq, cong_link_cnt); + else + tipc_rcast_xmit(net, &tmpq, dests, cong_link_cnt); + + /* This queue should normally be empty by now */ + __skb_queue_purge(&tmpq); + + return 0; +} + /* tipc_mcast_xmit - deliver message to indicated destination nodes * and to identified node local sockets * @net: the applicable net namespace @@ -300,6 +372,9 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head *pkts, u16 *cong_link_cnt) { struct sk_buff_head inputq, localq; + struct sk_buff *skb; + struct tipc_msg *hdr; + bool rcast = method->rcast; int rc = 0; skb_queue_head_init(&inputq); @@ -313,6 +388,18 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head *pkts, /* Send according to determined transmit method */ if (dests->remote) { tipc_bcast_select_xmit_method(net, dests->remote, method); + + skb = skb_peek(pkts); + hdr = buf_msg(skb); + if (msg_user(hdr) == MSG_FRAGMENTER) + hdr = msg_get_wrapped(hdr); + msg_set_is_rcast(hdr, method->rcast); + + /* Switch method ? */ + if (rcast != method->rcast) + tipc_mcast_send_sync(net, skb, method, + dests, cong_link_cnt); + if (method->rcast) rc = tipc_rcast_xmit(net, pkts, dests, cong_link_cnt); else @@ -672,3 +759,79 @@ u32 tipc_bcast_get_broadcast_ratio(struct net *net) return bb->rc_ratio; } + +void tipc_mcast_filter_msg(struct sk_buff_head *defq, + struct sk_buff_head *inputq) +{ + struct sk_buff *skb, *_skb, *tmp; + struct tipc_msg *hdr, *_hdr; + bool match = false; + u32 node, port; + + skb = skb_peek(inputq); + hdr = buf_msg(skb); + + if (likely(!msg_is_syn(hdr) && skb_queue_empty(defq))) + return; + + node = msg_orignode(hdr); + port = msg_origport(hdr); + + /* Has the twin SYN message already arrived ? */ + skb_queue_walk(defq, _skb) { + _hdr = buf_msg(_skb); + if (msg_orignode(_hdr) != node) + continue; + if (msg_origport(_hdr) != port) + continue; + match = true; + break; + } + + if (!match) { + if (!msg_is_syn(hdr)) + return; + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Deliver non-SYN message from other link, otherwise queue it */ + if (!msg_is_syn(hdr)) { + if (msg_is_rcast(hdr) != msg_is_rcast(_hdr)) + return; + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Queue non-SYN/SYN message from same link */ + if (msg_is_rcast(hdr) == msg_is_rcast(_hdr)) { + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Matching SYN messages => return the one with data, if any */ + __skb_unlink(_skb, defq); + if (msg_data_sz(hdr)) { + kfree_skb(_skb); + } else { + __skb_dequeue(inputq); + kfree_skb(skb); + __skb_queue_tail(inputq, _skb); + } + + /* Deliver subsequent non-SYN messages from same peer */ + skb_queue_walk_safe(defq, _skb, tmp) { + _hdr = buf_msg(_skb); + if (msg_orignode(_hdr) != node) + continue; + if (msg_origport(_hdr) != port) + continue; + if (msg_is_syn(_hdr)) + break; + __skb_unlink(_skb, defq); + __skb_queue_tail(inputq, _skb); + } +} diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index 37c55e7347a5..484bde289d3a 100644 --- a/net/tipc/bcast.h +++ b/net/tipc/bcast.h @@ -67,11 +67,13 @@ void tipc_nlist_del(struct tipc_nlist *nl, u32 node); /* Cookie to be used between socket and broadcast layer * @rcast: replicast (instead of broadcast) was used at previous xmit * @mandatory: broadcast/replicast indication was set by user + * @deferredq: defer queue to make message in order * @expires: re-evaluate non-mandatory transmit method if we are past this */ struct tipc_mc_method { bool rcast; bool mandatory; + struct sk_buff_head deferredq; unsigned long expires; }; @@ -99,6 +101,9 @@ int tipc_bclink_reset_stats(struct net *net); u32 tipc_bcast_get_broadcast_mode(struct net *net); u32 tipc_bcast_get_broadcast_ratio(struct net *net); +void tipc_mcast_filter_msg(struct sk_buff_head *defq, + struct sk_buff_head *inputq); + static inline void tipc_bcast_lock(struct net *net) { spin_lock_bh(&tipc_net(net)->bclock); diff --git a/net/tipc/msg.h b/net/tipc/msg.h index d7e4b8b93f9d..528ba9241acc 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -257,6 +257,16 @@ static inline void msg_set_src_droppable(struct tipc_msg *m, u32 d) msg_set_bits(m, 0, 18, 1, d); } +static inline bool msg_is_rcast(struct tipc_msg *m) +{ + return msg_bits(m, 0, 18, 0x1); +} + +static inline void msg_set_is_rcast(struct tipc_msg *m, bool d) +{ + msg_set_bits(m, 0, 18, 0x1, d); +} + static inline void msg_set_size(struct tipc_msg *m, u32 sz) { m->hdr[0] = htonl((msg_word(m, 0) & ~0x1ffff) | sz); diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 8fc5acd4820d..de83eb1e718e 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -483,6 +483,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock, tsk_set_unreturnable(tsk, true); if (sock->type == SOCK_DGRAM) tsk_set_unreliable(tsk, true); + __skb_queue_head_init(&tsk->mc_method.deferredq); } trace_tipc_sk_create(sk, NULL, TIPC_DUMP_NONE, " "); @@ -580,6 +581,7 @@ static int tipc_release(struct socket *sock) sk->sk_shutdown = SHUTDOWN_MASK; tipc_sk_leave(tsk); tipc_sk_withdraw(tsk, 0, NULL); + __skb_queue_purge(&tsk->mc_method.deferredq); sk_stop_timer(sk, &sk->sk_timer); tipc_sk_remove(tsk); @@ -2157,6 +2159,9 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb, if (unlikely(grp)) tipc_group_filter_msg(grp, &inputq, xmitq); + if (msg_type(hdr) == TIPC_MCAST_MSG) + tipc_mcast_filter_msg(&tsk->mc_method.deferredq, &inputq); + /* Validate and add to receive buffer if there is space */ while ((skb = __skb_dequeue(&inputq))) { hdr = buf_msg(skb); -- 2.17.1 |
From: Hoang Le <hoa...@de...> - 2019-02-18 10:08:20
|
As a preparation for introducing a moothly switching between replicast and broadcast method for multicast message We have to introduce a new capability flag TIPC_MCAST_RBCTL to handle this new feature because of compatibility reasons. When a cluster upgrade a node can come back with this new capabilities which also must be reflected in the cluster capabilities field and new feature only applicable if the cluster supports this new capability. Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/core.c | 2 ++ net/tipc/core.h | 3 +++ net/tipc/node.c | 18 ++++++++++++++++++ net/tipc/node.h | 6 ++++-- 4 files changed, 27 insertions(+), 2 deletions(-) diff --git a/net/tipc/core.c b/net/tipc/core.c index 5b38f5164281..27cccd101ef6 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -43,6 +43,7 @@ #include "net.h" #include "socket.h" #include "bcast.h" +#include "node.h" #include <linux/module.h> @@ -59,6 +60,7 @@ static int __net_init tipc_init_net(struct net *net) tn->node_addr = 0; tn->trial_addr = 0; tn->addr_trial_end = 0; + tn->capabilities = TIPC_NODE_CAPABILITIES; memset(tn->node_id, 0, sizeof(tn->node_id)); memset(tn->node_id_string, 0, sizeof(tn->node_id_string)); tn->mon_threshold = TIPC_DEF_MON_THRESHOLD; diff --git a/net/tipc/core.h b/net/tipc/core.h index 8020a6c360ff..7a68e1b6a066 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -122,6 +122,9 @@ struct tipc_net { /* Topology subscription server */ struct tipc_topsrv *topsrv; atomic_t subscription_count; + + /* Cluster capabilities */ + u16 capabilities; }; static inline struct tipc_net *tipc_net(struct net *net) diff --git a/net/tipc/node.c b/net/tipc/node.c index 2dc4919ab23c..2717893e9dbe 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -383,6 +383,11 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, tipc_link_update_caps(l, capabilities); } write_unlock_bh(&n->lock); + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } goto exit; } n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -433,6 +438,11 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, break; } list_add_tail_rcu(&n->list, &temp_node->list); + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } trace_tipc_node_create(n, true, " "); exit: spin_unlock_bh(&tn->node_list_lock); @@ -589,6 +599,7 @@ static void tipc_node_clear_links(struct tipc_node *node) */ static bool tipc_node_cleanup(struct tipc_node *peer) { + struct tipc_node *temp_node; struct tipc_net *tn = tipc_net(peer->net); bool deleted = false; @@ -604,6 +615,13 @@ static bool tipc_node_cleanup(struct tipc_node *peer) deleted = true; } tipc_node_write_unlock(peer); + + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } + spin_unlock_bh(&tn->node_list_lock); return deleted; } diff --git a/net/tipc/node.h b/net/tipc/node.h index 4f59a30e989a..2404225c5d58 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -51,7 +51,8 @@ enum { TIPC_BLOCK_FLOWCTL = (1 << 3), TIPC_BCAST_RCAST = (1 << 4), TIPC_NODE_ID128 = (1 << 5), - TIPC_LINK_PROTO_SEQNO = (1 << 6) + TIPC_LINK_PROTO_SEQNO = (1 << 6), + TIPC_MCAST_RBCTL = (1 << 7) }; #define TIPC_NODE_CAPABILITIES (TIPC_SYN_BIT | \ @@ -60,7 +61,8 @@ enum { TIPC_BCAST_RCAST | \ TIPC_BLOCK_FLOWCTL | \ TIPC_NODE_ID128 | \ - TIPC_LINK_PROTO_SEQNO) + TIPC_LINK_PROTO_SEQNO | \ + TIPC_MCAST_RBCTL) #define INVALID_BEARER_ID -1 void tipc_node_stop(struct net *net); -- 2.17.1 |
From: Ying X. <yin...@wi...> - 2019-02-18 09:09:23
|
On 2/13/19 4:18 PM, Tung Nguyen wrote: > This commit replaces schedule_timeout() with wait_woken() > in function tipc_wait_for_rcvmsg(). wait_woken() uses > memory barriers in its implementation to avoid potential > race condition when putting a process into sleeping state > and then waking it up. > > Signed-off-by: Tung Nguyen <tun...@de...> Acked-by: Ying Xue <yin...@wi...> > --- > net/tipc/socket.c | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/net/tipc/socket.c b/net/tipc/socket.c > index 81b87916a0eb..684f2125fc6b 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -1677,7 +1677,7 @@ static void tipc_sk_send_ack(struct tipc_sock *tsk) > static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop) > { > struct sock *sk = sock->sk; > - DEFINE_WAIT(wait); > + DEFINE_WAIT_FUNC(wait, woken_wake_function); > long timeo = *timeop; > int err = sock_error(sk); > > @@ -1685,15 +1685,17 @@ static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop) > return err; > > for (;;) { > - prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); > if (timeo && skb_queue_empty(&sk->sk_receive_queue)) { > if (sk->sk_shutdown & RCV_SHUTDOWN) { > err = -ENOTCONN; > break; > } > + add_wait_queue(sk_sleep(sk), &wait); > release_sock(sk); > - timeo = schedule_timeout(timeo); > + timeo = wait_woken(&wait, TASK_INTERRUPTIBLE, timeo); > + sched_annotate_sleep(); > lock_sock(sk); > + remove_wait_queue(sk_sleep(sk), &wait); > } > err = 0; > if (!skb_queue_empty(&sk->sk_receive_queue)) > @@ -1709,7 +1711,6 @@ static int tipc_wait_for_rcvmsg(struct socket *sock, long *timeop) > if (err) > break; > } > - finish_wait(sk_sleep(sk), &wait); > *timeop = timeo; > return err; > } > |
From: Ying X. <yin...@wi...> - 2019-02-18 09:07:23
|
On 2/13/19 4:18 PM, Tung Nguyen wrote: > Commit 844cf763fba654436d3a4279b6a672c196cf1901 > (tipc: make macro tipc_wait_for_cond() smp safe) Please use the following words to describe commit info: Commit 844cf763fba6 ("tipc: make macro tipc_wait_for_cond() smp safe") replaced > finish_wait() with remove_wait_queue() but still used > prepare_to_wait(). This causes unnecessary conditional > checking before adding to wait queue in prepare_to_wait(). > > This commit replaces prepare_to_wait() with add_wait_queue() > as the pair function with remove_wait_queue(). > > Signed-off-by: Tung Nguyen <tun...@de...> Acked-by: Ying Xue <yin...@wi...> > --- > net/tipc/socket.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/tipc/socket.c b/net/tipc/socket.c > index 1217c90a363b..81b87916a0eb 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -388,7 +388,7 @@ static int tipc_sk_sock_err(struct socket *sock, long *timeout) > rc_ = tipc_sk_sock_err((sock_), timeo_); \ > if (rc_) \ > break; \ > - prepare_to_wait(sk_sleep(sk_), &wait_, TASK_INTERRUPTIBLE); \ > + add_wait_queue(sk_sleep(sk_), &wait_); \ > release_sock(sk_); \ > *(timeo_) = wait_woken(&wait_, TASK_INTERRUPTIBLE, *(timeo_)); \ > sched_annotate_sleep(); \ > |
From: Hoang Le <hoa...@de...> - 2019-02-18 08:13:46
|
Currently, a multicast stream may start out using replicast, because there are few destinations, and then it should ideally switch to L2/broadcast IGMP/multicast when the number of destinations grows beyond a certain limit. The opposite should happen when the number decreases below the limit. To eliminate the risk of message reordering caused by method change, a sending socket must stick to a previously selected method until it enters an idle period of 5 seconds. Means there is a 5 seconds pause in the traffic from the sender socket. With this commit, we allow such a switch between replicast and broadcast without 5 seconds pause in the traffic. Solution is to send a dummy message with only the header, also with the SYN bit set, via broadcast or replicast. For the data message, the SYN bit is set and sending via replicast or broadcast (inverse method with dummy). Then, at receiving side any messages follow first SYN bit message (data or dummy message), they will be hold in deferred queue until another pair (dummy or data message) arrived in other link. Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/bcast.c | 166 +++++++++++++++++++++++++++++++++++++++++++++- net/tipc/bcast.h | 5 ++ net/tipc/msg.h | 10 +++ net/tipc/socket.c | 5 ++ 4 files changed, 185 insertions(+), 1 deletion(-) diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index 401e725ea535..a5c9ac5a57fc 100644 --- a/net/tipc/bcast.c +++ b/net/tipc/bcast.c @@ -218,9 +218,24 @@ static void tipc_bcast_select_xmit_method(struct net *net, int dests, } /* Can current method be changed ? */ method->expires = jiffies + TIPC_METHOD_EXPIRE; - if (method->mandatory || time_before(jiffies, exp)) + if (method->mandatory) return; + if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) && + time_before(jiffies, exp)) + return; + + /* Configuration as force 'broadcast' method */ + if (bb->force_bcast) { + method->rcast = false; + return; + } + /* Configuration as force 'replicast' method */ + if (bb->force_rcast) { + method->rcast = true; + return; + } + /* Configuration as 'selectable' or default method */ /* Determine method to use now */ method->rcast = dests <= bb->bc_threshold; } @@ -283,6 +298,64 @@ static int tipc_rcast_xmit(struct net *net, struct sk_buff_head *pkts, return 0; } +/* tipc_mcast_send_sync - deliver a dummy message with SYN bit + * @net: the applicable net namespace + * @skb: socket buffer to copy + * @method: send method to be used + * @dests: destination nodes for message. + * @cong_link_cnt: returns number of encountered congested destination links + * Returns 0 if success, otherwise errno + */ +static int tipc_mcast_send_sync(struct net *net, struct sk_buff *skb, + struct tipc_mc_method *method, + struct tipc_nlist *dests, + u16 *cong_link_cnt) +{ + struct sk_buff_head tmpq; + struct sk_buff *_skb; + struct tipc_msg *hdr, *_hdr; + + /* Is a cluster supporting with new capabilities ? */ + if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL)) + return 0; + + hdr = buf_msg(skb); + if (msg_user(hdr) == MSG_FRAGMENTER) + hdr = msg_get_wrapped(hdr); + if (msg_type(hdr) != TIPC_MCAST_MSG) + return 0; + + /* Allocate dummy message */ + _skb = tipc_buf_acquire(MCAST_H_SIZE, GFP_KERNEL); + if (!skb) + return -ENOMEM; + + /* Preparing for 'synching' header */ + msg_set_syn(hdr, 1); + msg_set_is_rcast(hdr, method->rcast); + + /* Copy skb's header into a dummy header */ + skb_copy_to_linear_data(_skb, hdr, MCAST_H_SIZE); + skb_orphan(_skb); + + /* Reverse method for dummy message */ + _hdr = buf_msg(_skb); + msg_set_size(_hdr, MCAST_H_SIZE); + msg_set_is_rcast(_hdr, !method->rcast); + + skb_queue_head_init(&tmpq); + __skb_queue_tail(&tmpq, _skb); + if (method->rcast) + tipc_bcast_xmit(net, &tmpq, cong_link_cnt); + else + tipc_rcast_xmit(net, &tmpq, dests, cong_link_cnt); + + /* This queue should normally be empty by now */ + __skb_queue_purge(&tmpq); + + return 0; +} + /* tipc_mcast_xmit - deliver message to indicated destination nodes * and to identified node local sockets * @net: the applicable net namespace @@ -298,6 +371,9 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head *pkts, u16 *cong_link_cnt) { struct sk_buff_head inputq, localq; + struct sk_buff *skb; + struct tipc_msg *hdr; + bool rcast = method->rcast; int rc = 0; skb_queue_head_init(&inputq); @@ -311,6 +387,18 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head *pkts, /* Send according to determined transmit method */ if (dests->remote) { tipc_bcast_select_xmit_method(net, dests->remote, method); + + skb = skb_peek(pkts); + hdr = buf_msg(skb); + if (msg_user(hdr) == MSG_FRAGMENTER) + hdr = msg_get_wrapped(hdr); + msg_set_is_rcast(hdr, method->rcast); + + /* Switch method ? */ + if (rcast != method->rcast) + tipc_mcast_send_sync(net, skb, method, + dests, cong_link_cnt); + if (method->rcast) rc = tipc_rcast_xmit(net, pkts, dests, cong_link_cnt); else @@ -670,3 +758,79 @@ u32 tipc_bcast_get_broadcast_ratio(struct net *net) return bb->rc_ratio; } + +void tipc_mcast_filter_msg(struct sk_buff_head *defq, + struct sk_buff_head *inputq) +{ + struct sk_buff *skb, *_skb, *tmp; + struct tipc_msg *hdr, *_hdr; + bool match = false; + u32 node, port; + + skb = skb_peek(inputq); + hdr = buf_msg(skb); + + if (likely(!msg_is_syn(hdr) && skb_queue_empty(defq))) + return; + + node = msg_orignode(hdr); + port = msg_origport(hdr); + + /* Has the twin SYN message already arrived ? */ + skb_queue_walk(defq, _skb) { + _hdr = buf_msg(_skb); + if (msg_orignode(_hdr) != node) + continue; + if (msg_origport(_hdr) != port) + continue; + match = true; + break; + } + + if (!match) { + if (!msg_is_syn(hdr)) + return; + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Deliver non-SYN message from other link, otherwise queue it */ + if (!msg_is_syn(hdr)) { + if (msg_is_rcast(hdr) != msg_is_rcast(_hdr)) + return; + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Queue non-SYN/SYN message from same link */ + if (msg_is_rcast(hdr) == msg_is_rcast(_hdr)) { + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Matching SYN messages => return the one with data, if any */ + __skb_unlink(_skb, defq); + if (msg_data_sz(hdr)) { + kfree_skb(_skb); + } else { + __skb_dequeue(inputq); + kfree_skb(skb); + __skb_queue_tail(inputq, _skb); + } + + /* Deliver subsequent non-SYN messages from same peer */ + skb_queue_walk_safe(defq, _skb, tmp) { + _hdr = buf_msg(_skb); + if (msg_orignode(_hdr) != node) + continue; + if (msg_origport(_hdr) != port) + continue; + if (msg_is_syn(_hdr)) + break; + __skb_unlink(_skb, defq); + __skb_queue_tail(inputq, _skb); + } +} diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index ae3af139d299..1b6772de7dd8 100644 --- a/net/tipc/bcast.h +++ b/net/tipc/bcast.h @@ -67,11 +67,13 @@ void tipc_nlist_del(struct tipc_nlist *nl, u32 node); /* Cookie to be used between socket and broadcast layer * @rcast: replicast (instead of broadcast) was used at previous xmit * @mandatory: broadcast/replicast indication was set by user + * @deferredq: defer queue to make message in order * @expires: re-evaluate non-mandatory transmit method if we are past this */ struct tipc_mc_method { bool rcast; bool mandatory; + struct sk_buff_head deferredq; unsigned long expires; }; @@ -96,6 +98,9 @@ int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg); int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]); int tipc_bclink_reset_stats(struct net *net); +void tipc_mcast_filter_msg(struct sk_buff_head *defq, + struct sk_buff_head *inputq); + static inline void tipc_bcast_lock(struct net *net) { spin_lock_bh(&tipc_net(net)->bclock); diff --git a/net/tipc/msg.h b/net/tipc/msg.h index a0924956bb61..70ddff2206a0 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -257,6 +257,16 @@ static inline void msg_set_src_droppable(struct tipc_msg *m, u32 d) msg_set_bits(m, 0, 18, 1, d); } +static inline bool msg_is_rcast(struct tipc_msg *m) +{ + return msg_bits(m, 0, 18, 0x1); +} + +static inline void msg_set_is_rcast(struct tipc_msg *m, bool d) +{ + msg_set_bits(m, 0, 18, 0x1, d); +} + static inline void msg_set_size(struct tipc_msg *m, u32 sz) { m->hdr[0] = htonl((msg_word(m, 0) & ~0x1ffff) | sz); diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 8fc5acd4820d..de83eb1e718e 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -483,6 +483,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock, tsk_set_unreturnable(tsk, true); if (sock->type == SOCK_DGRAM) tsk_set_unreliable(tsk, true); + __skb_queue_head_init(&tsk->mc_method.deferredq); } trace_tipc_sk_create(sk, NULL, TIPC_DUMP_NONE, " "); @@ -580,6 +581,7 @@ static int tipc_release(struct socket *sock) sk->sk_shutdown = SHUTDOWN_MASK; tipc_sk_leave(tsk); tipc_sk_withdraw(tsk, 0, NULL); + __skb_queue_purge(&tsk->mc_method.deferredq); sk_stop_timer(sk, &sk->sk_timer); tipc_sk_remove(tsk); @@ -2157,6 +2159,9 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb, if (unlikely(grp)) tipc_group_filter_msg(grp, &inputq, xmitq); + if (msg_type(hdr) == TIPC_MCAST_MSG) + tipc_mcast_filter_msg(&tsk->mc_method.deferredq, &inputq); + /* Validate and add to receive buffer if there is space */ while ((skb = __skb_dequeue(&inputq))) { hdr = buf_msg(skb); -- 2.17.1 |
From: Hoang Le <hoa...@de...> - 2019-02-18 08:13:43
|
As a preparation for introducing a moothly switching between replicast and broadcast method for multicast message We have to introduce a new capability flag TIPC_MCAST_RBCTL to handle this new feature because of compatibility reasons. When a cluster upgrade a node can come back with this new capabilities which also must be reflected in the cluster capabilities field and new feature only applicable if the cluster supports this new capability. Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/core.c | 2 ++ net/tipc/core.h | 3 +++ net/tipc/node.c | 18 ++++++++++++++++++ net/tipc/node.h | 6 ++++-- 4 files changed, 27 insertions(+), 2 deletions(-) diff --git a/net/tipc/core.c b/net/tipc/core.c index 5b38f5164281..27cccd101ef6 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -43,6 +43,7 @@ #include "net.h" #include "socket.h" #include "bcast.h" +#include "node.h" #include <linux/module.h> @@ -59,6 +60,7 @@ static int __net_init tipc_init_net(struct net *net) tn->node_addr = 0; tn->trial_addr = 0; tn->addr_trial_end = 0; + tn->capabilities = TIPC_NODE_CAPABILITIES; memset(tn->node_id, 0, sizeof(tn->node_id)); memset(tn->node_id_string, 0, sizeof(tn->node_id_string)); tn->mon_threshold = TIPC_DEF_MON_THRESHOLD; diff --git a/net/tipc/core.h b/net/tipc/core.h index 8020a6c360ff..7a68e1b6a066 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -122,6 +122,9 @@ struct tipc_net { /* Topology subscription server */ struct tipc_topsrv *topsrv; atomic_t subscription_count; + + /* Cluster capabilities */ + u16 capabilities; }; static inline struct tipc_net *tipc_net(struct net *net) diff --git a/net/tipc/node.c b/net/tipc/node.c index db2a6c3e0be9..b2b0b6f8ab07 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -383,6 +383,11 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, tipc_link_update_caps(l, capabilities); } write_unlock_bh(&n->lock); + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } goto exit; } n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -433,6 +438,11 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, break; } list_add_tail_rcu(&n->list, &temp_node->list); + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } trace_tipc_node_create(n, true, " "); exit: spin_unlock_bh(&tn->node_list_lock); @@ -589,6 +599,7 @@ static void tipc_node_clear_links(struct tipc_node *node) */ static bool tipc_node_cleanup(struct tipc_node *peer) { + struct tipc_node *temp_node; struct tipc_net *tn = tipc_net(peer->net); bool deleted = false; @@ -604,6 +615,13 @@ static bool tipc_node_cleanup(struct tipc_node *peer) deleted = true; } tipc_node_write_unlock(peer); + + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } + spin_unlock_bh(&tn->node_list_lock); return deleted; } diff --git a/net/tipc/node.h b/net/tipc/node.h index 4f59a30e989a..2404225c5d58 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -51,7 +51,8 @@ enum { TIPC_BLOCK_FLOWCTL = (1 << 3), TIPC_BCAST_RCAST = (1 << 4), TIPC_NODE_ID128 = (1 << 5), - TIPC_LINK_PROTO_SEQNO = (1 << 6) + TIPC_LINK_PROTO_SEQNO = (1 << 6), + TIPC_MCAST_RBCTL = (1 << 7) }; #define TIPC_NODE_CAPABILITIES (TIPC_SYN_BIT | \ @@ -60,7 +61,8 @@ enum { TIPC_BCAST_RCAST | \ TIPC_BLOCK_FLOWCTL | \ TIPC_NODE_ID128 | \ - TIPC_LINK_PROTO_SEQNO) + TIPC_LINK_PROTO_SEQNO | \ + TIPC_MCAST_RBCTL) #define INVALID_BEARER_ID -1 void tipc_node_stop(struct net *net); -- 2.17.1 |
From: Hoang Le <hoa...@de...> - 2019-02-18 07:59:17
|
Currently, a multicast stream may start out using replicast, because there are few destinations, and then it should ideally switch to L2/broadcast IGMP/multicast when the number of destinations grows beyond a certain limit. The opposite should happen when the number decreases below the limit. To eliminate the risk of message reordering caused by method change, a sending socket must stick to a previously selected method until it enters an idle period of 5 seconds. Means there is a 5 seconds pause in the traffic from the sender socket. With this commit, we allow such a switch between replicast and broadcast without 5 seconds pause in the traffic. Solution is to send a dummy message with only the header, also with the SYN bit set, via broadcast or replicast. For the data message, the SYN bit is set and sending via replicast or broadcast (inverse method with dummy). Then, at receiving side any messages follow first SYN bit message (data or dummy message), they will be hold in deferred queue until another pair (dummy or data message) arrived in other link. Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/bcast.c | 166 +++++++++++++++++++++++++++++++++++++++++++++- net/tipc/bcast.h | 5 ++ net/tipc/msg.h | 10 +++ net/tipc/socket.c | 5 ++ 4 files changed, 185 insertions(+), 1 deletion(-) diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index 401e725ea535..a5c9ac5a57fc 100644 --- a/net/tipc/bcast.c +++ b/net/tipc/bcast.c @@ -218,9 +218,24 @@ static void tipc_bcast_select_xmit_method(struct net *net, int dests, } /* Can current method be changed ? */ method->expires = jiffies + TIPC_METHOD_EXPIRE; - if (method->mandatory || time_before(jiffies, exp)) + if (method->mandatory) return; + if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) && + time_before(jiffies, exp)) + return; + + /* Configuration as force 'broadcast' method */ + if (bb->force_bcast) { + method->rcast = false; + return; + } + /* Configuration as force 'replicast' method */ + if (bb->force_rcast) { + method->rcast = true; + return; + } + /* Configuration as 'selectable' or default method */ /* Determine method to use now */ method->rcast = dests <= bb->bc_threshold; } @@ -283,6 +298,64 @@ static int tipc_rcast_xmit(struct net *net, struct sk_buff_head *pkts, return 0; } +/* tipc_mcast_send_sync - deliver a dummy message with SYN bit + * @net: the applicable net namespace + * @skb: socket buffer to copy + * @method: send method to be used + * @dests: destination nodes for message. + * @cong_link_cnt: returns number of encountered congested destination links + * Returns 0 if success, otherwise errno + */ +static int tipc_mcast_send_sync(struct net *net, struct sk_buff *skb, + struct tipc_mc_method *method, + struct tipc_nlist *dests, + u16 *cong_link_cnt) +{ + struct sk_buff_head tmpq; + struct sk_buff *_skb; + struct tipc_msg *hdr, *_hdr; + + /* Is a cluster supporting with new capabilities ? */ + if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL)) + return 0; + + hdr = buf_msg(skb); + if (msg_user(hdr) == MSG_FRAGMENTER) + hdr = msg_get_wrapped(hdr); + if (msg_type(hdr) != TIPC_MCAST_MSG) + return 0; + + /* Allocate dummy message */ + _skb = tipc_buf_acquire(MCAST_H_SIZE, GFP_KERNEL); + if (!skb) + return -ENOMEM; + + /* Preparing for 'synching' header */ + msg_set_syn(hdr, 1); + msg_set_is_rcast(hdr, method->rcast); + + /* Copy skb's header into a dummy header */ + skb_copy_to_linear_data(_skb, hdr, MCAST_H_SIZE); + skb_orphan(_skb); + + /* Reverse method for dummy message */ + _hdr = buf_msg(_skb); + msg_set_size(_hdr, MCAST_H_SIZE); + msg_set_is_rcast(_hdr, !method->rcast); + + skb_queue_head_init(&tmpq); + __skb_queue_tail(&tmpq, _skb); + if (method->rcast) + tipc_bcast_xmit(net, &tmpq, cong_link_cnt); + else + tipc_rcast_xmit(net, &tmpq, dests, cong_link_cnt); + + /* This queue should normally be empty by now */ + __skb_queue_purge(&tmpq); + + return 0; +} + /* tipc_mcast_xmit - deliver message to indicated destination nodes * and to identified node local sockets * @net: the applicable net namespace @@ -298,6 +371,9 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head *pkts, u16 *cong_link_cnt) { struct sk_buff_head inputq, localq; + struct sk_buff *skb; + struct tipc_msg *hdr; + bool rcast = method->rcast; int rc = 0; skb_queue_head_init(&inputq); @@ -311,6 +387,18 @@ int tipc_mcast_xmit(struct net *net, struct sk_buff_head *pkts, /* Send according to determined transmit method */ if (dests->remote) { tipc_bcast_select_xmit_method(net, dests->remote, method); + + skb = skb_peek(pkts); + hdr = buf_msg(skb); + if (msg_user(hdr) == MSG_FRAGMENTER) + hdr = msg_get_wrapped(hdr); + msg_set_is_rcast(hdr, method->rcast); + + /* Switch method ? */ + if (rcast != method->rcast) + tipc_mcast_send_sync(net, skb, method, + dests, cong_link_cnt); + if (method->rcast) rc = tipc_rcast_xmit(net, pkts, dests, cong_link_cnt); else @@ -670,3 +758,79 @@ u32 tipc_bcast_get_broadcast_ratio(struct net *net) return bb->rc_ratio; } + +void tipc_mcast_filter_msg(struct sk_buff_head *defq, + struct sk_buff_head *inputq) +{ + struct sk_buff *skb, *_skb, *tmp; + struct tipc_msg *hdr, *_hdr; + bool match = false; + u32 node, port; + + skb = skb_peek(inputq); + hdr = buf_msg(skb); + + if (likely(!msg_is_syn(hdr) && skb_queue_empty(defq))) + return; + + node = msg_orignode(hdr); + port = msg_origport(hdr); + + /* Has the twin SYN message already arrived ? */ + skb_queue_walk(defq, _skb) { + _hdr = buf_msg(_skb); + if (msg_orignode(_hdr) != node) + continue; + if (msg_origport(_hdr) != port) + continue; + match = true; + break; + } + + if (!match) { + if (!msg_is_syn(hdr)) + return; + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Deliver non-SYN message from other link, otherwise queue it */ + if (!msg_is_syn(hdr)) { + if (msg_is_rcast(hdr) != msg_is_rcast(_hdr)) + return; + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Queue non-SYN/SYN message from same link */ + if (msg_is_rcast(hdr) == msg_is_rcast(_hdr)) { + __skb_dequeue(inputq); + __skb_queue_tail(defq, skb); + return; + } + + /* Matching SYN messages => return the one with data, if any */ + __skb_unlink(_skb, defq); + if (msg_data_sz(hdr)) { + kfree_skb(_skb); + } else { + __skb_dequeue(inputq); + kfree_skb(skb); + __skb_queue_tail(inputq, _skb); + } + + /* Deliver subsequent non-SYN messages from same peer */ + skb_queue_walk_safe(defq, _skb, tmp) { + _hdr = buf_msg(_skb); + if (msg_orignode(_hdr) != node) + continue; + if (msg_origport(_hdr) != port) + continue; + if (msg_is_syn(_hdr)) + break; + __skb_unlink(_skb, defq); + __skb_queue_tail(inputq, _skb); + } +} diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index ae3af139d299..1b6772de7dd8 100644 --- a/net/tipc/bcast.h +++ b/net/tipc/bcast.h @@ -67,11 +67,13 @@ void tipc_nlist_del(struct tipc_nlist *nl, u32 node); /* Cookie to be used between socket and broadcast layer * @rcast: replicast (instead of broadcast) was used at previous xmit * @mandatory: broadcast/replicast indication was set by user + * @deferredq: defer queue to make message in order * @expires: re-evaluate non-mandatory transmit method if we are past this */ struct tipc_mc_method { bool rcast; bool mandatory; + struct sk_buff_head deferredq; unsigned long expires; }; @@ -96,6 +98,9 @@ int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg); int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]); int tipc_bclink_reset_stats(struct net *net); +void tipc_mcast_filter_msg(struct sk_buff_head *defq, + struct sk_buff_head *inputq); + static inline void tipc_bcast_lock(struct net *net) { spin_lock_bh(&tipc_net(net)->bclock); diff --git a/net/tipc/msg.h b/net/tipc/msg.h index a0924956bb61..70ddff2206a0 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -257,6 +257,16 @@ static inline void msg_set_src_droppable(struct tipc_msg *m, u32 d) msg_set_bits(m, 0, 18, 1, d); } +static inline bool msg_is_rcast(struct tipc_msg *m) +{ + return msg_bits(m, 0, 18, 0x1); +} + +static inline void msg_set_is_rcast(struct tipc_msg *m, bool d) +{ + msg_set_bits(m, 0, 18, 0x1, d); +} + static inline void msg_set_size(struct tipc_msg *m, u32 sz) { m->hdr[0] = htonl((msg_word(m, 0) & ~0x1ffff) | sz); diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 8fc5acd4820d..de83eb1e718e 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -483,6 +483,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock, tsk_set_unreturnable(tsk, true); if (sock->type == SOCK_DGRAM) tsk_set_unreliable(tsk, true); + __skb_queue_head_init(&tsk->mc_method.deferredq); } trace_tipc_sk_create(sk, NULL, TIPC_DUMP_NONE, " "); @@ -580,6 +581,7 @@ static int tipc_release(struct socket *sock) sk->sk_shutdown = SHUTDOWN_MASK; tipc_sk_leave(tsk); tipc_sk_withdraw(tsk, 0, NULL); + __skb_queue_purge(&tsk->mc_method.deferredq); sk_stop_timer(sk, &sk->sk_timer); tipc_sk_remove(tsk); @@ -2157,6 +2159,9 @@ static void tipc_sk_filter_rcv(struct sock *sk, struct sk_buff *skb, if (unlikely(grp)) tipc_group_filter_msg(grp, &inputq, xmitq); + if (msg_type(hdr) == TIPC_MCAST_MSG) + tipc_mcast_filter_msg(&tsk->mc_method.deferredq, &inputq); + /* Validate and add to receive buffer if there is space */ while ((skb = __skb_dequeue(&inputq))) { hdr = buf_msg(skb); -- 2.17.1 |
From: Hoang Le <hoa...@de...> - 2019-02-18 07:59:16
|
As a preparation for introducing a moothly switching between replicast and broadcast method for multicast message We have to introduce a new capability flag TIPC_MCAST_RBCTL to handle this new feature because of compatibility reasons. When a cluster upgrade a node can come back with this new capabilities which also must be reflected in the cluster capabilities field and new feature only applicable if the cluster supports this new capability. Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/core.c | 2 ++ net/tipc/core.h | 3 +++ net/tipc/node.c | 17 +++++++++++++++++ net/tipc/node.h | 6 ++++-- 4 files changed, 26 insertions(+), 2 deletions(-) diff --git a/net/tipc/core.c b/net/tipc/core.c index 5b38f5164281..27cccd101ef6 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -43,6 +43,7 @@ #include "net.h" #include "socket.h" #include "bcast.h" +#include "node.h" #include <linux/module.h> @@ -59,6 +60,7 @@ static int __net_init tipc_init_net(struct net *net) tn->node_addr = 0; tn->trial_addr = 0; tn->addr_trial_end = 0; + tn->capabilities = TIPC_NODE_CAPABILITIES; memset(tn->node_id, 0, sizeof(tn->node_id)); memset(tn->node_id_string, 0, sizeof(tn->node_id_string)); tn->mon_threshold = TIPC_DEF_MON_THRESHOLD; diff --git a/net/tipc/core.h b/net/tipc/core.h index 8020a6c360ff..7a68e1b6a066 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -122,6 +122,9 @@ struct tipc_net { /* Topology subscription server */ struct tipc_topsrv *topsrv; atomic_t subscription_count; + + /* Cluster capabilities */ + u16 capabilities; }; static inline struct tipc_net *tipc_net(struct net *net) diff --git a/net/tipc/node.c b/net/tipc/node.c index db2a6c3e0be9..f182b9129059 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -383,6 +383,11 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, tipc_link_update_caps(l, capabilities); } write_unlock_bh(&n->lock); + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } goto exit; } n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -433,6 +438,11 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, break; } list_add_tail_rcu(&n->list, &temp_node->list); + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } trace_tipc_node_create(n, true, " "); exit: spin_unlock_bh(&tn->node_list_lock); @@ -604,6 +614,13 @@ static bool tipc_node_cleanup(struct tipc_node *peer) deleted = true; } tipc_node_write_unlock(peer); + + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } + spin_unlock_bh(&tn->node_list_lock); return deleted; } diff --git a/net/tipc/node.h b/net/tipc/node.h index 4f59a30e989a..2404225c5d58 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -51,7 +51,8 @@ enum { TIPC_BLOCK_FLOWCTL = (1 << 3), TIPC_BCAST_RCAST = (1 << 4), TIPC_NODE_ID128 = (1 << 5), - TIPC_LINK_PROTO_SEQNO = (1 << 6) + TIPC_LINK_PROTO_SEQNO = (1 << 6), + TIPC_MCAST_RBCTL = (1 << 7) }; #define TIPC_NODE_CAPABILITIES (TIPC_SYN_BIT | \ @@ -60,7 +61,8 @@ enum { TIPC_BCAST_RCAST | \ TIPC_BLOCK_FLOWCTL | \ TIPC_NODE_ID128 | \ - TIPC_LINK_PROTO_SEQNO) + TIPC_LINK_PROTO_SEQNO | \ + TIPC_MCAST_RBCTL) #define INVALID_BEARER_ID -1 void tipc_node_stop(struct net *net); -- 2.17.1 |
From: Hoang Le <hoa...@de...> - 2019-02-18 07:54:51
|
This command is to support broacast/replicast configurable for broadcast-link. A sample usage is shown below: $tipc link set broadcast BROADCAST $tipc link set broadcast SELECTABLE ratio 25 $tipc link set broadcast -h Usage: tipc link set broadcast PROPERTY PROPERTIES BROADCAST - Forces all multicast traffic to be transmitted via broadcast only, irrespective of cluster size and number of destinations REPLICAST - Forces all multicast traffic to be transmitted via replicast only, irrespective of cluster size and number of destinations SELECTABLE - Auto switching to broadcast or replicast depending on cluster size and destination number ratio SIZE - Set the selection ratio for SELECTABLE PROPERTY Signed-off-by: Hoang Le <hoa...@de...> --- include/uapi/linux/tipc_netlink.h | 2 + tipc/link.c | 95 ++++++++++++++++++++++++++++++- 2 files changed, 96 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/tipc_netlink.h b/include/uapi/linux/tipc_netlink.h index 0ebe02ef1a86..efb958fd167d 100644 --- a/include/uapi/linux/tipc_netlink.h +++ b/include/uapi/linux/tipc_netlink.h @@ -281,6 +281,8 @@ enum { TIPC_NLA_PROP_TOL, /* u32 */ TIPC_NLA_PROP_WIN, /* u32 */ TIPC_NLA_PROP_MTU, /* u32 */ + TIPC_NLA_PROP_BROADCAST, /* u32 */ + TIPC_NLA_PROP_BROADCAST_RATIO, /* u32 */ __TIPC_NLA_PROP_MAX, TIPC_NLA_PROP_MAX = __TIPC_NLA_PROP_MAX - 1 diff --git a/tipc/link.c b/tipc/link.c index 43e26da3fa6b..c1db400f1b26 100644 --- a/tipc/link.c +++ b/tipc/link.c @@ -28,6 +28,9 @@ #define PRIORITY_STR "priority" #define TOLERANCE_STR "tolerance" #define WINDOW_STR "window" +#define BROADCAST_STR "broadcast" + +static const char tipc_bclink_name[] = "broadcast-link"; static int link_list_cb(const struct nlmsghdr *nlh, void *data) { @@ -521,7 +524,8 @@ static void cmd_link_set_help(struct cmdl *cmdl) "PROPERTIES\n" " tolerance TOLERANCE - Set link tolerance\n" " priority PRIORITY - Set link priority\n" - " window WINDOW - Set link window\n", + " window WINDOW - Set link window\n" + " broadcast BROADCAST - Set link broadcast\n", cmdl->argv[0]); } @@ -585,6 +589,94 @@ static int cmd_link_set_prop(struct nlmsghdr *nlh, const struct cmd *cmd, return msg_doit(nlh, link_get_cb, &prop); } +static void cmd_link_set_bcast_help(struct cmdl *cmdl) +{ + fprintf(stderr, "Usage: %s link set broadcast PROPERTY\n\n" + "PROPERTIES\n" + " BROADCAST - Forces all multicast traffic to be\n" + " transmitted via broadcast only,\n" + " irrespective of cluster size and number\n" + " of destinations\n\n" + " REPLICAST - Forces all multicast traffic to be\n" + " transmitted via replicast only,\n" + " irrespective of cluster size and number\n" + " of destinations\n\n" + " SELECTABLE - Auto switching to broadcast or replicast\n" + " depending on cluster size and destination\n" + " number\n\n" + " ratio SIZE - Set the selection ratio for SELECTABLE\n\n", + cmdl->argv[0]); +} + +static int cmd_link_set_bcast(struct nlmsghdr *nlh, const struct cmd *cmd, + struct cmdl *cmdl, void *data) +{ + char buf[MNL_SOCKET_BUFFER_SIZE]; + struct nlattr *props; + struct nlattr *attrs; + struct opt *opt; + struct opt opts[] = { + { "BROADCAST", OPT_KEY, NULL }, + { "REPLICAST", OPT_KEY, NULL }, + { "SELECTABLE", OPT_KEY, NULL }, + { "ratio", OPT_KEYVAL, NULL }, + { NULL } + }; + int method = 0; + + if (help_flag) { + (cmd->help)(cmdl); + return -EINVAL; + } + + if (parse_opts(opts, cmdl) < 0) + return -EINVAL; + + for (opt = opts; opt->key; opt++) + if (opt->val) + break; + + if (!opt || !opt->key) { + (cmd->help)(cmdl); + return -EINVAL; + } + + nlh = msg_init(buf, TIPC_NL_LINK_SET); + if (!nlh) { + fprintf(stderr, "error, message initialisation failed\n"); + return -1; + } + + attrs = mnl_attr_nest_start(nlh, TIPC_NLA_LINK); + /* Direct to broadcast-link setting */ + mnl_attr_put_strz(nlh, TIPC_NLA_LINK_NAME, tipc_bclink_name); + props = mnl_attr_nest_start(nlh, TIPC_NLA_LINK_PROP); + + if (get_opt(opts, "BROADCAST")) + method = 0x1; + else if (get_opt(opts, "REPLICAST")) + method = 0x2; + else if (get_opt(opts, "SELECTABLE")) + method = 0x4; + + opt = get_opt(opts, "ratio"); + if (!method && !opt) { + (cmd->help)(cmdl); + return -EINVAL; + } + + if (method) + mnl_attr_put_u32(nlh, TIPC_NLA_PROP_BROADCAST, method); + + if (opt) + mnl_attr_put_u32(nlh, TIPC_NLA_PROP_BROADCAST_RATIO, + atoi(opt->val)); + + mnl_attr_nest_end(nlh, props); + mnl_attr_nest_end(nlh, attrs); + return msg_doit(nlh, NULL, NULL); +} + static int cmd_link_set(struct nlmsghdr *nlh, const struct cmd *cmd, struct cmdl *cmdl, void *data) { @@ -592,6 +684,7 @@ static int cmd_link_set(struct nlmsghdr *nlh, const struct cmd *cmd, { PRIORITY_STR, cmd_link_set_prop, cmd_link_set_help }, { TOLERANCE_STR, cmd_link_set_prop, cmd_link_set_help }, { WINDOW_STR, cmd_link_set_prop, cmd_link_set_help }, + { BROADCAST_STR, cmd_link_set_bcast, cmd_link_set_bcast_help }, { NULL } }; -- 2.17.1 |
From: Hoang Le <hoa...@de...> - 2019-02-18 07:54:50
|
The command prints the actually method that multicast is running in the system. Also 'ratio' value for SELECTABLE method. A sample usage is shown below: $tipc link get broadcast BROADCAST $tipc link get broadcast SELECTABLE ratio:30% $tipc link get broadcast -j -p [ { "method": "SELECTABLE" },{ "ratio": 30 } ] Signed-off-by: Hoang Le <hoa...@de...> --- tipc/link.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 82 insertions(+), 1 deletion(-) diff --git a/tipc/link.c b/tipc/link.c index c1db400f1b26..97c75122038e 100644 --- a/tipc/link.c +++ b/tipc/link.c @@ -175,10 +175,90 @@ static void cmd_link_get_help(struct cmdl *cmdl) "PROPERTIES\n" " tolerance - Get link tolerance\n" " priority - Get link priority\n" - " window - Get link window\n", + " window - Get link window\n" + " broadcast - Get link broadcast\n", cmdl->argv[0]); } +static int cmd_link_get_bcast_cb(const struct nlmsghdr *nlh, void *data) +{ + int *prop = data; + struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh); + struct nlattr *info[TIPC_NLA_MAX + 1] = {}; + struct nlattr *attrs[TIPC_NLA_LINK_MAX + 1] = {}; + struct nlattr *props[TIPC_NLA_PROP_MAX + 1] = {}; + int bc_mode; + + mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info); + if (!info[TIPC_NLA_LINK]) + return MNL_CB_ERROR; + + mnl_attr_parse_nested(info[TIPC_NLA_LINK], parse_attrs, attrs); + if (!attrs[TIPC_NLA_LINK_PROP]) + return MNL_CB_ERROR; + + mnl_attr_parse_nested(attrs[TIPC_NLA_LINK_PROP], parse_attrs, props); + if (!props[*prop]) + return MNL_CB_ERROR; + + bc_mode = mnl_attr_get_u32(props[*prop]); + + new_json_obj(json); + open_json_object(NULL); + switch (bc_mode) { + case 0x1: + print_string(PRINT_ANY, "method", "%s\n", "BROADCAST"); + break; + case 0x2: + print_string(PRINT_ANY, "method", "%s\n", "REPLICAST"); + break; + case 0x4: + print_string(PRINT_ANY, "method", "%s", "SELECTABLE"); + close_json_object(); + open_json_object(NULL); + print_uint(PRINT_ANY, "ratio", " ratio:%u%\n", mnl_attr_get_u32(props[TIPC_NLA_PROP_BROADCAST_RATIO])); + break; + default: + print_string(PRINT_ANY, NULL, "UNKNOWN\n", NULL); + break; + } + close_json_object(); + delete_json_obj(); + return MNL_CB_OK; +} + +static void cmd_link_get_bcast_help(struct cmdl *cmdl) +{ + fprintf(stderr, "Usage: %s link get PPROPERTY\n\n" + "PROPERTIES\n" + " broadcast - Get link broadcast\n", + cmdl->argv[0]); +} + +static int cmd_link_get_bcast(struct nlmsghdr *nlh, const struct cmd *cmd, + struct cmdl *cmdl, void *data) +{ + int prop = TIPC_NLA_PROP_BROADCAST; + char buf[MNL_SOCKET_BUFFER_SIZE]; + struct nlattr *attrs; + + if (help_flag) { + (cmd->help)(cmdl); + return -EINVAL; + } + + nlh = msg_init(buf, TIPC_NL_LINK_GET); + if (!nlh) { + fprintf(stderr, "error, message initialisation failed\n"); + return -1; + } + attrs = mnl_attr_nest_start(nlh, TIPC_NLA_LINK); + /* Direct to broadcast-link setting */ + mnl_attr_put_strz(nlh, TIPC_NLA_LINK_NAME, tipc_bclink_name); + mnl_attr_nest_end(nlh, attrs); + return msg_doit(nlh, cmd_link_get_bcast_cb, &prop); +} + static int cmd_link_get(struct nlmsghdr *nlh, const struct cmd *cmd, struct cmdl *cmdl, void *data) { @@ -186,6 +266,7 @@ static int cmd_link_get(struct nlmsghdr *nlh, const struct cmd *cmd, { PRIORITY_STR, cmd_link_get_prop, cmd_link_get_help }, { TOLERANCE_STR, cmd_link_get_prop, cmd_link_get_help }, { WINDOW_STR, cmd_link_get_prop, cmd_link_get_help }, + { BROADCAST_STR, cmd_link_get_bcast, cmd_link_get_bcast_help }, { NULL } }; -- 2.17.1 |
From: Amit J. <ami...@te...> - 2019-02-15 06:32:07
|
Hi All, We want to understand the socket options supported in tipc (specifically for UDP bearers). TIPC supports socket options at 2 different levels - SOL_SOCKET and SOL_TIPC. In the previous document version it was explicitly mentioned “TIPC does not currently support many socket options for level SOL_SOCKET, such as SO_SNDBUF. Options that are supported include SO_RCVTIMEO (for all socket types) and SO_RCVLOWAT (for SOCK_STREAM only)“ In the new documentation, there is no such explicit mention. But SO_RCVBUF is explicitly mentioned to be supported at level of SOL_SOCKET. What about the other socket options like - SO_SNDBUF, SO_SNDBUFFORCE , SO_RCVBUFFORCE , SO_KEEPALIVE, SO_REUSEADDR, SO_LINGER, SO_RCVLOWAT and SO_SNDLOWAT, SO_RCVTIMEO and SO_SNDTIMEO, SO_TIMESTAMP, SO_MARK , SO_OOBINLINE ; Are these applicable for tipc sockets ? Regards, Amit |
From: Hoang L. <hoa...@de...> - 2019-02-15 04:51:42
|
Hi Jon, I realized below would never happen in real and become redundancy. I will remove these lines of code as well. Regards, Hoang -----Original Message----- From: Hoang Le <hoa...@de...> Sent: Thursday, February 14, 2019 4:09 PM To: 'Jon Maloy' <jon...@er...>; ma...@do...; yin...@wi...; tip...@li... Subject: Re: [tipc-discussion] [net-next] tipc: smooth change between replicast and broadcast Hi Jon, Do we need purge cascade in this case? This happen when having a packet dropped from A source/peer. Then, filter out only that source/peer instead of __skb_queue_purge(defq) is preferable. [...] > if (!msg_is_syn(_hdr)) { > pr_warn_ratelimited("Non-sync mcast heads deferred queue\n"); > __skb_queue_purge(defq); > return __skb_queue_tail(inputq, skb); > } Thanks and regards, Hoang -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Friday, February 1, 2019 7:39 PM To: Hoang Huu Le <hoa...@de...>; ma...@do...; yin...@wi...; tip...@li... Subject: RE: [net-next] tipc: smooth change between replicast and broadcast Hi Hoang, I realized one more thing. Since this mechanism only is intended for MCAST messages, we must filter out those at the send side (in tipc_mcast_send_sync()) so that SYN messages are sent only for those. Group messaging, which has its own mechanism, does not need this new method, and cannot discover message duplicates, -it would just deliver them. This must be avoided. BR ///jon > -----Original Message----- > From: Jon Maloy > Sent: 31-Jan-19 16:05 > To: 'Hoang Le' <hoa...@de...>; ma...@do...; > yin...@wi...; tip...@li... > Subject: RE: [net-next] tipc: smooth change between replicast and broadcast > > Hi Hoang, > Nice job, but still a few things to improve. See below. > > > -----Original Message----- > > From: Hoang Le <hoa...@de...> > > Sent: 29-Jan-19 04:22 > > To: Jon Maloy <jon...@er...>; ma...@do...; > > yin...@wi...; tip...@li... > > Subject: [net-next] tipc: smooth change between replicast and > > broadcast > > > > Currently, a multicast stream may start out using replicast, because > > there are few destinations, and then it should ideally switch to > > L2/broadcast IGMP/multicast when the number of destinations grows > > beyond a certain limit. The opposite should happen when the number > > decreases below the limit. > > > > To eliminate the risk of message reordering caused by method change, a > > sending socket must stick to a previously selected method until it > > enters an idle period of 5 seconds. Means there is a 5 seconds pause > > in the traffic from the sender socket. > > > > In this fix, we allow such a switch between replicast and broadcast > > without a > > 5 seconds pause in the traffic. > > > > Solution is to send a dummy message with only the header, also with > > the SYN bit set, via broadcast/replicast. For the data message, the > > SYN bit set and sending via replicast/broadcast. > > > > Then, at receiving side any messages follow first SYN bit message > > (data or dummy), they will be hold in deferred queue until another > > pair (dummy or > > data) arrived. > > > > For compatibility reasons we have to introduce a new capability flag > > TIPC_MCAST_RBCTL to handle this new feature. Because of there is a > > dummy message sent out, then poll return empty at old machines. > > > > Signed-off-by: Jon Maloy <jon...@er...> > > Signed-off-by: Hoang Le <hoa...@de...> > > --- > > net/tipc/bcast.c | 116 > > +++++++++++++++++++++++++++++++++++++++++++++- > > net/tipc/bcast.h | 5 ++ > > net/tipc/core.c | 2 + > > net/tipc/core.h | 3 ++ > > net/tipc/msg.h | 10 ++++ > > net/tipc/node.c | 10 ++++ > > net/tipc/node.h | 6 ++- > > net/tipc/socket.c | 10 ++++ > > 8 files changed, 159 insertions(+), 3 deletions(-) > > > > diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index > > d8026543bf4c..e3a85227d4aa 100644 > > --- a/net/tipc/bcast.c > > +++ b/net/tipc/bcast.c > > @@ -295,11 +295,15 @@ int tipc_mcast_xmit(struct net *net, struct > > sk_buff_head *pkts, > > struct tipc_mc_method *method, struct tipc_nlist *dests, > > u16 *cong_link_cnt) > > { > > - struct sk_buff_head inputq, localq; > > + struct sk_buff_head inputq, localq, tmpq; > > + bool rcast = method->rcast; > > + struct sk_buff *skb, *_skb; > > + struct tipc_msg *hdr, *_hdr; > > int rc = 0; > > > > skb_queue_head_init(&inputq); > > skb_queue_head_init(&localq); > > + skb_queue_head_init(&tmpq); > > > > /* Clone packets before they are consumed by next call */ > > if (dests->local && !tipc_msg_reassemble(pkts, &localq)) { @@ - > > 309,6 +313,53 @@ int tipc_mcast_xmit(struct net *net, struct > > sk_buff_head *pkts, > > /* Send according to determined transmit method */ > > if (dests->remote) { > > tipc_bcast_select_xmit_method(net, dests->remote, > method); > > I would suggest that you move the whole code block below into a separate > function: > msg_set_is_rcast(hdr, method->rcast); > if (rcast != method->rcast) > tipc_mcast_send_sync(net, skb_peek(pkts)); > > This function sets the SYN bit in the packet header, then copies that header > into a dummy header, inverts the is_rcast bit in that header and sends it out > via the appropriate method. > Note that this involves a small change: the real message is sent out via the > selected method, the dummy message always via the other method, > whichever it is. > > > + > > + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) { > > + skb = skb_peek(pkts); > > + hdr = buf_msg(skb); > > + > > + if (msg_user(hdr) == MSG_FRAGMENTER) > > + hdr = msg_get_wrapped(hdr); > > + if (msg_type(hdr) != TIPC_MCAST_MSG) > > + goto xmit; > > + > > + msg_set_syn(hdr, 0); > > + msg_set_is_rcast(hdr, method->rcast); > > + > > + /* switch mode */ > > + if (rcast != method->rcast) { > > + /* Build message's copied */ > > + _skb = tipc_buf_acquire(MCAST_H_SIZE, > > + GFP_KERNEL); > > + if (!skb) { > > + rc = -ENOMEM; > > + goto exit; > > + } > > + skb_orphan(_skb); > > + skb_copy_to_linear_data(_skb, hdr, > > + MCAST_H_SIZE); > > + > > + /* Build dummy header */ > > + _hdr = buf_msg(_skb); > > + msg_set_size(_hdr, MCAST_H_SIZE); > > + __skb_queue_tail(&tmpq, _skb); > > + > > + msg_set_syn(hdr, 1); > > + msg_set_syn(_hdr, 1); > > + msg_set_is_rcast(_hdr, rcast); > > + /* Prepare for 'synching' */ > > + if (rcast) > > + tipc_rcast_xmit(net, &tmpq, dests, > > + cong_link_cnt); > > + else > > + tipc_bcast_xmit(net, &tmpq, > > + cong_link_cnt); > > + > > + /* This queue should normally be empty by > > now */ > > + __skb_queue_purge(&tmpq); > > + } > > + } > > +xmit: > > if (method->rcast) > > rc = tipc_rcast_xmit(net, pkts, dests, cong_link_cnt); > > else > > @@ -576,3 +627,66 @@ void tipc_nlist_purge(struct tipc_nlist *nl) > > nl->remote = 0; > > nl->local = false; > > } > > + > > +void tipc_mcast_filter_msg(struct sk_buff_head *defq, > > + struct sk_buff_head *inputq) > > +{ > > + struct sk_buff *skb, *_skb; > > + struct tipc_msg *hdr, *_hdr; > > + u32 node, port, _node, _port; > > + bool match = false; > > If you put in the following lines here we will save some instruction cycles: > hdr = buf_msg(skb_peek(inputq)); > if (likely(!msg_is_syn(hdr) && skb_queue_empty(defq))) > return; > > After all this will be the case in the vast majority of cases. It is safe to just > peek the queue and access, since inputq never can be empty. > > > + > > + skb = __skb_dequeue(inputq); > > + if (!skb) > > + return; > > + > > + hdr = buf_msg(skb); > > + node = msg_orignode(hdr); > > + port = msg_origport(hdr); > > + > > + /* Find a peer port if its existing in defer queue */ > > + while ((_skb = skb_peek(defq))) { > > + _hdr = buf_msg(_skb); > > + _node = msg_orignode(_hdr); > > + _port = msg_origport(_hdr); > > + > > + if (_node != node) > > + continue; > > + if (_port != port) > > + continue; > > + > > + if (!match) { > > + if (msg_is_syn(hdr) && > > + msg_is_rcast(hdr) != msg_is_rcast(_hdr)) { > > + __skb_dequeue(defq); > > + if (msg_data_sz(hdr)) { > > + __skb_queue_tail(inputq, skb); > > + kfree_skb(_skb); > > + } else { > > + __skb_queue_tail(inputq, _skb); > > + kfree_skb(skb); > > + } > > + match = true; > > + } else { > > + break; > > + } > > + } else { > > + if (msg_is_syn(_hdr)) > > + return; > > + /* Dequeued to receive buffer */ > > + __skb_dequeue(defq); > > + __skb_queue_tail(inputq, _skb); > > + } > > + } > > + > > + if (match) > > + return; > > + > > + if (msg_is_syn(hdr)) { > > + /* Enqueue and defer to next synching */ > > + __skb_queue_tail(defq, skb); > > + } else { > > + /* Direct enqueued */ > > + __skb_queue_tail(inputq, skb); > > + } > > +} > > Above function is hard to follow and does not convince me. What if there are > messages from many sources, and you find the match in the middle of the > queue? > I suggest you break down the logics to smaller tasks, e.g., as follows (code > compiles ok, but is untested): > > void tipc_mcast_filter_msg(struct sk_buff_head *defq, > struct sk_buff_head *inputq) { > struct sk_buff *skb, *_skb, *tmp; > struct tipc_msg *hdr, *_hdr; > bool match = false; > u32 node, port; > > skb = skb_peek(inputq); > hdr = buf_msg(skb); > if (likely(!msg_is_syn(hdr) && skb_queue_empty(defq))) > return; > > __skb_dequeue(inputq); > node = msg_orignode(hdr); > port = msg_origport(hdr); > > /* Has the twin SYN message already arrived ? */ > skb_queue_walk(defq, _skb) { > _hdr = buf_msg(_skb); > if (msg_orignode(_hdr) != node) > continue; > if (msg_origport(_hdr) != port) > continue; > match = true; > break; > } > > if (!match) > return __skb_queue_tail(defq, skb); > > if (!msg_is_syn(_hdr)) { > pr_warn_ratelimited("Non-sync mcast heads deferred queue\n"); > __skb_queue_purge(defq); > return __skb_queue_tail(inputq, skb); > } > > /* Non-SYN message from other link can be delivered right away */ > if (!msg_is_syn(hdr)) { > if (msg_is_rcast(hdr) != msg_is_rcast(_hdr)) > return __skb_queue_tail(inputq, skb); > else > return __skb_queue_tail(defq, skb); > } > > /* Matching SYN messages => return the one with data, if any */ > __skb_unlink(_skb, defq); > if (msg_data_sz(hdr)) { > kfree_skb(_skb); > __skb_queue_tail(inputq, skb); > } else { > kfree_skb(skb); > __skb_queue_tail(inputq, _skb); > } > /* Deliver subsequent non-SYN messages from same peer */ > skb_queue_walk_safe(defq, _skb, tmp) { > _hdr = buf_msg(_skb); > if (msg_orignode(_hdr) != node) > continue; > if (msg_origport(_hdr) != port) > continue; > if (msg_is_syn(_hdr)) > break; > __skb_unlink(_skb, defq); > __skb_queue_tail(inputq, _skb); > } > } > > > diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index > > 751530ab0c49..165d88a503e4 100644 > > --- a/net/tipc/bcast.h > > +++ b/net/tipc/bcast.h > > @@ -63,11 +63,13 @@ void tipc_nlist_del(struct tipc_nlist *nl, u32 > > node); > > /* Cookie to be used between socket and broadcast layer > > * @rcast: replicast (instead of broadcast) was used at previous xmit > > * @mandatory: broadcast/replicast indication was set by user > > + * @deferredq: defer queue to make message in order > > * @expires: re-evaluate non-mandatory transmit method if we are past > this > > */ > > [...] > > > @@ -383,6 +383,11 @@ static struct tipc_node *tipc_node_create(struct > > net *net, u32 addr, > > tipc_link_update_caps(l, capabilities); > > } > > write_unlock_bh(&n->lock); > > + /* Calculate cluster capabilities */ > > + tn->capabilities = TIPC_NODE_CAPABILITIES; > > + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { > > + tn->capabilities &= temp_node->capabilities; > > + } > > Yes, you are right here. During a cluster upgrade a node can come back with > new capabilities which also must be reflected in the cluster capabilities field. > Actually, I think it would be a good idea to add cluster capabilities as a > separate patch. This makes this rather complex patch slightly smaller. > > > goto exit; > > } > > n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -433,6 +438,11 @@ static > > struct tipc_node *tipc_node_create(struct net *net, u32 addr, > > break; > > [...] > > > > > @@ -817,6 +819,11 @@ static int tipc_sendmcast(struct socket *sock, > > struct tipc_name_seq *seq, > > &tsk->cong_link_cnt); > > } > > > > + /* Broadcast link is now free to choose method for next broadcast */ > > + if (rc == 0) { > > + method->mandatory = false; > > + method->expires = jiffies; > > + } > > No, we should leave the socket code as is, so we are sure it works with legacy > nodes. > We should instead make tipc_bcast_select_input_method() slightly smarter: > > static void tipc_bcast_select_xmit_method(struct net *net, int dests, > struct tipc_mc_method *method) { > ....... > /* Can current method be changed ? */ > method->expires = jiffies + TIPC_METHOD_EXPIRE; > if (method->mandatory) > return; > if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL)) && > time_before(jiffies, exp)) > return; > > /* Determine method to use now */ > method->rcast = dests <= bb->bc_threshold; } > > I.e., we respect the 'mandatory' setting, because we need that for > group_cast wo work correctly, but we override 'method->expire' if the > cluster capabilities says that all nodes support MCAST_RBCTL. > Combined with the patch where we add forced BCAST or REPLICAST (make > sure you add this patch on top of that one) I think we have a achieved a > pretty smart and adaptive multicast subsystem. > > BR > ///jon > > > > tipc_nlist_purge(&dsts); > > > > return rc ? rc : dlen; > > @@ -2157,6 +2164,9 @@ static void tipc_sk_filter_rcv(struct sock *sk, > > struct sk_buff *skb, > > if (unlikely(grp)) > > tipc_group_filter_msg(grp, &inputq, xmitq); > > > > + if (msg_type(hdr) == TIPC_MCAST_MSG) > > + tipc_mcast_filter_msg(&tsk->mc_method.deferredq, > > &inputq); > > + > > /* Validate and add to receive buffer if there is space */ > > while ((skb = __skb_dequeue(&inputq))) { > > hdr = buf_msg(skb); > > -- > > 2.17.1 _______________________________________________ tipc-discussion mailing list tip...@li... https://lists.sourceforge.net/lists/listinfo/tipc-discussion |
From: Hoang L. <hoa...@de...> - 2019-02-14 09:09:08
|
Hi Jon, Do we need purge cascade in this case? This happen when having a packet dropped from A source/peer. Then, filter out only that source/peer instead of __skb_queue_purge(defq) is preferable. [...] > if (!msg_is_syn(_hdr)) { > pr_warn_ratelimited("Non-sync mcast heads deferred queue\n"); > __skb_queue_purge(defq); > return __skb_queue_tail(inputq, skb); > } Thanks and regards, Hoang -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Friday, February 1, 2019 7:39 PM To: Hoang Huu Le <hoa...@de...>; ma...@do...; yin...@wi...; tip...@li... Subject: RE: [net-next] tipc: smooth change between replicast and broadcast Hi Hoang, I realized one more thing. Since this mechanism only is intended for MCAST messages, we must filter out those at the send side (in tipc_mcast_send_sync()) so that SYN messages are sent only for those. Group messaging, which has its own mechanism, does not need this new method, and cannot discover message duplicates, -it would just deliver them. This must be avoided. BR ///jon > -----Original Message----- > From: Jon Maloy > Sent: 31-Jan-19 16:05 > To: 'Hoang Le' <hoa...@de...>; ma...@do...; > yin...@wi...; tip...@li... > Subject: RE: [net-next] tipc: smooth change between replicast and broadcast > > Hi Hoang, > Nice job, but still a few things to improve. See below. > > > -----Original Message----- > > From: Hoang Le <hoa...@de...> > > Sent: 29-Jan-19 04:22 > > To: Jon Maloy <jon...@er...>; ma...@do...; > > yin...@wi...; tip...@li... > > Subject: [net-next] tipc: smooth change between replicast and > > broadcast > > > > Currently, a multicast stream may start out using replicast, because > > there are few destinations, and then it should ideally switch to > > L2/broadcast IGMP/multicast when the number of destinations grows > > beyond a certain limit. The opposite should happen when the number > > decreases below the limit. > > > > To eliminate the risk of message reordering caused by method change, a > > sending socket must stick to a previously selected method until it > > enters an idle period of 5 seconds. Means there is a 5 seconds pause > > in the traffic from the sender socket. > > > > In this fix, we allow such a switch between replicast and broadcast > > without a > > 5 seconds pause in the traffic. > > > > Solution is to send a dummy message with only the header, also with > > the SYN bit set, via broadcast/replicast. For the data message, the > > SYN bit set and sending via replicast/broadcast. > > > > Then, at receiving side any messages follow first SYN bit message > > (data or dummy), they will be hold in deferred queue until another > > pair (dummy or > > data) arrived. > > > > For compatibility reasons we have to introduce a new capability flag > > TIPC_MCAST_RBCTL to handle this new feature. Because of there is a > > dummy message sent out, then poll return empty at old machines. > > > > Signed-off-by: Jon Maloy <jon...@er...> > > Signed-off-by: Hoang Le <hoa...@de...> > > --- > > net/tipc/bcast.c | 116 > > +++++++++++++++++++++++++++++++++++++++++++++- > > net/tipc/bcast.h | 5 ++ > > net/tipc/core.c | 2 + > > net/tipc/core.h | 3 ++ > > net/tipc/msg.h | 10 ++++ > > net/tipc/node.c | 10 ++++ > > net/tipc/node.h | 6 ++- > > net/tipc/socket.c | 10 ++++ > > 8 files changed, 159 insertions(+), 3 deletions(-) > > > > diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index > > d8026543bf4c..e3a85227d4aa 100644 > > --- a/net/tipc/bcast.c > > +++ b/net/tipc/bcast.c > > @@ -295,11 +295,15 @@ int tipc_mcast_xmit(struct net *net, struct > > sk_buff_head *pkts, > > struct tipc_mc_method *method, struct tipc_nlist *dests, > > u16 *cong_link_cnt) > > { > > - struct sk_buff_head inputq, localq; > > + struct sk_buff_head inputq, localq, tmpq; > > + bool rcast = method->rcast; > > + struct sk_buff *skb, *_skb; > > + struct tipc_msg *hdr, *_hdr; > > int rc = 0; > > > > skb_queue_head_init(&inputq); > > skb_queue_head_init(&localq); > > + skb_queue_head_init(&tmpq); > > > > /* Clone packets before they are consumed by next call */ > > if (dests->local && !tipc_msg_reassemble(pkts, &localq)) { @@ - > > 309,6 +313,53 @@ int tipc_mcast_xmit(struct net *net, struct > > sk_buff_head *pkts, > > /* Send according to determined transmit method */ > > if (dests->remote) { > > tipc_bcast_select_xmit_method(net, dests->remote, > method); > > I would suggest that you move the whole code block below into a separate > function: > msg_set_is_rcast(hdr, method->rcast); > if (rcast != method->rcast) > tipc_mcast_send_sync(net, skb_peek(pkts)); > > This function sets the SYN bit in the packet header, then copies that header > into a dummy header, inverts the is_rcast bit in that header and sends it out > via the appropriate method. > Note that this involves a small change: the real message is sent out via the > selected method, the dummy message always via the other method, > whichever it is. > > > + > > + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) { > > + skb = skb_peek(pkts); > > + hdr = buf_msg(skb); > > + > > + if (msg_user(hdr) == MSG_FRAGMENTER) > > + hdr = msg_get_wrapped(hdr); > > + if (msg_type(hdr) != TIPC_MCAST_MSG) > > + goto xmit; > > + > > + msg_set_syn(hdr, 0); > > + msg_set_is_rcast(hdr, method->rcast); > > + > > + /* switch mode */ > > + if (rcast != method->rcast) { > > + /* Build message's copied */ > > + _skb = tipc_buf_acquire(MCAST_H_SIZE, > > + GFP_KERNEL); > > + if (!skb) { > > + rc = -ENOMEM; > > + goto exit; > > + } > > + skb_orphan(_skb); > > + skb_copy_to_linear_data(_skb, hdr, > > + MCAST_H_SIZE); > > + > > + /* Build dummy header */ > > + _hdr = buf_msg(_skb); > > + msg_set_size(_hdr, MCAST_H_SIZE); > > + __skb_queue_tail(&tmpq, _skb); > > + > > + msg_set_syn(hdr, 1); > > + msg_set_syn(_hdr, 1); > > + msg_set_is_rcast(_hdr, rcast); > > + /* Prepare for 'synching' */ > > + if (rcast) > > + tipc_rcast_xmit(net, &tmpq, dests, > > + cong_link_cnt); > > + else > > + tipc_bcast_xmit(net, &tmpq, > > + cong_link_cnt); > > + > > + /* This queue should normally be empty by > > now */ > > + __skb_queue_purge(&tmpq); > > + } > > + } > > +xmit: > > if (method->rcast) > > rc = tipc_rcast_xmit(net, pkts, dests, cong_link_cnt); > > else > > @@ -576,3 +627,66 @@ void tipc_nlist_purge(struct tipc_nlist *nl) > > nl->remote = 0; > > nl->local = false; > > } > > + > > +void tipc_mcast_filter_msg(struct sk_buff_head *defq, > > + struct sk_buff_head *inputq) > > +{ > > + struct sk_buff *skb, *_skb; > > + struct tipc_msg *hdr, *_hdr; > > + u32 node, port, _node, _port; > > + bool match = false; > > If you put in the following lines here we will save some instruction cycles: > hdr = buf_msg(skb_peek(inputq)); > if (likely(!msg_is_syn(hdr) && skb_queue_empty(defq))) > return; > > After all this will be the case in the vast majority of cases. It is safe to just > peek the queue and access, since inputq never can be empty. > > > + > > + skb = __skb_dequeue(inputq); > > + if (!skb) > > + return; > > + > > + hdr = buf_msg(skb); > > + node = msg_orignode(hdr); > > + port = msg_origport(hdr); > > + > > + /* Find a peer port if its existing in defer queue */ > > + while ((_skb = skb_peek(defq))) { > > + _hdr = buf_msg(_skb); > > + _node = msg_orignode(_hdr); > > + _port = msg_origport(_hdr); > > + > > + if (_node != node) > > + continue; > > + if (_port != port) > > + continue; > > + > > + if (!match) { > > + if (msg_is_syn(hdr) && > > + msg_is_rcast(hdr) != msg_is_rcast(_hdr)) { > > + __skb_dequeue(defq); > > + if (msg_data_sz(hdr)) { > > + __skb_queue_tail(inputq, skb); > > + kfree_skb(_skb); > > + } else { > > + __skb_queue_tail(inputq, _skb); > > + kfree_skb(skb); > > + } > > + match = true; > > + } else { > > + break; > > + } > > + } else { > > + if (msg_is_syn(_hdr)) > > + return; > > + /* Dequeued to receive buffer */ > > + __skb_dequeue(defq); > > + __skb_queue_tail(inputq, _skb); > > + } > > + } > > + > > + if (match) > > + return; > > + > > + if (msg_is_syn(hdr)) { > > + /* Enqueue and defer to next synching */ > > + __skb_queue_tail(defq, skb); > > + } else { > > + /* Direct enqueued */ > > + __skb_queue_tail(inputq, skb); > > + } > > +} > > Above function is hard to follow and does not convince me. What if there are > messages from many sources, and you find the match in the middle of the > queue? > I suggest you break down the logics to smaller tasks, e.g., as follows (code > compiles ok, but is untested): > > void tipc_mcast_filter_msg(struct sk_buff_head *defq, > struct sk_buff_head *inputq) { > struct sk_buff *skb, *_skb, *tmp; > struct tipc_msg *hdr, *_hdr; > bool match = false; > u32 node, port; > > skb = skb_peek(inputq); > hdr = buf_msg(skb); > if (likely(!msg_is_syn(hdr) && skb_queue_empty(defq))) > return; > > __skb_dequeue(inputq); > node = msg_orignode(hdr); > port = msg_origport(hdr); > > /* Has the twin SYN message already arrived ? */ > skb_queue_walk(defq, _skb) { > _hdr = buf_msg(_skb); > if (msg_orignode(_hdr) != node) > continue; > if (msg_origport(_hdr) != port) > continue; > match = true; > break; > } > > if (!match) > return __skb_queue_tail(defq, skb); > > if (!msg_is_syn(_hdr)) { > pr_warn_ratelimited("Non-sync mcast heads deferred queue\n"); > __skb_queue_purge(defq); > return __skb_queue_tail(inputq, skb); > } > > /* Non-SYN message from other link can be delivered right away */ > if (!msg_is_syn(hdr)) { > if (msg_is_rcast(hdr) != msg_is_rcast(_hdr)) > return __skb_queue_tail(inputq, skb); > else > return __skb_queue_tail(defq, skb); > } > > /* Matching SYN messages => return the one with data, if any */ > __skb_unlink(_skb, defq); > if (msg_data_sz(hdr)) { > kfree_skb(_skb); > __skb_queue_tail(inputq, skb); > } else { > kfree_skb(skb); > __skb_queue_tail(inputq, _skb); > } > /* Deliver subsequent non-SYN messages from same peer */ > skb_queue_walk_safe(defq, _skb, tmp) { > _hdr = buf_msg(_skb); > if (msg_orignode(_hdr) != node) > continue; > if (msg_origport(_hdr) != port) > continue; > if (msg_is_syn(_hdr)) > break; > __skb_unlink(_skb, defq); > __skb_queue_tail(inputq, _skb); > } > } > > > diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h index > > 751530ab0c49..165d88a503e4 100644 > > --- a/net/tipc/bcast.h > > +++ b/net/tipc/bcast.h > > @@ -63,11 +63,13 @@ void tipc_nlist_del(struct tipc_nlist *nl, u32 > > node); > > /* Cookie to be used between socket and broadcast layer > > * @rcast: replicast (instead of broadcast) was used at previous xmit > > * @mandatory: broadcast/replicast indication was set by user > > + * @deferredq: defer queue to make message in order > > * @expires: re-evaluate non-mandatory transmit method if we are past > this > > */ > > [...] > > > @@ -383,6 +383,11 @@ static struct tipc_node *tipc_node_create(struct > > net *net, u32 addr, > > tipc_link_update_caps(l, capabilities); > > } > > write_unlock_bh(&n->lock); > > + /* Calculate cluster capabilities */ > > + tn->capabilities = TIPC_NODE_CAPABILITIES; > > + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { > > + tn->capabilities &= temp_node->capabilities; > > + } > > Yes, you are right here. During a cluster upgrade a node can come back with > new capabilities which also must be reflected in the cluster capabilities field. > Actually, I think it would be a good idea to add cluster capabilities as a > separate patch. This makes this rather complex patch slightly smaller. > > > goto exit; > > } > > n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -433,6 +438,11 @@ static > > struct tipc_node *tipc_node_create(struct net *net, u32 addr, > > break; > > [...] > > > > > @@ -817,6 +819,11 @@ static int tipc_sendmcast(struct socket *sock, > > struct tipc_name_seq *seq, > > &tsk->cong_link_cnt); > > } > > > > + /* Broadcast link is now free to choose method for next broadcast */ > > + if (rc == 0) { > > + method->mandatory = false; > > + method->expires = jiffies; > > + } > > No, we should leave the socket code as is, so we are sure it works with legacy > nodes. > We should instead make tipc_bcast_select_input_method() slightly smarter: > > static void tipc_bcast_select_xmit_method(struct net *net, int dests, > struct tipc_mc_method *method) { > ....... > /* Can current method be changed ? */ > method->expires = jiffies + TIPC_METHOD_EXPIRE; > if (method->mandatory) > return; > if (!(tipc_net(net)->capabilities & TIPC_MCAST_RBCTL)) && > time_before(jiffies, exp)) > return; > > /* Determine method to use now */ > method->rcast = dests <= bb->bc_threshold; } > > I.e., we respect the 'mandatory' setting, because we need that for > group_cast wo work correctly, but we override 'method->expire' if the > cluster capabilities says that all nodes support MCAST_RBCTL. > Combined with the patch where we add forced BCAST or REPLICAST (make > sure you add this patch on top of that one) I think we have a achieved a > pretty smart and adaptive multicast subsystem. > > BR > ///jon > > > > tipc_nlist_purge(&dsts); > > > > return rc ? rc : dlen; > > @@ -2157,6 +2164,9 @@ static void tipc_sk_filter_rcv(struct sock *sk, > > struct sk_buff *skb, > > if (unlikely(grp)) > > tipc_group_filter_msg(grp, &inputq, xmitq); > > > > + if (msg_type(hdr) == TIPC_MCAST_MSG) > > + tipc_mcast_filter_msg(&tsk->mc_method.deferredq, > > &inputq); > > + > > /* Validate and add to receive buffer if there is space */ > > while ((skb = __skb_dequeue(&inputq))) { > > hdr = buf_msg(skb); > > -- > > 2.17.1 |
From: Tung N. <tun...@de...> - 2019-02-14 06:42:17
|
When sending multicast messages via blocking socket, if sending link is congested (tsk->cong_link_cnt is set to 1), the sending thread is put into sleeping state. However, tipc_sk_filter_rcv() is called under socket spin lock but tipc_wait_for_cond() is not. So, there is no guarantee that the setting of tsk->cong_link_cnt to 0 in tipc_sk_proto_rcv() in CPU-1 will be perceived by CPU-0. If that is the case, the sending thread in CPU-0 after being waken up, will continue to see tsk->cong_link_cnt as 1 and put the sending thread into sleeping state again. The sending thread will sleep forever. CPU-0 | CPU-1 tipc_wait_for_cond() | { | // condition_ = !tsk->cong_link_cnt | while ((rc_ = !(condition_))) { | ... | release_sock(sk_); | wait_woken(); | | if (!sock_owned_by_user(sk)) | tipc_sk_filter_rcv() | { | ... | tipc_sk_proto_rcv() | { | ... | tsk->cong_link_cnt--; | ... | sk->sk_write_space(sk); | ... | } | ... | } sched_annotate_sleep(); | lock_sock(sk_); | remove_wait_queue(); | } | } | This commit fixes it by adding memory barrier to tipc_sk_proto_rcv() and tipc_wait_for_cond(). Signed-off-by: Tung Nguyen <tun...@de...> --- net/tipc/socket.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 1217c90a363b..d8f054d45941 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -383,6 +383,8 @@ static int tipc_sk_sock_err(struct socket *sock, long *timeout) int rc_; \ \ while ((rc_ = !(condition_))) { \ + /* coupled with smp_wmb() in tipc_sk_proto_rcv() */ \ + smp_rmb(); \ DEFINE_WAIT_FUNC(wait_, woken_wake_function); \ sk_ = (sock_)->sk; \ rc_ = tipc_sk_sock_err((sock_), timeo_); \ @@ -1982,6 +1984,8 @@ static void tipc_sk_proto_rcv(struct sock *sk, return; case SOCK_WAKEUP: tipc_dest_del(&tsk->cong_links, msg_orignode(hdr), 0); + /* coupled with smp_rmb() in tipc_wait_for_cond() */ + smp_wmb(); tsk->cong_link_cnt--; wakeup = true; break; -- 2.17.1 |