You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(9) |
Feb
(11) |
Mar
(22) |
Apr
(73) |
May
(78) |
Jun
(146) |
Jul
(80) |
Aug
(27) |
Sep
(5) |
Oct
(14) |
Nov
(18) |
Dec
(27) |
2005 |
Jan
(20) |
Feb
(30) |
Mar
(19) |
Apr
(28) |
May
(50) |
Jun
(31) |
Jul
(32) |
Aug
(14) |
Sep
(36) |
Oct
(43) |
Nov
(74) |
Dec
(63) |
2006 |
Jan
(34) |
Feb
(32) |
Mar
(21) |
Apr
(76) |
May
(106) |
Jun
(72) |
Jul
(70) |
Aug
(175) |
Sep
(130) |
Oct
(39) |
Nov
(81) |
Dec
(43) |
2007 |
Jan
(81) |
Feb
(36) |
Mar
(20) |
Apr
(43) |
May
(54) |
Jun
(34) |
Jul
(44) |
Aug
(55) |
Sep
(44) |
Oct
(54) |
Nov
(43) |
Dec
(41) |
2008 |
Jan
(42) |
Feb
(84) |
Mar
(73) |
Apr
(30) |
May
(119) |
Jun
(54) |
Jul
(54) |
Aug
(93) |
Sep
(173) |
Oct
(130) |
Nov
(145) |
Dec
(153) |
2009 |
Jan
(59) |
Feb
(12) |
Mar
(28) |
Apr
(18) |
May
(56) |
Jun
(9) |
Jul
(28) |
Aug
(62) |
Sep
(16) |
Oct
(19) |
Nov
(15) |
Dec
(17) |
2010 |
Jan
(14) |
Feb
(36) |
Mar
(37) |
Apr
(30) |
May
(33) |
Jun
(53) |
Jul
(42) |
Aug
(50) |
Sep
(67) |
Oct
(66) |
Nov
(69) |
Dec
(36) |
2011 |
Jan
(52) |
Feb
(45) |
Mar
(49) |
Apr
(21) |
May
(34) |
Jun
(13) |
Jul
(19) |
Aug
(37) |
Sep
(43) |
Oct
(10) |
Nov
(23) |
Dec
(30) |
2012 |
Jan
(42) |
Feb
(36) |
Mar
(46) |
Apr
(25) |
May
(96) |
Jun
(146) |
Jul
(40) |
Aug
(28) |
Sep
(61) |
Oct
(45) |
Nov
(100) |
Dec
(53) |
2013 |
Jan
(79) |
Feb
(24) |
Mar
(134) |
Apr
(156) |
May
(118) |
Jun
(75) |
Jul
(278) |
Aug
(145) |
Sep
(136) |
Oct
(168) |
Nov
(137) |
Dec
(439) |
2014 |
Jan
(284) |
Feb
(158) |
Mar
(231) |
Apr
(275) |
May
(259) |
Jun
(91) |
Jul
(222) |
Aug
(215) |
Sep
(165) |
Oct
(166) |
Nov
(211) |
Dec
(150) |
2015 |
Jan
(164) |
Feb
(324) |
Mar
(299) |
Apr
(214) |
May
(111) |
Jun
(109) |
Jul
(105) |
Aug
(36) |
Sep
(58) |
Oct
(131) |
Nov
(68) |
Dec
(30) |
2016 |
Jan
(46) |
Feb
(87) |
Mar
(135) |
Apr
(174) |
May
(132) |
Jun
(135) |
Jul
(149) |
Aug
(125) |
Sep
(79) |
Oct
(49) |
Nov
(95) |
Dec
(102) |
2017 |
Jan
(104) |
Feb
(75) |
Mar
(72) |
Apr
(53) |
May
(18) |
Jun
(5) |
Jul
(14) |
Aug
(19) |
Sep
(2) |
Oct
(13) |
Nov
(21) |
Dec
(67) |
2018 |
Jan
(56) |
Feb
(50) |
Mar
(148) |
Apr
(41) |
May
(37) |
Jun
(34) |
Jul
(34) |
Aug
(11) |
Sep
(52) |
Oct
(48) |
Nov
(28) |
Dec
(46) |
2019 |
Jan
(29) |
Feb
(63) |
Mar
(95) |
Apr
(54) |
May
(14) |
Jun
(71) |
Jul
(60) |
Aug
(49) |
Sep
(3) |
Oct
(64) |
Nov
(115) |
Dec
(57) |
2020 |
Jan
(15) |
Feb
(9) |
Mar
(38) |
Apr
(27) |
May
(60) |
Jun
(53) |
Jul
(35) |
Aug
(46) |
Sep
(37) |
Oct
(64) |
Nov
(20) |
Dec
(25) |
2021 |
Jan
(20) |
Feb
(31) |
Mar
(27) |
Apr
(23) |
May
(21) |
Jun
(30) |
Jul
(30) |
Aug
(7) |
Sep
(18) |
Oct
|
Nov
(15) |
Dec
(4) |
2022 |
Jan
(3) |
Feb
(1) |
Mar
(10) |
Apr
|
May
(2) |
Jun
(26) |
Jul
(5) |
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(9) |
Dec
(2) |
2023 |
Jan
(4) |
Feb
(4) |
Mar
(5) |
Apr
(10) |
May
(29) |
Jun
(17) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
(2) |
Dec
|
2024 |
Jan
|
Feb
(6) |
Mar
|
Apr
(1) |
May
(6) |
Jun
|
Jul
(5) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Xue, Y. <Yin...@wi...> - 2019-10-11 19:00:27
|
I can recognize this is a good improvement except that the following switch cases of return values of tipc_msg_try_bundle() are not very friendly for code reader. Although I do understand their real meanings, I have to spend time checking its context back and forth. At least we should the meaningless hard code case numbers or we try to change return value numbers of tipc_msg_try_bundle(). + n = tipc_msg_try_bundle(&l->backlog[imp].target_bskb, skb, + mtu - INT_H_SIZE, + l->addr); + switch (n) { + case 0: + break; + case 1: + __skb_queue_tail(backlogq, skb); l->backlog[imp].len++; - l->stats.sent_bundled++; + continue; + case 2: l->stats.sent_bundles++; + l->stats.sent_bundled++; + default: + kfree_skb(skb); + l->stats.sent_bundled++; continue; |
From: Jon M. <jon...@er...> - 2019-10-11 16:19:03
|
Ying and Xin, At netdev 0x13 in Prague last July there was presented a related proposal https://netdevconf.info/0x13/session.html?talk-AF_GRAFT. I was there, and I cannot say there was any overwhelming approval of this proposal, but neither was it rejected out of hand. First, I see TIPC as an IPC, not a network protocol, and anybody using TIPC inside a cluster has per definition been authenticated to start a node and connect to the cluster. Here, there is no change from current policies. Once a node has been accepted in a cluster, possibly via encrypted discovery messages which have been passing all policies checks, and we are 100% certain it is legitimate and located in the same kernel (as we are trying to ensure in this patch), I cannot see any reason why we should not be allowed to short-cut the stack the way we do. Security checks have already been done. Are we circumventing any other policies by doing this that must not be done? Unless you strongly object I would suggest we send this to netdev as an RFC and observe the reactions. If David or Eric or any of the other heavyweight say flatly no there is nothing we can do. But It might be worth a try. ////jon > -----Original Message----- > From: Xue, Ying <Yin...@wi...> > Sent: 11-Oct-19 07:58 > To: Jon Maloy <jon...@er...>; Xin Long <lx...@re...> > Subject: RE: [net-next] tipc: improve throughput between nodes in netns > > Exactly. I agree with Xin. The major purpose of namespace is mainly to provide > an isolated environment. But as this patch almost completely bypasses security > check points of networking stack, the traffics between namespaces will be out > of control. So I don't think this is a good idea. > > Thanks, > Ying > > -----Original Message----- > From: Jon Maloy [mailto:jon...@er...] > Sent: Friday, October 11, 2019 2:14 AM > To: Xin Long > Cc: Xue, Ying > Subject: RE: [net-next] tipc: improve throughput between nodes in netns > > Hi Xin, > I am not surprised by you answer. Apart from concerns about security, this is > the same objection I have heard from others when presenting this idea, and I > suspect that this would also be the reaction if we try to deliver this to David. > If we can achieve anything close to this by adding GSO to the veth interface I > think that would be a safer approach. > So, I suggest we put this one to rest for now, and I'll try to go ahead with the > GSO approach instead. > > Sorry Hoang for making you waste your time. > > BR > ///jon > > > -----Original Message----- > > From: Xin Long <lx...@re...> > > Sent: 10-Oct-19 07:14 > > To: Jon Maloy <jon...@er...> > > Cc: Ying Xue <yin...@wi...> > > Subject: Re: [net-next] tipc: improve throughput between nodes in > > netns > > > > > > > > ----- Original Message ----- > > > Ying and Xin, > > > This is the "wormhole" functionality I have been suggesting a since > > > while back. > > > Basically, we send messages directly socket to socket between name > > > spaces on the same host, not only between sockets within the same > > > name > > space. > > > As you will understand this might have a huge positive impact on > > > performance between e.g., docker containers or containers inside > > Kubernetes pods. > > > > > > Please spend some time reviewing this, as it might be a > > > controversial feature. It is imperative that we get security right here. > > > > > If I understand it right: > > > > With this patch, TIPC packets will skip all lower layers protocol > > stack, like IP (udp media), ether link layer, which means all rules of > > like tc, ovs, netfiler/br_netfilter will be skipped. > > > > I don't think this could be endured, especially when it comes to a > > cloud environment where many rules are configured on those virtual > > NICs. Unless we have some special needs, I'm not sure if this > > performance improvement is worth a big protocol stack skip. > > > > Thanks. > > > > > BR > > > ///jon > > > > > > > > > -----Original Message----- > > > From: Hoang Le <hoa...@de...> > > > Sent: 2-Oct-19 06:26 > > > To: Jon Maloy <jon...@er...>; ma...@do...; > > > tip...@li... > > > Subject: [net-next] tipc: improve throughput between nodes in netns > > > > > > Introduce traffic cross namespaces transmission as local node. > > > By this way, throughput between nodes in namespace as fast as local. > > > > > > Testcase: > > > $ip netns exec 1 benchmark_client -c 100 $ip netns exec 2 > > > benchmark_server > > > > > > Before: > > > +--------------------------------------------------------------------------------------------- > + > > > | Msg Size | # | # Msgs/ | Elapsed | Throughput > > > | | > > > | [octets] | Conns | Conn | [ms] > > > | +------------------------------------------------+ > > > | | | | | Total [Msg/s] | Total [Mb/s] | > > > | | | | | Per Conn [Mb/s] | > > > +--------------------------------------------------------------------------------------------- > + > > > | 64 | 100 | 64000 | 13005 | 492103 | 251 | > > > | 2 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 256 | 100 | 32000 | 4964 | 644627 | 1320 | > > > | 13 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 1024 | 100 | 16000 | 4524 | 353612 | 2896 | > > > | 28 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 4096 | 100 | 8000 | 3675 | 217644 | 7131 | > > > | 71 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 16384 | 100 | 4000 | 7914 | 50540 | 6624 | > > > | 66 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 65536 | 100 | 2000 | 13000 | 15384 | 8065 | > > > | 80 | > > > +--------------------------------------------------------------------------------------------- > + > > > > > > After: > > > +--------------------------------------------------------------------------------------------- > + > > > | Msg Size | # | # Msgs/ | Elapsed | Throughput > > > | | > > > | [octets] | Conns | Conn | [ms] > > > | +------------------------------------------------+ > > > | | | | | Total [Msg/s] | Total [Mb/s] | > > > | | | | | Per Conn [Mb/s] | > > > +--------------------------------------------------------------------------------------------- > + > > > | 64 | 100 | 64000 | 7842 | 816090 | 417 | > > > | 4 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 256 | 100 | 32000 | 3593 | 890469 | 1823 | > > > | 18 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 1024 | 100 | 16000 | 1835 | 871828 | 7142 | > > > | 71 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 4096 | 100 | 8000 | 1134 | 704904 | 23098 | > > > | 230 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 16384 | 100 | 4000 | 878 | 455295 | 59676 | > > > | 596 | > > > +--------------------------------------------------------------------------------------------- > + > > > | 65536 | 100 | 2000 | 1007 | 198487 | 104064 | > > > | 1040 | > > > +--------------------------------------------------------------------------------------------- > + > > > > > > Signed-off-by: Hoang Le <hoa...@de...> > > > --- > > > net/tipc/discover.c | 6 ++- > > > net/tipc/msg.h | 10 +++++ > > > net/tipc/name_distr.c | 2 +- > > > net/tipc/node.c | 94 > > +++++++++++++++++++++++++++++++++++++++++-- > > > net/tipc/node.h | 4 +- > > > net/tipc/socket.c | 6 +-- > > > 6 files changed, 111 insertions(+), 11 deletions(-) > > > > > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index > > > c138d68e8a69..98d4eea97eb7 100644 > > > --- a/net/tipc/discover.c > > > +++ b/net/tipc/discover.c > > > @@ -38,6 +38,8 @@ > > > #include "node.h" > > > #include "discover.h" > > > > > > +#include <net/netns/hash.h> > > > + > > > /* min delay during bearer start up */ > > > #define TIPC_DISC_INIT msecs_to_jiffies(125) > > > /* max delay if bearer has no links */ @@ -94,6 +96,7 @@ static > > > void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, > > > msg_set_dest_domain(hdr, dest_domain); > > > msg_set_bc_netid(hdr, tn->net_id); > > > b->media->addr2msg(msg_media_addr(hdr), &b->addr); > > > + msg_set_peer_net_hash(hdr, net_hash_mix(net)); > > > msg_set_node_id(hdr, tipc_own_id(net)); } > > > > > > @@ -200,6 +203,7 @@ void tipc_disc_rcv(struct net *net, struct > > > sk_buff > > *skb, > > > u8 peer_id[NODE_ID_LEN] = {0,}; > > > u32 dst = msg_dest_domain(hdr); > > > u32 net_id = msg_bc_netid(hdr); > > > + u32 pnet_hash = msg_peer_net_hash(hdr); > > > struct tipc_media_addr maddr; > > > u32 src = msg_prevnode(hdr); > > > u32 mtyp = msg_type(hdr); > > > @@ -242,7 +246,7 @@ void tipc_disc_rcv(struct net *net, struct > > > sk_buff > > *skb, > > > if (!tipc_in_scope(legacy, b->domain, src)) > > > return; > > > tipc_node_check_dest(net, src, peer_id, b, caps, signature, > > > - &maddr, &respond, &dupl_addr); > > > + pnet_hash, &maddr, &respond, &dupl_addr); > > > if (dupl_addr) > > > disc_dupl_alert(b, src, &maddr); > > > if (!respond) > > > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index > > > 0daa6f04ca81..a8d0f28094f2 > > > 100644 > > > --- a/net/tipc/msg.h > > > +++ b/net/tipc/msg.h > > > @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct > > > tipc_msg *m, u16 n) > > > msg_set_bits(m, 9, 16, 0xffff, n); } > > > > > > +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) { > > > + msg_set_word(m, 9, n); > > > +} > > > + > > > +static inline u32 msg_peer_net_hash(struct tipc_msg *m) { > > > + return msg_word(m, 9); > > > +} > > > + > > > /* Word 10 > > > */ > > > static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git > > > a/net/tipc/name_distr.c b/net/tipc/name_distr.c index > > > 836e629e8f4a..5feaf3b67380 100644 > > > --- a/net/tipc/name_distr.c > > > +++ b/net/tipc/name_distr.c > > > @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, > > > struct sk_buff_head *list, > > > struct publication *publ; > > > struct sk_buff *skb = NULL; > > > struct distr_item *item = NULL; > > > - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / > > > + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - > > > +INT_H_SIZE) / > > > ITEM_SIZE) * ITEM_SIZE; > > > u32 msg_rem = msg_dsz; > > > > > > diff --git a/net/tipc/node.c b/net/tipc/node.c index > > > c8f6177dd5a2..9a4ffd647701 100644 > > > --- a/net/tipc/node.c > > > +++ b/net/tipc/node.c > > > @@ -45,6 +45,8 @@ > > > #include "netlink.h" > > > #include "trace.h" > > > > > > +#include <net/netns/hash.h> > > > + > > > #define INVALID_NODE_SIG 0x10000 > > > #define NODE_CLEANUP_AFTER 300000 > > > > > > @@ -126,6 +128,7 @@ struct tipc_node { > > > struct timer_list timer; > > > struct rcu_head rcu; > > > unsigned long delete_at; > > > + struct net *pnet; > > > }; > > > > > > /* Node FSM states and events: > > > @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct > > > tipc_node *n, int sel) > > > return n->links[bearer_id].link; > > > } > > > > > > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > > > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > > > +connected) > > > { > > > struct tipc_node *n; > > > int bearer_id; > > > @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 > > > addr, > > > u32 > > > sel) > > > if (unlikely(!n)) > > > return mtu; > > > > > > + /* Allow MAX_MSG_SIZE when building connection oriented message > > > + * if they are in the same core network > > > + */ > > > + if (n->pnet && connected) { > > > + tipc_node_put(n); > > > + return mtu; > > > + } > > > + > > > bearer_id = n->active_links[sel & 1]; > > > if (likely(bearer_id != INVALID_BEARER_ID)) > > > mtu = n->links[bearer_id].mtu; > > > @@ -361,11 +372,14 @@ static void tipc_node_write_unlock(struct > > > tipc_node *n) } > > > > > > static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > > > - u8 *peer_id, u16 capabilities) > > > + u8 *peer_id, u16 capabilities, > > > + u32 signature, u32 pnet_hash) > > > { > > > struct tipc_net *tn = net_generic(net, tipc_net_id); > > > struct tipc_node *n, *temp_node; > > > + struct tipc_net *tn_peer; > > > struct tipc_link *l; > > > + struct net *tmp; > > > int bearer_id; > > > int i; > > > > > > @@ -400,6 +414,23 @@ static struct tipc_node > > > *tipc_node_create(struct net *net, u32 addr, > > > memcpy(&n->peer_id, peer_id, 16); > > > n->net = net; > > > n->capabilities = capabilities; > > > + n->pnet = NULL; > > > + for_each_net_rcu(tmp) { > > > + /* Integrity checking whether node exists in namespace or not */ > > > + if (net_hash_mix(tmp) != pnet_hash) > > > + continue; > > > + tn_peer = net_generic(tmp, tipc_net_id); > > > + if (!tn_peer) > > > + continue; > > > + > > > + if ((tn_peer->random & 0x7fff) != (signature & 0x7fff)) > > > + continue; > > > + > > > + if (!memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) { > > > + n->pnet = tmp; > > > + break; > > > + } > > > + } > > > kref_init(&n->kref); > > > rwlock_init(&n->lock); > > > INIT_HLIST_NODE(&n->hash); > > > @@ -979,7 +1010,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, > > > u32 > > > addr) > > > > > > void tipc_node_check_dest(struct net *net, u32 addr, > > > u8 *peer_id, struct tipc_bearer *b, > > > - u16 capabilities, u32 signature, > > > + u16 capabilities, u32 signature, u32 pnet_hash, > > > struct tipc_media_addr *maddr, > > > bool *respond, bool *dupl_addr) { @@ -998,7 +1029,8 > @@ void > > > tipc_node_check_dest(struct net *net, u32 > > addr, > > > *dupl_addr = false; > > > *respond = false; > > > > > > - n = tipc_node_create(net, addr, peer_id, capabilities); > > > + n = tipc_node_create(net, addr, peer_id, capabilities, signature, > > > + pnet_hash); > > > if (!n) > > > return; > > > > > > @@ -1424,6 +1456,49 @@ static int __tipc_nl_add_node(struct > > > tipc_nl_msg *msg, struct tipc_node *node) > > > return -EMSGSIZE; > > > } > > > > > > +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head > > > +*list) { > > > + struct tipc_msg *hdr = buf_msg(skb_peek(list)); > > > + struct sk_buff_head inputq; > > > + > > > + switch (msg_user(hdr)) { > > > + case TIPC_LOW_IMPORTANCE: > > > + case TIPC_MEDIUM_IMPORTANCE: > > > + case TIPC_HIGH_IMPORTANCE: > > > + case TIPC_CRITICAL_IMPORTANCE: > > > + if (msg_connected(hdr) || msg_named(hdr)) { > > > + spin_lock_init(&list->lock); > > > + tipc_sk_rcv(pnet, list); > > > + return; > > > + } > > > + if (msg_mcast(hdr)) { > > > + skb_queue_head_init(&inputq); > > > + tipc_sk_mcast_rcv(pnet, list, &inputq); > > > + __skb_queue_purge(list); > > > + skb_queue_purge(&inputq); > > > + return; > > > + } > > > + return; > > > + case MSG_FRAGMENTER: > > > + if (tipc_msg_assemble(list)) { > > > + skb_queue_head_init(&inputq); > > > + tipc_sk_mcast_rcv(pnet, list, &inputq); > > > + __skb_queue_purge(list); > > > + skb_queue_purge(&inputq); > > > + } > > > + return; > > > + case LINK_PROTOCOL: > > > + case NAME_DISTRIBUTOR: > > > + case GROUP_PROTOCOL: > > > + case CONN_MANAGER: > > > + case TUNNEL_PROTOCOL: > > > + case BCAST_PROTOCOL: > > > + return; > > > + default: > > > + return; > > > + }; > > > +} > > > + > > > /** > > > * tipc_node_xmit() is the general link level function for message sending > > > * @net: the applicable net namespace @@ -1439,6 +1514,7 @@ int > > > tipc_node_xmit(struct net *net, struct sk_buff_head *list, > > > struct tipc_link_entry *le = NULL; > > > struct tipc_node *n; > > > struct sk_buff_head xmitq; > > > + bool node_up = false; > > > int bearer_id; > > > int rc; > > > > > > @@ -1455,6 +1531,16 @@ int tipc_node_xmit(struct net *net, struct > > > sk_buff_head *list, > > > return -EHOSTUNREACH; > > > } > > > > > > + node_up = node_is_up(n); > > > + if (node_up && n->pnet && check_net(n->pnet)) { > > > + /* xmit inner linux container */ > > > + tipc_lxc_xmit(n->pnet, list); > > > + if (likely(skb_queue_empty(list))) { > > > + tipc_node_put(n); > > > + return 0; > > > + } > > > + } > > > + > > > tipc_node_read_lock(n); > > > bearer_id = n->active_links[selector & 1]; > > > if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git > > > a/net/tipc/node.h b/net/tipc/node.h index > > 291d0ecd4101..11eb95ce358b > > > 100644 > > > --- a/net/tipc/node.h > > > +++ b/net/tipc/node.h > > > @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); > > > u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void > > > tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, > > > struct tipc_bearer *bearer, > > > - u16 capabilities, u32 signature, > > > + u16 capabilities, u32 signature, u32 pnet_hash, > > > struct tipc_media_addr *maddr, > > > bool *respond, bool *dupl_addr); void > > > tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 > > > +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct > > > list_head *subscr, > > > u32 addr); void tipc_node_broadcast(struct net *net, struct > > > sk_buff *skb); int tipc_node_add_conn(struct net *net, u32 dnode, > > > u32 port, > > > u32 peer_port); void tipc_node_remove_conn(struct net *net, u32 > > > dnode, u32 port); -int tipc_node_get_mtu(struct net *net, u32 addr, > > > u32 sel); > > > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > > > +connected); > > > bool tipc_node_is_up(struct net *net, u32 addr); > > > u16 tipc_node_get_capabilities(struct net *net, u32 addr); int > > > tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); > > > diff --git a/net/tipc/socket.c b/net/tipc/socket.c index > > > 3b9f8cc328f5..fb24df03da6c 100644 > > > --- a/net/tipc/socket.c > > > +++ b/net/tipc/socket.c > > > @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, > > > struct tipc_sock *tsk, > > > > > > /* Build message as chain of buffers */ > > > __skb_queue_head_init(&pkts); > > > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > > > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > > > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > > > if (unlikely(rc != dlen)) > > > return rc; > > > @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, > > > struct msghdr *m, size_t dlen) > > > return rc; > > > > > > __skb_queue_head_init(&pkts); > > > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > > > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > > > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > > > if (unlikely(rc != dlen)) > > > return rc; > > > @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct > > > tipc_sock *tsk, > > > u32 peer_port, > > > sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); > > > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > > > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > > > - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > > > + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, > > > +true); > > > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > > > __skb_queue_purge(&sk->sk_write_queue); > > > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > > > -- > > > 2.20.1 > > > > > > |
From: David M. <da...@da...> - 2019-10-02 15:03:24
|
From: Tuong Lien <tuo...@de...> Date: Wed, 2 Oct 2019 18:49:43 +0700 > We have identified a problem with the "oversubscription" policy in the > link transmission code. ... Applied and queued up for -stable. |
From: Tuong L. <tuo...@de...> - 2019-10-02 11:50:15
|
We have identified a problem with the "oversubscription" policy in the link transmission code. When small messages are transmitted, and the sending link has reached the transmit window limit, those messages will be bundled and put into the link backlog queue. However, bundles of data messages are counted at the 'CRITICAL' level, so that the counter for that level, instead of the counter for the real, bundled message's level is the one being increased. Subsequent, to-be-bundled data messages at non-CRITICAL levels continue to be tested against the unchanged counter for their own level, while contributing to an unrestrained increase at the CRITICAL backlog level. This leaves a gap in congestion control algorithm for small messages that can result in starvation for other users or a "real" CRITICAL user. Even that eventually can lead to buffer exhaustion & link reset. We fix this by keeping a 'target_bskb' buffer pointer at each levels, then when bundling, we only bundle messages at the same importance level only. This way, we know exactly how many slots a certain level have occupied in the queue, so can manage level congestion accurately. By bundling messages at the same level, we even have more benefits. Let consider this: - One socket sends 64-byte messages at the 'CRITICAL' level; - Another sends 4096-byte messages at the 'LOW' level; When a 64-byte message comes and is bundled the first time, we put the overhead of message bundle to it (+ 40-byte header, data copy, etc.) for later use, but the next message can be a 4096-byte one that cannot be bundled to the previous one. This means the last bundle carries only one payload message which is totally inefficient, as for the receiver also! Later on, another 64-byte message comes, now we make a new bundle and the same story repeats... With the new bundling algorithm, this will not happen, the 64-byte messages will be bundled together even when the 4096-byte message(s) comes in between. However, if the 4096-byte messages are sent at the same level i.e. 'CRITICAL', the bundling algorithm will again cause the same overhead. Also, the same will happen even with only one socket sending small messages at a rate close to the link transmit's one, so that, when one message is bundled, it's transmitted shortly. Then, another message comes, a new bundle is created and so on... We will solve this issue radically by another patch. Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion") Reported-by: Hoang Le <hoa...@de...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/link.c | 29 ++++++++++++++++++----------- net/tipc/msg.c | 5 +---- 2 files changed, 19 insertions(+), 15 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ffd9e2c..999eab592de8 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -160,6 +160,7 @@ struct tipc_link { struct { u16 len; u16 limit; + struct sk_buff *target_bskb; } backlog[5]; u16 snd_nxt; u16 window; @@ -880,6 +881,7 @@ static void link_prepare_wakeup(struct tipc_link *l) void tipc_link_reset(struct tipc_link *l) { struct sk_buff_head list; + u32 imp; __skb_queue_head_init(&list); @@ -901,11 +903,10 @@ void tipc_link_reset(struct tipc_link *l) __skb_queue_purge(&l->deferdq); __skb_queue_purge(&l->backlogq); __skb_queue_purge(&l->failover_deferdq); - l->backlog[TIPC_LOW_IMPORTANCE].len = 0; - l->backlog[TIPC_MEDIUM_IMPORTANCE].len = 0; - l->backlog[TIPC_HIGH_IMPORTANCE].len = 0; - l->backlog[TIPC_CRITICAL_IMPORTANCE].len = 0; - l->backlog[TIPC_SYSTEM_IMPORTANCE].len = 0; + for (imp = 0; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) { + l->backlog[imp].len = 0; + l->backlog[imp].target_bskb = NULL; + } kfree_skb(l->reasm_buf); kfree_skb(l->reasm_tnlmsg); kfree_skb(l->failover_reasm_skb); @@ -947,7 +948,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; struct sk_buff_head *transmq = &l->transmq; struct sk_buff_head *backlogq = &l->backlogq; - struct sk_buff *skb, *_skb, *bskb; + struct sk_buff *skb, *_skb, **tskb; int pkt_cnt = skb_queue_len(list); int rc = 0; @@ -999,19 +1000,21 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, seqno++; continue; } - if (tipc_msg_bundle(skb_peek_tail(backlogq), hdr, mtu)) { + tskb = &l->backlog[imp].target_bskb; + if (tipc_msg_bundle(*tskb, hdr, mtu)) { kfree_skb(__skb_dequeue(list)); l->stats.sent_bundled++; continue; } - if (tipc_msg_make_bundle(&bskb, hdr, mtu, l->addr)) { + if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { kfree_skb(__skb_dequeue(list)); - __skb_queue_tail(backlogq, bskb); - l->backlog[msg_importance(buf_msg(bskb))].len++; + __skb_queue_tail(backlogq, *tskb); + l->backlog[imp].len++; l->stats.sent_bundled++; l->stats.sent_bundles++; continue; } + l->backlog[imp].target_bskb = NULL; l->backlog[imp].len += skb_queue_len(list); skb_queue_splice_tail_init(list, backlogq); } @@ -1027,6 +1030,7 @@ static void tipc_link_advance_backlog(struct tipc_link *l, u16 seqno = l->snd_nxt; u16 ack = l->rcv_nxt - 1; u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; + u32 imp; while (skb_queue_len(&l->transmq) < l->window) { skb = skb_peek(&l->backlogq); @@ -1037,7 +1041,10 @@ static void tipc_link_advance_backlog(struct tipc_link *l, break; __skb_dequeue(&l->backlogq); hdr = buf_msg(skb); - l->backlog[msg_importance(hdr)].len--; + imp = msg_importance(hdr); + l->backlog[imp].len--; + if (unlikely(skb == l->backlog[imp].target_bskb)) + l->backlog[imp].target_bskb = NULL; __skb_queue_tail(&l->transmq, skb); /* next retransmit attempt */ if (link_is_bc_sndlink(l)) diff --git a/net/tipc/msg.c b/net/tipc/msg.c index e6d49cdc61b4..922d262e153f 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -543,10 +543,7 @@ bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, bmsg = buf_msg(_skb); tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, INT_H_SIZE, dnode); - if (msg_isdata(msg)) - msg_set_importance(bmsg, TIPC_CRITICAL_IMPORTANCE); - else - msg_set_importance(bmsg, TIPC_SYSTEM_IMPORTANCE); + msg_set_importance(bmsg, msg_importance(msg)); msg_set_seqno(bmsg, msg_seqno(msg)); msg_set_ack(bmsg, msg_ack(msg)); msg_set_bcast_ack(bmsg, msg_bcast_ack(msg)); -- 2.13.7 |
From: Hoang Le <hoa...@de...> - 2019-10-02 10:25:37
|
Introduce traffic cross namespaces transmission as local node. By this way, throughput between nodes in namespace as fast as local. Testcase: $ip netns exec 1 benchmark_client -c 100 $ip netns exec 2 benchmark_server Before: +---------------------------------------------------------------------------------------------+ | Msg Size | # | # Msgs/ | Elapsed | Throughput | | [octets] | Conns | Conn | [ms] +------------------------------------------------+ | | | | | Total [Msg/s] | Total [Mb/s] | Per Conn [Mb/s] | +---------------------------------------------------------------------------------------------+ | 64 | 100 | 64000 | 13005 | 492103 | 251 | 2 | +---------------------------------------------------------------------------------------------+ | 256 | 100 | 32000 | 4964 | 644627 | 1320 | 13 | +---------------------------------------------------------------------------------------------+ | 1024 | 100 | 16000 | 4524 | 353612 | 2896 | 28 | +---------------------------------------------------------------------------------------------+ | 4096 | 100 | 8000 | 3675 | 217644 | 7131 | 71 | +---------------------------------------------------------------------------------------------+ | 16384 | 100 | 4000 | 7914 | 50540 | 6624 | 66 | +---------------------------------------------------------------------------------------------+ | 65536 | 100 | 2000 | 13000 | 15384 | 8065 | 80 | +---------------------------------------------------------------------------------------------+ After: +---------------------------------------------------------------------------------------------+ | Msg Size | # | # Msgs/ | Elapsed | Throughput | | [octets] | Conns | Conn | [ms] +------------------------------------------------+ | | | | | Total [Msg/s] | Total [Mb/s] | Per Conn [Mb/s] | +---------------------------------------------------------------------------------------------+ | 64 | 100 | 64000 | 7842 | 816090 | 417 | 4 | +---------------------------------------------------------------------------------------------+ | 256 | 100 | 32000 | 3593 | 890469 | 1823 | 18 | +---------------------------------------------------------------------------------------------+ | 1024 | 100 | 16000 | 1835 | 871828 | 7142 | 71 | +---------------------------------------------------------------------------------------------+ | 4096 | 100 | 8000 | 1134 | 704904 | 23098 | 230 | +---------------------------------------------------------------------------------------------+ | 16384 | 100 | 4000 | 878 | 455295 | 59676 | 596 | +---------------------------------------------------------------------------------------------+ | 65536 | 100 | 2000 | 1007 | 198487 | 104064 | 1040 | +---------------------------------------------------------------------------------------------+ Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/discover.c | 6 ++- net/tipc/msg.h | 10 +++++ net/tipc/name_distr.c | 2 +- net/tipc/node.c | 94 +++++++++++++++++++++++++++++++++++++++++-- net/tipc/node.h | 4 +- net/tipc/socket.c | 6 +-- 6 files changed, 111 insertions(+), 11 deletions(-) diff --git a/net/tipc/discover.c b/net/tipc/discover.c index c138d68e8a69..98d4eea97eb7 100644 --- a/net/tipc/discover.c +++ b/net/tipc/discover.c @@ -38,6 +38,8 @@ #include "node.h" #include "discover.h" +#include <net/netns/hash.h> + /* min delay during bearer start up */ #define TIPC_DISC_INIT msecs_to_jiffies(125) /* max delay if bearer has no links */ @@ -94,6 +96,7 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, msg_set_dest_domain(hdr, dest_domain); msg_set_bc_netid(hdr, tn->net_id); b->media->addr2msg(msg_media_addr(hdr), &b->addr); + msg_set_peer_net_hash(hdr, net_hash_mix(net)); msg_set_node_id(hdr, tipc_own_id(net)); } @@ -200,6 +203,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, u8 peer_id[NODE_ID_LEN] = {0,}; u32 dst = msg_dest_domain(hdr); u32 net_id = msg_bc_netid(hdr); + u32 pnet_hash = msg_peer_net_hash(hdr); struct tipc_media_addr maddr; u32 src = msg_prevnode(hdr); u32 mtyp = msg_type(hdr); @@ -242,7 +246,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, if (!tipc_in_scope(legacy, b->domain, src)) return; tipc_node_check_dest(net, src, peer_id, b, caps, signature, - &maddr, &respond, &dupl_addr); + pnet_hash, &maddr, &respond, &dupl_addr); if (dupl_addr) disc_dupl_alert(b, src, &maddr); if (!respond) diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..a8d0f28094f2 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct tipc_msg *m, u16 n) msg_set_bits(m, 9, 16, 0xffff, n); } +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) +{ + msg_set_word(m, 9, n); +} + +static inline u32 msg_peer_net_hash(struct tipc_msg *m) +{ + return msg_word(m, 9); +} + /* Word 10 */ static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index 836e629e8f4a..5feaf3b67380 100644 --- a/net/tipc/name_distr.c +++ b/net/tipc/name_distr.c @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list, struct publication *publ; struct sk_buff *skb = NULL; struct distr_item *item = NULL; - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) / ITEM_SIZE) * ITEM_SIZE; u32 msg_rem = msg_dsz; diff --git a/net/tipc/node.c b/net/tipc/node.c index c8f6177dd5a2..9a4ffd647701 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -45,6 +45,8 @@ #include "netlink.h" #include "trace.h" +#include <net/netns/hash.h> + #define INVALID_NODE_SIG 0x10000 #define NODE_CLEANUP_AFTER 300000 @@ -126,6 +128,7 @@ struct tipc_node { struct timer_list timer; struct rcu_head rcu; unsigned long delete_at; + struct net *pnet; }; /* Node FSM states and events: @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct tipc_node *n, int sel) return n->links[bearer_id].link; } -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected) { struct tipc_node *n; int bearer_id; @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) if (unlikely(!n)) return mtu; + /* Allow MAX_MSG_SIZE when building connection oriented message + * if they are in the same core network + */ + if (n->pnet && connected) { + tipc_node_put(n); + return mtu; + } + bearer_id = n->active_links[sel & 1]; if (likely(bearer_id != INVALID_BEARER_ID)) mtu = n->links[bearer_id].mtu; @@ -361,11 +372,14 @@ static void tipc_node_write_unlock(struct tipc_node *n) } static struct tipc_node *tipc_node_create(struct net *net, u32 addr, - u8 *peer_id, u16 capabilities) + u8 *peer_id, u16 capabilities, + u32 signature, u32 pnet_hash) { struct tipc_net *tn = net_generic(net, tipc_net_id); struct tipc_node *n, *temp_node; + struct tipc_net *tn_peer; struct tipc_link *l; + struct net *tmp; int bearer_id; int i; @@ -400,6 +414,23 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, memcpy(&n->peer_id, peer_id, 16); n->net = net; n->capabilities = capabilities; + n->pnet = NULL; + for_each_net_rcu(tmp) { + /* Integrity checking whether node exists in namespace or not */ + if (net_hash_mix(tmp) != pnet_hash) + continue; + tn_peer = net_generic(tmp, tipc_net_id); + if (!tn_peer) + continue; + + if ((tn_peer->random & 0x7fff) != (signature & 0x7fff)) + continue; + + if (!memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) { + n->pnet = tmp; + break; + } + } kref_init(&n->kref); rwlock_init(&n->lock); INIT_HLIST_NODE(&n->hash); @@ -979,7 +1010,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) void tipc_node_check_dest(struct net *net, u32 addr, u8 *peer_id, struct tipc_bearer *b, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 pnet_hash, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr) { @@ -998,7 +1029,8 @@ void tipc_node_check_dest(struct net *net, u32 addr, *dupl_addr = false; *respond = false; - n = tipc_node_create(net, addr, peer_id, capabilities); + n = tipc_node_create(net, addr, peer_id, capabilities, signature, + pnet_hash); if (!n) return; @@ -1424,6 +1456,49 @@ static int __tipc_nl_add_node(struct tipc_nl_msg *msg, struct tipc_node *node) return -EMSGSIZE; } +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) +{ + struct tipc_msg *hdr = buf_msg(skb_peek(list)); + struct sk_buff_head inputq; + + switch (msg_user(hdr)) { + case TIPC_LOW_IMPORTANCE: + case TIPC_MEDIUM_IMPORTANCE: + case TIPC_HIGH_IMPORTANCE: + case TIPC_CRITICAL_IMPORTANCE: + if (msg_connected(hdr) || msg_named(hdr)) { + spin_lock_init(&list->lock); + tipc_sk_rcv(pnet, list); + return; + } + if (msg_mcast(hdr)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(pnet, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + return; + } + return; + case MSG_FRAGMENTER: + if (tipc_msg_assemble(list)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(pnet, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + } + return; + case LINK_PROTOCOL: + case NAME_DISTRIBUTOR: + case GROUP_PROTOCOL: + case CONN_MANAGER: + case TUNNEL_PROTOCOL: + case BCAST_PROTOCOL: + return; + default: + return; + }; +} + /** * tipc_node_xmit() is the general link level function for message sending * @net: the applicable net namespace @@ -1439,6 +1514,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, struct tipc_link_entry *le = NULL; struct tipc_node *n; struct sk_buff_head xmitq; + bool node_up = false; int bearer_id; int rc; @@ -1455,6 +1531,16 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, return -EHOSTUNREACH; } + node_up = node_is_up(n); + if (node_up && n->pnet && check_net(n->pnet)) { + /* xmit inner linux container */ + tipc_lxc_xmit(n->pnet, list); + if (likely(skb_queue_empty(list))) { + tipc_node_put(n); + return 0; + } + } + tipc_node_read_lock(n); bearer_id = n->active_links[selector & 1]; if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..11eb95ce358b 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, struct tipc_bearer *bearer, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 pnet_hash, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr); void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected); bool tipc_node_is_up(struct net *net, u32 addr); u16 tipc_node_get_capabilities(struct net *net, u32 addr); int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk, /* Build message as chain of buffers */ __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen) return rc; __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); tipc_set_sk_state(sk, TIPC_ESTABLISHED); tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); __skb_queue_purge(&sk->sk_write_queue); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) -- 2.20.1 |
From: Tuong L. T. <tuo...@de...> - 2019-09-10 02:55:02
|
Hi Jon, Agree, the patch doesn't show an improvement in throughput. However, I believe this is due to the fact our test tool e.g. the benchmark just runs tests of the same messages size in each cases. In practice, when there are different sized messages sent by applications at the same time, the new bundling strategy will take effect. By the way, since now we limit exactly the number of packets in the backlog at each levels, this obviously reduces the throughput of small messages (as bundles), but I wonder why we need to set the limits quite small? When trying to set a larger value, I have observed a significant improvement in the throughputs, for large messages as well. Our current approach at the link layer doesn't seem very good as the limit is fixed without considering the actual number of users or connections... BR/Tuong -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Saturday, August 31, 2019 3:57 AM To: Tuong Tong Lien <tuo...@de...>; tip...@li...; ma...@do...; yin...@wi... Subject: RE: [PATCH RFC 1/2] tipc: fix unlimited bundling of small messages Hi Tuong, Both patches are good with me. Unfortunately I could not register any measurable performance improvement, but I still think this is the right thing to do. Acked-by: Jon > -----Original Message----- > From: Tuong Lien <tuo...@de...> > Sent: 29-Aug-19 07:36 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [PATCH RFC 1/2] tipc: fix unlimited bundling of small messages > > We have identified a problem with the "oversubscription" policy in the link > transmission code. > > When small messages are transmitted, and the sending link has reached the > transmit window limit, those messages will be bundled and put into the link > backlog queue. However, bundles of data messages are counted at the > 'CRITICAL' level, so that the counter for that level, instead of the counter for > the real, bundled message's level is the one being increased. > Subsequent, to-be-bundled data messages at non-CRITICAL levels continue to > be tested against the unchanged counter for their own level, while > contributing to an unrestrained increase at the CRITICAL backlog level. > > This leaves a gap in congestion control algorithm for small messages that can > result in starvation for other users or a "real" CRITICAL user. Even that > eventually can lead to buffer exhaustion & link reset. > > We fix this by keeping a 'target_bskb' buffer pointer at each levels, then when > bundling, we only bundle messages at the same importance level only. This > way, we know exactly how many slots a certain level have occupied in the > queue, so can manage level congestion accurately. > > By bundling messages at the same level, we even have more benefits. Let > consider this: > - One socket sends 64-byte messages at the 'CRITICAL' level; > - Another sends 4096-byte messages at the 'LOW' level; > > When a 64-byte message comes and is bundled the first time, we put the > overhead of message bundle to it (+ 40-byte header, data copy, etc.) for later > use, but the next message can be a 4096-byte one that cannot be bundled to > the previous one. This means the last bundle carries only one payload message > which is totally inefficient, as for the receiver also! Later on, another 64-byte > message comes, now we make a new bundle and the same story repeats... > > With the new bundling algorithm, this will not happen, the 64-byte messages > will be bundled together even when the 4096-byte message(s) comes in > between. However, if the 4096-byte messages are sent at the same level i.e. > 'CRITICAL', the bundling algorithm will again cause the same overhead. > > Also, the same will happen even with only one socket sending small messages > at a rate close to the link transmit's one, so that, when one message is > bundled, it's transmitted shortly. Then, another message comes, a new bundle > is created and so on... > > We will solve this issue radically by the 2nd patch of this series. > > Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link > congestion") > Reported-by: Hoang Le <hoa...@de...> > Acked-by: Jon Maloy <jon...@er...> > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/link.c | 29 ++++++++++++++++++----------- net/tipc/msg.c | 5 +---- > 2 files changed, 19 insertions(+), 15 deletions(-) > > diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ffd9e2c..999eab592de8 > 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -160,6 +160,7 @@ struct tipc_link { > struct { > u16 len; > u16 limit; > + struct sk_buff *target_bskb; > } backlog[5]; > u16 snd_nxt; > u16 window; > @@ -880,6 +881,7 @@ static void link_prepare_wakeup(struct tipc_link *l) > void tipc_link_reset(struct tipc_link *l) { > struct sk_buff_head list; > + u32 imp; > > __skb_queue_head_init(&list); > > @@ -901,11 +903,10 @@ void tipc_link_reset(struct tipc_link *l) > __skb_queue_purge(&l->deferdq); > __skb_queue_purge(&l->backlogq); > __skb_queue_purge(&l->failover_deferdq); > - l->backlog[TIPC_LOW_IMPORTANCE].len = 0; > - l->backlog[TIPC_MEDIUM_IMPORTANCE].len = 0; > - l->backlog[TIPC_HIGH_IMPORTANCE].len = 0; > - l->backlog[TIPC_CRITICAL_IMPORTANCE].len = 0; > - l->backlog[TIPC_SYSTEM_IMPORTANCE].len = 0; > + for (imp = 0; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) { > + l->backlog[imp].len = 0; > + l->backlog[imp].target_bskb = NULL; > + } > kfree_skb(l->reasm_buf); > kfree_skb(l->reasm_tnlmsg); > kfree_skb(l->failover_reasm_skb); > @@ -947,7 +948,7 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; > struct sk_buff_head *transmq = &l->transmq; > struct sk_buff_head *backlogq = &l->backlogq; > - struct sk_buff *skb, *_skb, *bskb; > + struct sk_buff *skb, *_skb, **tskb; > int pkt_cnt = skb_queue_len(list); > int rc = 0; > > @@ -999,19 +1000,21 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > seqno++; > continue; > } > - if (tipc_msg_bundle(skb_peek_tail(backlogq), hdr, mtu)) { > + tskb = &l->backlog[imp].target_bskb; > + if (tipc_msg_bundle(*tskb, hdr, mtu)) { > kfree_skb(__skb_dequeue(list)); > l->stats.sent_bundled++; > continue; > } > - if (tipc_msg_make_bundle(&bskb, hdr, mtu, l->addr)) { > + if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { > kfree_skb(__skb_dequeue(list)); > - __skb_queue_tail(backlogq, bskb); > - l->backlog[msg_importance(buf_msg(bskb))].len++; > + __skb_queue_tail(backlogq, *tskb); > + l->backlog[imp].len++; > l->stats.sent_bundled++; > l->stats.sent_bundles++; > continue; > } > + l->backlog[imp].target_bskb = NULL; > l->backlog[imp].len += skb_queue_len(list); > skb_queue_splice_tail_init(list, backlogq); > } > @@ -1027,6 +1030,7 @@ static void tipc_link_advance_backlog(struct > tipc_link *l, > u16 seqno = l->snd_nxt; > u16 ack = l->rcv_nxt - 1; > u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; > + u32 imp; > > while (skb_queue_len(&l->transmq) < l->window) { > skb = skb_peek(&l->backlogq); > @@ -1037,7 +1041,10 @@ static void tipc_link_advance_backlog(struct > tipc_link *l, > break; > __skb_dequeue(&l->backlogq); > hdr = buf_msg(skb); > - l->backlog[msg_importance(hdr)].len--; > + imp = msg_importance(hdr); > + l->backlog[imp].len--; > + if (unlikely(skb == l->backlog[imp].target_bskb)) > + l->backlog[imp].target_bskb = NULL; > __skb_queue_tail(&l->transmq, skb); > /* next retransmit attempt */ > if (link_is_bc_sndlink(l)) > diff --git a/net/tipc/msg.c b/net/tipc/msg.c index > e6d49cdc61b4..922d262e153f 100644 > --- a/net/tipc/msg.c > +++ b/net/tipc/msg.c > @@ -543,10 +543,7 @@ bool tipc_msg_make_bundle(struct sk_buff **skb, > struct tipc_msg *msg, > bmsg = buf_msg(_skb); > tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, > INT_H_SIZE, dnode); > - if (msg_isdata(msg)) > - msg_set_importance(bmsg, TIPC_CRITICAL_IMPORTANCE); > - else > - msg_set_importance(bmsg, TIPC_SYSTEM_IMPORTANCE); > + msg_set_importance(bmsg, msg_importance(msg)); > msg_set_seqno(bmsg, msg_seqno(msg)); > msg_set_ack(bmsg, msg_ack(msg)); > msg_set_bcast_ack(bmsg, msg_bcast_ack(msg)); > -- > 2.13.7 |
From: David M. <da...@da...> - 2019-09-05 07:59:58
|
From: Xin Long <luc...@gm...> Date: Tue, 3 Sep 2019 17:53:12 +0800 > Unlike kfree(p), kfree_rcu(p, rcu) won't do NULL pointer check. When > tipc_nametbl_remove_publ returns NULL, the panic below happens: ... > Fixes: 97ede29e80ee ("tipc: convert name table read-write lock to RCU") > Reported-by: Li Shuang <sh...@re...> > Signed-off-by: Xin Long <luc...@gm...> Applied and queued up for -stable, thanks. |
From: Xin L. <luc...@gm...> - 2019-09-03 09:53:28
|
Unlike kfree(p), kfree_rcu(p, rcu) won't do NULL pointer check. When tipc_nametbl_remove_publ returns NULL, the panic below happens: BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 RIP: 0010:__call_rcu+0x1d/0x290 Call Trace: <IRQ> tipc_publ_notify+0xa9/0x170 [tipc] tipc_node_write_unlock+0x8d/0x100 [tipc] tipc_node_link_down+0xae/0x1d0 [tipc] tipc_node_check_dest+0x3ea/0x8f0 [tipc] ? tipc_disc_rcv+0x2c7/0x430 [tipc] tipc_disc_rcv+0x2c7/0x430 [tipc] ? tipc_rcv+0x6bb/0xf20 [tipc] tipc_rcv+0x6bb/0xf20 [tipc] ? ip_route_input_slow+0x9cf/0xb10 tipc_udp_recv+0x195/0x1e0 [tipc] ? tipc_udp_is_known_peer+0x80/0x80 [tipc] udp_queue_rcv_skb+0x180/0x460 udp_unicast_rcv_skb.isra.56+0x75/0x90 __udp4_lib_rcv+0x4ce/0xb90 ip_local_deliver_finish+0x11c/0x210 ip_local_deliver+0x6b/0xe0 ? ip_rcv_finish+0xa9/0x410 ip_rcv+0x273/0x362 Fixes: 97ede29e80ee ("tipc: convert name table read-write lock to RCU") Reported-by: Li Shuang <sh...@re...> Signed-off-by: Xin Long <luc...@gm...> --- net/tipc/name_distr.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index 44abc8e..241ed22 100644 --- a/net/tipc/name_distr.c +++ b/net/tipc/name_distr.c @@ -223,7 +223,8 @@ static void tipc_publ_purge(struct net *net, struct publication *publ, u32 addr) publ->key); } - kfree_rcu(p, rcu); + if (p) + kfree_rcu(p, rcu); } /** -- 2.1.0 |
From: Jon M. <jon...@er...> - 2019-08-30 20:56:51
|
Hi Tuong, Both patches are good with me. Unfortunately I could not register any measurable performance improvement, but I still think this is the right thing to do. Acked-by: Jon > -----Original Message----- > From: Tuong Lien <tuo...@de...> > Sent: 29-Aug-19 07:36 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [PATCH RFC 1/2] tipc: fix unlimited bundling of small messages > > We have identified a problem with the "oversubscription" policy in the link > transmission code. > > When small messages are transmitted, and the sending link has reached the > transmit window limit, those messages will be bundled and put into the link > backlog queue. However, bundles of data messages are counted at the > 'CRITICAL' level, so that the counter for that level, instead of the counter for > the real, bundled message's level is the one being increased. > Subsequent, to-be-bundled data messages at non-CRITICAL levels continue to > be tested against the unchanged counter for their own level, while > contributing to an unrestrained increase at the CRITICAL backlog level. > > This leaves a gap in congestion control algorithm for small messages that can > result in starvation for other users or a "real" CRITICAL user. Even that > eventually can lead to buffer exhaustion & link reset. > > We fix this by keeping a 'target_bskb' buffer pointer at each levels, then when > bundling, we only bundle messages at the same importance level only. This > way, we know exactly how many slots a certain level have occupied in the > queue, so can manage level congestion accurately. > > By bundling messages at the same level, we even have more benefits. Let > consider this: > - One socket sends 64-byte messages at the 'CRITICAL' level; > - Another sends 4096-byte messages at the 'LOW' level; > > When a 64-byte message comes and is bundled the first time, we put the > overhead of message bundle to it (+ 40-byte header, data copy, etc.) for later > use, but the next message can be a 4096-byte one that cannot be bundled to > the previous one. This means the last bundle carries only one payload message > which is totally inefficient, as for the receiver also! Later on, another 64-byte > message comes, now we make a new bundle and the same story repeats... > > With the new bundling algorithm, this will not happen, the 64-byte messages > will be bundled together even when the 4096-byte message(s) comes in > between. However, if the 4096-byte messages are sent at the same level i.e. > 'CRITICAL', the bundling algorithm will again cause the same overhead. > > Also, the same will happen even with only one socket sending small messages > at a rate close to the link transmit's one, so that, when one message is > bundled, it's transmitted shortly. Then, another message comes, a new bundle > is created and so on... > > We will solve this issue radically by the 2nd patch of this series. > > Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link > congestion") > Reported-by: Hoang Le <hoa...@de...> > Acked-by: Jon Maloy <jon...@er...> > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/link.c | 29 ++++++++++++++++++----------- net/tipc/msg.c | 5 +---- > 2 files changed, 19 insertions(+), 15 deletions(-) > > diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ffd9e2c..999eab592de8 > 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -160,6 +160,7 @@ struct tipc_link { > struct { > u16 len; > u16 limit; > + struct sk_buff *target_bskb; > } backlog[5]; > u16 snd_nxt; > u16 window; > @@ -880,6 +881,7 @@ static void link_prepare_wakeup(struct tipc_link *l) > void tipc_link_reset(struct tipc_link *l) { > struct sk_buff_head list; > + u32 imp; > > __skb_queue_head_init(&list); > > @@ -901,11 +903,10 @@ void tipc_link_reset(struct tipc_link *l) > __skb_queue_purge(&l->deferdq); > __skb_queue_purge(&l->backlogq); > __skb_queue_purge(&l->failover_deferdq); > - l->backlog[TIPC_LOW_IMPORTANCE].len = 0; > - l->backlog[TIPC_MEDIUM_IMPORTANCE].len = 0; > - l->backlog[TIPC_HIGH_IMPORTANCE].len = 0; > - l->backlog[TIPC_CRITICAL_IMPORTANCE].len = 0; > - l->backlog[TIPC_SYSTEM_IMPORTANCE].len = 0; > + for (imp = 0; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) { > + l->backlog[imp].len = 0; > + l->backlog[imp].target_bskb = NULL; > + } > kfree_skb(l->reasm_buf); > kfree_skb(l->reasm_tnlmsg); > kfree_skb(l->failover_reasm_skb); > @@ -947,7 +948,7 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; > struct sk_buff_head *transmq = &l->transmq; > struct sk_buff_head *backlogq = &l->backlogq; > - struct sk_buff *skb, *_skb, *bskb; > + struct sk_buff *skb, *_skb, **tskb; > int pkt_cnt = skb_queue_len(list); > int rc = 0; > > @@ -999,19 +1000,21 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > seqno++; > continue; > } > - if (tipc_msg_bundle(skb_peek_tail(backlogq), hdr, mtu)) { > + tskb = &l->backlog[imp].target_bskb; > + if (tipc_msg_bundle(*tskb, hdr, mtu)) { > kfree_skb(__skb_dequeue(list)); > l->stats.sent_bundled++; > continue; > } > - if (tipc_msg_make_bundle(&bskb, hdr, mtu, l->addr)) { > + if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { > kfree_skb(__skb_dequeue(list)); > - __skb_queue_tail(backlogq, bskb); > - l->backlog[msg_importance(buf_msg(bskb))].len++; > + __skb_queue_tail(backlogq, *tskb); > + l->backlog[imp].len++; > l->stats.sent_bundled++; > l->stats.sent_bundles++; > continue; > } > + l->backlog[imp].target_bskb = NULL; > l->backlog[imp].len += skb_queue_len(list); > skb_queue_splice_tail_init(list, backlogq); > } > @@ -1027,6 +1030,7 @@ static void tipc_link_advance_backlog(struct > tipc_link *l, > u16 seqno = l->snd_nxt; > u16 ack = l->rcv_nxt - 1; > u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; > + u32 imp; > > while (skb_queue_len(&l->transmq) < l->window) { > skb = skb_peek(&l->backlogq); > @@ -1037,7 +1041,10 @@ static void tipc_link_advance_backlog(struct > tipc_link *l, > break; > __skb_dequeue(&l->backlogq); > hdr = buf_msg(skb); > - l->backlog[msg_importance(hdr)].len--; > + imp = msg_importance(hdr); > + l->backlog[imp].len--; > + if (unlikely(skb == l->backlog[imp].target_bskb)) > + l->backlog[imp].target_bskb = NULL; > __skb_queue_tail(&l->transmq, skb); > /* next retransmit attempt */ > if (link_is_bc_sndlink(l)) > diff --git a/net/tipc/msg.c b/net/tipc/msg.c index > e6d49cdc61b4..922d262e153f 100644 > --- a/net/tipc/msg.c > +++ b/net/tipc/msg.c > @@ -543,10 +543,7 @@ bool tipc_msg_make_bundle(struct sk_buff **skb, > struct tipc_msg *msg, > bmsg = buf_msg(_skb); > tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, > INT_H_SIZE, dnode); > - if (msg_isdata(msg)) > - msg_set_importance(bmsg, TIPC_CRITICAL_IMPORTANCE); > - else > - msg_set_importance(bmsg, TIPC_SYSTEM_IMPORTANCE); > + msg_set_importance(bmsg, msg_importance(msg)); > msg_set_seqno(bmsg, msg_seqno(msg)); > msg_set_ack(bmsg, msg_ack(msg)); > msg_set_bcast_ack(bmsg, msg_bcast_ack(msg)); > -- > 2.13.7 |
From: Tuong L. <tuo...@de...> - 2019-08-29 11:36:35
|
As mentioned in the previous commit of this series, the current message bundling algorithm is inefficient that can generate bundles of only one payload message, that causes unnecessary overheads for both the sender and receiver. This commit re-designs the 'tipc_msg_make_bundle()' function (now named as 'tipc_msg_try_bundle()'), so that when a message comes at the first place, we will check if it is suitable for bundling or not. If yes, we keep a reference to it and do nothing i.e. the message buffer will be put into the link backlog queue and processed as normal. Later on, when another one comes, we will make a bundle with the first message if possible and so on... This way, a bundle if really needed will always consist of at least two payload messages. Otherwise, we let the first buffer go its way without any need of bundling, so reduce the overheads to zero. Moreover, since now we have both the messages in hand, we can even optimize the 'tipc_msg_bundle()' function, make bundle of a very large (size ~ MSS) and small messages which is not with the current algorithm, e.g. [1400-byte message] + [10-byte message] (MTU = 1500). Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/link.c | 45 ++++++++------- net/tipc/msg.c | 167 ++++++++++++++++++++++++++++++-------------------------- net/tipc/msg.h | 5 +- 3 files changed, 117 insertions(+), 100 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index 999eab592de8..ff5980fc56dc 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -948,9 +948,10 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; struct sk_buff_head *transmq = &l->transmq; struct sk_buff_head *backlogq = &l->backlogq; - struct sk_buff *skb, *_skb, **tskb; + struct sk_buff *skb, *_skb; int pkt_cnt = skb_queue_len(list); int rc = 0; + u16 n; if (unlikely(msg_size(hdr) > mtu)) { pr_warn("Too large msg, purging xmit list %d %d %d %d %d!\n", @@ -975,20 +976,18 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, } /* Prepare each packet for sending, and add to relevant queue: */ - while (skb_queue_len(list)) { - skb = skb_peek(list); - hdr = buf_msg(skb); - msg_set_seqno(hdr, seqno); - msg_set_ack(hdr, ack); - msg_set_bcast_ack(hdr, bc_ack); - + while ((skb = __skb_dequeue(list))) { if (likely(skb_queue_len(transmq) < maxwin)) { + hdr = buf_msg(skb); + msg_set_seqno(hdr, seqno); + msg_set_ack(hdr, ack); + msg_set_bcast_ack(hdr, bc_ack); _skb = skb_clone(skb, GFP_ATOMIC); if (!_skb) { + kfree_skb(skb); __skb_queue_purge(list); return -ENOBUFS; } - __skb_dequeue(list); __skb_queue_tail(transmq, skb); /* next retransmit attempt */ if (link_is_bc_sndlink(l)) @@ -1000,22 +999,26 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, seqno++; continue; } - tskb = &l->backlog[imp].target_bskb; - if (tipc_msg_bundle(*tskb, hdr, mtu)) { - kfree_skb(__skb_dequeue(list)); - l->stats.sent_bundled++; - continue; - } - if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { - kfree_skb(__skb_dequeue(list)); - __skb_queue_tail(backlogq, *tskb); + n = tipc_msg_try_bundle(&l->backlog[imp].target_bskb, skb, + mtu - INT_H_SIZE, + l->addr); + switch (n) { + case 0: + break; + case 1: + __skb_queue_tail(backlogq, skb); l->backlog[imp].len++; - l->stats.sent_bundled++; + continue; + case 2: l->stats.sent_bundles++; + l->stats.sent_bundled++; + default: + kfree_skb(skb); + l->stats.sent_bundled++; continue; } - l->backlog[imp].target_bskb = NULL; - l->backlog[imp].len += skb_queue_len(list); + l->backlog[imp].len += (1 + skb_queue_len(list)); + __skb_queue_tail(backlogq, skb); skb_queue_splice_tail_init(list, backlogq); } l->snd_nxt = seqno; diff --git a/net/tipc/msg.c b/net/tipc/msg.c index 922d262e153f..768e29d6599e 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -419,52 +419,110 @@ int tipc_msg_build(struct tipc_msg *mhdr, struct msghdr *m, int offset, } /** - * tipc_msg_bundle(): Append contents of a buffer to tail of an existing one - * @skb: the buffer to append to ("bundle") - * @msg: message to be appended - * @mtu: max allowable size for the bundle buffer - * Consumes buffer if successful - * Returns true if bundling could be performed, otherwise false + * tipc_msg_bundle - Append contents of a buffer to tail of an existing one + * @bskb: the bundle buffer to append to + * @msg: message to be appended + * @max: max allowable size for the bundle buffer + * + * Returns "true" if bundling has been performed, otherwise "false" */ -bool tipc_msg_bundle(struct sk_buff *skb, struct tipc_msg *msg, u32 mtu) +static bool tipc_msg_bundle(struct sk_buff *bskb, struct tipc_msg *msg, + u32 max) { - struct tipc_msg *bmsg; - unsigned int bsz; - unsigned int msz = msg_size(msg); - u32 start, pad; - u32 max = mtu - INT_H_SIZE; + struct tipc_msg *bmsg = buf_msg(bskb); + u32 msz, bsz, offset, pad; - if (likely(msg_user(msg) == MSG_FRAGMENTER)) - return false; - if (!skb) - return false; - bmsg = buf_msg(skb); + msz = msg_size(msg); bsz = msg_size(bmsg); - start = align(bsz); - pad = start - bsz; + offset = align(bsz); + pad = offset - bsz; - if (unlikely(msg_user(msg) == TUNNEL_PROTOCOL)) - return false; - if (unlikely(msg_user(msg) == BCAST_PROTOCOL)) + if (unlikely(skb_tailroom(bskb) < (pad + msz))) return false; - if (unlikely(msg_user(bmsg) != MSG_BUNDLER)) - return false; - if (unlikely(skb_tailroom(skb) < (pad + msz))) - return false; - if (unlikely(max < (start + msz))) - return false; - if ((msg_importance(msg) < TIPC_SYSTEM_IMPORTANCE) && - (msg_importance(bmsg) == TIPC_SYSTEM_IMPORTANCE)) + if (unlikely(max < (offset + msz))) return false; - skb_put(skb, pad + msz); - skb_copy_to_linear_data_offset(skb, start, msg, msz); - msg_set_size(bmsg, start + msz); + skb_put(bskb, pad + msz); + skb_copy_to_linear_data_offset(bskb, offset, msg, msz); + msg_set_size(bmsg, offset + msz); msg_set_msgcnt(bmsg, msg_msgcnt(bmsg) + 1); return true; } /** + * tipc_msg_try_bundle - Try to bundle a new buffer to the last bundle or + * previously potential message if any + * @tskb: target buffer to bundle to (= a bundle or normal buffer or NULL) + * @skb: the new buffer to be bundled + * @mss: max message size (header inclusive) + * @dnode: destination node for the buffer + * + * Returns the number of bundled messages in the bundle if bundling has been + * performed (must be > 1). In case the new buffer is suitable for bundling + * but cannot be bundled this time, it becomes the new target for next time, + * returns 1! + * Otherwise, returns 0 shortly and the current target is flushed too. + */ +u16 tipc_msg_try_bundle(struct sk_buff **tskb, struct sk_buff *skb, u32 mss, + u32 dnode) +{ + struct tipc_msg *msg = buf_msg(skb); + struct tipc_msg *bmsg, *omsg; + u32 bsz, msz = msg_size(msg); + + /* First, check if the new buffer is suitable for bundling */ + if (msg_user(msg) == MSG_FRAGMENTER) + goto flush; + if (msg_user(msg) == TUNNEL_PROTOCOL) + goto flush; + if (msg_user(msg) == BCAST_PROTOCOL) + goto flush; + if (mss <= INT_H_SIZE + msz) + goto flush; + + /* Ok, but target lost, make a new one */ + if (unlikely(!*tskb)) + goto new_target; + + bmsg = buf_msg(*tskb); + bsz = msg_size(bmsg); + + /* If target is a bundle already, try to bundle the new buffer to */ + if (msg_user(bmsg) == MSG_BUNDLER) + goto bundle; + + /* Target is not a bundle (i.e. the previously potential buffer), make + * a new bundle of the two buffers if possible + */ + if (unlikely(mss < align(INT_H_SIZE + bsz) + msz)) + goto new_target; + if (unlikely(pskb_expand_head(*tskb, INT_H_SIZE, mss - bsz, + GFP_ATOMIC))) + goto new_target; + + omsg = buf_msg(*tskb); + skb_push(*tskb, INT_H_SIZE); + bmsg = buf_msg(*tskb); + tipc_msg_init(msg_prevnode(omsg), bmsg, MSG_BUNDLER, 0, INT_H_SIZE, + dnode); + msg_set_importance(bmsg, msg_importance(omsg)); + msg_set_size(bmsg, INT_H_SIZE + bsz); + msg_set_msgcnt(bmsg, 1); + +bundle: + if (likely(tipc_msg_bundle(*tskb, msg, mss))) + return msg_msgcnt(bmsg); + +new_target: + *tskb = skb; + return 1; + +flush: + *tskb = NULL; + return 0; +} + +/** * tipc_msg_extract(): extract bundled inner packet from buffer * @skb: buffer to be extracted from. * @iskb: extracted inner buffer, to be returned @@ -510,49 +568,6 @@ bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff **iskb, int *pos) } /** - * tipc_msg_make_bundle(): Create bundle buf and append message to its tail - * @list: the buffer chain, where head is the buffer to replace/append - * @skb: buffer to be created, appended to and returned in case of success - * @msg: message to be appended - * @mtu: max allowable size for the bundle buffer, inclusive header - * @dnode: destination node for message. (Not always present in header) - * Returns true if success, otherwise false - */ -bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, - u32 mtu, u32 dnode) -{ - struct sk_buff *_skb; - struct tipc_msg *bmsg; - u32 msz = msg_size(msg); - u32 max = mtu - INT_H_SIZE; - - if (msg_user(msg) == MSG_FRAGMENTER) - return false; - if (msg_user(msg) == TUNNEL_PROTOCOL) - return false; - if (msg_user(msg) == BCAST_PROTOCOL) - return false; - if (msz > (max / 2)) - return false; - - _skb = tipc_buf_acquire(max, GFP_ATOMIC); - if (!_skb) - return false; - - skb_trim(_skb, INT_H_SIZE); - bmsg = buf_msg(_skb); - tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, - INT_H_SIZE, dnode); - msg_set_importance(bmsg, msg_importance(msg)); - msg_set_seqno(bmsg, msg_seqno(msg)); - msg_set_ack(bmsg, msg_ack(msg)); - msg_set_bcast_ack(bmsg, msg_bcast_ack(msg)); - tipc_msg_bundle(_skb, msg, mtu); - *skb = _skb; - return true; -} - -/** * tipc_msg_reverse(): swap source and destination addresses and add error code * @own_node: originating node id for reversed message * @skb: buffer containing message to be reversed; will be consumed diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..281fcda55311 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -1057,9 +1057,8 @@ struct sk_buff *tipc_msg_create(uint user, uint type, uint hdr_sz, uint data_sz, u32 dnode, u32 onode, u32 dport, u32 oport, int errcode); int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf); -bool tipc_msg_bundle(struct sk_buff *skb, struct tipc_msg *msg, u32 mtu); -bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, - u32 mtu, u32 dnode); +u16 tipc_msg_try_bundle(struct sk_buff **tskb, struct sk_buff *skb, u32 mss, + u32 dnode); bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff **iskb, int *pos); int tipc_msg_fragment(struct sk_buff *skb, const struct tipc_msg *hdr, int pktmax, struct sk_buff_head *frags); -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2019-08-29 11:36:30
|
We have identified a problem with the "oversubscription" policy in the link transmission code. When small messages are transmitted, and the sending link has reached the transmit window limit, those messages will be bundled and put into the link backlog queue. However, bundles of data messages are counted at the 'CRITICAL' level, so that the counter for that level, instead of the counter for the real, bundled message's level is the one being increased. Subsequent, to-be-bundled data messages at non-CRITICAL levels continue to be tested against the unchanged counter for their own level, while contributing to an unrestrained increase at the CRITICAL backlog level. This leaves a gap in congestion control algorithm for small messages that can result in starvation for other users or a "real" CRITICAL user. Even that eventually can lead to buffer exhaustion & link reset. We fix this by keeping a 'target_bskb' buffer pointer at each levels, then when bundling, we only bundle messages at the same importance level only. This way, we know exactly how many slots a certain level have occupied in the queue, so can manage level congestion accurately. By bundling messages at the same level, we even have more benefits. Let consider this: - One socket sends 64-byte messages at the 'CRITICAL' level; - Another sends 4096-byte messages at the 'LOW' level; When a 64-byte message comes and is bundled the first time, we put the overhead of message bundle to it (+ 40-byte header, data copy, etc.) for later use, but the next message can be a 4096-byte one that cannot be bundled to the previous one. This means the last bundle carries only one payload message which is totally inefficient, as for the receiver also! Later on, another 64-byte message comes, now we make a new bundle and the same story repeats... With the new bundling algorithm, this will not happen, the 64-byte messages will be bundled together even when the 4096-byte message(s) comes in between. However, if the 4096-byte messages are sent at the same level i.e. 'CRITICAL', the bundling algorithm will again cause the same overhead. Also, the same will happen even with only one socket sending small messages at a rate close to the link transmit's one, so that, when one message is bundled, it's transmitted shortly. Then, another message comes, a new bundle is created and so on... We will solve this issue radically by the 2nd patch of this series. Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion") Reported-by: Hoang Le <hoa...@de...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/link.c | 29 ++++++++++++++++++----------- net/tipc/msg.c | 5 +---- 2 files changed, 19 insertions(+), 15 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ffd9e2c..999eab592de8 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -160,6 +160,7 @@ struct tipc_link { struct { u16 len; u16 limit; + struct sk_buff *target_bskb; } backlog[5]; u16 snd_nxt; u16 window; @@ -880,6 +881,7 @@ static void link_prepare_wakeup(struct tipc_link *l) void tipc_link_reset(struct tipc_link *l) { struct sk_buff_head list; + u32 imp; __skb_queue_head_init(&list); @@ -901,11 +903,10 @@ void tipc_link_reset(struct tipc_link *l) __skb_queue_purge(&l->deferdq); __skb_queue_purge(&l->backlogq); __skb_queue_purge(&l->failover_deferdq); - l->backlog[TIPC_LOW_IMPORTANCE].len = 0; - l->backlog[TIPC_MEDIUM_IMPORTANCE].len = 0; - l->backlog[TIPC_HIGH_IMPORTANCE].len = 0; - l->backlog[TIPC_CRITICAL_IMPORTANCE].len = 0; - l->backlog[TIPC_SYSTEM_IMPORTANCE].len = 0; + for (imp = 0; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) { + l->backlog[imp].len = 0; + l->backlog[imp].target_bskb = NULL; + } kfree_skb(l->reasm_buf); kfree_skb(l->reasm_tnlmsg); kfree_skb(l->failover_reasm_skb); @@ -947,7 +948,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; struct sk_buff_head *transmq = &l->transmq; struct sk_buff_head *backlogq = &l->backlogq; - struct sk_buff *skb, *_skb, *bskb; + struct sk_buff *skb, *_skb, **tskb; int pkt_cnt = skb_queue_len(list); int rc = 0; @@ -999,19 +1000,21 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, seqno++; continue; } - if (tipc_msg_bundle(skb_peek_tail(backlogq), hdr, mtu)) { + tskb = &l->backlog[imp].target_bskb; + if (tipc_msg_bundle(*tskb, hdr, mtu)) { kfree_skb(__skb_dequeue(list)); l->stats.sent_bundled++; continue; } - if (tipc_msg_make_bundle(&bskb, hdr, mtu, l->addr)) { + if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { kfree_skb(__skb_dequeue(list)); - __skb_queue_tail(backlogq, bskb); - l->backlog[msg_importance(buf_msg(bskb))].len++; + __skb_queue_tail(backlogq, *tskb); + l->backlog[imp].len++; l->stats.sent_bundled++; l->stats.sent_bundles++; continue; } + l->backlog[imp].target_bskb = NULL; l->backlog[imp].len += skb_queue_len(list); skb_queue_splice_tail_init(list, backlogq); } @@ -1027,6 +1030,7 @@ static void tipc_link_advance_backlog(struct tipc_link *l, u16 seqno = l->snd_nxt; u16 ack = l->rcv_nxt - 1; u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; + u32 imp; while (skb_queue_len(&l->transmq) < l->window) { skb = skb_peek(&l->backlogq); @@ -1037,7 +1041,10 @@ static void tipc_link_advance_backlog(struct tipc_link *l, break; __skb_dequeue(&l->backlogq); hdr = buf_msg(skb); - l->backlog[msg_importance(hdr)].len--; + imp = msg_importance(hdr); + l->backlog[imp].len--; + if (unlikely(skb == l->backlog[imp].target_bskb)) + l->backlog[imp].target_bskb = NULL; __skb_queue_tail(&l->transmq, skb); /* next retransmit attempt */ if (link_is_bc_sndlink(l)) diff --git a/net/tipc/msg.c b/net/tipc/msg.c index e6d49cdc61b4..922d262e153f 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -543,10 +543,7 @@ bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, bmsg = buf_msg(_skb); tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, INT_H_SIZE, dnode); - if (msg_isdata(msg)) - msg_set_importance(bmsg, TIPC_CRITICAL_IMPORTANCE); - else - msg_set_importance(bmsg, TIPC_SYSTEM_IMPORTANCE); + msg_set_importance(bmsg, msg_importance(msg)); msg_set_seqno(bmsg, msg_seqno(msg)); msg_set_ack(bmsg, msg_ack(msg)); msg_set_bcast_ack(bmsg, msg_bcast_ack(msg)); -- 2.13.7 |
From: Tuong L. T. <tuo...@de...> - 2019-08-29 11:18:44
|
Hi Jon, Thanks, yes we don't need that if-clause anymore, I have updated it along with a commit message. Also, I will send you another patch as mentioned before. BR/Tuong -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Wednesday, August 28, 2019 9:29 PM To: Tuong Tong Lien <tuo...@de...>; ma...@do...; yin...@wi...; tip...@li... Subject: RE: [PATCH RFC] tipc: improve bundle algorithm > -----Original Message----- > From: Tuong Lien <tuo...@de...> > Sent: 26-Aug-19 07:46 > To: Jon Maloy <jon...@er...>; ma...@do...; > yin...@wi...; tip...@li... > Subject: [PATCH RFC] tipc: improve bundle algorithm > > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/link.c | 29 ++++++++++++++++++----------- net/tipc/msg.c | 4 ++-- > 2 files changed, 20 insertions(+), 13 deletions(-) [...] > @@ -544,7 +544,7 @@ bool tipc_msg_make_bundle(struct sk_buff **skb, > struct tipc_msg *msg, > tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, > INT_H_SIZE, dnode); > if (msg_isdata(msg)) > - msg_set_importance(bmsg, TIPC_CRITICAL_IMPORTANCE); > + msg_set_importance(bmsg, msg_importance(msg)); > else > msg_set_importance(bmsg, TIPC_SYSTEM_IMPORTANCE); The if-clause is not necessary any more. You can assign the importance of the inner message directly to the bundle. You can do that here, or inside the "tipc_make_bundle" branch of tipc_link_xmit(). Otherwise I think this is a smart, although not very elegant, solution to our problem. Maybe you could steal some of the log text from my first suggestion to this patch? I think that describes the problem well. Acked-by: Jon ///jon > msg_set_seqno(bmsg, msg_seqno(msg)); > -- > 2.13.7 |
From: Jon M. <jon...@er...> - 2019-08-28 14:28:54
|
> -----Original Message----- > From: Tuong Lien <tuo...@de...> > Sent: 26-Aug-19 07:46 > To: Jon Maloy <jon...@er...>; ma...@do...; > yin...@wi...; tip...@li... > Subject: [PATCH RFC] tipc: improve bundle algorithm > > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/link.c | 29 ++++++++++++++++++----------- net/tipc/msg.c | 4 ++-- > 2 files changed, 20 insertions(+), 13 deletions(-) [...] > @@ -544,7 +544,7 @@ bool tipc_msg_make_bundle(struct sk_buff **skb, > struct tipc_msg *msg, > tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, > INT_H_SIZE, dnode); > if (msg_isdata(msg)) > - msg_set_importance(bmsg, TIPC_CRITICAL_IMPORTANCE); > + msg_set_importance(bmsg, msg_importance(msg)); > else > msg_set_importance(bmsg, TIPC_SYSTEM_IMPORTANCE); The if-clause is not necessary any more. You can assign the importance of the inner message directly to the bundle. You can do that here, or inside the "tipc_make_bundle" branch of tipc_link_xmit(). Otherwise I think this is a smart, although not very elegant, solution to our problem. Maybe you could steal some of the log text from my first suggestion to this patch? I think that describes the problem well. Acked-by: Jon ///jon > msg_set_seqno(bmsg, msg_seqno(msg)); > -- > 2.13.7 |
From: Jon M. <jon...@er...> - 2019-08-26 14:49:44
|
> -----Original Message----- > From: Tuong Lien Tong <tuo...@de...> > Sent: 26-Aug-19 07:44 > To: Jon Maloy <jon...@er...>; 'Jon Maloy' > <ma...@do...> > Cc: Mohan Krishna Ghanta Krishnamurthy > <moh...@er...>; > par...@gm...; Tung Quang Nguyen > <tun...@de...>; Hoang Huu Le > <hoa...@de...>; Gordan Mihaljevic > <gor...@de...>; yin...@wi...; tipc- > dis...@li... > Subject: RE: [net-next 1/1] tipc: reduce risk of wakeup queue starvation > > Hi Jon, > > Yes, you are right, my previous patch was not complete (sorry, I have not > verified it but just wanted to give a general idea...). Actually, we could still > preserve the necessary data/header of the original message for building the > wakeup message later as needed (look, just the message 'dport' > is enough). dport *and* the original importance. I was considering this myself, but I didn't like it. However, I don't really like this approach because there is still an > issue there (see below). > > Your patch can fix the bug I mentioned earlier (i.e. unlimited bundles of small > messages...), but looks like it has a side-effect. We may again encounter the > starvation issue that we have tried to overcome by the previous patches, that > is, a socket user with a certain importance level messages can make the others > starved, in this case it's the 'CRITICAL' > level? That is a side effect I was aware of, but I don't see it as a showstopper. Normally, only a small percentage of the sockets are using the CRITICAL level, and the backlog queue limit is very high for this level. So, if we ever run into this limit we can be pretty sure it is due to bundling, in which case I think it is only fair to throttle the lower levels too. A much worse side effect is that my patch would block even SYSTEM level messages, which of course is unacceptable. So my patch has to be improved in that respect. > With the additional condition at the link_xmit(), we will limit the other > level users (i.e. report link congestion & cause them to wait...) just due to the > congestion at the 'CRITICAL' level (i.e. not their own levels) of the backlog > queue. Even, a "true" CRITICAL user that wants to send a message will face the > issue because the bundles of small messages at lower levels occupy all the > 'CRITICAL' slots... > > Really, I don't understand the purpose we set the importance level of a bundle > to 'CRITICAL', which even gives more slots for the "less important" > users with small messages... We have to give it *some* importance, and since it very often contains CRITICAL messages it looks to me that is the correct level. This also means that we can bundle a lot more messages, even of lower importance, before having to put users to sleep. An alternative might be to give the bundle message the importance of the highest-importance message inside the bundle. That would be better than putting all at CRITICAL level, but doesn't really solve our problem. > Isn't it by dividing & increasing the backlog level > limits, we want to give more chances for higher level users in message > sending? I think we should improve the bundle algorithm a little bit to reflect > the backlog level usages accurately instead. I will send you another patch... I have received your patch, and will study it. ///jon > > BR/Tuong > > -----Original Message----- > From: Jon Maloy <jon...@er...> > Sent: Saturday, August 24, 2019 9:29 PM > To: Jon Maloy <ma...@do...> > Cc: Mohan Krishna Ghanta Krishnamurthy > <moh...@er...>; > par...@gm...; Tung Quang Nguyen > <tun...@de...>; Hoang Huu Le > <hoa...@de...>; Tuong Tong Lien > <tuo...@de...>; Gordan Mihaljevic > <gor...@de...>; yin...@wi...; tipc- > dis...@li... > Subject: RE: [net-next 1/1] tipc: reduce risk of wakeup queue starvation > > Hi Tuong, > While experimenting with byte-oriented flow control I realized that this is a > very real problem that has to be fixed. > I first tried your suggestion with putting the congestion test at the end of > tipc_link_xmit(), but realized that we need access to the original message > header when we are scheduling a user to the wakeup queue. But this header is > already gone if original the message was bundled and deleted! > Also, there is no more space in the CB area for storing the per-level counter in > the bundle packets, as was my first suggestion. > > So, this was the simplest solution I could come up with. It seems to work well, > but seems to give a slight performance degradation. I am afraid we will have > to accept that for now. > > Please give feedback. > > ///jon > > > > > -----Original Message----- > > From: Jon Maloy <jon...@er...> > > Sent: 24-Aug-19 10:19 > > To: Jon Maloy <jon...@er...>; Jon Maloy > <ma...@do...> > > Cc: Mohan Krishna Ghanta Krishnamurthy > > <moh...@er...>; > > par...@gm...; Tung Quang Nguyen > > <tun...@de...>; Hoang Huu Le > > <hoa...@de...>; Tuong Tong Lien > > <tuo...@de...>; Gordan Mihaljevic > > <gor...@de...>; yin...@wi...; tipc- > > dis...@li... > > Subject: [net-next 1/1] tipc: reduce risk of wakeup queue starvation > > > > We have identified a problem with the "oversubscription" policy in the > link > > transmission code. > > > > When small messages are transmitted, and the sending link has reached > > the transmit window limit, those messages will be bundled and put into > > the > link > > backlog queue. However, bundles of data messages are counted at the > > 'CRITICAL' level, so that the counter for that level, instead of the > counter for > > the real, bundled message's level is the one being increased. > > Subsequent, to-be-bundled data messagea at non-CRITICAL levels > > continue to be tested against the unchanged counter for their own > > level, while contributing to an unrestrained increase at the CRITICAL backlog > level. > > > > This leaves a gap in congestion control algorithm for small messages, > > and > may > > eventually lead to buffer exhaustion and link reset. > > > > We fix this by adding a test for congestion at the CRITICAL level, as > > well > as the > > existing testing for the message's own level, whenever a message is > > transmitted. We also refuse to notify any waiting users as long as > congestion at > > the CRITICAL level exists. > > > > Reported-by: Tuong Lien <tuo...@de...> > > Signed-off-by: Jon Maloy <jon...@er...> > > --- > > net/tipc/link.c | 18 +++++++++++++----- > > 1 file changed, 13 insertions(+), 5 deletions(-) > > > > diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ff..25a6acb > 100644 > > --- a/net/tipc/link.c > > +++ b/net/tipc/link.c > > @@ -77,6 +77,11 @@ struct tipc_stats { > > u32 msg_length_profile[7]; /* used for msg. length profiling */ }; > > > > +struct tipc_backlog { > > + u16 len; > > + u16 limit; > > +}; > > + > > /** > > * struct tipc_link - TIPC link data structure > > * @addr: network address of link's peer node @@ -157,10 +162,7 @@ > > struct tipc_link { > > /* Sending */ > > struct sk_buff_head transmq; > > struct sk_buff_head backlogq; > > - struct { > > - u16 len; > > - u16 limit; > > - } backlog[5]; > > + struct tipc_backlog backlog[5]; > > u16 snd_nxt; > > u16 window; > > > > @@ -862,6 +864,9 @@ static void link_prepare_wakeup(struct tipc_link *l) > > for (; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) > > avail[imp] = l->backlog[imp].limit - l->backlog[imp].len; > > > > + if (avail[TIPC_CRITICAL_IMPORTANCE] <= 0) > > + return; > > + > > skb_queue_walk_safe(wakeupq, skb, tmp) { > > imp = TIPC_SKB_CB(skb)->chain_imp; > > if (avail[imp] <= 0) > > @@ -949,6 +954,7 @@ int tipc_link_xmit(struct tipc_link *l, struct > > sk_buff_head *list, > > struct sk_buff_head *backlogq = &l->backlogq; > > struct sk_buff *skb, *_skb, *bskb; > > int pkt_cnt = skb_queue_len(list); > > + struct tipc_backlog *bklog = l->backlog; > > int rc = 0; > > > > if (unlikely(msg_size(hdr) > mtu)) { @@ -960,7 +966,9 @@ int > > tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, > > } > > > > /* Allow oversubscription of one data msg per source at congestion > */ > > - if (unlikely(l->backlog[imp].len >= l->backlog[imp].limit)) { > > + if (bklog[TIPC_CRITICAL_IMPORTANCE].len >= > > + bklog[TIPC_CRITICAL_IMPORTANCE].limit || > > + bklog[imp].len >= bklog[imp].limit) { > > if (imp == TIPC_SYSTEM_IMPORTANCE) { > > pr_warn("%s<%s>, link overflow", link_rst_msg, > l->name); > > return -ENOBUFS; > > -- > > 2.1.4 > |
From: Tuong L. <tuo...@de...> - 2019-08-26 11:46:18
|
Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/link.c | 29 ++++++++++++++++++----------- net/tipc/msg.c | 4 ++-- 2 files changed, 20 insertions(+), 13 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ffd9e2c..22ba0c8af3a8 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -160,6 +160,7 @@ struct tipc_link { struct { u16 len; u16 limit; + struct sk_buff *tskb; } backlog[5]; u16 snd_nxt; u16 window; @@ -880,6 +881,7 @@ static void link_prepare_wakeup(struct tipc_link *l) void tipc_link_reset(struct tipc_link *l) { struct sk_buff_head list; + u32 imp; __skb_queue_head_init(&list); @@ -901,11 +903,10 @@ void tipc_link_reset(struct tipc_link *l) __skb_queue_purge(&l->deferdq); __skb_queue_purge(&l->backlogq); __skb_queue_purge(&l->failover_deferdq); - l->backlog[TIPC_LOW_IMPORTANCE].len = 0; - l->backlog[TIPC_MEDIUM_IMPORTANCE].len = 0; - l->backlog[TIPC_HIGH_IMPORTANCE].len = 0; - l->backlog[TIPC_CRITICAL_IMPORTANCE].len = 0; - l->backlog[TIPC_SYSTEM_IMPORTANCE].len = 0; + for (imp = 0; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) { + l->backlog[imp].len = 0; + l->backlog[imp].tskb = NULL; + } kfree_skb(l->reasm_buf); kfree_skb(l->reasm_tnlmsg); kfree_skb(l->failover_reasm_skb); @@ -947,7 +948,7 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; struct sk_buff_head *transmq = &l->transmq; struct sk_buff_head *backlogq = &l->backlogq; - struct sk_buff *skb, *_skb, *bskb; + struct sk_buff *skb, *_skb, **tskb; int pkt_cnt = skb_queue_len(list); int rc = 0; @@ -999,19 +1000,21 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, seqno++; continue; } - if (tipc_msg_bundle(skb_peek_tail(backlogq), hdr, mtu)) { + tskb = &l->backlog[imp].tskb; + if (tipc_msg_bundle(*tskb, hdr, mtu)) { kfree_skb(__skb_dequeue(list)); l->stats.sent_bundled++; continue; } - if (tipc_msg_make_bundle(&bskb, hdr, mtu, l->addr)) { + if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { kfree_skb(__skb_dequeue(list)); - __skb_queue_tail(backlogq, bskb); - l->backlog[msg_importance(buf_msg(bskb))].len++; + __skb_queue_tail(backlogq, *tskb); + l->backlog[imp].len++; l->stats.sent_bundled++; l->stats.sent_bundles++; continue; } + l->backlog[imp].tskb = NULL; l->backlog[imp].len += skb_queue_len(list); skb_queue_splice_tail_init(list, backlogq); } @@ -1027,6 +1030,7 @@ static void tipc_link_advance_backlog(struct tipc_link *l, u16 seqno = l->snd_nxt; u16 ack = l->rcv_nxt - 1; u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; + u32 imp; while (skb_queue_len(&l->transmq) < l->window) { skb = skb_peek(&l->backlogq); @@ -1037,7 +1041,10 @@ static void tipc_link_advance_backlog(struct tipc_link *l, break; __skb_dequeue(&l->backlogq); hdr = buf_msg(skb); - l->backlog[msg_importance(hdr)].len--; + imp = msg_importance(hdr); + l->backlog[imp].len--; + if (unlikely(skb == l->backlog[imp].tskb)) + l->backlog[imp].tskb = NULL; __skb_queue_tail(&l->transmq, skb); /* next retransmit attempt */ if (link_is_bc_sndlink(l)) diff --git a/net/tipc/msg.c b/net/tipc/msg.c index e6d49cdc61b4..25c884cc8b88 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -518,7 +518,7 @@ bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff **iskb, int *pos) * @dnode: destination node for message. (Not always present in header) * Returns true if success, otherwise false */ -bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, +bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, u32 mtu, u32 dnode) { struct sk_buff *_skb; @@ -544,7 +544,7 @@ bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, INT_H_SIZE, dnode); if (msg_isdata(msg)) - msg_set_importance(bmsg, TIPC_CRITICAL_IMPORTANCE); + msg_set_importance(bmsg, msg_importance(msg)); else msg_set_importance(bmsg, TIPC_SYSTEM_IMPORTANCE); msg_set_seqno(bmsg, msg_seqno(msg)); -- 2.13.7 |
From: Tuong L. T. <tuo...@de...> - 2019-08-26 11:44:17
|
Hi Jon, Yes, you are right, my previous patch was not complete (sorry, I have not verified it but just wanted to give a general idea...). Actually, we could still preserve the necessary data/header of the original message for building the wakeup message later as needed (look, just the message 'dport' is enough). However, I don't really like this approach because there is still an issue there (see below). Your patch can fix the bug I mentioned earlier (i.e. unlimited bundles of small messages...), but looks like it has a side-effect. We may again encounter the starvation issue that we have tried to overcome by the previous patches, that is, a socket user with a certain importance level messages can make the others starved, in this case it's the 'CRITICAL' level? With the additional condition at the link_xmit(), we will limit the other level users (i.e. report link congestion & cause them to wait...) just due to the congestion at the 'CRITICAL' level (i.e. not their own levels) of the backlog queue. Even, a "true" CRITICAL user that wants to send a message will face the issue because the bundles of small messages at lower levels occupy all the 'CRITICAL' slots... Really, I don't understand the purpose we set the importance level of a bundle to 'CRITICAL', which even gives more slots for the "less important" users with small messages... Isn't it by dividing & increasing the backlog level limits, we want to give more chances for higher level users in message sending? I think we should improve the bundle algorithm a little bit to reflect the backlog level usages accurately instead. I will send you another patch... BR/Tuong -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Saturday, August 24, 2019 9:29 PM To: Jon Maloy <ma...@do...> Cc: Mohan Krishna Ghanta Krishnamurthy <moh...@er...>; par...@gm...; Tung Quang Nguyen <tun...@de...>; Hoang Huu Le <hoa...@de...>; Tuong Tong Lien <tuo...@de...>; Gordan Mihaljevic <gor...@de...>; yin...@wi...; tip...@li... Subject: RE: [net-next 1/1] tipc: reduce risk of wakeup queue starvation Hi Tuong, While experimenting with byte-oriented flow control I realized that this is a very real problem that has to be fixed. I first tried your suggestion with putting the congestion test at the end of tipc_link_xmit(), but realized that we need access to the original message header when we are scheduling a user to the wakeup queue. But this header is already gone if original the message was bundled and deleted! Also, there is no more space in the CB area for storing the per-level counter in the bundle packets, as was my first suggestion. So, this was the simplest solution I could come up with. It seems to work well, but seems to give a slight performance degradation. I am afraid we will have to accept that for now. Please give feedback. ///jon > -----Original Message----- > From: Jon Maloy <jon...@er...> > Sent: 24-Aug-19 10:19 > To: Jon Maloy <jon...@er...>; Jon Maloy > <ma...@do...> > Cc: Mohan Krishna Ghanta Krishnamurthy > <moh...@er...>; > par...@gm...; Tung Quang Nguyen > <tun...@de...>; Hoang Huu Le > <hoa...@de...>; Tuong Tong Lien > <tuo...@de...>; Gordan Mihaljevic > <gor...@de...>; yin...@wi...; tipc- > dis...@li... > Subject: [net-next 1/1] tipc: reduce risk of wakeup queue starvation > > We have identified a problem with the "oversubscription" policy in the link > transmission code. > > When small messages are transmitted, and the sending link has reached the > transmit window limit, those messages will be bundled and put into the link > backlog queue. However, bundles of data messages are counted at the > 'CRITICAL' level, so that the counter for that level, instead of the counter for > the real, bundled message's level is the one being increased. > Subsequent, to-be-bundled data messagea at non-CRITICAL levels continue to > be tested against the unchanged counter for their own level, while > contributing to an unrestrained increase at the CRITICAL backlog level. > > This leaves a gap in congestion control algorithm for small messages, and may > eventually lead to buffer exhaustion and link reset. > > We fix this by adding a test for congestion at the CRITICAL level, as well as the > existing testing for the message's own level, whenever a message is > transmitted. We also refuse to notify any waiting users as long as congestion at > the CRITICAL level exists. > > Reported-by: Tuong Lien <tuo...@de...> > Signed-off-by: Jon Maloy <jon...@er...> > --- > net/tipc/link.c | 18 +++++++++++++----- > 1 file changed, 13 insertions(+), 5 deletions(-) > > diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ff..25a6acb 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -77,6 +77,11 @@ struct tipc_stats { > u32 msg_length_profile[7]; /* used for msg. length profiling */ }; > > +struct tipc_backlog { > + u16 len; > + u16 limit; > +}; > + > /** > * struct tipc_link - TIPC link data structure > * @addr: network address of link's peer node @@ -157,10 +162,7 @@ > struct tipc_link { > /* Sending */ > struct sk_buff_head transmq; > struct sk_buff_head backlogq; > - struct { > - u16 len; > - u16 limit; > - } backlog[5]; > + struct tipc_backlog backlog[5]; > u16 snd_nxt; > u16 window; > > @@ -862,6 +864,9 @@ static void link_prepare_wakeup(struct tipc_link *l) > for (; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) > avail[imp] = l->backlog[imp].limit - l->backlog[imp].len; > > + if (avail[TIPC_CRITICAL_IMPORTANCE] <= 0) > + return; > + > skb_queue_walk_safe(wakeupq, skb, tmp) { > imp = TIPC_SKB_CB(skb)->chain_imp; > if (avail[imp] <= 0) > @@ -949,6 +954,7 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > struct sk_buff_head *backlogq = &l->backlogq; > struct sk_buff *skb, *_skb, *bskb; > int pkt_cnt = skb_queue_len(list); > + struct tipc_backlog *bklog = l->backlog; > int rc = 0; > > if (unlikely(msg_size(hdr) > mtu)) { > @@ -960,7 +966,9 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > } > > /* Allow oversubscription of one data msg per source at congestion */ > - if (unlikely(l->backlog[imp].len >= l->backlog[imp].limit)) { > + if (bklog[TIPC_CRITICAL_IMPORTANCE].len >= > + bklog[TIPC_CRITICAL_IMPORTANCE].limit || > + bklog[imp].len >= bklog[imp].limit) { > if (imp == TIPC_SYSTEM_IMPORTANCE) { > pr_warn("%s<%s>, link overflow", link_rst_msg, l->name); > return -ENOBUFS; > -- > 2.1.4 |
From: Jon M. <jon...@er...> - 2019-08-24 14:44:40
|
Hi Tuong, While experimenting with byte-oriented flow control I realized that this is a very real problem that has to be fixed. I first tried your suggestion with putting the congestion test at the end of tipc_link_xmit(), but realized that we need access to the original message header when we are scheduling a user to the wakeup queue. But this header is already gone if original the message was bundled and deleted! Also, there is no more space in the CB area for storing the per-level counter in the bundle packets, as was my first suggestion. So, this was the simplest solution I could come up with. It seems to work well, but seems to give a slight performance degradation. I am afraid we will have to accept that for now. Please give feedback. ///jon > -----Original Message----- > From: Jon Maloy <jon...@er...> > Sent: 24-Aug-19 10:19 > To: Jon Maloy <jon...@er...>; Jon Maloy > <ma...@do...> > Cc: Mohan Krishna Ghanta Krishnamurthy > <moh...@er...>; > par...@gm...; Tung Quang Nguyen > <tun...@de...>; Hoang Huu Le > <hoa...@de...>; Tuong Tong Lien > <tuo...@de...>; Gordan Mihaljevic > <gor...@de...>; yin...@wi...; tipc- > dis...@li... > Subject: [net-next 1/1] tipc: reduce risk of wakeup queue starvation > > We have identified a problem with the "oversubscription" policy in the link > transmission code. > > When small messages are transmitted, and the sending link has reached the > transmit window limit, those messages will be bundled and put into the link > backlog queue. However, bundles of data messages are counted at the > 'CRITICAL' level, so that the counter for that level, instead of the counter for > the real, bundled message's level is the one being increased. > Subsequent, to-be-bundled data messagea at non-CRITICAL levels continue to > be tested against the unchanged counter for their own level, while > contributing to an unrestrained increase at the CRITICAL backlog level. > > This leaves a gap in congestion control algorithm for small messages, and may > eventually lead to buffer exhaustion and link reset. > > We fix this by adding a test for congestion at the CRITICAL level, as well as the > existing testing for the message's own level, whenever a message is > transmitted. We also refuse to notify any waiting users as long as congestion at > the CRITICAL level exists. > > Reported-by: Tuong Lien <tuo...@de...> > Signed-off-by: Jon Maloy <jon...@er...> > --- > net/tipc/link.c | 18 +++++++++++++----- > 1 file changed, 13 insertions(+), 5 deletions(-) > > diff --git a/net/tipc/link.c b/net/tipc/link.c index 6cc75ff..25a6acb 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -77,6 +77,11 @@ struct tipc_stats { > u32 msg_length_profile[7]; /* used for msg. length profiling */ }; > > +struct tipc_backlog { > + u16 len; > + u16 limit; > +}; > + > /** > * struct tipc_link - TIPC link data structure > * @addr: network address of link's peer node @@ -157,10 +162,7 @@ > struct tipc_link { > /* Sending */ > struct sk_buff_head transmq; > struct sk_buff_head backlogq; > - struct { > - u16 len; > - u16 limit; > - } backlog[5]; > + struct tipc_backlog backlog[5]; > u16 snd_nxt; > u16 window; > > @@ -862,6 +864,9 @@ static void link_prepare_wakeup(struct tipc_link *l) > for (; imp <= TIPC_SYSTEM_IMPORTANCE; imp++) > avail[imp] = l->backlog[imp].limit - l->backlog[imp].len; > > + if (avail[TIPC_CRITICAL_IMPORTANCE] <= 0) > + return; > + > skb_queue_walk_safe(wakeupq, skb, tmp) { > imp = TIPC_SKB_CB(skb)->chain_imp; > if (avail[imp] <= 0) > @@ -949,6 +954,7 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > struct sk_buff_head *backlogq = &l->backlogq; > struct sk_buff *skb, *_skb, *bskb; > int pkt_cnt = skb_queue_len(list); > + struct tipc_backlog *bklog = l->backlog; > int rc = 0; > > if (unlikely(msg_size(hdr) > mtu)) { > @@ -960,7 +966,9 @@ int tipc_link_xmit(struct tipc_link *l, struct > sk_buff_head *list, > } > > /* Allow oversubscription of one data msg per source at congestion */ > - if (unlikely(l->backlog[imp].len >= l->backlog[imp].limit)) { > + if (bklog[TIPC_CRITICAL_IMPORTANCE].len >= > + bklog[TIPC_CRITICAL_IMPORTANCE].limit || > + bklog[imp].len >= bklog[imp].limit) { > if (imp == TIPC_SYSTEM_IMPORTANCE) { > pr_warn("%s<%s>, link overflow", link_rst_msg, l->name); > return -ENOBUFS; > -- > 2.1.4 |
From: Andy S. <And...@in...> - 2019-08-20 21:42:02
|
VMware VM running on RHEL7.6 encountered a kernel panic when executing the below command: /usr/sbin/tipc-config -ls=1.1.10:eth0-1.1.1:eth0 Has anybody encountered this issue? Any kind of assistance would be appreciated. Please find below the version details for TIPC and RHEL of the Guest OS along with the stack trace. ==================== TIPC Version Details ==================== TIPC configuration tool - 2.0.6 TIPC Kernel Module - 2.0.0 TIPC Module vermagic - 3.10.0-862.14.4.el7.x86_64 SMP mod_unload modversions ====================== VM RHEL Release info ====================== Linux cm.mplab.ics.com 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux ============= Stack Trace ============= crash> sy CPUS: 8 [OFFLINE: 7] DATE: Sun Aug 18 20:15:31 2019 LOAD AVERAGE: 0.02, 0.16, 0.25 NODENAME: cm.mplab.ics.com RELEASE: 3.10.0-957.el7.x86_64 PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000068" crash> bt PID: 95723 TASK: ffff9fac8f728000 CPU: 4 COMMAND: "tipc-config" #0 [ffff9fad9af83808] panic at ffffffffa195b5ff #1 [ffff9fad9af83888] oops_end at ffffffffa196c783 #2 [ffff9fad9af838b0] no_context at ffffffffa195aa7e #3 [ffff9fad9af83900] __bad_area_nosemaphore at ffffffffa195ab15 #4 [ffff9fad9af83950] bad_area_nosemaphore at ffffffffa195ac86 #5 [ffff9fad9af83960] __do_page_fault at ffffffffa196f6b0 #6 [ffff9fad9af839d0] do_page_fault at ffffffffa196f915 #7 [ffff9fad9af83a00] page_fault at ffffffffa196b758 [exception RIP: tipc_cfg_do_cmd+0x2b0] RIP: ffffffffc04db5f0 RSP: ffff9fad9af83ab8 RFLAGS: 00010292 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000004e61450 RDX: 0000000000000000 RSI: ffff9fad97673700 RDI: 0000000000000000 RBP: ffff9fad9af83af0 R8: 000000000001f120 R9: ffffffffa1824371 R10: ffff9fad9c71f120 R11: ffffdb4ae05d9c00 R12: 000000000100100a R13: ffff9fad87f11c1c R14: 0000000000000040 R15: 000000000000001c ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff9fad9af83af8] handle_cmd at ffffffffc04e46ae [tipc] #9 [ffff9fad9af83b20] genl_family_rcv_msg at ffffffffa1877898 #10 [ffff9fad9af83be8] genl_rcv_msg at ffffffffa1877b1b #11 [ffff9fad9af83c18] netlink_rcv_skb at ffffffffa1875b2b #12 [ffff9fad9af83c40] genl_rcv at ffffffffa1876068 #13 [ffff9fad9af83c58] netlink_unicast at ffffffffa18754b0 #14 [ffff9fad9af83ca0] netlink_sendmsg at ffffffffa1875858 #15 [ffff9fad9af83d28] sock_aio_write at ffffffffa181857d #16 [ffff9fad9af83df0] do_sync_write at ffffffffa14405b3 #17 [ffff9fad9af83ec8] vfs_write at ffffffffa14411a5 #18 [ffff9fad9af83f08] sys_write at ffffffffa1441ebf #19 [ffff9fad9af83f50] system_call_fastpath at ffffffffa1974ddb RIP: 00007fd854967fd0 RSP: 00007ffd425410e0 RFLAGS: 00010206 RAX: 0000000000000001 RBX: 000000000000005c RCX: 000000000259e080 RDX: 000000000000005c RSI: 000000000259e010 RDI: 0000000000000003 <--- RBP: 000000000259e010 R8: 0000000000000004 R9: 0000000000008104 R10: 00007ffd4254120c R11: 0000000000000246 R12: 000000000000005c R13: 0000000000000003 R14: 000000000259e010 R15: 0000000000000008 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> pd 0x3 $1 = 3 crash> net -s PID: 95723 TASK: ffff9fac8f728000 CPU: 4 COMMAND: "tipc-config" FD SOCKET SOCK FAMILY:TYPE SOURCE-PORT DESTINATION-PORT 3 ffff9fad77000a00 ffff9fad97b37800 NETLINK/ROUTE:DGRAM crash> socket.state,type ffff9fad77000a00 state = SS_UNCONNECTED type = 0x2 #define AF_INET 2 /* Internet IP Protocol */ crash> ps -a ffff9fac8f728000 PID: 95723 TASK: ffff9fac8f728000 CPU: 4 COMMAND: "tipc-config" ARG: /usr/sbin/tipc-config -ls=1.1.10:eth0-1.1.1:eth0 ENV: XDG_SESSION_ID=46274 MAILTO= SHELL=/bin/sh USER=mpug PATH=/bin:/usr/bin PWD=/home/mpug LANG=en_US.UTF-8 HOME=/home/mpug SHLVL=2 LOGNAME=mpug XDG_RUNTIME_DIR=/run/user/29702 _=/usr/sbin/tipc-config This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify so to the sender by e-mail and delete the original message. In such cases, please notify us immediately at in...@in... <mailto:in...@in...> . Further, you are not to copy, disclose, or distribute this e-mail or its contents to any unauthorized person(s). Any such actions are considered unlawful. This e-mail may contain viruses. Infinite has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachments. Infinite reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infinite e-mail system. ***INFINITE******** End of Disclaimer********INFINITE******** |
From: David M. <da...@da...> - 2019-08-18 21:01:34
|
From: Jon Maloy <jon...@er...> Date: Thu, 15 Aug 2019 16:42:50 +0200 > The policy for handling the skb list locks on the send and receive paths > is simple. > > - On the send path we never need to grab the lock on the 'xmitq' list > when the destination is an exernal node. > > - On the receive path we always need to grab the lock on the 'inputq' > list, irrespective of source node. > > However, when transmitting node local messages those will eventually > end up on the receive path of a local socket, meaning that the argument > 'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the > function tipc_sk_rcv(). This has been handled by always initializing > the spinlock of the 'xmitq' list at message creation, just in case it > may end up on the receive path later, and despite knowing that the lock > in most cases never will be used. > > This approach is inaccurate and confusing, and has also concealed the > fact that the stated 'no lock grabbing' policy for the send path is > violated in some cases. > > We now clean up this by never initializing the lock at message creation, > instead doing this at the moment we find that the message actually will > enter the receive path. At the same time we fix the four locations > where we incorrectly access the spinlock on the send/error path. > > This patch also reverts commit d12cffe9329f ("tipc: ensure head->lock > is initialised") which has now become redundant. > > CC: Eric Dumazet <edu...@go...> > Reported-by: Chris Packham <chr...@al...> > Acked-by: Ying Xue <yin...@wi...> > Signed-off-by: Jon Maloy <jon...@er...> Applied. |
From: David M. <da...@da...> - 2019-08-16 23:27:57
|
From: Tuong Lien <tuo...@de...> Date: Thu, 15 Aug 2019 10:24:08 +0700 > This commit eliminates the use of the link 'stale_limit' & 'prev_from' > (besides the already removed - 'stale_cnt') variables in the detection > of repeated retransmit failures as there is no proper way to initialize > them to avoid a false detection, i.e. it is not really a retransmission > failure but due to a garbage values in the variables. > > Instead, a jiffies variable will be added to individual skbs (like the > way we restrict the skb retransmissions) in order to mark the first skb > retransmit time. Later on, at the next retransmissions, the timestamp > will be checked to see if the skb in the link transmq is "too stale", > that is, the link tolerance time has passed, so that a link reset will > be ordered. Note, just checking on the first skb in the queue is fine > enough since it must be the oldest one. > A counter is also added to keep track the actual skb retransmissions' > number for later checking when the failure happens. > > The downside of this approach is that the skb->cb[] buffer is about to > be exhausted, however it is always able to allocate another memory area > and keep a reference to it when needed. > > Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria") > Reported-by: Hoang Le <hoa...@de...> > Acked-by: Ying Xue <yin...@wi...> > Acked-by: Jon Maloy <jon...@er...> > Signed-off-by: Tuong Lien <tuo...@de...> Applied, thank you. |
From: <joh...@de...> - 2019-08-16 01:09:57
|
From: John Rutherford <joh...@de...> In commit 4f07b80c9733 ("tipc: check msg->req data len in tipc_nl_compat_bearer_disable") the same patch code was copied into routines: tipc_nl_compat_bearer_disable(), tipc_nl_compat_link_stat_dump() and tipc_nl_compat_link_reset_stats(). The two link routine occurrences should have been modified to check the maximum link name length and not bearer name length. Fixes: 4f07b80c9733 ("tipc: check msg->reg data len in tipc_nl_compat_bearer_disable") Signed-off-by: John Rutherford <joh...@de...> --- net/tipc/netlink_compat.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/tipc/netlink_compat.c b/net/tipc/netlink_compat.c index e135d4e..d4d2928 100644 --- a/net/tipc/netlink_compat.c +++ b/net/tipc/netlink_compat.c @@ -550,7 +550,7 @@ static int tipc_nl_compat_link_stat_dump(struct tipc_nl_compat_msg *msg, if (len <= 0) return -EINVAL; - len = min_t(int, len, TIPC_MAX_BEARER_NAME); + len = min_t(int, len, TIPC_MAX_LINK_NAME); if (!string_is_valid(name, len)) return -EINVAL; @@ -822,7 +822,7 @@ static int tipc_nl_compat_link_reset_stats(struct tipc_nl_compat_cmd_doit *cmd, if (len <= 0) return -EINVAL; - len = min_t(int, len, TIPC_MAX_BEARER_NAME); + len = min_t(int, len, TIPC_MAX_LINK_NAME); if (!string_is_valid(name, len)) return -EINVAL; -- 2.11.0 |
From: Jon M. <jon...@er...> - 2019-08-15 14:10:22
|
> -----Original Message----- > From: net...@vg... <net...@vg...> On > Behalf Of Xin Long > Sent: 15-Aug-19 01:58 > To: Jon Maloy <jon...@er...> > Cc: da...@da...; ne...@vg...; Tung Quang Nguyen > <tun...@de...>; Hoang Huu Le > <hoa...@de...>; sh...@re...; ying xue > <yin...@wi...>; edu...@go...; tipc- > dis...@li... > Subject: Re: [net-next 1/1] tipc: clean up skb list lock handling on send path > > [...] > > /* Try again later if socket is busy */ > > -- > > 2.1.4 > > > > > Patch looks good, can you also check those tmp tx queues in: > > tipc_group_cong() > tipc_group_join() > tipc_link_create_dummy_tnl_msg() > tipc_link_tnl_prepare() > > which are using skb_queue_head_init() to init? > > Thanks. You are right. I missed those. I'll post a v2 of this patch. ///jon |
From: Tuong L. <tuo...@de...> - 2019-08-15 03:24:29
|
This commit eliminates the use of the link 'stale_limit' & 'prev_from' (besides the already removed - 'stale_cnt') variables in the detection of repeated retransmit failures as there is no proper way to initialize them to avoid a false detection, i.e. it is not really a retransmission failure but due to a garbage values in the variables. Instead, a jiffies variable will be added to individual skbs (like the way we restrict the skb retransmissions) in order to mark the first skb retransmit time. Later on, at the next retransmissions, the timestamp will be checked to see if the skb in the link transmq is "too stale", that is, the link tolerance time has passed, so that a link reset will be ordered. Note, just checking on the first skb in the queue is fine enough since it must be the oldest one. A counter is also added to keep track the actual skb retransmissions' number for later checking when the failure happens. The downside of this approach is that the skb->cb[] buffer is about to be exhausted, however it is always able to allocate another memory area and keep a reference to it when needed. Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria") Reported-by: Hoang Le <hoa...@de...> Acked-by: Ying Xue <yin...@wi...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/link.c | 92 ++++++++++++++++++++++++++++++++------------------------- net/tipc/msg.h | 8 +++-- 2 files changed, 57 insertions(+), 43 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index dd3155b14654..01d76bf16e9d 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -106,8 +106,6 @@ struct tipc_stats { * @transmitq: queue for sent, non-acked messages * @backlogq: queue for messages waiting to be sent * @snt_nxt: next sequence number to use for outbound messages - * @prev_from: sequence number of most previous retransmission request - * @stale_limit: time when repeated identical retransmits must force link reset * @ackers: # of peers that needs to ack each packet before it can be released * @acked: # last packet acked by a certain peer. Used for broadcast. * @rcv_nxt: next sequence number to expect for inbound messages @@ -164,9 +162,7 @@ struct tipc_link { u16 limit; } backlog[5]; u16 snd_nxt; - u16 prev_from; u16 window; - unsigned long stale_limit; /* Reception */ u16 rcv_nxt; @@ -1063,47 +1059,53 @@ static void tipc_link_advance_backlog(struct tipc_link *l, * link_retransmit_failure() - Detect repeated retransmit failures * @l: tipc link sender * @r: tipc link receiver (= l in case of unicast) - * @from: seqno of the 1st packet in retransmit request * @rc: returned code * * Return: true if the repeated retransmit failures happens, otherwise * false */ static bool link_retransmit_failure(struct tipc_link *l, struct tipc_link *r, - u16 from, int *rc) + int *rc) { struct sk_buff *skb = skb_peek(&l->transmq); struct tipc_msg *hdr; if (!skb) return false; - hdr = buf_msg(skb); - /* Detect repeated retransmit failures on same packet */ - if (r->prev_from != from) { - r->prev_from = from; - r->stale_limit = jiffies + msecs_to_jiffies(r->tolerance); - } else if (time_after(jiffies, r->stale_limit)) { - pr_warn("Retransmission failure on link <%s>\n", l->name); - link_print(l, "State of link "); - pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n", - msg_user(hdr), msg_type(hdr), msg_size(hdr), - msg_errcode(hdr)); - pr_info("sqno %u, prev: %x, src: %x\n", - msg_seqno(hdr), msg_prevnode(hdr), msg_orignode(hdr)); - - trace_tipc_list_dump(&l->transmq, true, "retrans failure!"); - trace_tipc_link_dump(l, TIPC_DUMP_NONE, "retrans failure!"); - trace_tipc_link_dump(r, TIPC_DUMP_NONE, "retrans failure!"); + if (!TIPC_SKB_CB(skb)->retr_cnt) + return false; - if (link_is_bc_sndlink(l)) - *rc = TIPC_LINK_DOWN_EVT; + if (!time_after(jiffies, TIPC_SKB_CB(skb)->retr_stamp + + msecs_to_jiffies(r->tolerance))) + return false; + + hdr = buf_msg(skb); + if (link_is_bc_sndlink(l) && !less(r->acked, msg_seqno(hdr))) + return false; + pr_warn("Retransmission failure on link <%s>\n", l->name); + link_print(l, "State of link "); + pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n", + msg_user(hdr), msg_type(hdr), msg_size(hdr), msg_errcode(hdr)); + pr_info("sqno %u, prev: %x, dest: %x\n", + msg_seqno(hdr), msg_prevnode(hdr), msg_destnode(hdr)); + pr_info("retr_stamp %d, retr_cnt %d\n", + jiffies_to_msecs(TIPC_SKB_CB(skb)->retr_stamp), + TIPC_SKB_CB(skb)->retr_cnt); + + trace_tipc_list_dump(&l->transmq, true, "retrans failure!"); + trace_tipc_link_dump(l, TIPC_DUMP_NONE, "retrans failure!"); + trace_tipc_link_dump(r, TIPC_DUMP_NONE, "retrans failure!"); + + if (link_is_bc_sndlink(l)) { + r->state = LINK_RESET; + *rc = TIPC_LINK_DOWN_EVT; + } else { *rc = tipc_link_fsm_evt(l, LINK_FAILURE_EVT); - return true; } - return false; + return true; } /* tipc_link_bc_retrans() - retransmit zero or more packets @@ -1129,7 +1131,7 @@ static int tipc_link_bc_retrans(struct tipc_link *l, struct tipc_link *r, trace_tipc_link_retrans(r, from, to, &l->transmq); - if (link_retransmit_failure(l, r, from, &rc)) + if (link_retransmit_failure(l, r, &rc)) return rc; skb_queue_walk(&l->transmq, skb) { @@ -1138,11 +1140,10 @@ static int tipc_link_bc_retrans(struct tipc_link *l, struct tipc_link *r, continue; if (more(msg_seqno(hdr), to)) break; - if (link_is_bc_sndlink(l)) { - if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr)) - continue; - TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM; - } + + if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr)) + continue; + TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM; _skb = __pskb_copy(skb, LL_MAX_HEADER + MIN_H_SIZE, GFP_ATOMIC); if (!_skb) return 0; @@ -1152,6 +1153,10 @@ static int tipc_link_bc_retrans(struct tipc_link *l, struct tipc_link *r, _skb->priority = TC_PRIO_CONTROL; __skb_queue_tail(xmitq, _skb); l->stats.retransmitted++; + + /* Increase actual retrans counter & mark first time */ + if (!TIPC_SKB_CB(skb)->retr_cnt++) + TIPC_SKB_CB(skb)->retr_stamp = jiffies; } return 0; } @@ -1393,12 +1398,10 @@ static int tipc_link_advance_transmq(struct tipc_link *l, u16 acked, u16 gap, struct tipc_msg *hdr; u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; u16 ack = l->rcv_nxt - 1; + bool passed = false; u16 seqno, n = 0; int rc = 0; - if (gap && link_retransmit_failure(l, l, acked + 1, &rc)) - return rc; - skb_queue_walk_safe(&l->transmq, skb, tmp) { seqno = buf_seqno(skb); @@ -1408,12 +1411,17 @@ static int tipc_link_advance_transmq(struct tipc_link *l, u16 acked, u16 gap, __skb_unlink(skb, &l->transmq); kfree_skb(skb); } else if (less_eq(seqno, acked + gap)) { - /* retransmit skb */ + /* First, check if repeated retrans failures occurs? */ + if (!passed && link_retransmit_failure(l, l, &rc)) + return rc; + passed = true; + + /* retransmit skb if unrestricted*/ if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr)) continue; TIPC_SKB_CB(skb)->nxt_retr = TIPC_UC_RETR_TIME; - - _skb = __pskb_copy(skb, MIN_H_SIZE, GFP_ATOMIC); + _skb = __pskb_copy(skb, LL_MAX_HEADER + MIN_H_SIZE, + GFP_ATOMIC); if (!_skb) continue; hdr = buf_msg(_skb); @@ -1422,6 +1430,10 @@ static int tipc_link_advance_transmq(struct tipc_link *l, u16 acked, u16 gap, _skb->priority = TC_PRIO_CONTROL; __skb_queue_tail(xmitq, _skb); l->stats.retransmitted++; + + /* Increase actual retrans counter & mark first time */ + if (!TIPC_SKB_CB(skb)->retr_cnt++) + TIPC_SKB_CB(skb)->retr_stamp = jiffies; } else { /* retry with Gap ACK blocks if any */ if (!ga || n >= ga->gack_cnt) @@ -2681,7 +2693,7 @@ int tipc_link_dump(struct tipc_link *l, u16 dqueues, char *buf) i += scnprintf(buf + i, sz - i, " %x", l->peer_caps); i += scnprintf(buf + i, sz - i, " %u", l->silent_intv_cnt); i += scnprintf(buf + i, sz - i, " %u", l->rst_cnt); - i += scnprintf(buf + i, sz - i, " %u", l->prev_from); + i += scnprintf(buf + i, sz - i, " %u", 0); i += scnprintf(buf + i, sz - i, " %u", 0); i += scnprintf(buf + i, sz - i, " %u", l->acked); diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 1c8c8dd32a4e..0daa6f04ca81 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -102,13 +102,15 @@ struct plist; #define TIPC_MEDIA_INFO_OFFSET 5 struct tipc_skb_cb { - u32 bytes_read; - u32 orig_member; struct sk_buff *tail; unsigned long nxt_retr; - bool validated; + unsigned long retr_stamp; + u32 bytes_read; + u32 orig_member; u16 chain_imp; u16 ackers; + u16 retr_cnt; + bool validated; }; #define TIPC_SKB_CB(__skb) ((struct tipc_skb_cb *)&((__skb)->cb[0])) -- 2.13.7 |
From: Jon M. <jon...@er...> - 2019-08-14 15:36:54
|
Acked-by: Jon > -----Original Message----- > From: Ying Xue <yin...@wi...> > Sent: 14-Aug-19 07:41 > To: Tung Quang Nguyen <tun...@de...>; tipc- > dis...@li...; Jon Maloy <jon...@er...>; > ma...@do... > Subject: Re: [tipc-discussion][net v1 1/3] tipc: fix potential memory leak in > __tipc_sendmsg() > > On 8/13/19 6:01 PM, Tung Nguyen wrote: > > When initiating a connection message to a server side, the connection > > message is cloned and added to the socket write queue. However, if the > > cloning is failed, only the socket write queue is purged. It causes > > memory leak because the original connection message is not freed. > > > > This commit fixes it by purging the list of connection message when it > > cannot be cloned. > > > > Fixes: 6787927475e5 ("tipc: buffer overflow handling in listener > > socket") > > Reported-by: Hoang Le <hoa...@de...> > > Signed-off-by: Tung Nguyen <tun...@de...> > > Acked-by: Ying Xue <yin...@wi...> > > > --- > > net/tipc/socket.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/net/tipc/socket.c b/net/tipc/socket.c index > > 83ae41d7e554..dcb8b6082757 100644 > > --- a/net/tipc/socket.c > > +++ b/net/tipc/socket.c > > @@ -1392,8 +1392,10 @@ static int __tipc_sendmsg(struct socket *sock, > struct msghdr *m, size_t dlen) > > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > > if (unlikely(rc != dlen)) > > return rc; > > - if (unlikely(syn && !tipc_msg_skb_clone(&pkts, &sk- > >sk_write_queue))) > > + if (unlikely(syn && !tipc_msg_skb_clone(&pkts, &sk- > >sk_write_queue))) { > > + __skb_queue_purge(&pkts); > > return -ENOMEM; > > + } > > > > trace_tipc_sk_sendmsg(sk, skb_peek(&pkts), TIPC_DUMP_SK_SNDQ, " > "); > > rc = tipc_node_xmit(net, &pkts, dnode, tsk->portid); > > |
From: Jon M. <jon...@er...> - 2019-08-14 15:16:12
|
Acked-by: Jon > -----Original Message----- > From: Tung Nguyen <tun...@de...> > Sent: 13-Aug-19 06:02 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi... > Subject: [tipc-discussion][net v1 2/3] tipc: fix wrong socket reference counter > after tipc_sk_timeout() returns > > When tipc_sk_timeout() is executed but user space is grabbing ownership, this > function rearms itself and returns. However, the socket reference counter is > not reduced. This causes potential unexpected behavior. > > This commit fixes it by calling sock_put() before tipc_sk_timeout() returns in > the above-mentioned case. > > Fixes: afe8792fec69 ("tipc: refactor function tipc_sk_timeout()") > Signed-off-by: Tung Nguyen <tun...@de...> > --- > net/tipc/socket.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/tipc/socket.c b/net/tipc/socket.c index > dcb8b6082757..9fd9a5727786 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -2683,6 +2683,7 @@ static void tipc_sk_timeout(struct timer_list *t) > if (sock_owned_by_user(sk)) { > sk_reset_timer(sk, &sk->sk_timer, jiffies + HZ / 20); > bh_unlock_sock(sk); > + sock_put(sk); > return; > } > > -- > 2.17.1 |