You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(9) |
Feb
(11) |
Mar
(22) |
Apr
(73) |
May
(78) |
Jun
(146) |
Jul
(80) |
Aug
(27) |
Sep
(5) |
Oct
(14) |
Nov
(18) |
Dec
(27) |
2005 |
Jan
(20) |
Feb
(30) |
Mar
(19) |
Apr
(28) |
May
(50) |
Jun
(31) |
Jul
(32) |
Aug
(14) |
Sep
(36) |
Oct
(43) |
Nov
(74) |
Dec
(63) |
2006 |
Jan
(34) |
Feb
(32) |
Mar
(21) |
Apr
(76) |
May
(106) |
Jun
(72) |
Jul
(70) |
Aug
(175) |
Sep
(130) |
Oct
(39) |
Nov
(81) |
Dec
(43) |
2007 |
Jan
(81) |
Feb
(36) |
Mar
(20) |
Apr
(43) |
May
(54) |
Jun
(34) |
Jul
(44) |
Aug
(55) |
Sep
(44) |
Oct
(54) |
Nov
(43) |
Dec
(41) |
2008 |
Jan
(42) |
Feb
(84) |
Mar
(73) |
Apr
(30) |
May
(119) |
Jun
(54) |
Jul
(54) |
Aug
(93) |
Sep
(173) |
Oct
(130) |
Nov
(145) |
Dec
(153) |
2009 |
Jan
(59) |
Feb
(12) |
Mar
(28) |
Apr
(18) |
May
(56) |
Jun
(9) |
Jul
(28) |
Aug
(62) |
Sep
(16) |
Oct
(19) |
Nov
(15) |
Dec
(17) |
2010 |
Jan
(14) |
Feb
(36) |
Mar
(37) |
Apr
(30) |
May
(33) |
Jun
(53) |
Jul
(42) |
Aug
(50) |
Sep
(67) |
Oct
(66) |
Nov
(69) |
Dec
(36) |
2011 |
Jan
(52) |
Feb
(45) |
Mar
(49) |
Apr
(21) |
May
(34) |
Jun
(13) |
Jul
(19) |
Aug
(37) |
Sep
(43) |
Oct
(10) |
Nov
(23) |
Dec
(30) |
2012 |
Jan
(42) |
Feb
(36) |
Mar
(46) |
Apr
(25) |
May
(96) |
Jun
(146) |
Jul
(40) |
Aug
(28) |
Sep
(61) |
Oct
(45) |
Nov
(100) |
Dec
(53) |
2013 |
Jan
(79) |
Feb
(24) |
Mar
(134) |
Apr
(156) |
May
(118) |
Jun
(75) |
Jul
(278) |
Aug
(145) |
Sep
(136) |
Oct
(168) |
Nov
(137) |
Dec
(439) |
2014 |
Jan
(284) |
Feb
(158) |
Mar
(231) |
Apr
(275) |
May
(259) |
Jun
(91) |
Jul
(222) |
Aug
(215) |
Sep
(165) |
Oct
(166) |
Nov
(211) |
Dec
(150) |
2015 |
Jan
(164) |
Feb
(324) |
Mar
(299) |
Apr
(214) |
May
(111) |
Jun
(109) |
Jul
(105) |
Aug
(36) |
Sep
(58) |
Oct
(131) |
Nov
(68) |
Dec
(30) |
2016 |
Jan
(46) |
Feb
(87) |
Mar
(135) |
Apr
(174) |
May
(132) |
Jun
(135) |
Jul
(149) |
Aug
(125) |
Sep
(79) |
Oct
(49) |
Nov
(95) |
Dec
(102) |
2017 |
Jan
(104) |
Feb
(75) |
Mar
(72) |
Apr
(53) |
May
(18) |
Jun
(5) |
Jul
(14) |
Aug
(19) |
Sep
(2) |
Oct
(13) |
Nov
(21) |
Dec
(67) |
2018 |
Jan
(56) |
Feb
(50) |
Mar
(148) |
Apr
(41) |
May
(37) |
Jun
(34) |
Jul
(34) |
Aug
(11) |
Sep
(52) |
Oct
(48) |
Nov
(28) |
Dec
(46) |
2019 |
Jan
(29) |
Feb
(63) |
Mar
(95) |
Apr
(54) |
May
(14) |
Jun
(71) |
Jul
(60) |
Aug
(49) |
Sep
(3) |
Oct
(64) |
Nov
(115) |
Dec
(57) |
2020 |
Jan
(15) |
Feb
(9) |
Mar
(38) |
Apr
(27) |
May
(60) |
Jun
(53) |
Jul
(35) |
Aug
(46) |
Sep
(37) |
Oct
(64) |
Nov
(20) |
Dec
(25) |
2021 |
Jan
(20) |
Feb
(31) |
Mar
(27) |
Apr
(23) |
May
(21) |
Jun
(30) |
Jul
(30) |
Aug
(7) |
Sep
(18) |
Oct
|
Nov
(15) |
Dec
(4) |
2022 |
Jan
(3) |
Feb
(1) |
Mar
(10) |
Apr
|
May
(2) |
Jun
(26) |
Jul
(5) |
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(9) |
Dec
(2) |
2023 |
Jan
(4) |
Feb
(4) |
Mar
(5) |
Apr
(10) |
May
(29) |
Jun
(17) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
(2) |
Dec
|
2024 |
Jan
|
Feb
(6) |
Mar
|
Apr
(1) |
May
(6) |
Jun
|
Jul
(5) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Jon M. <jon...@er...> - 2019-10-29 11:25:13
|
Hi Ying, You're right. I'll fix this. Anyway, I just realized another improvement I could make, so I'll send a new version with both changes. Regards ///jon > -----Original Message----- > From: Ying Xue <yin...@wi...> > Sent: 29-Oct-19 06:24 > To: Jon Maloy <jon...@er...>; Jon Maloy <ma...@do...> > Cc: Mohan Krishna Ghanta Krishnamurthy <moh...@er...>; > par...@gm...; Tung Quang Nguyen <tun...@de...>; Hoang > Huu Le <hoa...@de...>; Tuong Tong Lien <tuo...@de...>; Gordan > Mihaljevic <gor...@de...>; tip...@li... > Subject: Re: [net-next v3 1/1] tipc: add smart nagle feature > > On 10/29/19 1:37 AM, Jon Maloy wrote: > > @@ -3007,6 +3068,9 @@ static int tipc_setsockopt(struct socket *sock, int lvl, int opt, > > case TIPC_GROUP_LEAVE: > > res = tipc_sk_leave(tsk); > > break; > > + case TIPC_NODELAY: > > + tsk->nodelay = true; > > + break; > > default: > > res = -EINVAL; > > } > > Once user sets tsk->nodelay to true, there is no chance to set it back > to false. Although this scenario rarely happens for us, it's better that > we can provide the function. > > For example, below is how TCP supports TCP_NODELAY option: > > case TCP_NODELAY: > if (val) { > /* TCP_NODELAY is weaker than TCP_CORK, so that > * this option on corked socket is remembered, but > * it is not activated until cork is cleared. > * > * However, when TCP_NODELAY is set we make > * an explicit push, which overrides even TCP_CORK > * for currently queued segments. > */ > tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH; > tcp_push_pending_frames(sk); > } else { > tp->nonagle &= ~TCP_NAGLE_OFF; > } > break; |
From: Ying X. <yin...@wi...> - 2019-10-29 10:36:47
|
On 10/29/19 1:37 AM, Jon Maloy wrote: > @@ -3007,6 +3068,9 @@ static int tipc_setsockopt(struct socket *sock, int lvl, int opt, > case TIPC_GROUP_LEAVE: > res = tipc_sk_leave(tsk); > break; > + case TIPC_NODELAY: > + tsk->nodelay = true; > + break; > default: > res = -EINVAL; > } Once user sets tsk->nodelay to true, there is no chance to set it back to false. Although this scenario rarely happens for us, it's better that we can provide the function. For example, below is how TCP supports TCP_NODELAY option: case TCP_NODELAY: if (val) { /* TCP_NODELAY is weaker than TCP_CORK, so that * this option on corked socket is remembered, but * it is not activated until cork is cleared. * * However, when TCP_NODELAY is set we make * an explicit push, which overrides even TCP_CORK * for currently queued segments. */ tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH; tcp_push_pending_frames(sk); } else { tp->nonagle &= ~TCP_NAGLE_OFF; } break; |
From: Ying X. <yin...@wi...> - 2019-10-29 10:24:06
|
On 10/25/19 12:28 AM, Jon Maloy wrote: > 1) TIPC_NODELAY might be a good option, although I fear people might misuse it in the belief that TIPC nagle has the same disadvantages as TCP nagle, which it doesn't. > But ok, I'll add it. > > 2) CONN_PROBE/CONN_PROBE_REPLY are not considered simply because they are so rare (once per hour) that they won't make any difference. > > 3) We don't really have any tools to measure this. The latency measurement in our benchmark tool never trigs nagle mode, so we won't notice any difference. > When nagle is enabled, we can only measure latency per direction, not round-trip delay (since there is no return message), but logically it works as follows: > > Scenario 1: > 1) Socket goes to nagle mode. The message trigging this is not bundled, but just sent out with the 'response_req' bit set. > 2) A number of messages and possible skbs are added to the queue. > 3) The ACK_MSG (response on msg 1) arrives after 1 RTT, and the accumulated messages are sent. So, the first message, probably added just after the 'resp-req' message was sent might have a delay of up to one RTT. The remaining messages in the queue will have a lower delay, and notably a message added just before the ACK_MSG arrives will have almost no delay. > > Scenario 2: > 1) Socket is in nagle mode, and a number of messages are being accumulated. The last message in the queue always have the resp_req bit set. > 2) The queue surpasses 64 k, or a larger message than 'maxnagle'is being sent. So the whole send queue is sent out. > 3) Obviously we didn't wait for the expected MSG_ACK in this case, so the delay for all messages is less than 1 RTT. > > Remains to know the size of RTT, but in the type of clusters we are running this is rarely more than a few milliseconds, at most. This in contrast to TCP, where the delay may be several hundred milliseconds. > Thank you for your clear explanation. It makes me fully understood why you stated the delay was no more than one RTT after nagle was enabled. |
From: Hoang Le <hoa...@de...> - 2019-10-29 00:53:00
|
Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network namespaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a namespace on the same host, we just replace the namespace pointer with the receiving node/namespace's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stacks. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/namespaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local namespace does it choose to let user data messages, and only those, take the crossover path to the receiving node/namespace. - If the receiving node/namespace is removed, its namespace pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that the sender knows a common secret. We use the hash mix of the sending node/namespace for this purpose, since it can be accessed directly by all other namespaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local namespaces'hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local namespace, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence we think it can justified to allow it to shortcut the lower protocol layers. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the involved nodes' loopback interfaces their inter-name space messaging can easily be tracked. v2: - update 'net' pointer when node left/rejoined v3: - grab read/write lock when using node ref obj v4: - clone traffics between netns to loopback Suggested-by: Jon Maloy <jon...@er...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/core.c | 16 +++++ net/tipc/core.h | 6 ++ net/tipc/discover.c | 4 +- net/tipc/msg.h | 14 ++++ net/tipc/name_distr.c | 2 +- net/tipc/node.c | 155 ++++++++++++++++++++++++++++++++++++++++-- net/tipc/node.h | 5 +- net/tipc/socket.c | 6 +- 8 files changed, 197 insertions(+), 11 deletions(-) diff --git a/net/tipc/core.c b/net/tipc/core.c index 23cb379a93d6..ab648dd150ee 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -105,6 +105,15 @@ static void __net_exit tipc_exit_net(struct net *net) tipc_sk_rht_destroy(net); } +static void __net_exit tipc_pernet_pre_exit(struct net *net) +{ + tipc_node_pre_cleanup_net(net); +} + +static struct pernet_operations tipc_pernet_pre_exit_ops = { + .pre_exit = tipc_pernet_pre_exit, +}; + static struct pernet_operations tipc_net_ops = { .init = tipc_init_net, .exit = tipc_exit_net, @@ -151,6 +160,10 @@ static int __init tipc_init(void) if (err) goto out_pernet_topsrv; + err = register_pernet_subsys(&tipc_pernet_pre_exit_ops); + if (err) + goto out_register_pernet_subsys; + err = tipc_bearer_setup(); if (err) goto out_bearer; @@ -158,6 +171,8 @@ static int __init tipc_init(void) pr_info("Started in single node mode\n"); return 0; out_bearer: + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); +out_register_pernet_subsys: unregister_pernet_device(&tipc_topsrv_net_ops); out_pernet_topsrv: tipc_socket_stop(); @@ -177,6 +192,7 @@ static int __init tipc_init(void) static void __exit tipc_exit(void) { tipc_bearer_cleanup(); + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); unregister_pernet_device(&tipc_topsrv_net_ops); tipc_socket_stop(); unregister_pernet_device(&tipc_net_ops); diff --git a/net/tipc/core.h b/net/tipc/core.h index 60d829581068..8776d32a4a47 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -59,6 +59,7 @@ #include <net/netns/generic.h> #include <linux/rhashtable.h> #include <net/genetlink.h> +#include <net/netns/hash.h> struct tipc_node; struct tipc_bearer; @@ -185,6 +186,11 @@ static inline int in_range(u16 val, u16 min, u16 max) return !less(val, min) && !more(val, max); } +static inline u32 tipc_net_hash_mixes(struct net *net, int tn_rand) +{ + return net_hash_mix(&init_net) ^ net_hash_mix(net) ^ tn_rand; +} + #ifdef CONFIG_SYSCTL int tipc_register_sysctl(void); void tipc_unregister_sysctl(void); diff --git a/net/tipc/discover.c b/net/tipc/discover.c index c138d68e8a69..b043e8c6397a 100644 --- a/net/tipc/discover.c +++ b/net/tipc/discover.c @@ -94,6 +94,7 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, msg_set_dest_domain(hdr, dest_domain); msg_set_bc_netid(hdr, tn->net_id); b->media->addr2msg(msg_media_addr(hdr), &b->addr); + msg_set_peer_net_hash(hdr, tipc_net_hash_mixes(net, tn->random)); msg_set_node_id(hdr, tipc_own_id(net)); } @@ -242,7 +243,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, if (!tipc_in_scope(legacy, b->domain, src)) return; tipc_node_check_dest(net, src, peer_id, b, caps, signature, - &maddr, &respond, &dupl_addr); + msg_peer_net_hash(hdr), &maddr, &respond, + &dupl_addr); if (dupl_addr) disc_dupl_alert(b, src, &maddr); if (!respond) diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..2d7cb66a6912 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -1026,6 +1026,20 @@ static inline bool msg_is_reset(struct tipc_msg *hdr) return (msg_user(hdr) == LINK_PROTOCOL) && (msg_type(hdr) == RESET_MSG); } +/* Word 13 + */ +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) +{ + msg_set_word(m, 13, n); +} + +static inline u32 msg_peer_net_hash(struct tipc_msg *m) +{ + return msg_word(m, 13); +} + +/* Word 14 + */ static inline u32 msg_sugg_node_addr(struct tipc_msg *m) { return msg_word(m, 14); diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index 836e629e8f4a..5feaf3b67380 100644 --- a/net/tipc/name_distr.c +++ b/net/tipc/name_distr.c @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list, struct publication *publ; struct sk_buff *skb = NULL; struct distr_item *item = NULL; - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) / ITEM_SIZE) * ITEM_SIZE; u32 msg_rem = msg_dsz; diff --git a/net/tipc/node.c b/net/tipc/node.c index f2e3cf70c922..4b60928049ea 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -126,6 +126,8 @@ struct tipc_node { struct timer_list timer; struct rcu_head rcu; unsigned long delete_at; + struct net *peer_net; + u32 peer_hash_mix; }; /* Node FSM states and events: @@ -184,7 +186,7 @@ static struct tipc_link *node_active_link(struct tipc_node *n, int sel) return n->links[bearer_id].link; } -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected) { struct tipc_node *n; int bearer_id; @@ -194,6 +196,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) if (unlikely(!n)) return mtu; + /* Allow MAX_MSG_SIZE when building connection oriented message + * if they are in the same core network + */ + if (n->peer_net && connected) { + tipc_node_put(n); + return mtu; + } + bearer_id = n->active_links[sel & 1]; if (likely(bearer_id != INVALID_BEARER_ID)) mtu = n->links[bearer_id].mtu; @@ -360,8 +370,37 @@ static void tipc_node_write_unlock(struct tipc_node *n) } } +static void tipc_node_assign_peer_net(struct tipc_node *n, u32 hash_mixes) +{ + int net_id = tipc_netid(n->net); + struct tipc_net *tn_peer; + struct net *tmp; + u32 hash_chk; + + if (n->peer_net) + return; + + for_each_net_rcu(tmp) { + tn_peer = tipc_net(tmp); + if (!tn_peer) + continue; + /* Integrity checking whether node exists in namespace or not */ + if (tn_peer->net_id != net_id) + continue; + if (memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) + continue; + hash_chk = tipc_net_hash_mixes(tmp, tn_peer->random); + if (hash_mixes ^ hash_chk) + continue; + n->peer_net = tmp; + n->peer_hash_mix = hash_mixes; + break; + } +} + static struct tipc_node *tipc_node_create(struct net *net, u32 addr, - u8 *peer_id, u16 capabilities) + u8 *peer_id, u16 capabilities, + u32 signature, u32 hash_mixes) { struct tipc_net *tn = net_generic(net, tipc_net_id); struct tipc_node *n, *temp_node; @@ -372,6 +411,8 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, spin_lock_bh(&tn->node_list_lock); n = tipc_node_find(net, addr); if (n) { + if (n->peer_hash_mix ^ hash_mixes) + tipc_node_assign_peer_net(n, hash_mixes); if (n->capabilities == capabilities) goto exit; /* Same node may come back with new capabilities */ @@ -389,6 +430,7 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, list_for_each_entry_rcu(temp_node, &tn->node_list, list) { tn->capabilities &= temp_node->capabilities; } + goto exit; } n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -399,6 +441,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, n->addr = addr; memcpy(&n->peer_id, peer_id, 16); n->net = net; + n->peer_net = NULL; + n->peer_hash_mix = 0; + /* Assign kernel local namespace if exists */ + tipc_node_assign_peer_net(n, hash_mixes); n->capabilities = capabilities; kref_init(&n->kref); rwlock_init(&n->lock); @@ -426,6 +472,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, tipc_bc_sndlink(net), &n->bc_entry.link)) { pr_warn("Broadcast rcv link creation failed, no memory\n"); + if (n->peer_net) { + n->peer_net = NULL; + n->peer_hash_mix = 0; + } kfree(n); n = NULL; goto exit; @@ -979,7 +1029,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) void tipc_node_check_dest(struct net *net, u32 addr, u8 *peer_id, struct tipc_bearer *b, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr) { @@ -998,7 +1048,8 @@ void tipc_node_check_dest(struct net *net, u32 addr, *dupl_addr = false; *respond = false; - n = tipc_node_create(net, addr, peer_id, capabilities); + n = tipc_node_create(net, addr, peer_id, capabilities, signature, + hash_mixes); if (!n) return; @@ -1343,6 +1394,10 @@ static void node_lost_contact(struct tipc_node *n, /* Notify publications from this node */ n->action_flags |= TIPC_NOTIFY_NODE_DOWN; + if (n->peer_net) { + n->peer_net = NULL; + n->peer_hash_mix = 0; + } /* Notify sockets connected to node */ list_for_each_entry_safe(conn, safe, conns, list) { skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE, TIPC_CONN_MSG, @@ -1424,6 +1479,56 @@ static int __tipc_nl_add_node(struct tipc_nl_msg *msg, struct tipc_node *node) return -EMSGSIZE; } +static void tipc_lxc_xmit(struct net *peer_net, struct sk_buff_head *list) +{ + struct tipc_msg *hdr = buf_msg(skb_peek(list)); + struct sk_buff_head inputq; + + switch (msg_user(hdr)) { + case TIPC_LOW_IMPORTANCE: + case TIPC_MEDIUM_IMPORTANCE: + case TIPC_HIGH_IMPORTANCE: + case TIPC_CRITICAL_IMPORTANCE: + if (msg_connected(hdr) || msg_named(hdr)) { + tipc_loopback_trace(peer_net, list); + spin_lock_init(&list->lock); + tipc_sk_rcv(peer_net, list); + return; + } + if (msg_mcast(hdr)) { + tipc_loopback_trace(peer_net, list); + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(peer_net, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + return; + } + return; + case MSG_FRAGMENTER: + if (tipc_msg_assemble(list)) { + tipc_loopback_trace(peer_net, list); + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(peer_net, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + } + return; + case GROUP_PROTOCOL: + case CONN_MANAGER: + tipc_loopback_trace(peer_net, list); + spin_lock_init(&list->lock); + tipc_sk_rcv(peer_net, list); + return; + case LINK_PROTOCOL: + case NAME_DISTRIBUTOR: + case TUNNEL_PROTOCOL: + case BCAST_PROTOCOL: + return; + default: + return; + }; +} + /** * tipc_node_xmit() is the general link level function for message sending * @net: the applicable net namespace @@ -1439,6 +1544,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, struct tipc_link_entry *le = NULL; struct tipc_node *n; struct sk_buff_head xmitq; + bool node_up = false; int bearer_id; int rc; @@ -1456,6 +1562,17 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, } tipc_node_read_lock(n); + node_up = node_is_up(n); + if (node_up && n->peer_net && check_net(n->peer_net)) { + /* xmit inner linux container */ + tipc_lxc_xmit(n->peer_net, list); + if (likely(skb_queue_empty(list))) { + tipc_node_read_unlock(n); + tipc_node_put(n); + return 0; + } + } + bearer_id = n->active_links[selector & 1]; if (unlikely(bearer_id == INVALID_BEARER_ID)) { tipc_node_read_unlock(n); @@ -2587,3 +2704,33 @@ int tipc_node_dump(struct tipc_node *n, bool more, char *buf) return i; } + +void tipc_node_pre_cleanup_net(struct net *exit_net) +{ + struct tipc_node *n; + struct tipc_net *tn; + struct net *tmp; + + rcu_read_lock(); + for_each_net_rcu(tmp) { + if (tmp == exit_net) + continue; + tn = tipc_net(tmp); + if (!tn) + continue; + spin_lock_bh(&tn->node_list_lock); + list_for_each_entry_rcu(n, &tn->node_list, list) { + if (!n->peer_net) + continue; + if (n->peer_net != exit_net) + continue; + tipc_node_write_lock(n); + n->peer_net = NULL; + n->peer_hash_mix = 0; + tipc_node_write_unlock_fast(n); + break; + } + spin_unlock_bh(&tn->node_list_lock); + } + rcu_read_unlock(); +} diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..30563c4f35d5 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, struct tipc_bearer *bearer, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr); void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected); bool tipc_node_is_up(struct net *net, u32 addr); u16 tipc_node_get_capabilities(struct net *net, u32 addr); int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); @@ -107,4 +107,5 @@ int tipc_nl_node_get_monitor(struct sk_buff *skb, struct genl_info *info); int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct netlink_callback *cb); int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb, struct netlink_callback *cb); +void tipc_node_pre_cleanup_net(struct net *exit_net); #endif diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 35e32ffc2b90..2bcacd6022d5 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk, /* Build message as chain of buffers */ __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen) return rc; __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); tipc_set_sk_state(sk, TIPC_ESTABLISHED); tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); __skb_queue_purge(&sk->sk_write_queue); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) -- 2.20.1 |
From: Jon M. <jon...@er...> - 2019-10-28 22:23:49
|
Hi Hoang, Sorry that my review took some time. To me this looks safe, as the pointers now will be cleared before the name space disappears. I suggest you post this to netdev and then we'll see if Eric or anybody else still have objections. Acked-by: Jon Maloy <jon...@er...> ///jon > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 28-Oct-19 06:41 > To: tip...@li...; Jon Maloy <jon...@er...>; ma...@do... > Subject: [net-next v4] tipc: improve throughput between nodes in netns > > Currently, TIPC transports intra-node user data messages directly > socket to socket, hence shortcutting all the lower layers of the > communication stack. This gives TIPC very good intra node performance, > both regarding throughput and latency. > > We now introduce a similar mechanism for TIPC data traffic across > network namespaces located in the same kernel. On the send path, the > call chain is as always accompanied by the sending node's network name > space pointer. However, once we have reliably established that the > receiving node is represented by a namespace on the same host, we just > replace the namespace pointer with the receiving node/namespace's > ditto, and follow the regular socket receive patch though the receiving > node. This technique gives us a throughput similar to the node internal > throughput, several times larger than if we let the traffic go though > the full network stacks. As a comparison, max throughput for 64k > messages is four times larger than TCP throughput for the same type of > traffic. > > To meet any security concerns, the following should be noted. > > - All nodes joining a cluster are supposed to have been be certified > and authenticated by mechanisms outside TIPC. This is no different for > nodes/namespaces on the same host; they have to auto discover each > other using the attached interfaces, and establish links which are > supervised via the regular link monitoring mechanism. Hence, a kernel > local node has no other way to join a cluster than any other node, and > have to obey to policies set in the IP or device layers of the stack. > > - Only when a sender has established with 100% certainty that the peer > node is located in a kernel local namespace does it choose to let user > data messages, and only those, take the crossover path to the receiving > node/namespace. > > - If the receiving node/namespace is removed, its namespace pointer > is invalidated at all peer nodes, and their neighbor link monitoring > will eventually note that this node is gone. > > - To ensure the "100% certainty" criteria, and prevent any possible > spoofing, received discovery messages must contain a proof that the > sender knows a common secret. We use the hash mix of the sending > node/namespace for this purpose, since it can be accessed directly by > all other namespaces in the kernel. Upon reception of a discovery > message, the receiver checks this proof against all the local > namespaces'hash_mix:es. If it finds a match, that, along with a > matching node id and cluster id, this is deemed sufficient proof that > the peer node in question is in a local namespace, and a wormhole can > be opened. > > - We should also consider that TIPC is intended to be a cluster local > IPC mechanism (just like e.g. UNIX sockets) rather than a network > protocol, and hence we think it can justified to allow it to shortcut the > lower protocol layers. > > Regarding traceability, we should notice that since commit 6c9081a3915d > ("tipc: add loopback device tracking") it is possible to follow the node > internal packet flow by just activating tcpdump on the loopback > interface. This will be true even for this mechanism; by activating > tcpdump on the involved nodes' loopback interfaces their inter-name > space messaging can easily be tracked. > > v2: > - update 'net' pointer when node left/rejoined > v3: > - grab read/write lock when using node ref obj > v4: > - clone traffics between netns to loopback > > Suggested-by: Jon Maloy <jon...@er...> > Acked-by: Jon Maloy <jon...@er...> > Signed-off-by: Hoang Le <hoa...@de...> > --- > net/tipc/core.c | 16 +++++ > net/tipc/core.h | 6 ++ > net/tipc/discover.c | 4 +- > net/tipc/msg.h | 14 ++++ > net/tipc/name_distr.c | 2 +- > net/tipc/node.c | 155 ++++++++++++++++++++++++++++++++++++++++-- > net/tipc/node.h | 5 +- > net/tipc/socket.c | 6 +- > 8 files changed, 197 insertions(+), 11 deletions(-) > > diff --git a/net/tipc/core.c b/net/tipc/core.c > index 23cb379a93d6..ab648dd150ee 100644 > --- a/net/tipc/core.c > +++ b/net/tipc/core.c > @@ -105,6 +105,15 @@ static void __net_exit tipc_exit_net(struct net *net) > tipc_sk_rht_destroy(net); > } > > +static void __net_exit tipc_pernet_pre_exit(struct net *net) > +{ > + tipc_node_pre_cleanup_net(net); > +} > + > +static struct pernet_operations tipc_pernet_pre_exit_ops = { > + .pre_exit = tipc_pernet_pre_exit, > +}; > + > static struct pernet_operations tipc_net_ops = { > .init = tipc_init_net, > .exit = tipc_exit_net, > @@ -151,6 +160,10 @@ static int __init tipc_init(void) > if (err) > goto out_pernet_topsrv; > > + err = register_pernet_subsys(&tipc_pernet_pre_exit_ops); > + if (err) > + goto out_register_pernet_subsys; > + > err = tipc_bearer_setup(); > if (err) > goto out_bearer; > @@ -158,6 +171,8 @@ static int __init tipc_init(void) > pr_info("Started in single node mode\n"); > return 0; > out_bearer: > + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); > +out_register_pernet_subsys: > unregister_pernet_device(&tipc_topsrv_net_ops); > out_pernet_topsrv: > tipc_socket_stop(); > @@ -177,6 +192,7 @@ static int __init tipc_init(void) > static void __exit tipc_exit(void) > { > tipc_bearer_cleanup(); > + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); > unregister_pernet_device(&tipc_topsrv_net_ops); > tipc_socket_stop(); > unregister_pernet_device(&tipc_net_ops); > diff --git a/net/tipc/core.h b/net/tipc/core.h > index 60d829581068..8776d32a4a47 100644 > --- a/net/tipc/core.h > +++ b/net/tipc/core.h > @@ -59,6 +59,7 @@ > #include <net/netns/generic.h> > #include <linux/rhashtable.h> > #include <net/genetlink.h> > +#include <net/netns/hash.h> > > struct tipc_node; > struct tipc_bearer; > @@ -185,6 +186,11 @@ static inline int in_range(u16 val, u16 min, u16 max) > return !less(val, min) && !more(val, max); > } > > +static inline u32 tipc_net_hash_mixes(struct net *net, int tn_rand) > +{ > + return net_hash_mix(&init_net) ^ net_hash_mix(net) ^ tn_rand; > +} > + > #ifdef CONFIG_SYSCTL > int tipc_register_sysctl(void); > void tipc_unregister_sysctl(void); > diff --git a/net/tipc/discover.c b/net/tipc/discover.c > index c138d68e8a69..b043e8c6397a 100644 > --- a/net/tipc/discover.c > +++ b/net/tipc/discover.c > @@ -94,6 +94,7 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, > msg_set_dest_domain(hdr, dest_domain); > msg_set_bc_netid(hdr, tn->net_id); > b->media->addr2msg(msg_media_addr(hdr), &b->addr); > + msg_set_peer_net_hash(hdr, tipc_net_hash_mixes(net, tn->random)); > msg_set_node_id(hdr, tipc_own_id(net)); > } > > @@ -242,7 +243,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, > if (!tipc_in_scope(legacy, b->domain, src)) > return; > tipc_node_check_dest(net, src, peer_id, b, caps, signature, > - &maddr, &respond, &dupl_addr); > + msg_peer_net_hash(hdr), &maddr, &respond, > + &dupl_addr); > if (dupl_addr) > disc_dupl_alert(b, src, &maddr); > if (!respond) > diff --git a/net/tipc/msg.h b/net/tipc/msg.h > index 0daa6f04ca81..2d7cb66a6912 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -1026,6 +1026,20 @@ static inline bool msg_is_reset(struct tipc_msg *hdr) > return (msg_user(hdr) == LINK_PROTOCOL) && (msg_type(hdr) == RESET_MSG); > } > > +/* Word 13 > + */ > +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) > +{ > + msg_set_word(m, 13, n); > +} > + > +static inline u32 msg_peer_net_hash(struct tipc_msg *m) > +{ > + return msg_word(m, 13); > +} > + > +/* Word 14 > + */ > static inline u32 msg_sugg_node_addr(struct tipc_msg *m) > { > return msg_word(m, 14); > diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c > index 836e629e8f4a..5feaf3b67380 100644 > --- a/net/tipc/name_distr.c > +++ b/net/tipc/name_distr.c > @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list, > struct publication *publ; > struct sk_buff *skb = NULL; > struct distr_item *item = NULL; > - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / > + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) / > ITEM_SIZE) * ITEM_SIZE; > u32 msg_rem = msg_dsz; > > diff --git a/net/tipc/node.c b/net/tipc/node.c > index f2e3cf70c922..4b60928049ea 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -126,6 +126,8 @@ struct tipc_node { > struct timer_list timer; > struct rcu_head rcu; > unsigned long delete_at; > + struct net *peer_net; > + u32 peer_hash_mix; > }; > > /* Node FSM states and events: > @@ -184,7 +186,7 @@ static struct tipc_link *node_active_link(struct tipc_node *n, int sel) > return n->links[bearer_id].link; > } > > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected) > { > struct tipc_node *n; > int bearer_id; > @@ -194,6 +196,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > if (unlikely(!n)) > return mtu; > > + /* Allow MAX_MSG_SIZE when building connection oriented message > + * if they are in the same core network > + */ > + if (n->peer_net && connected) { > + tipc_node_put(n); > + return mtu; > + } > + > bearer_id = n->active_links[sel & 1]; > if (likely(bearer_id != INVALID_BEARER_ID)) > mtu = n->links[bearer_id].mtu; > @@ -360,8 +370,37 @@ static void tipc_node_write_unlock(struct tipc_node *n) > } > } > > +static void tipc_node_assign_peer_net(struct tipc_node *n, u32 hash_mixes) > +{ > + int net_id = tipc_netid(n->net); > + struct tipc_net *tn_peer; > + struct net *tmp; > + u32 hash_chk; > + > + if (n->peer_net) > + return; > + > + for_each_net_rcu(tmp) { > + tn_peer = tipc_net(tmp); > + if (!tn_peer) > + continue; > + /* Integrity checking whether node exists in namespace or not */ > + if (tn_peer->net_id != net_id) > + continue; > + if (memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) > + continue; > + hash_chk = tipc_net_hash_mixes(tmp, tn_peer->random); > + if (hash_mixes ^ hash_chk) > + continue; > + n->peer_net = tmp; > + n->peer_hash_mix = hash_mixes; > + break; > + } > +} > + > static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > - u8 *peer_id, u16 capabilities) > + u8 *peer_id, u16 capabilities, > + u32 signature, u32 hash_mixes) > { > struct tipc_net *tn = net_generic(net, tipc_net_id); > struct tipc_node *n, *temp_node; > @@ -372,6 +411,8 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > spin_lock_bh(&tn->node_list_lock); > n = tipc_node_find(net, addr); > if (n) { > + if (n->peer_hash_mix ^ hash_mixes) > + tipc_node_assign_peer_net(n, hash_mixes); > if (n->capabilities == capabilities) > goto exit; > /* Same node may come back with new capabilities */ > @@ -389,6 +430,7 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > list_for_each_entry_rcu(temp_node, &tn->node_list, list) { > tn->capabilities &= temp_node->capabilities; > } > + > goto exit; > } > n = kzalloc(sizeof(*n), GFP_ATOMIC); > @@ -399,6 +441,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > n->addr = addr; > memcpy(&n->peer_id, peer_id, 16); > n->net = net; > + n->peer_net = NULL; > + n->peer_hash_mix = 0; > + /* Assign kernel local namespace if exists */ > + tipc_node_assign_peer_net(n, hash_mixes); > n->capabilities = capabilities; > kref_init(&n->kref); > rwlock_init(&n->lock); > @@ -426,6 +472,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > tipc_bc_sndlink(net), > &n->bc_entry.link)) { > pr_warn("Broadcast rcv link creation failed, no memory\n"); > + if (n->peer_net) { > + n->peer_net = NULL; > + n->peer_hash_mix = 0; > + } > kfree(n); > n = NULL; > goto exit; > @@ -979,7 +1029,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) > > void tipc_node_check_dest(struct net *net, u32 addr, > u8 *peer_id, struct tipc_bearer *b, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 hash_mixes, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr) > { > @@ -998,7 +1048,8 @@ void tipc_node_check_dest(struct net *net, u32 addr, > *dupl_addr = false; > *respond = false; > > - n = tipc_node_create(net, addr, peer_id, capabilities); > + n = tipc_node_create(net, addr, peer_id, capabilities, signature, > + hash_mixes); > if (!n) > return; > > @@ -1343,6 +1394,10 @@ static void node_lost_contact(struct tipc_node *n, > /* Notify publications from this node */ > n->action_flags |= TIPC_NOTIFY_NODE_DOWN; > > + if (n->peer_net) { > + n->peer_net = NULL; > + n->peer_hash_mix = 0; > + } > /* Notify sockets connected to node */ > list_for_each_entry_safe(conn, safe, conns, list) { > skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE, TIPC_CONN_MSG, > @@ -1424,6 +1479,56 @@ static int __tipc_nl_add_node(struct tipc_nl_msg *msg, struct tipc_node > *node) > return -EMSGSIZE; > } > > +static void tipc_lxc_xmit(struct net *peer_net, struct sk_buff_head *list) > +{ > + struct tipc_msg *hdr = buf_msg(skb_peek(list)); > + struct sk_buff_head inputq; > + > + switch (msg_user(hdr)) { > + case TIPC_LOW_IMPORTANCE: > + case TIPC_MEDIUM_IMPORTANCE: > + case TIPC_HIGH_IMPORTANCE: > + case TIPC_CRITICAL_IMPORTANCE: > + if (msg_connected(hdr) || msg_named(hdr)) { > + tipc_loopback_trace(peer_net, list); > + spin_lock_init(&list->lock); > + tipc_sk_rcv(peer_net, list); > + return; > + } > + if (msg_mcast(hdr)) { > + tipc_loopback_trace(peer_net, list); > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(peer_net, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + return; > + } > + return; > + case MSG_FRAGMENTER: > + if (tipc_msg_assemble(list)) { > + tipc_loopback_trace(peer_net, list); > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(peer_net, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + } > + return; > + case GROUP_PROTOCOL: > + case CONN_MANAGER: > + tipc_loopback_trace(peer_net, list); > + spin_lock_init(&list->lock); > + tipc_sk_rcv(peer_net, list); > + return; > + case LINK_PROTOCOL: > + case NAME_DISTRIBUTOR: > + case TUNNEL_PROTOCOL: > + case BCAST_PROTOCOL: > + return; > + default: > + return; > + }; > +} > + > /** > * tipc_node_xmit() is the general link level function for message sending > * @net: the applicable net namespace > @@ -1439,6 +1544,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, > struct tipc_link_entry *le = NULL; > struct tipc_node *n; > struct sk_buff_head xmitq; > + bool node_up = false; > int bearer_id; > int rc; > > @@ -1456,6 +1562,17 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, > } > > tipc_node_read_lock(n); > + node_up = node_is_up(n); > + if (node_up && n->peer_net && check_net(n->peer_net)) { > + /* xmit inner linux container */ > + tipc_lxc_xmit(n->peer_net, list); > + if (likely(skb_queue_empty(list))) { > + tipc_node_read_unlock(n); > + tipc_node_put(n); > + return 0; > + } > + } > + > bearer_id = n->active_links[selector & 1]; > if (unlikely(bearer_id == INVALID_BEARER_ID)) { > tipc_node_read_unlock(n); > @@ -2587,3 +2704,33 @@ int tipc_node_dump(struct tipc_node *n, bool more, char *buf) > > return i; > } > + > +void tipc_node_pre_cleanup_net(struct net *exit_net) > +{ > + struct tipc_node *n; > + struct tipc_net *tn; > + struct net *tmp; > + > + rcu_read_lock(); > + for_each_net_rcu(tmp) { > + if (tmp == exit_net) > + continue; > + tn = tipc_net(tmp); > + if (!tn) > + continue; > + spin_lock_bh(&tn->node_list_lock); > + list_for_each_entry_rcu(n, &tn->node_list, list) { > + if (!n->peer_net) > + continue; > + if (n->peer_net != exit_net) > + continue; > + tipc_node_write_lock(n); > + n->peer_net = NULL; > + n->peer_hash_mix = 0; > + tipc_node_write_unlock_fast(n); > + break; > + } > + spin_unlock_bh(&tn->node_list_lock); > + } > + rcu_read_unlock(); > +} > diff --git a/net/tipc/node.h b/net/tipc/node.h > index 291d0ecd4101..30563c4f35d5 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); > u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); > void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, > struct tipc_bearer *bearer, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 hash_mixes, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr); > void tipc_node_delete_links(struct net *net, int bearer_id); > @@ -92,7 +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr); > void tipc_node_broadcast(struct net *net, struct sk_buff *skb); > int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port); > void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port); > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected); > bool tipc_node_is_up(struct net *net, u32 addr); > u16 tipc_node_get_capabilities(struct net *net, u32 addr); > int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); > @@ -107,4 +107,5 @@ int tipc_nl_node_get_monitor(struct sk_buff *skb, struct genl_info *info); > int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct netlink_callback *cb); > int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb, > struct netlink_callback *cb); > +void tipc_node_pre_cleanup_net(struct net *exit_net); > #endif > diff --git a/net/tipc/socket.c b/net/tipc/socket.c > index 35e32ffc2b90..2bcacd6022d5 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk, > > /* Build message as chain of buffers */ > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen) > return rc; > > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, > sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > __skb_queue_purge(&sk->sk_write_queue); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > -- > 2.20.1 |
From: David M. <da...@da...> - 2019-10-28 20:43:04
|
From: Geert Uytterhoeven <gee...@gl...> Date: Thu, 24 Oct 2019 17:30:43 +0200 > Fix misspelling of "endpoint". > > Signed-off-by: Geert Uytterhoeven <gee...@gl...> Applied to net-next. |
From: Hoang Le <hoa...@de...> - 2019-10-28 10:42:49
|
Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network namespaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a namespace on the same host, we just replace the namespace pointer with the receiving node/namespace's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stacks. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/namespaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local namespace does it choose to let user data messages, and only those, take the crossover path to the receiving node/namespace. - If the receiving node/namespace is removed, its namespace pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that the sender knows a common secret. We use the hash mix of the sending node/namespace for this purpose, since it can be accessed directly by all other namespaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local namespaces'hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local namespace, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence we think it can justified to allow it to shortcut the lower protocol layers. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the involved nodes' loopback interfaces their inter-name space messaging can easily be tracked. v2: - update 'net' pointer when node left/rejoined v3: - grab read/write lock when using node ref obj v4: - clone traffics between netns to loopback Suggested-by: Jon Maloy <jon...@er...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/core.c | 16 +++++ net/tipc/core.h | 6 ++ net/tipc/discover.c | 4 +- net/tipc/msg.h | 14 ++++ net/tipc/name_distr.c | 2 +- net/tipc/node.c | 155 ++++++++++++++++++++++++++++++++++++++++-- net/tipc/node.h | 5 +- net/tipc/socket.c | 6 +- 8 files changed, 197 insertions(+), 11 deletions(-) diff --git a/net/tipc/core.c b/net/tipc/core.c index 23cb379a93d6..ab648dd150ee 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -105,6 +105,15 @@ static void __net_exit tipc_exit_net(struct net *net) tipc_sk_rht_destroy(net); } +static void __net_exit tipc_pernet_pre_exit(struct net *net) +{ + tipc_node_pre_cleanup_net(net); +} + +static struct pernet_operations tipc_pernet_pre_exit_ops = { + .pre_exit = tipc_pernet_pre_exit, +}; + static struct pernet_operations tipc_net_ops = { .init = tipc_init_net, .exit = tipc_exit_net, @@ -151,6 +160,10 @@ static int __init tipc_init(void) if (err) goto out_pernet_topsrv; + err = register_pernet_subsys(&tipc_pernet_pre_exit_ops); + if (err) + goto out_register_pernet_subsys; + err = tipc_bearer_setup(); if (err) goto out_bearer; @@ -158,6 +171,8 @@ static int __init tipc_init(void) pr_info("Started in single node mode\n"); return 0; out_bearer: + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); +out_register_pernet_subsys: unregister_pernet_device(&tipc_topsrv_net_ops); out_pernet_topsrv: tipc_socket_stop(); @@ -177,6 +192,7 @@ static int __init tipc_init(void) static void __exit tipc_exit(void) { tipc_bearer_cleanup(); + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); unregister_pernet_device(&tipc_topsrv_net_ops); tipc_socket_stop(); unregister_pernet_device(&tipc_net_ops); diff --git a/net/tipc/core.h b/net/tipc/core.h index 60d829581068..8776d32a4a47 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -59,6 +59,7 @@ #include <net/netns/generic.h> #include <linux/rhashtable.h> #include <net/genetlink.h> +#include <net/netns/hash.h> struct tipc_node; struct tipc_bearer; @@ -185,6 +186,11 @@ static inline int in_range(u16 val, u16 min, u16 max) return !less(val, min) && !more(val, max); } +static inline u32 tipc_net_hash_mixes(struct net *net, int tn_rand) +{ + return net_hash_mix(&init_net) ^ net_hash_mix(net) ^ tn_rand; +} + #ifdef CONFIG_SYSCTL int tipc_register_sysctl(void); void tipc_unregister_sysctl(void); diff --git a/net/tipc/discover.c b/net/tipc/discover.c index c138d68e8a69..b043e8c6397a 100644 --- a/net/tipc/discover.c +++ b/net/tipc/discover.c @@ -94,6 +94,7 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, msg_set_dest_domain(hdr, dest_domain); msg_set_bc_netid(hdr, tn->net_id); b->media->addr2msg(msg_media_addr(hdr), &b->addr); + msg_set_peer_net_hash(hdr, tipc_net_hash_mixes(net, tn->random)); msg_set_node_id(hdr, tipc_own_id(net)); } @@ -242,7 +243,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, if (!tipc_in_scope(legacy, b->domain, src)) return; tipc_node_check_dest(net, src, peer_id, b, caps, signature, - &maddr, &respond, &dupl_addr); + msg_peer_net_hash(hdr), &maddr, &respond, + &dupl_addr); if (dupl_addr) disc_dupl_alert(b, src, &maddr); if (!respond) diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..2d7cb66a6912 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -1026,6 +1026,20 @@ static inline bool msg_is_reset(struct tipc_msg *hdr) return (msg_user(hdr) == LINK_PROTOCOL) && (msg_type(hdr) == RESET_MSG); } +/* Word 13 + */ +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) +{ + msg_set_word(m, 13, n); +} + +static inline u32 msg_peer_net_hash(struct tipc_msg *m) +{ + return msg_word(m, 13); +} + +/* Word 14 + */ static inline u32 msg_sugg_node_addr(struct tipc_msg *m) { return msg_word(m, 14); diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index 836e629e8f4a..5feaf3b67380 100644 --- a/net/tipc/name_distr.c +++ b/net/tipc/name_distr.c @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list, struct publication *publ; struct sk_buff *skb = NULL; struct distr_item *item = NULL; - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) / ITEM_SIZE) * ITEM_SIZE; u32 msg_rem = msg_dsz; diff --git a/net/tipc/node.c b/net/tipc/node.c index f2e3cf70c922..4b60928049ea 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -126,6 +126,8 @@ struct tipc_node { struct timer_list timer; struct rcu_head rcu; unsigned long delete_at; + struct net *peer_net; + u32 peer_hash_mix; }; /* Node FSM states and events: @@ -184,7 +186,7 @@ static struct tipc_link *node_active_link(struct tipc_node *n, int sel) return n->links[bearer_id].link; } -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected) { struct tipc_node *n; int bearer_id; @@ -194,6 +196,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) if (unlikely(!n)) return mtu; + /* Allow MAX_MSG_SIZE when building connection oriented message + * if they are in the same core network + */ + if (n->peer_net && connected) { + tipc_node_put(n); + return mtu; + } + bearer_id = n->active_links[sel & 1]; if (likely(bearer_id != INVALID_BEARER_ID)) mtu = n->links[bearer_id].mtu; @@ -360,8 +370,37 @@ static void tipc_node_write_unlock(struct tipc_node *n) } } +static void tipc_node_assign_peer_net(struct tipc_node *n, u32 hash_mixes) +{ + int net_id = tipc_netid(n->net); + struct tipc_net *tn_peer; + struct net *tmp; + u32 hash_chk; + + if (n->peer_net) + return; + + for_each_net_rcu(tmp) { + tn_peer = tipc_net(tmp); + if (!tn_peer) + continue; + /* Integrity checking whether node exists in namespace or not */ + if (tn_peer->net_id != net_id) + continue; + if (memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) + continue; + hash_chk = tipc_net_hash_mixes(tmp, tn_peer->random); + if (hash_mixes ^ hash_chk) + continue; + n->peer_net = tmp; + n->peer_hash_mix = hash_mixes; + break; + } +} + static struct tipc_node *tipc_node_create(struct net *net, u32 addr, - u8 *peer_id, u16 capabilities) + u8 *peer_id, u16 capabilities, + u32 signature, u32 hash_mixes) { struct tipc_net *tn = net_generic(net, tipc_net_id); struct tipc_node *n, *temp_node; @@ -372,6 +411,8 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, spin_lock_bh(&tn->node_list_lock); n = tipc_node_find(net, addr); if (n) { + if (n->peer_hash_mix ^ hash_mixes) + tipc_node_assign_peer_net(n, hash_mixes); if (n->capabilities == capabilities) goto exit; /* Same node may come back with new capabilities */ @@ -389,6 +430,7 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, list_for_each_entry_rcu(temp_node, &tn->node_list, list) { tn->capabilities &= temp_node->capabilities; } + goto exit; } n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -399,6 +441,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, n->addr = addr; memcpy(&n->peer_id, peer_id, 16); n->net = net; + n->peer_net = NULL; + n->peer_hash_mix = 0; + /* Assign kernel local namespace if exists */ + tipc_node_assign_peer_net(n, hash_mixes); n->capabilities = capabilities; kref_init(&n->kref); rwlock_init(&n->lock); @@ -426,6 +472,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, tipc_bc_sndlink(net), &n->bc_entry.link)) { pr_warn("Broadcast rcv link creation failed, no memory\n"); + if (n->peer_net) { + n->peer_net = NULL; + n->peer_hash_mix = 0; + } kfree(n); n = NULL; goto exit; @@ -979,7 +1029,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) void tipc_node_check_dest(struct net *net, u32 addr, u8 *peer_id, struct tipc_bearer *b, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr) { @@ -998,7 +1048,8 @@ void tipc_node_check_dest(struct net *net, u32 addr, *dupl_addr = false; *respond = false; - n = tipc_node_create(net, addr, peer_id, capabilities); + n = tipc_node_create(net, addr, peer_id, capabilities, signature, + hash_mixes); if (!n) return; @@ -1343,6 +1394,10 @@ static void node_lost_contact(struct tipc_node *n, /* Notify publications from this node */ n->action_flags |= TIPC_NOTIFY_NODE_DOWN; + if (n->peer_net) { + n->peer_net = NULL; + n->peer_hash_mix = 0; + } /* Notify sockets connected to node */ list_for_each_entry_safe(conn, safe, conns, list) { skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE, TIPC_CONN_MSG, @@ -1424,6 +1479,56 @@ static int __tipc_nl_add_node(struct tipc_nl_msg *msg, struct tipc_node *node) return -EMSGSIZE; } +static void tipc_lxc_xmit(struct net *peer_net, struct sk_buff_head *list) +{ + struct tipc_msg *hdr = buf_msg(skb_peek(list)); + struct sk_buff_head inputq; + + switch (msg_user(hdr)) { + case TIPC_LOW_IMPORTANCE: + case TIPC_MEDIUM_IMPORTANCE: + case TIPC_HIGH_IMPORTANCE: + case TIPC_CRITICAL_IMPORTANCE: + if (msg_connected(hdr) || msg_named(hdr)) { + tipc_loopback_trace(peer_net, list); + spin_lock_init(&list->lock); + tipc_sk_rcv(peer_net, list); + return; + } + if (msg_mcast(hdr)) { + tipc_loopback_trace(peer_net, list); + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(peer_net, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + return; + } + return; + case MSG_FRAGMENTER: + if (tipc_msg_assemble(list)) { + tipc_loopback_trace(peer_net, list); + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(peer_net, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + } + return; + case GROUP_PROTOCOL: + case CONN_MANAGER: + tipc_loopback_trace(peer_net, list); + spin_lock_init(&list->lock); + tipc_sk_rcv(peer_net, list); + return; + case LINK_PROTOCOL: + case NAME_DISTRIBUTOR: + case TUNNEL_PROTOCOL: + case BCAST_PROTOCOL: + return; + default: + return; + }; +} + /** * tipc_node_xmit() is the general link level function for message sending * @net: the applicable net namespace @@ -1439,6 +1544,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, struct tipc_link_entry *le = NULL; struct tipc_node *n; struct sk_buff_head xmitq; + bool node_up = false; int bearer_id; int rc; @@ -1456,6 +1562,17 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, } tipc_node_read_lock(n); + node_up = node_is_up(n); + if (node_up && n->peer_net && check_net(n->peer_net)) { + /* xmit inner linux container */ + tipc_lxc_xmit(n->peer_net, list); + if (likely(skb_queue_empty(list))) { + tipc_node_read_unlock(n); + tipc_node_put(n); + return 0; + } + } + bearer_id = n->active_links[selector & 1]; if (unlikely(bearer_id == INVALID_BEARER_ID)) { tipc_node_read_unlock(n); @@ -2587,3 +2704,33 @@ int tipc_node_dump(struct tipc_node *n, bool more, char *buf) return i; } + +void tipc_node_pre_cleanup_net(struct net *exit_net) +{ + struct tipc_node *n; + struct tipc_net *tn; + struct net *tmp; + + rcu_read_lock(); + for_each_net_rcu(tmp) { + if (tmp == exit_net) + continue; + tn = tipc_net(tmp); + if (!tn) + continue; + spin_lock_bh(&tn->node_list_lock); + list_for_each_entry_rcu(n, &tn->node_list, list) { + if (!n->peer_net) + continue; + if (n->peer_net != exit_net) + continue; + tipc_node_write_lock(n); + n->peer_net = NULL; + n->peer_hash_mix = 0; + tipc_node_write_unlock_fast(n); + break; + } + spin_unlock_bh(&tn->node_list_lock); + } + rcu_read_unlock(); +} diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..30563c4f35d5 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, struct tipc_bearer *bearer, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr); void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected); bool tipc_node_is_up(struct net *net, u32 addr); u16 tipc_node_get_capabilities(struct net *net, u32 addr); int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); @@ -107,4 +107,5 @@ int tipc_nl_node_get_monitor(struct sk_buff *skb, struct genl_info *info); int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct netlink_callback *cb); int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb, struct netlink_callback *cb); +void tipc_node_pre_cleanup_net(struct net *exit_net); #endif diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 35e32ffc2b90..2bcacd6022d5 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk, /* Build message as chain of buffers */ __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen) return rc; __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); tipc_set_sk_state(sk, TIPC_ESTABLISHED); tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); __skb_queue_purge(&sk->sk_write_queue); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) -- 2.20.1 |
From: Hoang Le <hoa...@de...> - 2019-10-28 04:46:23
|
Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network namespaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a namespace on the same host, we just replace the namespace pointer with the receiving node/namespace's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stacks. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/namespaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local namespace does it choose to let user data messages, and only those, take the crossover path to the receiving node/namespace. - If the receiving node/namespace is removed, its namespace pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that the sender knows a common secret. We use the hash mix of the sending node/namespace for this purpose, since it can be accessed directly by all other namespaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local namespaces'hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local namespace, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence we think it can justified to allow it to shortcut the lower protocol layers. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the involved nodes' loopback interfaces their inter-name space messaging can easily be tracked. v2: - update 'net' pointer when node left/rejoined v3: - grab read/write lock when using node ref obj Suggested-by: Jon Maloy <jon...@er...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/core.c | 16 +++++ net/tipc/core.h | 6 ++ net/tipc/discover.c | 4 +- net/tipc/msg.h | 14 ++++ net/tipc/name_distr.c | 2 +- net/tipc/node.c | 151 ++++++++++++++++++++++++++++++++++++++++-- net/tipc/node.h | 5 +- net/tipc/socket.c | 6 +- 8 files changed, 193 insertions(+), 11 deletions(-) diff --git a/net/tipc/core.c b/net/tipc/core.c index 23cb379a93d6..ab648dd150ee 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -105,6 +105,15 @@ static void __net_exit tipc_exit_net(struct net *net) tipc_sk_rht_destroy(net); } +static void __net_exit tipc_pernet_pre_exit(struct net *net) +{ + tipc_node_pre_cleanup_net(net); +} + +static struct pernet_operations tipc_pernet_pre_exit_ops = { + .pre_exit = tipc_pernet_pre_exit, +}; + static struct pernet_operations tipc_net_ops = { .init = tipc_init_net, .exit = tipc_exit_net, @@ -151,6 +160,10 @@ static int __init tipc_init(void) if (err) goto out_pernet_topsrv; + err = register_pernet_subsys(&tipc_pernet_pre_exit_ops); + if (err) + goto out_register_pernet_subsys; + err = tipc_bearer_setup(); if (err) goto out_bearer; @@ -158,6 +171,8 @@ static int __init tipc_init(void) pr_info("Started in single node mode\n"); return 0; out_bearer: + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); +out_register_pernet_subsys: unregister_pernet_device(&tipc_topsrv_net_ops); out_pernet_topsrv: tipc_socket_stop(); @@ -177,6 +192,7 @@ static int __init tipc_init(void) static void __exit tipc_exit(void) { tipc_bearer_cleanup(); + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); unregister_pernet_device(&tipc_topsrv_net_ops); tipc_socket_stop(); unregister_pernet_device(&tipc_net_ops); diff --git a/net/tipc/core.h b/net/tipc/core.h index 60d829581068..8776d32a4a47 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -59,6 +59,7 @@ #include <net/netns/generic.h> #include <linux/rhashtable.h> #include <net/genetlink.h> +#include <net/netns/hash.h> struct tipc_node; struct tipc_bearer; @@ -185,6 +186,11 @@ static inline int in_range(u16 val, u16 min, u16 max) return !less(val, min) && !more(val, max); } +static inline u32 tipc_net_hash_mixes(struct net *net, int tn_rand) +{ + return net_hash_mix(&init_net) ^ net_hash_mix(net) ^ tn_rand; +} + #ifdef CONFIG_SYSCTL int tipc_register_sysctl(void); void tipc_unregister_sysctl(void); diff --git a/net/tipc/discover.c b/net/tipc/discover.c index c138d68e8a69..b043e8c6397a 100644 --- a/net/tipc/discover.c +++ b/net/tipc/discover.c @@ -94,6 +94,7 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, msg_set_dest_domain(hdr, dest_domain); msg_set_bc_netid(hdr, tn->net_id); b->media->addr2msg(msg_media_addr(hdr), &b->addr); + msg_set_peer_net_hash(hdr, tipc_net_hash_mixes(net, tn->random)); msg_set_node_id(hdr, tipc_own_id(net)); } @@ -242,7 +243,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, if (!tipc_in_scope(legacy, b->domain, src)) return; tipc_node_check_dest(net, src, peer_id, b, caps, signature, - &maddr, &respond, &dupl_addr); + msg_peer_net_hash(hdr), &maddr, &respond, + &dupl_addr); if (dupl_addr) disc_dupl_alert(b, src, &maddr); if (!respond) diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..2d7cb66a6912 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -1026,6 +1026,20 @@ static inline bool msg_is_reset(struct tipc_msg *hdr) return (msg_user(hdr) == LINK_PROTOCOL) && (msg_type(hdr) == RESET_MSG); } +/* Word 13 + */ +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) +{ + msg_set_word(m, 13, n); +} + +static inline u32 msg_peer_net_hash(struct tipc_msg *m) +{ + return msg_word(m, 13); +} + +/* Word 14 + */ static inline u32 msg_sugg_node_addr(struct tipc_msg *m) { return msg_word(m, 14); diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index 836e629e8f4a..5feaf3b67380 100644 --- a/net/tipc/name_distr.c +++ b/net/tipc/name_distr.c @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list, struct publication *publ; struct sk_buff *skb = NULL; struct distr_item *item = NULL; - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) / ITEM_SIZE) * ITEM_SIZE; u32 msg_rem = msg_dsz; diff --git a/net/tipc/node.c b/net/tipc/node.c index f2e3cf70c922..62a636a09fe7 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -126,6 +126,8 @@ struct tipc_node { struct timer_list timer; struct rcu_head rcu; unsigned long delete_at; + struct net *peer_net; + u32 peer_hash_mix; }; /* Node FSM states and events: @@ -184,7 +186,7 @@ static struct tipc_link *node_active_link(struct tipc_node *n, int sel) return n->links[bearer_id].link; } -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected) { struct tipc_node *n; int bearer_id; @@ -194,6 +196,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) if (unlikely(!n)) return mtu; + /* Allow MAX_MSG_SIZE when building connection oriented message + * if they are in the same core network + */ + if (n->peer_net && connected) { + tipc_node_put(n); + return mtu; + } + bearer_id = n->active_links[sel & 1]; if (likely(bearer_id != INVALID_BEARER_ID)) mtu = n->links[bearer_id].mtu; @@ -360,8 +370,37 @@ static void tipc_node_write_unlock(struct tipc_node *n) } } +static void tipc_node_assign_peer_net(struct tipc_node *n, u32 hash_mixes) +{ + int net_id = tipc_netid(n->net); + struct tipc_net *tn_peer; + struct net *tmp; + u32 hash_chk; + + if (n->peer_net) + return; + + for_each_net_rcu(tmp) { + tn_peer = tipc_net(tmp); + if (!tn_peer) + continue; + /* Integrity checking whether node exists in namespace or not */ + if (tn_peer->net_id != net_id) + continue; + if (memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) + continue; + hash_chk = tipc_net_hash_mixes(tmp, tn_peer->random); + if (hash_mixes ^ hash_chk) + continue; + n->peer_net = tmp; + n->peer_hash_mix = hash_mixes; + break; + } +} + static struct tipc_node *tipc_node_create(struct net *net, u32 addr, - u8 *peer_id, u16 capabilities) + u8 *peer_id, u16 capabilities, + u32 signature, u32 hash_mixes) { struct tipc_net *tn = net_generic(net, tipc_net_id); struct tipc_node *n, *temp_node; @@ -372,6 +411,8 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, spin_lock_bh(&tn->node_list_lock); n = tipc_node_find(net, addr); if (n) { + if (n->peer_hash_mix ^ hash_mixes) + tipc_node_assign_peer_net(n, hash_mixes); if (n->capabilities == capabilities) goto exit; /* Same node may come back with new capabilities */ @@ -389,6 +430,7 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, list_for_each_entry_rcu(temp_node, &tn->node_list, list) { tn->capabilities &= temp_node->capabilities; } + goto exit; } n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -399,6 +441,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, n->addr = addr; memcpy(&n->peer_id, peer_id, 16); n->net = net; + n->peer_net = NULL; + n->peer_hash_mix = 0; + /* Assign kernel local namespace if exists */ + tipc_node_assign_peer_net(n, hash_mixes); n->capabilities = capabilities; kref_init(&n->kref); rwlock_init(&n->lock); @@ -426,6 +472,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, tipc_bc_sndlink(net), &n->bc_entry.link)) { pr_warn("Broadcast rcv link creation failed, no memory\n"); + if (n->peer_net) { + n->peer_net = NULL; + n->peer_hash_mix = 0; + } kfree(n); n = NULL; goto exit; @@ -979,7 +1029,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) void tipc_node_check_dest(struct net *net, u32 addr, u8 *peer_id, struct tipc_bearer *b, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr) { @@ -998,7 +1048,8 @@ void tipc_node_check_dest(struct net *net, u32 addr, *dupl_addr = false; *respond = false; - n = tipc_node_create(net, addr, peer_id, capabilities); + n = tipc_node_create(net, addr, peer_id, capabilities, signature, + hash_mixes); if (!n) return; @@ -1343,6 +1394,10 @@ static void node_lost_contact(struct tipc_node *n, /* Notify publications from this node */ n->action_flags |= TIPC_NOTIFY_NODE_DOWN; + if (n->peer_net) { + n->peer_net = NULL; + n->peer_hash_mix = 0; + } /* Notify sockets connected to node */ list_for_each_entry_safe(conn, safe, conns, list) { skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE, TIPC_CONN_MSG, @@ -1424,6 +1479,52 @@ static int __tipc_nl_add_node(struct tipc_nl_msg *msg, struct tipc_node *node) return -EMSGSIZE; } +static void tipc_lxc_xmit(struct net *peer_net, struct sk_buff_head *list) +{ + struct tipc_msg *hdr = buf_msg(skb_peek(list)); + struct sk_buff_head inputq; + + switch (msg_user(hdr)) { + case TIPC_LOW_IMPORTANCE: + case TIPC_MEDIUM_IMPORTANCE: + case TIPC_HIGH_IMPORTANCE: + case TIPC_CRITICAL_IMPORTANCE: + if (msg_connected(hdr) || msg_named(hdr)) { + spin_lock_init(&list->lock); + tipc_sk_rcv(peer_net, list); + return; + } + if (msg_mcast(hdr)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(peer_net, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + return; + } + return; + case MSG_FRAGMENTER: + if (tipc_msg_assemble(list)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(peer_net, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + } + return; + case GROUP_PROTOCOL: + case CONN_MANAGER: + spin_lock_init(&list->lock); + tipc_sk_rcv(peer_net, list); + return; + case LINK_PROTOCOL: + case NAME_DISTRIBUTOR: + case TUNNEL_PROTOCOL: + case BCAST_PROTOCOL: + return; + default: + return; + }; +} + /** * tipc_node_xmit() is the general link level function for message sending * @net: the applicable net namespace @@ -1439,6 +1540,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, struct tipc_link_entry *le = NULL; struct tipc_node *n; struct sk_buff_head xmitq; + bool node_up = false; int bearer_id; int rc; @@ -1456,6 +1558,17 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, } tipc_node_read_lock(n); + node_up = node_is_up(n); + if (node_up && n->peer_net && check_net(n->peer_net)) { + /* xmit inner linux container */ + tipc_lxc_xmit(n->peer_net, list); + if (likely(skb_queue_empty(list))) { + tipc_node_read_unlock(n); + tipc_node_put(n); + return 0; + } + } + bearer_id = n->active_links[selector & 1]; if (unlikely(bearer_id == INVALID_BEARER_ID)) { tipc_node_read_unlock(n); @@ -2587,3 +2700,33 @@ int tipc_node_dump(struct tipc_node *n, bool more, char *buf) return i; } + +void tipc_node_pre_cleanup_net(struct net *exit_net) +{ + struct tipc_node *n; + struct tipc_net *tn; + struct net *tmp; + + rcu_read_lock(); + for_each_net_rcu(tmp) { + if (tmp == exit_net) + continue; + tn = tipc_net(tmp); + if (!tn) + continue; + spin_lock_bh(&tn->node_list_lock); + list_for_each_entry_rcu(n, &tn->node_list, list) { + if (!n->peer_net) + continue; + if (n->peer_net != exit_net) + continue; + tipc_node_write_lock(n); + n->peer_net = NULL; + n->peer_hash_mix = 0; + tipc_node_write_unlock_fast(n); + break; + } + spin_unlock_bh(&tn->node_list_lock); + } + rcu_read_unlock(); +} diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..30563c4f35d5 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, struct tipc_bearer *bearer, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr); void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected); bool tipc_node_is_up(struct net *net, u32 addr); u16 tipc_node_get_capabilities(struct net *net, u32 addr); int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); @@ -107,4 +107,5 @@ int tipc_nl_node_get_monitor(struct sk_buff *skb, struct genl_info *info); int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct netlink_callback *cb); int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb, struct netlink_callback *cb); +void tipc_node_pre_cleanup_net(struct net *exit_net); #endif diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 35e32ffc2b90..2bcacd6022d5 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk, /* Build message as chain of buffers */ __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen) return rc; __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); tipc_set_sk_state(sk, TIPC_ESTABLISHED); tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); __skb_queue_purge(&sk->sk_write_queue); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) -- 2.20.1 |
From: Jon M. <jon...@er...> - 2019-10-24 16:28:40
|
Hi Ying, 1) TIPC_NODELAY might be a good option, although I fear people might misuse it in the belief that TIPC nagle has the same disadvantages as TCP nagle, which it doesn't. But ok, I'll add it. 2) CONN_PROBE/CONN_PROBE_REPLY are not considered simply because they are so rare (once per hour) that they won't make any difference. 3) We don't really have any tools to measure this. The latency measurement in our benchmark tool never trigs nagle mode, so we won't notice any difference. When nagle is enabled, we can only measure latency per direction, not round-trip delay (since there is no return message), but logically it works as follows: Scenario 1: 1) Socket goes to nagle mode. The message trigging this is not bundled, but just sent out with the 'response_req' bit set. 2) A number of messages and possible skbs are added to the queue. 3) The ACK_MSG (response on msg 1) arrives after 1 RTT, and the accumulated messages are sent. So, the first message, probably added just after the 'resp-req' message was sent might have a delay of up to one RTT. The remaining messages in the queue will have a lower delay, and notably a message added just before the ACK_MSG arrives will have almost no delay. Scenario 2: 1) Socket is in nagle mode, and a number of messages are being accumulated. The last message in the queue always have the resp_req bit set. 2) The queue surpasses 64 k, or a larger message than 'maxnagle'is being sent. So the whole send queue is sent out. 3) Obviously we didn't wait for the expected MSG_ACK in this case, so the delay for all messages is less than 1 RTT. Remains to know the size of RTT, but in the type of clusters we are running this is rarely more than a few milliseconds, at most. This in contrast to TCP, where the delay may be several hundred milliseconds. ///jon > -----Original Message----- > From: Xue, Ying <Yin...@wi...> > Sent: 24-Oct-19 11:28 > To: Jon Maloy <jon...@er...>; Jon Maloy > <ma...@do...> > Cc: Mohan Krishna Ghanta Krishnamurthy > <moh...@er...>; > par...@gm...; Tung Quang Nguyen > <tun...@de...>; Hoang Huu Le > <hoa...@de...>; Tuong Tong Lien > <tuo...@de...>; Gordan Mihaljevic > <gor...@de...>; tip...@li... > Subject: RE: [net-next v2 1/1] tipc: add smart nagle feature > > Hi Jon, > > We have the following comments: > - Please consider to add TIPC_NODELAY option of tipc_setsockopt() so that > user has right to disable nagle algorithm. > - I don't understand why we don't transmit the accumulated contents of the > write queue when a CONN_PROBE message is received from the peer. Can > you please explain it? > - I am just curious what impact the nagle feature has on latency for > SOCK_STREAM socket. Did you ever measure latency after nagle feature is > enabled? > > Thanks, > Ying > > -----Original Message----- > From: Jon Maloy [mailto:jon...@er...] > Sent: Wednesday, October 23, 2019 3:53 PM > To: Jon Maloy; Jon Maloy > Cc: moh...@er...; > par...@gm...; tun...@de...; > hoa...@de...; tuo...@de...; > gor...@de...; Xue, Ying; tipc- > dis...@li... > Subject: [net-next v2 1/1] tipc: add smart nagle feature > > We introduce a Nagle-like algorithm for bundling small messages at the socket > level. > > - A socket enters nagle mode when more than 4 messages have been sent > out without receiving any data message from the peer. > - A socket leaves nagle mode whenever it receives a data message from > the peer. > > In nagle mode, small messages are accumulated in the socket write queue. > The last buffer in the queue is marked with a new 'ack_required' bit, which > forces the receiving peer to send a CONN_ACK message back to the sender. > > The accumulated contents of the write queue is transmitted when one of the > following events or conditions occur. > > - A CONN_ACK message is received from the peer. > - A data message is received from the peer. > - A SOCK_WAKEUP pseudo message is received from the link level. > - The write queue contains more than 64 1k blocks of data. > - The connection is being shut down. > - There is no CONN_ACK message to expect. I.e., there is currently > no outstanding message where the 'ack_required' bit was set. As a > consequence, the first message added after we enter nagle mode > is always sent directly with this bit set. > > This new feature gives a 50-100% improvement of throughput for small (i.e., > less than MTU size) messages, while it might add up to one RTT to latency time > when the socket is in nagle mode. > > Signed-off-by: Jon Maloy <jon...@er...> > > --- > v2: Increased max nagle size for UDP to 14k. This improves > throughput for messages 750-1500 bytes with ~50%. > --- > net/tipc/msg.c | 53 ++++++++++++++++++++++++++++++++ > net/tipc/msg.h | 12 ++++++++ > net/tipc/node.h | 7 +++-- > net/tipc/socket.c | 91 > +++++++++++++++++++++++++++++++++++++++++++++---------- > 4 files changed, 145 insertions(+), 18 deletions(-) > > diff --git a/net/tipc/msg.c b/net/tipc/msg.c index 922d262..973795a 100644 > --- a/net/tipc/msg.c > +++ b/net/tipc/msg.c > @@ -190,6 +190,59 @@ int tipc_buf_append(struct sk_buff **headbuf, > struct sk_buff **buf) > return 0; > } > > +/** > + * tipc_msg_append(): Append data to tail of an existing buffer queue > + * @hdr: header to be used > + * @m: the data to be appended > + * @mss: max allowable size of buffer > + * @dlen: size of data to be appended > + * @txq: queue to appand to > + * Returns the number og 1k blocks appended or errno value */ int > +tipc_msg_append(struct tipc_msg *_hdr, struct msghdr *m, int dlen, > + int mss, struct sk_buff_head *txq) { > + struct sk_buff *skb, *prev; > + int accounted, total, curr; > + int mlen, cpy, rem = dlen; > + struct tipc_msg *hdr; > + > + skb = skb_peek_tail(txq); > + accounted = skb ? msg_blocks(buf_msg(skb)) : 0; > + total = accounted; > + > + while (rem) { > + if (!skb || skb->len >= mss) { > + prev = skb; > + skb = tipc_buf_acquire(mss, GFP_KERNEL); > + if (unlikely(!skb)) > + return -ENOMEM; > + skb_orphan(skb); > + skb_trim(skb, MIN_H_SIZE); > + hdr = buf_msg(skb); > + skb_copy_to_linear_data(skb, _hdr, MIN_H_SIZE); > + msg_set_hdr_sz(hdr, MIN_H_SIZE); > + msg_set_size(hdr, MIN_H_SIZE); > + __skb_queue_tail(txq, skb); > + total += 1; > + if (prev) > + msg_set_ack_required(buf_msg(prev), 0); > + msg_set_ack_required(hdr, 1); > + } > + hdr = buf_msg(skb); > + curr = msg_blocks(hdr); > + mlen = msg_size(hdr); > + cpy = min_t(int, rem, mss - mlen); > + if (cpy != copy_from_iter(skb->data + mlen, cpy, &m->msg_iter)) > + return -EFAULT; > + msg_set_size(hdr, mlen + cpy); > + skb_put(skb, cpy); > + rem -= cpy; > + total += msg_blocks(hdr) - curr; > + } > + return total - accounted; > +} > + > /* tipc_msg_validate - validate basic format of received message > * > * This routine ensures a TIPC message has an acceptable header, and at least > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f0..b85b85a 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -290,6 +290,16 @@ static inline void msg_set_src_droppable(struct > tipc_msg *m, u32 d) > msg_set_bits(m, 0, 18, 1, d); > } > > +static inline int msg_ack_required(struct tipc_msg *m) { > + return msg_bits(m, 0, 18, 1); > +} > + > +static inline void msg_set_ack_required(struct tipc_msg *m, u32 d) { > + msg_set_bits(m, 0, 18, 1, d); > +} > + > static inline bool msg_is_rcast(struct tipc_msg *m) { > return msg_bits(m, 0, 18, 0x1); > @@ -1065,6 +1075,8 @@ int tipc_msg_fragment(struct sk_buff *skb, const > struct tipc_msg *hdr, > int pktmax, struct sk_buff_head *frags); int > tipc_msg_build(struct tipc_msg *mhdr, struct msghdr *m, > int offset, int dsz, int mtu, struct sk_buff_head *list); > +int tipc_msg_append(struct tipc_msg *hdr, struct msghdr *m, int dlen, > + int mss, struct sk_buff_head *txq); > bool tipc_msg_lookup_dest(struct net *net, struct sk_buff *skb, int *err); > bool tipc_msg_assemble(struct sk_buff_head *list); bool > tipc_msg_reassemble(struct sk_buff_head *list, struct sk_buff_head *rcvq); > diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ec..b9036f28 > 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -54,7 +54,8 @@ enum { > TIPC_LINK_PROTO_SEQNO = (1 << 6), > TIPC_MCAST_RBCTL = (1 << 7), > TIPC_GAP_ACK_BLOCK = (1 << 8), > - TIPC_TUNNEL_ENHANCED = (1 << 9) > + TIPC_TUNNEL_ENHANCED = (1 << 9), > + TIPC_NAGLE = (1 << 10) > }; > > #define TIPC_NODE_CAPABILITIES (TIPC_SYN_BIT | \ > @@ -66,7 +67,9 @@ enum { > TIPC_LINK_PROTO_SEQNO | \ > TIPC_MCAST_RBCTL | \ > TIPC_GAP_ACK_BLOCK | \ > - TIPC_TUNNEL_ENHANCED) > + TIPC_TUNNEL_ENHANCED | \ > + TIPC_NAGLE) > + > #define INVALID_BEARER_ID -1 > > void tipc_node_stop(struct net *net); > diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 35e32ff..1594a50 > 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -75,6 +75,7 @@ struct sockaddr_pair { > * @conn_instance: TIPC instance used when connection was established > * @published: non-zero if port has one or more associated names > * @max_pkt: maximum packet size "hint" used when building messages sent > by port > + * @maxnagle: maximum size of mmsg subject to nagle > * @portid: unique port identity in TIPC socket hash table > * @phdr: preformatted message header used when sending messages > * #cong_links: list of congested links @@ -97,6 +98,7 @@ struct tipc_sock { > u32 conn_instance; > int published; > u32 max_pkt; > + u32 maxnagle; > u32 portid; > struct tipc_msg phdr; > struct list_head cong_links; > @@ -116,6 +118,9 @@ struct tipc_sock { > struct tipc_mc_method mc_method; > struct rcu_head rcu; > struct tipc_group *group; > + u32 oneway; > + u16 snd_backlog; > + bool expect_ack; > bool group_is_open; > }; > > @@ -137,6 +142,7 @@ static int tipc_sk_insert(struct tipc_sock *tsk); static > void tipc_sk_remove(struct tipc_sock *tsk); static int > __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dsz); static > int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dsz); > +static void tipc_sk_push_backlog(struct tipc_sock *tsk); > > static const struct proto_ops packet_ops; static const struct proto_ops > stream_ops; @@ -446,6 +452,7 @@ static int tipc_sk_create(struct net *net, > struct socket *sock, > > tsk = tipc_sk(sk); > tsk->max_pkt = MAX_PKT_DEFAULT; > + tsk->maxnagle = MAX_PKT_DEFAULT; > INIT_LIST_HEAD(&tsk->publications); > INIT_LIST_HEAD(&tsk->cong_links); > msg = &tsk->phdr; > @@ -512,8 +519,12 @@ static void __tipc_shutdown(struct socket *sock, int > error) > tipc_wait_for_cond(sock, &timeout, (!tsk->cong_link_cnt && > !tsk_conn_cong(tsk))); > > - /* Remove any pending SYN message */ > - __skb_queue_purge(&sk->sk_write_queue); > + /* Push out unsent messages or remove if pending SYN */ > + skb = skb_peek(&sk->sk_write_queue); > + if (skb && !msg_is_syn(buf_msg(skb))) > + tipc_sk_push_backlog(tsk); > + else > + __skb_queue_purge(&sk->sk_write_queue); > > /* Reject all unreceived messages, except on an active connection > * (which disconnects locally & sends a 'FIN+' to peer). > @@ -1208,6 +1219,27 @@ void tipc_sk_mcast_rcv(struct net *net, struct > sk_buff_head *arrvq, > tipc_sk_rcv(net, inputq); > } > > +/* tipc_sk_push_backlog(): send accumulated buffers in socket write queue > + * when socket is in Nagle mode > + */ > +static void tipc_sk_push_backlog(struct tipc_sock *tsk) { > + struct sk_buff_head *txq = &tsk->sk.sk_write_queue; > + struct net *net = sock_net(&tsk->sk); > + u32 dnode = tsk_peer_node(tsk); > + int rc; > + > + if (skb_queue_empty(txq) || tsk->cong_link_cnt) > + return; > + > + tsk->snt_unacked += tsk->snd_backlog; > + tsk->snd_backlog = 0; > + tsk->expect_ack = true; > + rc = tipc_node_xmit(net, txq, dnode, tsk->portid); > + if (rc == -ELINKCONG) > + tsk->cong_link_cnt = 1; > +} > + > /** > * tipc_sk_conn_proto_rcv - receive a connection mng protocol message > * @tsk: receiving socket > @@ -1221,7 +1253,7 @@ static void tipc_sk_conn_proto_rcv(struct > tipc_sock *tsk, struct sk_buff *skb, > u32 onode = tsk_own_node(tsk); > struct sock *sk = &tsk->sk; > int mtyp = msg_type(hdr); > - bool conn_cong; > + bool was_cong; > > /* Ignore if connection cannot be validated: */ > if (!tsk_peer_msg(tsk, hdr)) { > @@ -1254,11 +1286,13 @@ static void tipc_sk_conn_proto_rcv(struct > tipc_sock *tsk, struct sk_buff *skb, > __skb_queue_tail(xmitq, skb); > return; > } else if (mtyp == CONN_ACK) { > - conn_cong = tsk_conn_cong(tsk); > + was_cong = tsk_conn_cong(tsk); > + tsk->expect_ack = false; > + tipc_sk_push_backlog(tsk); > tsk->snt_unacked -= msg_conn_ack(hdr); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > tsk->snd_win = msg_adv_win(hdr); > - if (conn_cong) > + if (was_cong && !tsk_conn_cong(tsk)) > sk->sk_write_space(sk); > } else if (mtyp != CONN_PROBE_REPLY) { > pr_warn("Received unknown CONN_PROTO msg\n"); @@ - > 1437,16 +1471,17 @@ static int __tipc_sendstream(struct socket *sock, > struct msghdr *m, size_t dlen) > struct sock *sk = sock->sk; > DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name); > long timeout = sock_sndtimeo(sk, m->msg_flags & MSG_DONTWAIT); > + struct sk_buff_head *txq = &sk->sk_write_queue; > struct tipc_sock *tsk = tipc_sk(sk); > struct tipc_msg *hdr = &tsk->phdr; > struct net *net = sock_net(sk); > - struct sk_buff_head pkts; > u32 dnode = tsk_peer_node(tsk); > + int blocks = tsk->snd_backlog; > + int maxnagle = tsk->maxnagle; > + int maxpkt = tsk->max_pkt; > int send, sent = 0; > int rc = 0; > > - __skb_queue_head_init(&pkts); > - > if (unlikely(dlen > INT_MAX)) > return -EMSGSIZE; > > @@ -1467,21 +1502,38 @@ static int __tipc_sendstream(struct socket > *sock, struct msghdr *m, size_t dlen) > tipc_sk_connected(sk))); > if (unlikely(rc)) > break; > - > send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE); > - rc = tipc_msg_build(hdr, m, sent, send, tsk->max_pkt, &pkts); > - if (unlikely(rc != send)) > - break; > > - trace_tipc_sk_sendstream(sk, skb_peek(&pkts), > + if (tsk->oneway++ >= 4 && > + send <= maxnagle && > + tsk->peer_caps & TIPC_NAGLE && > + sock->type == SOCK_STREAM) { > + rc = tipc_msg_append(hdr, m, send, maxnagle, txq); > + if (rc < 0) > + break; > + blocks += rc; > + if (blocks <= 64 && tsk->expect_ack) { > + tsk->snd_backlog = blocks; > + sent += send; > + break; > + } > + tsk->expect_ack = true; > + } else { > + rc = tipc_msg_build(hdr, m, sent, send, maxpkt, txq); > + if (unlikely(rc != send)) > + break; > + blocks += tsk_inc(tsk, send + MIN_H_SIZE); > + } > + trace_tipc_sk_sendstream(sk, skb_peek(txq), > TIPC_DUMP_SK_SNDQ, " "); > - rc = tipc_node_xmit(net, &pkts, dnode, tsk->portid); > + rc = tipc_node_xmit(net, txq, dnode, tsk->portid); > if (unlikely(rc == -ELINKCONG)) { > tsk->cong_link_cnt = 1; > rc = 0; > } > if (likely(!rc)) { > - tsk->snt_unacked += tsk_inc(tsk, send + MIN_H_SIZE); > + tsk->snt_unacked += blocks; > + tsk->snd_backlog = 0; > sent += send; > } > } while (sent < dlen && !rc); > @@ -1527,6 +1579,7 @@ static void tipc_sk_finish_conn(struct tipc_sock > *tsk, u32 peer_port, > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > + tsk->maxnagle = tsk->max_pkt == MAX_MSG_SIZE ? 1500 : tsk- > >max_pkt; > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > __skb_queue_purge(&sk->sk_write_queue); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) @@ -1848,6 +1901,7 > @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, > bool peek = flags & MSG_PEEK; > int offset, required, copy, copied = 0; > int hlen, dlen, err, rc; > + bool ack = false; > long timeout; > > /* Catch invalid receive attempts */ > @@ -1892,6 +1946,7 @@ static int tipc_recvstream(struct socket *sock, > struct msghdr *m, > > /* Copy data if msg ok, otherwise return error/partial data */ > if (likely(!err)) { > + ack = msg_ack_required(hdr); > offset = skb_cb->bytes_read; > copy = min_t(int, dlen - offset, buflen - copied); > rc = skb_copy_datagram_msg(skb, hlen + offset, m, copy); > @@ -1919,7 +1974,7 @@ static int tipc_recvstream(struct socket *sock, > struct msghdr *m, > > /* Send connection flow control advertisement when applicable > */ > tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen); > - if (unlikely(tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)) > + if (ack || tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE) > tipc_sk_send_ack(tsk); > > /* Exit if all requested data or FIN/error received */ @@ -1990,6 > +2045,7 @@ static void tipc_sk_proto_rcv(struct sock *sk, > smp_wmb(); > tsk->cong_link_cnt--; > wakeup = true; > + tipc_sk_push_backlog(tsk); > break; > case GROUP_PROTOCOL: > tipc_group_proto_rcv(grp, &wakeup, hdr, inputq, xmitq); @@ - > 2029,6 +2085,7 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, > struct sk_buff *skb) > > if (unlikely(msg_mcast(hdr))) > return false; > + tsk->oneway = 0; > > switch (sk->sk_state) { > case TIPC_CONNECTING: > @@ -2074,6 +2131,8 @@ static bool tipc_sk_filter_connect(struct tipc_sock > *tsk, struct sk_buff *skb) > return true; > return false; > case TIPC_ESTABLISHED: > + if (!skb_queue_empty(&sk->sk_write_queue)) > + tipc_sk_push_backlog(tsk); > /* Accept only connection-based messages sent by peer */ > if (likely(con_msg && !err && pport == oport && pnode == > onode)) > return true; > -- > 2.1.4 |
From: Xue, Y. <Yin...@wi...> - 2019-10-24 15:28:40
|
Hi Jon, We have the following comments: - Please consider to add TIPC_NODELAY option of tipc_setsockopt() so that user has right to disable nagle algorithm. - I don't understand why we don't transmit the accumulated contents of the write queue when a CONN_PROBE message is received from the peer. Can you please explain it? - I am just curious what impact the nagle feature has on latency for SOCK_STREAM socket. Did you ever measure latency after nagle feature is enabled? Thanks, Ying -----Original Message----- From: Jon Maloy [mailto:jon...@er...] Sent: Wednesday, October 23, 2019 3:53 PM To: Jon Maloy; Jon Maloy Cc: moh...@er...; par...@gm...; tun...@de...; hoa...@de...; tuo...@de...; gor...@de...; Xue, Ying; tip...@li... Subject: [net-next v2 1/1] tipc: add smart nagle feature We introduce a Nagle-like algorithm for bundling small messages at the socket level. - A socket enters nagle mode when more than 4 messages have been sent out without receiving any data message from the peer. - A socket leaves nagle mode whenever it receives a data message from the peer. In nagle mode, small messages are accumulated in the socket write queue. The last buffer in the queue is marked with a new 'ack_required' bit, which forces the receiving peer to send a CONN_ACK message back to the sender. The accumulated contents of the write queue is transmitted when one of the following events or conditions occur. - A CONN_ACK message is received from the peer. - A data message is received from the peer. - A SOCK_WAKEUP pseudo message is received from the link level. - The write queue contains more than 64 1k blocks of data. - The connection is being shut down. - There is no CONN_ACK message to expect. I.e., there is currently no outstanding message where the 'ack_required' bit was set. As a consequence, the first message added after we enter nagle mode is always sent directly with this bit set. This new feature gives a 50-100% improvement of throughput for small (i.e., less than MTU size) messages, while it might add up to one RTT to latency time when the socket is in nagle mode. Signed-off-by: Jon Maloy <jon...@er...> --- v2: Increased max nagle size for UDP to 14k. This improves throughput for messages 750-1500 bytes with ~50%. --- net/tipc/msg.c | 53 ++++++++++++++++++++++++++++++++ net/tipc/msg.h | 12 ++++++++ net/tipc/node.h | 7 +++-- net/tipc/socket.c | 91 +++++++++++++++++++++++++++++++++++++++++++++---------- 4 files changed, 145 insertions(+), 18 deletions(-) diff --git a/net/tipc/msg.c b/net/tipc/msg.c index 922d262..973795a 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -190,6 +190,59 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf) return 0; } +/** + * tipc_msg_append(): Append data to tail of an existing buffer queue + * @hdr: header to be used + * @m: the data to be appended + * @mss: max allowable size of buffer + * @dlen: size of data to be appended + * @txq: queue to appand to + * Returns the number og 1k blocks appended or errno value + */ +int tipc_msg_append(struct tipc_msg *_hdr, struct msghdr *m, int dlen, + int mss, struct sk_buff_head *txq) +{ + struct sk_buff *skb, *prev; + int accounted, total, curr; + int mlen, cpy, rem = dlen; + struct tipc_msg *hdr; + + skb = skb_peek_tail(txq); + accounted = skb ? msg_blocks(buf_msg(skb)) : 0; + total = accounted; + + while (rem) { + if (!skb || skb->len >= mss) { + prev = skb; + skb = tipc_buf_acquire(mss, GFP_KERNEL); + if (unlikely(!skb)) + return -ENOMEM; + skb_orphan(skb); + skb_trim(skb, MIN_H_SIZE); + hdr = buf_msg(skb); + skb_copy_to_linear_data(skb, _hdr, MIN_H_SIZE); + msg_set_hdr_sz(hdr, MIN_H_SIZE); + msg_set_size(hdr, MIN_H_SIZE); + __skb_queue_tail(txq, skb); + total += 1; + if (prev) + msg_set_ack_required(buf_msg(prev), 0); + msg_set_ack_required(hdr, 1); + } + hdr = buf_msg(skb); + curr = msg_blocks(hdr); + mlen = msg_size(hdr); + cpy = min_t(int, rem, mss - mlen); + if (cpy != copy_from_iter(skb->data + mlen, cpy, &m->msg_iter)) + return -EFAULT; + msg_set_size(hdr, mlen + cpy); + skb_put(skb, cpy); + rem -= cpy; + total += msg_blocks(hdr) - curr; + } + return total - accounted; +} + /* tipc_msg_validate - validate basic format of received message * * This routine ensures a TIPC message has an acceptable header, and at least diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f0..b85b85a 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -290,6 +290,16 @@ static inline void msg_set_src_droppable(struct tipc_msg *m, u32 d) msg_set_bits(m, 0, 18, 1, d); } +static inline int msg_ack_required(struct tipc_msg *m) +{ + return msg_bits(m, 0, 18, 1); +} + +static inline void msg_set_ack_required(struct tipc_msg *m, u32 d) +{ + msg_set_bits(m, 0, 18, 1, d); +} + static inline bool msg_is_rcast(struct tipc_msg *m) { return msg_bits(m, 0, 18, 0x1); @@ -1065,6 +1075,8 @@ int tipc_msg_fragment(struct sk_buff *skb, const struct tipc_msg *hdr, int pktmax, struct sk_buff_head *frags); int tipc_msg_build(struct tipc_msg *mhdr, struct msghdr *m, int offset, int dsz, int mtu, struct sk_buff_head *list); +int tipc_msg_append(struct tipc_msg *hdr, struct msghdr *m, int dlen, + int mss, struct sk_buff_head *txq); bool tipc_msg_lookup_dest(struct net *net, struct sk_buff *skb, int *err); bool tipc_msg_assemble(struct sk_buff_head *list); bool tipc_msg_reassemble(struct sk_buff_head *list, struct sk_buff_head *rcvq); diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ec..b9036f28 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -54,7 +54,8 @@ enum { TIPC_LINK_PROTO_SEQNO = (1 << 6), TIPC_MCAST_RBCTL = (1 << 7), TIPC_GAP_ACK_BLOCK = (1 << 8), - TIPC_TUNNEL_ENHANCED = (1 << 9) + TIPC_TUNNEL_ENHANCED = (1 << 9), + TIPC_NAGLE = (1 << 10) }; #define TIPC_NODE_CAPABILITIES (TIPC_SYN_BIT | \ @@ -66,7 +67,9 @@ enum { TIPC_LINK_PROTO_SEQNO | \ TIPC_MCAST_RBCTL | \ TIPC_GAP_ACK_BLOCK | \ - TIPC_TUNNEL_ENHANCED) + TIPC_TUNNEL_ENHANCED | \ + TIPC_NAGLE) + #define INVALID_BEARER_ID -1 void tipc_node_stop(struct net *net); diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 35e32ff..1594a50 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -75,6 +75,7 @@ struct sockaddr_pair { * @conn_instance: TIPC instance used when connection was established * @published: non-zero if port has one or more associated names * @max_pkt: maximum packet size "hint" used when building messages sent by port + * @maxnagle: maximum size of mmsg subject to nagle * @portid: unique port identity in TIPC socket hash table * @phdr: preformatted message header used when sending messages * #cong_links: list of congested links @@ -97,6 +98,7 @@ struct tipc_sock { u32 conn_instance; int published; u32 max_pkt; + u32 maxnagle; u32 portid; struct tipc_msg phdr; struct list_head cong_links; @@ -116,6 +118,9 @@ struct tipc_sock { struct tipc_mc_method mc_method; struct rcu_head rcu; struct tipc_group *group; + u32 oneway; + u16 snd_backlog; + bool expect_ack; bool group_is_open; }; @@ -137,6 +142,7 @@ static int tipc_sk_insert(struct tipc_sock *tsk); static void tipc_sk_remove(struct tipc_sock *tsk); static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dsz); static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dsz); +static void tipc_sk_push_backlog(struct tipc_sock *tsk); static const struct proto_ops packet_ops; static const struct proto_ops stream_ops; @@ -446,6 +452,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock, tsk = tipc_sk(sk); tsk->max_pkt = MAX_PKT_DEFAULT; + tsk->maxnagle = MAX_PKT_DEFAULT; INIT_LIST_HEAD(&tsk->publications); INIT_LIST_HEAD(&tsk->cong_links); msg = &tsk->phdr; @@ -512,8 +519,12 @@ static void __tipc_shutdown(struct socket *sock, int error) tipc_wait_for_cond(sock, &timeout, (!tsk->cong_link_cnt && !tsk_conn_cong(tsk))); - /* Remove any pending SYN message */ - __skb_queue_purge(&sk->sk_write_queue); + /* Push out unsent messages or remove if pending SYN */ + skb = skb_peek(&sk->sk_write_queue); + if (skb && !msg_is_syn(buf_msg(skb))) + tipc_sk_push_backlog(tsk); + else + __skb_queue_purge(&sk->sk_write_queue); /* Reject all unreceived messages, except on an active connection * (which disconnects locally & sends a 'FIN+' to peer). @@ -1208,6 +1219,27 @@ void tipc_sk_mcast_rcv(struct net *net, struct sk_buff_head *arrvq, tipc_sk_rcv(net, inputq); } +/* tipc_sk_push_backlog(): send accumulated buffers in socket write queue + * when socket is in Nagle mode + */ +static void tipc_sk_push_backlog(struct tipc_sock *tsk) +{ + struct sk_buff_head *txq = &tsk->sk.sk_write_queue; + struct net *net = sock_net(&tsk->sk); + u32 dnode = tsk_peer_node(tsk); + int rc; + + if (skb_queue_empty(txq) || tsk->cong_link_cnt) + return; + + tsk->snt_unacked += tsk->snd_backlog; + tsk->snd_backlog = 0; + tsk->expect_ack = true; + rc = tipc_node_xmit(net, txq, dnode, tsk->portid); + if (rc == -ELINKCONG) + tsk->cong_link_cnt = 1; +} + /** * tipc_sk_conn_proto_rcv - receive a connection mng protocol message * @tsk: receiving socket @@ -1221,7 +1253,7 @@ static void tipc_sk_conn_proto_rcv(struct tipc_sock *tsk, struct sk_buff *skb, u32 onode = tsk_own_node(tsk); struct sock *sk = &tsk->sk; int mtyp = msg_type(hdr); - bool conn_cong; + bool was_cong; /* Ignore if connection cannot be validated: */ if (!tsk_peer_msg(tsk, hdr)) { @@ -1254,11 +1286,13 @@ static void tipc_sk_conn_proto_rcv(struct tipc_sock *tsk, struct sk_buff *skb, __skb_queue_tail(xmitq, skb); return; } else if (mtyp == CONN_ACK) { - conn_cong = tsk_conn_cong(tsk); + was_cong = tsk_conn_cong(tsk); + tsk->expect_ack = false; + tipc_sk_push_backlog(tsk); tsk->snt_unacked -= msg_conn_ack(hdr); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) tsk->snd_win = msg_adv_win(hdr); - if (conn_cong) + if (was_cong && !tsk_conn_cong(tsk)) sk->sk_write_space(sk); } else if (mtyp != CONN_PROBE_REPLY) { pr_warn("Received unknown CONN_PROTO msg\n"); @@ -1437,16 +1471,17 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen) struct sock *sk = sock->sk; DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name); long timeout = sock_sndtimeo(sk, m->msg_flags & MSG_DONTWAIT); + struct sk_buff_head *txq = &sk->sk_write_queue; struct tipc_sock *tsk = tipc_sk(sk); struct tipc_msg *hdr = &tsk->phdr; struct net *net = sock_net(sk); - struct sk_buff_head pkts; u32 dnode = tsk_peer_node(tsk); + int blocks = tsk->snd_backlog; + int maxnagle = tsk->maxnagle; + int maxpkt = tsk->max_pkt; int send, sent = 0; int rc = 0; - __skb_queue_head_init(&pkts); - if (unlikely(dlen > INT_MAX)) return -EMSGSIZE; @@ -1467,21 +1502,38 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen) tipc_sk_connected(sk))); if (unlikely(rc)) break; - send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE); - rc = tipc_msg_build(hdr, m, sent, send, tsk->max_pkt, &pkts); - if (unlikely(rc != send)) - break; - trace_tipc_sk_sendstream(sk, skb_peek(&pkts), + if (tsk->oneway++ >= 4 && + send <= maxnagle && + tsk->peer_caps & TIPC_NAGLE && + sock->type == SOCK_STREAM) { + rc = tipc_msg_append(hdr, m, send, maxnagle, txq); + if (rc < 0) + break; + blocks += rc; + if (blocks <= 64 && tsk->expect_ack) { + tsk->snd_backlog = blocks; + sent += send; + break; + } + tsk->expect_ack = true; + } else { + rc = tipc_msg_build(hdr, m, sent, send, maxpkt, txq); + if (unlikely(rc != send)) + break; + blocks += tsk_inc(tsk, send + MIN_H_SIZE); + } + trace_tipc_sk_sendstream(sk, skb_peek(txq), TIPC_DUMP_SK_SNDQ, " "); - rc = tipc_node_xmit(net, &pkts, dnode, tsk->portid); + rc = tipc_node_xmit(net, txq, dnode, tsk->portid); if (unlikely(rc == -ELINKCONG)) { tsk->cong_link_cnt = 1; rc = 0; } if (likely(!rc)) { - tsk->snt_unacked += tsk_inc(tsk, send + MIN_H_SIZE); + tsk->snt_unacked += blocks; + tsk->snd_backlog = 0; sent += send; } } while (sent < dlen && !rc); @@ -1527,6 +1579,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, tipc_set_sk_state(sk, TIPC_ESTABLISHED); tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); + tsk->maxnagle = tsk->max_pkt == MAX_MSG_SIZE ? 1500 : tsk->max_pkt; tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); __skb_queue_purge(&sk->sk_write_queue); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) @@ -1848,6 +1901,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, bool peek = flags & MSG_PEEK; int offset, required, copy, copied = 0; int hlen, dlen, err, rc; + bool ack = false; long timeout; /* Catch invalid receive attempts */ @@ -1892,6 +1946,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, /* Copy data if msg ok, otherwise return error/partial data */ if (likely(!err)) { + ack = msg_ack_required(hdr); offset = skb_cb->bytes_read; copy = min_t(int, dlen - offset, buflen - copied); rc = skb_copy_datagram_msg(skb, hlen + offset, m, copy); @@ -1919,7 +1974,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, /* Send connection flow control advertisement when applicable */ tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen); - if (unlikely(tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)) + if (ack || tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE) tipc_sk_send_ack(tsk); /* Exit if all requested data or FIN/error received */ @@ -1990,6 +2045,7 @@ static void tipc_sk_proto_rcv(struct sock *sk, smp_wmb(); tsk->cong_link_cnt--; wakeup = true; + tipc_sk_push_backlog(tsk); break; case GROUP_PROTOCOL: tipc_group_proto_rcv(grp, &wakeup, hdr, inputq, xmitq); @@ -2029,6 +2085,7 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb) if (unlikely(msg_mcast(hdr))) return false; + tsk->oneway = 0; switch (sk->sk_state) { case TIPC_CONNECTING: @@ -2074,6 +2131,8 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb) return true; return false; case TIPC_ESTABLISHED: + if (!skb_queue_empty(&sk->sk_write_queue)) + tipc_sk_push_backlog(tsk); /* Accept only connection-based messages sent by peer */ if (likely(con_msg && !err && pport == oport && pnode == onode)) return true; -- 2.1.4 |
From: Hoang Le <hoa...@de...> - 2019-10-24 10:19:00
|
Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network namespaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a namespace on the same host, we just replace the namespace pointer with the receiving node/namespace's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stacks. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/namespaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local namespace does it choose to let user data messages, and only those, take the crossover path to the receiving node/namespace. - If the receiving node/namespace is removed, its namespace pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that the sender knows a common secret. We use the hash mix of the sending node/namespace for this purpose, since it can be accessed directly by all other namespaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local namespaces'hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local namespace, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence we think it can justified to allow it to shortcut the lower protocol layers. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the involved nodes' loopback interfaces their inter-name space messaging can easily be tracked. v2: - update 'net' pointer when node left/rejoined Suggested-by: Jon Maloy <jon...@er...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/core.c | 16 +++++ net/tipc/core.h | 6 ++ net/tipc/discover.c | 4 +- net/tipc/msg.h | 14 ++++ net/tipc/name_distr.c | 2 +- net/tipc/node.c | 148 ++++++++++++++++++++++++++++++++++++++++-- net/tipc/node.h | 5 +- net/tipc/socket.c | 6 +- 8 files changed, 190 insertions(+), 11 deletions(-) diff --git a/net/tipc/core.c b/net/tipc/core.c index 23cb379a93d6..ab648dd150ee 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -105,6 +105,15 @@ static void __net_exit tipc_exit_net(struct net *net) tipc_sk_rht_destroy(net); } +static void __net_exit tipc_pernet_pre_exit(struct net *net) +{ + tipc_node_pre_cleanup_net(net); +} + +static struct pernet_operations tipc_pernet_pre_exit_ops = { + .pre_exit = tipc_pernet_pre_exit, +}; + static struct pernet_operations tipc_net_ops = { .init = tipc_init_net, .exit = tipc_exit_net, @@ -151,6 +160,10 @@ static int __init tipc_init(void) if (err) goto out_pernet_topsrv; + err = register_pernet_subsys(&tipc_pernet_pre_exit_ops); + if (err) + goto out_register_pernet_subsys; + err = tipc_bearer_setup(); if (err) goto out_bearer; @@ -158,6 +171,8 @@ static int __init tipc_init(void) pr_info("Started in single node mode\n"); return 0; out_bearer: + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); +out_register_pernet_subsys: unregister_pernet_device(&tipc_topsrv_net_ops); out_pernet_topsrv: tipc_socket_stop(); @@ -177,6 +192,7 @@ static int __init tipc_init(void) static void __exit tipc_exit(void) { tipc_bearer_cleanup(); + unregister_pernet_subsys(&tipc_pernet_pre_exit_ops); unregister_pernet_device(&tipc_topsrv_net_ops); tipc_socket_stop(); unregister_pernet_device(&tipc_net_ops); diff --git a/net/tipc/core.h b/net/tipc/core.h index 60d829581068..8776d32a4a47 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -59,6 +59,7 @@ #include <net/netns/generic.h> #include <linux/rhashtable.h> #include <net/genetlink.h> +#include <net/netns/hash.h> struct tipc_node; struct tipc_bearer; @@ -185,6 +186,11 @@ static inline int in_range(u16 val, u16 min, u16 max) return !less(val, min) && !more(val, max); } +static inline u32 tipc_net_hash_mixes(struct net *net, int tn_rand) +{ + return net_hash_mix(&init_net) ^ net_hash_mix(net) ^ tn_rand; +} + #ifdef CONFIG_SYSCTL int tipc_register_sysctl(void); void tipc_unregister_sysctl(void); diff --git a/net/tipc/discover.c b/net/tipc/discover.c index c138d68e8a69..b043e8c6397a 100644 --- a/net/tipc/discover.c +++ b/net/tipc/discover.c @@ -94,6 +94,7 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, msg_set_dest_domain(hdr, dest_domain); msg_set_bc_netid(hdr, tn->net_id); b->media->addr2msg(msg_media_addr(hdr), &b->addr); + msg_set_peer_net_hash(hdr, tipc_net_hash_mixes(net, tn->random)); msg_set_node_id(hdr, tipc_own_id(net)); } @@ -242,7 +243,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, if (!tipc_in_scope(legacy, b->domain, src)) return; tipc_node_check_dest(net, src, peer_id, b, caps, signature, - &maddr, &respond, &dupl_addr); + msg_peer_net_hash(hdr), &maddr, &respond, + &dupl_addr); if (dupl_addr) disc_dupl_alert(b, src, &maddr); if (!respond) diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..2d7cb66a6912 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -1026,6 +1026,20 @@ static inline bool msg_is_reset(struct tipc_msg *hdr) return (msg_user(hdr) == LINK_PROTOCOL) && (msg_type(hdr) == RESET_MSG); } +/* Word 13 + */ +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) +{ + msg_set_word(m, 13, n); +} + +static inline u32 msg_peer_net_hash(struct tipc_msg *m) +{ + return msg_word(m, 13); +} + +/* Word 14 + */ static inline u32 msg_sugg_node_addr(struct tipc_msg *m) { return msg_word(m, 14); diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index 836e629e8f4a..5feaf3b67380 100644 --- a/net/tipc/name_distr.c +++ b/net/tipc/name_distr.c @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list, struct publication *publ; struct sk_buff *skb = NULL; struct distr_item *item = NULL; - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) / ITEM_SIZE) * ITEM_SIZE; u32 msg_rem = msg_dsz; diff --git a/net/tipc/node.c b/net/tipc/node.c index f2e3cf70c922..cecb2fc3dc20 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -126,6 +126,8 @@ struct tipc_node { struct timer_list timer; struct rcu_head rcu; unsigned long delete_at; + struct net *peer_net; + u32 peer_hash_mix; }; /* Node FSM states and events: @@ -184,7 +186,7 @@ static struct tipc_link *node_active_link(struct tipc_node *n, int sel) return n->links[bearer_id].link; } -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected) { struct tipc_node *n; int bearer_id; @@ -194,6 +196,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) if (unlikely(!n)) return mtu; + /* Allow MAX_MSG_SIZE when building connection oriented message + * if they are in the same core network + */ + if (n->peer_net && connected) { + tipc_node_put(n); + return mtu; + } + bearer_id = n->active_links[sel & 1]; if (likely(bearer_id != INVALID_BEARER_ID)) mtu = n->links[bearer_id].mtu; @@ -360,8 +370,37 @@ static void tipc_node_write_unlock(struct tipc_node *n) } } +static void tipc_node_assign_peer_net(struct tipc_node *n, u32 hash_mixes) +{ + int net_id = tipc_netid(n->net); + struct tipc_net *tn_peer; + struct net *tmp; + u32 hash_chk; + + if (n->peer_net) + return; + + for_each_net_rcu(tmp) { + tn_peer = tipc_net(tmp); + if (!tn_peer) + continue; + /* Integrity checking whether node exists in namespace or not */ + if (tn_peer->net_id != net_id) + continue; + if (memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) + continue; + hash_chk = tipc_net_hash_mixes(tmp, tn_peer->random); + if (hash_mixes ^ hash_chk) + continue; + n->peer_net = tmp; + n->peer_hash_mix = hash_mixes; + break; + } +} + static struct tipc_node *tipc_node_create(struct net *net, u32 addr, - u8 *peer_id, u16 capabilities) + u8 *peer_id, u16 capabilities, + u32 signature, u32 hash_mixes) { struct tipc_net *tn = net_generic(net, tipc_net_id); struct tipc_node *n, *temp_node; @@ -372,6 +411,8 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, spin_lock_bh(&tn->node_list_lock); n = tipc_node_find(net, addr); if (n) { + if (n->peer_hash_mix ^ hash_mixes) + tipc_node_assign_peer_net(n, hash_mixes); if (n->capabilities == capabilities) goto exit; /* Same node may come back with new capabilities */ @@ -389,6 +430,7 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, list_for_each_entry_rcu(temp_node, &tn->node_list, list) { tn->capabilities &= temp_node->capabilities; } + goto exit; } n = kzalloc(sizeof(*n), GFP_ATOMIC); @@ -399,6 +441,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, n->addr = addr; memcpy(&n->peer_id, peer_id, 16); n->net = net; + n->peer_net = NULL; + n->peer_hash_mix = 0; + /* Assign kernel local namespace if exists */ + tipc_node_assign_peer_net(n, hash_mixes); n->capabilities = capabilities; kref_init(&n->kref); rwlock_init(&n->lock); @@ -426,6 +472,10 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, tipc_bc_sndlink(net), &n->bc_entry.link)) { pr_warn("Broadcast rcv link creation failed, no memory\n"); + if (n->peer_net) { + n->peer_net = NULL; + n->peer_hash_mix = 0; + } kfree(n); n = NULL; goto exit; @@ -979,7 +1029,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) void tipc_node_check_dest(struct net *net, u32 addr, u8 *peer_id, struct tipc_bearer *b, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr) { @@ -998,7 +1048,8 @@ void tipc_node_check_dest(struct net *net, u32 addr, *dupl_addr = false; *respond = false; - n = tipc_node_create(net, addr, peer_id, capabilities); + n = tipc_node_create(net, addr, peer_id, capabilities, signature, + hash_mixes); if (!n) return; @@ -1343,6 +1394,10 @@ static void node_lost_contact(struct tipc_node *n, /* Notify publications from this node */ n->action_flags |= TIPC_NOTIFY_NODE_DOWN; + if (n->peer_net) { + n->peer_net = NULL; + n->peer_hash_mix = 0; + } /* Notify sockets connected to node */ list_for_each_entry_safe(conn, safe, conns, list) { skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE, TIPC_CONN_MSG, @@ -1424,6 +1479,52 @@ static int __tipc_nl_add_node(struct tipc_nl_msg *msg, struct tipc_node *node) return -EMSGSIZE; } +static void tipc_lxc_xmit(struct net *peer_net, struct sk_buff_head *list) +{ + struct tipc_msg *hdr = buf_msg(skb_peek(list)); + struct sk_buff_head inputq; + + switch (msg_user(hdr)) { + case TIPC_LOW_IMPORTANCE: + case TIPC_MEDIUM_IMPORTANCE: + case TIPC_HIGH_IMPORTANCE: + case TIPC_CRITICAL_IMPORTANCE: + if (msg_connected(hdr) || msg_named(hdr)) { + spin_lock_init(&list->lock); + tipc_sk_rcv(peer_net, list); + return; + } + if (msg_mcast(hdr)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(peer_net, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + return; + } + return; + case MSG_FRAGMENTER: + if (tipc_msg_assemble(list)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(peer_net, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + } + return; + case GROUP_PROTOCOL: + case CONN_MANAGER: + spin_lock_init(&list->lock); + tipc_sk_rcv(peer_net, list); + return; + case LINK_PROTOCOL: + case NAME_DISTRIBUTOR: + case TUNNEL_PROTOCOL: + case BCAST_PROTOCOL: + return; + default: + return; + }; +} + /** * tipc_node_xmit() is the general link level function for message sending * @net: the applicable net namespace @@ -1439,6 +1540,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, struct tipc_link_entry *le = NULL; struct tipc_node *n; struct sk_buff_head xmitq; + bool node_up = false; int bearer_id; int rc; @@ -1455,6 +1557,16 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, return -EHOSTUNREACH; } + node_up = node_is_up(n); + if (node_up && n->peer_net && check_net(n->peer_net)) { + /* xmit inner linux container */ + tipc_lxc_xmit(n->peer_net, list); + if (likely(skb_queue_empty(list))) { + tipc_node_put(n); + return 0; + } + } + tipc_node_read_lock(n); bearer_id = n->active_links[selector & 1]; if (unlikely(bearer_id == INVALID_BEARER_ID)) { @@ -2587,3 +2699,31 @@ int tipc_node_dump(struct tipc_node *n, bool more, char *buf) return i; } + +void tipc_node_pre_cleanup_net(struct net *exit_net) +{ + struct tipc_node *n; + struct tipc_net *tn; + struct net *tmp; + u32 hash_mix = net_hash_mix(exit_net); + + for_each_net_rcu(tmp) { + if (!(net_hash_mix(tmp) ^ hash_mix)) + continue; + tn = tipc_net(tmp); + if (!tn) + continue; + spin_lock_bh(&tn->node_list_lock); + list_for_each_entry_rcu(n, &tn->node_list, list) { + if (!n->peer_net) + continue; + if (net_hash_mix(n->peer_net) ^ hash_mix) + continue; + n->peer_net = NULL; + n->peer_hash_mix = 0; + break; + } + spin_unlock_bh(&tn->node_list_lock); + } +} + diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..30563c4f35d5 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, struct tipc_bearer *bearer, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr); void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected); bool tipc_node_is_up(struct net *net, u32 addr); u16 tipc_node_get_capabilities(struct net *net, u32 addr); int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); @@ -107,4 +107,5 @@ int tipc_nl_node_get_monitor(struct sk_buff *skb, struct genl_info *info); int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct netlink_callback *cb); int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb, struct netlink_callback *cb); +void tipc_node_pre_cleanup_net(struct net *exit_net); #endif diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 35e32ffc2b90..2bcacd6022d5 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk, /* Build message as chain of buffers */ __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen) return rc; __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); tipc_set_sk_state(sk, TIPC_ESTABLISHED); tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); __skb_queue_purge(&sk->sk_write_queue); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) -- 2.20.1 |
From: Tuong L. <tuo...@de...> - 2019-10-23 08:21:58
|
When preparing tunnel packets for the link failover or synchronization, as for the safe algorithm, we added a dummy packet on the pair link but never sent it out. In the case of failover, the pair link will be reset anyway. But for link synching, it will always result in retransmission of the dummy packet after that. We have also observed that such the retransmission at the early stage when a new node comes in a large cluster will take some time and hard to be done, leading to the repeated retransmit failures and the link is reset. Since in commit 4929a932be33 ("tipc: optimize link synching mechanism") we have already built a dummy 'TUNNEL_PROTOCOL' message on the new link for the synchronization, there's no need for the dummy on the pair one, this commit will skip it when the new mechanism takes in place. In case nothing exists in the pair link's transmq, the link synching will just start and stop shortly on the peer side. The patch is backward compatible. Acked-by: Jon Maloy <jon...@er...> Tested-by: Hoang Le <hoa...@de...> Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/link.c | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index 999eab592de8..7e36b7ba61a9 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -1728,21 +1728,6 @@ void tipc_link_tnl_prepare(struct tipc_link *l, struct tipc_link *tnl, return; __skb_queue_head_init(&tnlq); - __skb_queue_head_init(&tmpxq); - __skb_queue_head_init(&frags); - - /* At least one packet required for safe algorithm => add dummy */ - skb = tipc_msg_create(TIPC_LOW_IMPORTANCE, TIPC_DIRECT_MSG, - BASIC_H_SIZE, 0, l->addr, tipc_own_addr(l->net), - 0, 0, TIPC_ERR_NO_PORT); - if (!skb) { - pr_warn("%sunable to create tunnel packet\n", link_co_err); - return; - } - __skb_queue_tail(&tnlq, skb); - tipc_link_xmit(l, &tnlq, &tmpxq); - __skb_queue_purge(&tmpxq); - /* Link Synching: * From now on, send only one single ("dummy") SYNCH message * to peer. The SYNCH message does not contain any data, just @@ -1768,6 +1753,20 @@ void tipc_link_tnl_prepare(struct tipc_link *l, struct tipc_link *tnl, return; } + __skb_queue_head_init(&tmpxq); + __skb_queue_head_init(&frags); + /* At least one packet required for safe algorithm => add dummy */ + skb = tipc_msg_create(TIPC_LOW_IMPORTANCE, TIPC_DIRECT_MSG, + BASIC_H_SIZE, 0, l->addr, tipc_own_addr(l->net), + 0, 0, TIPC_ERR_NO_PORT); + if (!skb) { + pr_warn("%sunable to create tunnel packet\n", link_co_err); + return; + } + __skb_queue_tail(&tnlq, skb); + tipc_link_xmit(l, &tnlq, &tmpxq); + __skb_queue_purge(&tmpxq); + /* Initialize reusable tunnel packet header */ tipc_msg_init(tipc_own_addr(l->net), &tnlhdr, TUNNEL_PROTOCOL, mtyp, INT_H_SIZE, l->addr); -- 2.13.7 |
From: Hoang L. <hoa...@de...> - 2019-10-22 03:35:24
|
Hi Eric, Thanks for quick feedback. See my inline answer. Regards, Hoang -----Original Message----- From: Eric Dumazet <eri...@gm...> Sent: Tuesday, October 22, 2019 9:41 AM To: Hoang Le <hoa...@de...>; jon...@er...; ma...@do...; tip...@li...; ne...@vg... Subject: Re: [net-next] tipc: improve throughput between nodes in netns On 10/21/19 7:20 PM, Hoang Le wrote: > n->net = net; > n->capabilities = capabilities; > + n->pnet = NULL; > + for_each_net_rcu(tmp) { This does not scale well, if say you have a thousand netns ? [Hoang] This check execs only once at setup step. So we get no problem with huge namespaces. > + tn_peer = net_generic(tmp, tipc_net_id); > + if (!tn_peer) > + continue; > + /* Integrity checking whether node exists in namespace or not */ > + if (tn_peer->net_id != tn->net_id) > + continue; > + if (memcmp(peer_id, tn_peer->node_id, NODE_ID_LEN)) > + continue; > + > + hash_chk = tn_peer->random; > + hash_chk ^= net_hash_mix(&init_net); Why the xor with net_hash_mix(&init_net) is needed ? [Hoang] We're trying to eliminate a sniff at injectable discovery message. Building hash-mixes as much as possible is to prevent fake discovery messages. > + hash_chk ^= net_hash_mix(tmp); > + if (hash_chk ^ hash_mixes) > + continue; > + n->pnet = tmp; > + break; > + } How can we set n->pnet without increasing netns ->count ? Using check_net() later might trigger an use-after-free. [Hoang] In this case, peer node is down. I assume the tipc xmit function already bypassed these lines. |
From: Hoang Le <hoa...@de...> - 2019-10-22 02:22:12
|
Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network namespaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a namespace on the same host, we just replace the namespace pointer with the receiving node/namespace's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stacks. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/namespaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local namespace does it choose to let user data messages, and only those, take the crossover path to the receiving node/namespace. - If the receiving node/namespace is removed, its namespace pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that the sender knows a common secret. We use the hash mix of the sending node/namespace for this purpose, since it can be accessed directly by all other namespaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local namespaces'hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local namespace, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence we think it can justified to allow it to shortcut the lower protocol layers. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the involved nodes' loopback interfaces their inter-name space messaging can easily be tracked. Suggested-by: Jon Maloy <jon...@er...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/discover.c | 10 ++++- net/tipc/msg.h | 10 +++++ net/tipc/name_distr.c | 2 +- net/tipc/node.c | 100 ++++++++++++++++++++++++++++++++++++++++-- net/tipc/node.h | 4 +- net/tipc/socket.c | 6 +-- 6 files changed, 121 insertions(+), 11 deletions(-) diff --git a/net/tipc/discover.c b/net/tipc/discover.c index c138d68e8a69..338d402fcf39 100644 --- a/net/tipc/discover.c +++ b/net/tipc/discover.c @@ -38,6 +38,8 @@ #include "node.h" #include "discover.h" +#include <net/netns/hash.h> + /* min delay during bearer start up */ #define TIPC_DISC_INIT msecs_to_jiffies(125) /* max delay if bearer has no links */ @@ -83,6 +85,7 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, struct tipc_net *tn = tipc_net(net); u32 dest_domain = b->domain; struct tipc_msg *hdr; + u32 hash; hdr = buf_msg(skb); tipc_msg_init(tn->trial_addr, hdr, LINK_CONFIG, mtyp, @@ -94,6 +97,10 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, msg_set_dest_domain(hdr, dest_domain); msg_set_bc_netid(hdr, tn->net_id); b->media->addr2msg(msg_media_addr(hdr), &b->addr); + hash = tn->random; + hash ^= net_hash_mix(&init_net); + hash ^= net_hash_mix(net); + msg_set_peer_net_hash(hdr, hash); msg_set_node_id(hdr, tipc_own_id(net)); } @@ -242,7 +249,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, if (!tipc_in_scope(legacy, b->domain, src)) return; tipc_node_check_dest(net, src, peer_id, b, caps, signature, - &maddr, &respond, &dupl_addr); + msg_peer_net_hash(hdr), &maddr, &respond, + &dupl_addr); if (dupl_addr) disc_dupl_alert(b, src, &maddr); if (!respond) diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..a8d0f28094f2 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct tipc_msg *m, u16 n) msg_set_bits(m, 9, 16, 0xffff, n); } +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) +{ + msg_set_word(m, 9, n); +} + +static inline u32 msg_peer_net_hash(struct tipc_msg *m) +{ + return msg_word(m, 9); +} + /* Word 10 */ static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index 836e629e8f4a..5feaf3b67380 100644 --- a/net/tipc/name_distr.c +++ b/net/tipc/name_distr.c @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list, struct publication *publ; struct sk_buff *skb = NULL; struct distr_item *item = NULL; - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) / ITEM_SIZE) * ITEM_SIZE; u32 msg_rem = msg_dsz; diff --git a/net/tipc/node.c b/net/tipc/node.c index f2e3cf70c922..d830c2d1dbe3 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -45,6 +45,8 @@ #include "netlink.h" #include "trace.h" +#include <net/netns/hash.h> + #define INVALID_NODE_SIG 0x10000 #define NODE_CLEANUP_AFTER 300000 @@ -126,6 +128,7 @@ struct tipc_node { struct timer_list timer; struct rcu_head rcu; unsigned long delete_at; + struct net *pnet; }; /* Node FSM states and events: @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct tipc_node *n, int sel) return n->links[bearer_id].link; } -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected) { struct tipc_node *n; int bearer_id; @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) if (unlikely(!n)) return mtu; + /* Allow MAX_MSG_SIZE when building connection oriented message + * if they are in the same core network + */ + if (n->pnet && connected) { + tipc_node_put(n); + return mtu; + } + bearer_id = n->active_links[sel & 1]; if (likely(bearer_id != INVALID_BEARER_ID)) mtu = n->links[bearer_id].mtu; @@ -361,12 +372,16 @@ static void tipc_node_write_unlock(struct tipc_node *n) } static struct tipc_node *tipc_node_create(struct net *net, u32 addr, - u8 *peer_id, u16 capabilities) + u8 *peer_id, u16 capabilities, + u32 signature, u32 hash_mixes) { struct tipc_net *tn = net_generic(net, tipc_net_id); struct tipc_node *n, *temp_node; + struct tipc_net *tn_peer; struct tipc_link *l; + struct net *tmp; int bearer_id; + u32 hash_chk; int i; spin_lock_bh(&tn->node_list_lock); @@ -400,6 +415,25 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, memcpy(&n->peer_id, peer_id, 16); n->net = net; n->capabilities = capabilities; + n->pnet = NULL; + for_each_net_rcu(tmp) { + tn_peer = net_generic(tmp, tipc_net_id); + if (!tn_peer) + continue; + /* Integrity checking whether node exists in namespace or not */ + if (tn_peer->net_id != tn->net_id) + continue; + if (memcmp(peer_id, tn_peer->node_id, NODE_ID_LEN)) + continue; + + hash_chk = tn_peer->random; + hash_chk ^= net_hash_mix(&init_net); + hash_chk ^= net_hash_mix(tmp); + if (hash_chk ^ hash_mixes) + continue; + n->pnet = tmp; + break; + } kref_init(&n->kref); rwlock_init(&n->lock); INIT_HLIST_NODE(&n->hash); @@ -979,7 +1013,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) void tipc_node_check_dest(struct net *net, u32 addr, u8 *peer_id, struct tipc_bearer *b, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr) { @@ -998,7 +1032,8 @@ void tipc_node_check_dest(struct net *net, u32 addr, *dupl_addr = false; *respond = false; - n = tipc_node_create(net, addr, peer_id, capabilities); + n = tipc_node_create(net, addr, peer_id, capabilities, signature, + hash_mixes); if (!n) return; @@ -1424,6 +1459,52 @@ static int __tipc_nl_add_node(struct tipc_nl_msg *msg, struct tipc_node *node) return -EMSGSIZE; } +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) +{ + struct tipc_msg *hdr = buf_msg(skb_peek(list)); + struct sk_buff_head inputq; + + switch (msg_user(hdr)) { + case TIPC_LOW_IMPORTANCE: + case TIPC_MEDIUM_IMPORTANCE: + case TIPC_HIGH_IMPORTANCE: + case TIPC_CRITICAL_IMPORTANCE: + if (msg_connected(hdr) || msg_named(hdr)) { + spin_lock_init(&list->lock); + tipc_sk_rcv(pnet, list); + return; + } + if (msg_mcast(hdr)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(pnet, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + return; + } + return; + case MSG_FRAGMENTER: + if (tipc_msg_assemble(list)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(pnet, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + } + return; + case GROUP_PROTOCOL: + case CONN_MANAGER: + spin_lock_init(&list->lock); + tipc_sk_rcv(pnet, list); + return; + case LINK_PROTOCOL: + case NAME_DISTRIBUTOR: + case TUNNEL_PROTOCOL: + case BCAST_PROTOCOL: + return; + default: + return; + }; +} + /** * tipc_node_xmit() is the general link level function for message sending * @net: the applicable net namespace @@ -1439,6 +1520,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, struct tipc_link_entry *le = NULL; struct tipc_node *n; struct sk_buff_head xmitq; + bool node_up = false; int bearer_id; int rc; @@ -1455,6 +1537,16 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, return -EHOSTUNREACH; } + node_up = node_is_up(n); + if (node_up && n->pnet && check_net(n->pnet)) { + /* xmit inner linux container */ + tipc_lxc_xmit(n->pnet, list); + if (likely(skb_queue_empty(list))) { + tipc_node_put(n); + return 0; + } + } + tipc_node_read_lock(n); bearer_id = n->active_links[selector & 1]; if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..2557d40fd417 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, struct tipc_bearer *bearer, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr); void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected); bool tipc_node_is_up(struct net *net, u32 addr); u16 tipc_node_get_capabilities(struct net *net, u32 addr); int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff --git a/net/tipc/socket.c b/net/tipc/socket.c index d579b64705b1..d34bd2e36050 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk, /* Build message as chain of buffers */ __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen) return rc; __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); tipc_set_sk_state(sk, TIPC_ESTABLISHED); tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); __skb_queue_purge(&sk->sk_write_queue); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) -- 2.20.1 |
From: Jon M. <jon...@er...> - 2019-10-21 22:53:40
|
Hi Hoang, Just some improvements to (my own) log message text below. Then you can go ahead and add "acked-by" from me. ///jon > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 21-Oct-19 00:16 > To: tip...@li...; Jon Maloy > <jon...@er...>; ma...@do...; yin...@wi...; > lx...@re... > Subject: [net-next v2] tipc: improve throughput between nodes in netns > > Currently, TIPC transports intra-node user data messages directly socket to > socket, hence shortcutting all the lower layers of the communication stack. > This gives TIPC very good intra node performance, both regarding throughput > and latency. > > We now introduce a similar mechanism for TIPC data traffic across network > name spaces located in the same kernel. On the send path, the call chain is as > always accompanied by the sending node's network name space pointer. > However, once we have reliably established that the receiving node is > represented by a name space on the same host, we just replace the name > space pointer with the receiving node/name space's ditto, and follow the > regular socket receive patch though the receiving node. This technique gives > us a throughput similar to the node internal throughput, several times larger > than if we let the traffic go though the full network stack. As a comparison, > max throughput for 64k messages is four times larger than TCP throughput for > the same type of traffic in a similar environment. > > To meet any security concerns, the following should be noted. > > - All nodes joining a cluster are supposed to have been be certified and > authenticated by mechanisms outside TIPC. This is no different for > nodes/name spaces on the same host; they have to auto discover each other > using the attached interfaces, and establish links which are supervised via the > regular link monitoring mechanism. Hence, a kernel local node has no other > way to join a cluster than any other node, and have to obey to policies set in > the IP or device layers of the stack. > > - Only when a sender has established with 100% certainty that the peer node > is located in a kernel local name space does it choose to let user data messages, > and only those, take the crossover path to the receiving node/name space. > > - If the receiving node/name space is removed, its name space pointer is > invalidated at all peer nodes, and their neighbor link monitoring will eventually > note that this node is gone. > > - To ensure the "100% certainty" criteria, and prevent any possible spoofing, > received discovery messages must contain a proof that s/they know a common secret./the sender knows a common secret./g > We use the hash_mix of the sending node/name space for this > purpose, since it can be accessed directly by all other name spaces in the > kernel. Upon reception of a discovery message, the receiver checks this proof > against all the local name spaces' > hash_mix:es. If it finds a match, that, along with a matching node id and > cluster id, this is deemed sufficient proof that the peer node in question is in a > local name space, and a wormhole can be opened. > > - We should also consider that TIPC is intended to be a cluster local IPC > mechanism (just like e.g. UNIX sockets) rather than a network protocol, and > hence s/should be given more freedom to shortcut the lower protocol than other protocols/ we think it can justified to allow it to shortcut the lower protocol layers./g > > Regarding traceability, we should notice that since commit 6c9081a3915d > ("tipc: add loopback device tracking") it is possible to follow the node internal > packet flow by just activating tcpdump on the loopback interface. This will be > true even for this mechanism; by activating tcpdump on the invloved nodes' > loopback interfaces their inter-name space messaging can easily be tracked. > > Suggested-by: Jon Maloy <jon...@er...> > Signed-off-by: Hoang Le <hoa...@de...> > --- > net/tipc/discover.c | 10 ++++- > net/tipc/msg.h | 10 +++++ > net/tipc/name_distr.c | 2 +- > net/tipc/node.c | 100 > ++++++++++++++++++++++++++++++++++++++++-- > net/tipc/node.h | 4 +- > net/tipc/socket.c | 6 +-- > 6 files changed, 121 insertions(+), 11 deletions(-) > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index > c138d68e8a69..338d402fcf39 100644 > --- a/net/tipc/discover.c > +++ b/net/tipc/discover.c > @@ -38,6 +38,8 @@ > #include "node.h" > #include "discover.h" > > +#include <net/netns/hash.h> > + > /* min delay during bearer start up */ > #define TIPC_DISC_INIT msecs_to_jiffies(125) > /* max delay if bearer has no links */ > @@ -83,6 +85,7 @@ static void tipc_disc_init_msg(struct net *net, struct > sk_buff *skb, > struct tipc_net *tn = tipc_net(net); > u32 dest_domain = b->domain; > struct tipc_msg *hdr; > + u32 hash; > > hdr = buf_msg(skb); > tipc_msg_init(tn->trial_addr, hdr, LINK_CONFIG, mtyp, @@ -94,6 > +97,10 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, > msg_set_dest_domain(hdr, dest_domain); > msg_set_bc_netid(hdr, tn->net_id); > b->media->addr2msg(msg_media_addr(hdr), &b->addr); > + hash = tn->random; > + hash ^= net_hash_mix(&init_net); > + hash ^= net_hash_mix(net); > + msg_set_peer_net_hash(hdr, hash); > msg_set_node_id(hdr, tipc_own_id(net)); } > > @@ -242,7 +249,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > *skb, > if (!tipc_in_scope(legacy, b->domain, src)) > return; > tipc_node_check_dest(net, src, peer_id, b, caps, signature, > - &maddr, &respond, &dupl_addr); > + msg_peer_net_hash(hdr), &maddr, &respond, > + &dupl_addr); > if (dupl_addr) > disc_dupl_alert(b, src, &maddr); > if (!respond) > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index > 0daa6f04ca81..a8d0f28094f2 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct > tipc_msg *m, u16 n) > msg_set_bits(m, 9, 16, 0xffff, n); > } > > +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) { > + msg_set_word(m, 9, n); > +} > + > +static inline u32 msg_peer_net_hash(struct tipc_msg *m) { > + return msg_word(m, 9); > +} > + > /* Word 10 > */ > static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git > a/net/tipc/name_distr.c b/net/tipc/name_distr.c index > 836e629e8f4a..5feaf3b67380 100644 > --- a/net/tipc/name_distr.c > +++ b/net/tipc/name_distr.c > @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct > sk_buff_head *list, > struct publication *publ; > struct sk_buff *skb = NULL; > struct distr_item *item = NULL; > - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / > + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) > +/ > ITEM_SIZE) * ITEM_SIZE; > u32 msg_rem = msg_dsz; > > diff --git a/net/tipc/node.c b/net/tipc/node.c index > c8f6177dd5a2..780b726041dd 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -45,6 +45,8 @@ > #include "netlink.h" > #include "trace.h" > > +#include <net/netns/hash.h> > + > #define INVALID_NODE_SIG 0x10000 > #define NODE_CLEANUP_AFTER 300000 > > @@ -126,6 +128,7 @@ struct tipc_node { > struct timer_list timer; > struct rcu_head rcu; > unsigned long delete_at; > + struct net *pnet; > }; > > /* Node FSM states and events: > @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct > tipc_node *n, int sel) > return n->links[bearer_id].link; > } > > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > +connected) > { > struct tipc_node *n; > int bearer_id; > @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, > u32 sel) > if (unlikely(!n)) > return mtu; > > + /* Allow MAX_MSG_SIZE when building connection oriented message > + * if they are in the same core network > + */ > + if (n->pnet && connected) { > + tipc_node_put(n); > + return mtu; > + } > + > bearer_id = n->active_links[sel & 1]; > if (likely(bearer_id != INVALID_BEARER_ID)) > mtu = n->links[bearer_id].mtu; > @@ -361,12 +372,16 @@ static void tipc_node_write_unlock(struct > tipc_node *n) } > > static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > - u8 *peer_id, u16 capabilities) > + u8 *peer_id, u16 capabilities, > + u32 signature, u32 hash_mixes) > { > struct tipc_net *tn = net_generic(net, tipc_net_id); > struct tipc_node *n, *temp_node; > + struct tipc_net *tn_peer; > struct tipc_link *l; > + struct net *tmp; > int bearer_id; > + u32 hash_chk; > int i; > > spin_lock_bh(&tn->node_list_lock); > @@ -400,6 +415,25 @@ static struct tipc_node *tipc_node_create(struct net > *net, u32 addr, > memcpy(&n->peer_id, peer_id, 16); > n->net = net; > n->capabilities = capabilities; > + n->pnet = NULL; > + for_each_net_rcu(tmp) { > + tn_peer = net_generic(tmp, tipc_net_id); > + if (!tn_peer) > + continue; > + /* Integrity checking whether node exists in namespace or not */ > + if (tn_peer->net_id != tn->net_id) > + continue; > + if (memcmp(peer_id, tn_peer->node_id, NODE_ID_LEN)) > + continue; > + > + hash_chk = tn_peer->random; > + hash_chk ^= net_hash_mix(&init_net); > + hash_chk ^= net_hash_mix(tmp); > + if (hash_chk ^ hash_mixes) > + continue; > + n->pnet = tmp; > + break; > + } > kref_init(&n->kref); > rwlock_init(&n->lock); > INIT_HLIST_NODE(&n->hash); > @@ -979,7 +1013,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, > u32 addr) > > void tipc_node_check_dest(struct net *net, u32 addr, > u8 *peer_id, struct tipc_bearer *b, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 hash_mixes, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr) > { > @@ -998,7 +1032,8 @@ void tipc_node_check_dest(struct net *net, u32 > addr, > *dupl_addr = false; > *respond = false; > > - n = tipc_node_create(net, addr, peer_id, capabilities); > + n = tipc_node_create(net, addr, peer_id, capabilities, signature, > + hash_mixes); > if (!n) > return; > > @@ -1424,6 +1459,52 @@ static int __tipc_nl_add_node(struct tipc_nl_msg > *msg, struct tipc_node *node) > return -EMSGSIZE; > } > > +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) > +{ > + struct tipc_msg *hdr = buf_msg(skb_peek(list)); > + struct sk_buff_head inputq; > + > + switch (msg_user(hdr)) { > + case TIPC_LOW_IMPORTANCE: > + case TIPC_MEDIUM_IMPORTANCE: > + case TIPC_HIGH_IMPORTANCE: > + case TIPC_CRITICAL_IMPORTANCE: > + if (msg_connected(hdr) || msg_named(hdr)) { > + spin_lock_init(&list->lock); > + tipc_sk_rcv(pnet, list); > + return; > + } > + if (msg_mcast(hdr)) { > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(pnet, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + return; > + } > + return; > + case MSG_FRAGMENTER: > + if (tipc_msg_assemble(list)) { > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(pnet, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + } > + return; > + case GROUP_PROTOCOL: > + case CONN_MANAGER: > + spin_lock_init(&list->lock); > + tipc_sk_rcv(pnet, list); > + return; > + case LINK_PROTOCOL: > + case NAME_DISTRIBUTOR: > + case TUNNEL_PROTOCOL: > + case BCAST_PROTOCOL: > + return; > + default: > + return; > + }; > +} > + > /** > * tipc_node_xmit() is the general link level function for message sending > * @net: the applicable net namespace > @@ -1439,6 +1520,7 @@ int tipc_node_xmit(struct net *net, struct > sk_buff_head *list, > struct tipc_link_entry *le = NULL; > struct tipc_node *n; > struct sk_buff_head xmitq; > + bool node_up = false; > int bearer_id; > int rc; > > @@ -1455,6 +1537,16 @@ int tipc_node_xmit(struct net *net, struct > sk_buff_head *list, > return -EHOSTUNREACH; > } > > + node_up = node_is_up(n); > + if (node_up && n->pnet && check_net(n->pnet)) { > + /* xmit inner linux container */ > + tipc_lxc_xmit(n->pnet, list); > + if (likely(skb_queue_empty(list))) { > + tipc_node_put(n); > + return 0; > + } > + } > + > tipc_node_read_lock(n); > bearer_id = n->active_links[selector & 1]; > if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git > a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..2557d40fd417 > 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); > u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void > tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, > struct tipc_bearer *bearer, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 hash_mixes, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr); > void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 > @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, > u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); > int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 > peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 > port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > +connected); > bool tipc_node_is_up(struct net *net, u32 addr); > u16 tipc_node_get_capabilities(struct net *net, u32 addr); int > tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff --git > a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c > 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, > struct tipc_sock *tsk, > > /* Build message as chain of buffers */ > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, > struct msghdr *m, size_t dlen) > return rc; > > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock > *tsk, u32 peer_port, > sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > __skb_queue_purge(&sk->sk_write_queue); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > -- > 2.20.1 |
From: Hoang L. <hoa...@de...> - 2019-10-21 04:20:11
|
Hi Jon, I have submitted the new code change in separate email. Please help to review again. Thanks, Hoang -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Friday, October 18, 2019 9:21 PM To: Hoang Huu Le <hoa...@de...>; ma...@do...; tip...@de...; tip...@li... Subject: RE: [net-next] tipc: improve throughput between nodes in netns Hi Hoang, Our task is to establish that the message really came from the same node we have found in a local name space. Imagine somebody is sniffing on a network, and finds there is a remote peer with proof(hash_mix)= M node id X and cluster id Y. He then creates an illegitimate local name space with the proof(hash_mix)= N , node id X, but cluster id Z, so that all its discovery messages are dropped by the receiver. He may then create fake discovery messages with proof(hash_mix)= N, node id X and cluster id Y, which will be accepted by the receiver and compared to the fake node's data. Alas, they all match, and he has succeeded in hijacking traffic to the remote node, and this may happen even if the traffic was meant to be encrypted. Admittedly there are some weaknesses in this scenario, e.g., he cannot do this if unless the remote node is temporarily down (maybe he can kill it with a fake RESET message?), and there are other reasons why this might be very hard to do. But, better safe than sorry, if we can avoid this with just a simple extra test that costs nothing. Regards ///jon > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 18-Oct-19 04:24 > To: Jon Maloy <jon...@er...>; ma...@do...; tipc- > de...@de...; tip...@li... > Subject: RE: [net-next] tipc: improve throughput between nodes in netns > > Hi Jon, > > Thanks for good description. > However, w.r.t your comment "We even need to verify cluster ids.", I'm still > unclear why we need to isolate cluster ids here. > I guess the node had been accepted already when bypassed at function > tipc_disc_rcv. Then, we just check to apply new mechanism for kernel local > namespaces. > > Regars, > Hoang > -----Original Message----- > From: Jon Maloy <jon...@er...> > Sent: Friday, October 18, 2019 2:20 AM > To: Hoang Huu Le <hoa...@de...>; ma...@do...; > tip...@de...; tip...@li... > Subject: RE: [net-next] tipc: improve throughput between nodes in netns > > Hi Hoang, > We need a very good log text to justify this. > > My proposal: > > "Currently, TIPC transports intra-node user data messages directly socket to > socket, hence shortcutting all the lower layers of the communication stack. > This gives TIPC very good intra node performance, both regarding throughput > and latency. > > We now introduce a similar mechanism for TIPC data traffic across network > name spaces located in the same kernel. On the send path, the call chain is as > always accompanied by the sending node's network name space pointer. > However, once we have reliably established that the receiving node is > represented by a name space on the same host, we just replace the name > space pointer with the receiving node/name space's ditto, and follow the > regular socket receive patch though the receiving node. This technique gives > us a throughput similar to the node internal throughput, several times larger > than if we let the traffic go though the full network stack. As a comparison, > max throughput for 64k messages is four times larger than TCP throughput for > the same type of traffic. > > To meet any security concerns, the following should be noted. > > - All nodes joining a cluster are supposed to have been be certified and > authenticated by mechanisms outside TIPC. This is no different for > nodes/name spaces on the same host; they have to auto discover each other > using the attached interfaces, and establish links which are supervised via the > regular link monitoring mechanism. Hence, a kernel local node has no other > way to join a cluster than any other node, and have to obey to policies set in > the IP or device layers of the stack. > > - Only when a sender has established with 100% certainty that the peer node > is located in a kernel local name space does it choose to let user data messages, > and only those, take the crossover path to the receiving node/name space. > > - If the receiving node/name space is removed, its name space pointer is > invalidated at all peer nodes, and their neighbor link monitoring will eventually > note that this node is gone. > > - To ensure the "100% certainty" criteria, and prevent any possible spoofing, > received discovery messages must contain a proof that they know a common > secret. We use the hash_mix of the sending node/name space for this > purpose, since it can be accessed directly by all other name spaces in the > kernel. Upon reception of a discovery message, the receiver checks this proof > against all the local name spaces' hash_mix:es. If it finds a match, that, along > with a matching node id and cluster id, this is deemed sufficient proof that the > peer node in question is in a local name space, and a wormhole can be > opened. > > - We should also consider that TIPC is intended to be a cluster local IPC > mechanism (just like e.g. UNIX sockets) rather than a network protocol, and > hence should be given more freedom to shortcut the lower protocol than > other protocols. > > Regarding traceability, we should notice that since commit 6c9081a3915d > ("add loopback device tracing") it is possible to follow the node internal packet > flow by just activating tcpdump on the loopback interface. This will be true > even for this mechanism; by activating tcpdump on the invloved nodes' > loopback interfaces their inter-name space messaging can easily be tracked." > > I also think there should be a "Suggested-by: Jon Maloy > <jon...@er...>" at the bottom of the patch. > > See more comments below. > > > > -----Original Message----- > > From: Hoang Le <hoa...@de...> > > Sent: 17-Oct-19 06:10 > > To: Jon Maloy <jon...@er...>; ma...@do...; tipc- > > de...@de... > > Subject: [net-next] tipc: improve throughput between nodes in netns > > > > Introduce traffic cross namespaces transmission as intranode. > > By this way, throughput between nodes in namespace as fast as local. > > Looks though the architectural view of TIPC, the new TIPC mechanism > > for containers will not introduce any security or breaking the current > > policies at > > all: > > > > 1/ Extranode: > > > > Node A Node B > > +-----------------+ +-----------------+ > > | TIPC | | TIPC | > > | Application | | Application | > > |-----------------| |-----------------| > > | | | | > > | TIPC |TIPC address TIPC address| TIPC | > > | | | | > > |-----------------| |-----------------| > > | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | > > | Service | | Service | > > +-----------------+ +-----------------+ > > NIC NIC > > +---------------- Bearer Transport ----------------+ > > > > 2/ Intranode: > > Node A Node A > > +-----------------+ +-----------------+ > > | TIPC | | TIPC | > > | Application | | Application | > > |-----------------| |-----------------| > > | | | | > > | TIPC |TIPC address TIPC address| TIPC | > > | | | | > > +-------+---------+ +--------+--------+ > > +--------------------------------------------------+ > > > > 3/ For container (same as extranode): > > +-----------------------------------------------------------------------+ > > | Container Container | > > | +-----------------+ > > | +-----------------+ +-----------------+ > > | +-----------------+ | > > | | TIPC | | TIPC | | > > | | Application | | Application | | > > | |-----------------| > > | |-----------------| |-----------------| > > | |-----------------| | > > | | | | | | > > | | TIPC |TIPC address TIPC address| TIPC | | > > | | | | | | > > | |-----------------| > > | |-----------------| |-----------------| > > | |-----------------| | > > | | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | | > > | | Service | | Service | | > > | +-----------------+ > > | +-----------------+ +-----------------+ > > | +-----------------+ | > > | (vNIC) (vNIC) | > > | + Host Kernel (KVM, Native) + | > > | +----------------Bearer Transport-------------------+ | > > | (bridge, OpenVSwitch) | > > | + | > > | +-------+---------+ | > > | | L2 or L3 Bearer | | > > | | Service | | > > | |-----------------| | > > | | | | > > | | TIPC |TIPC address | > > | | | | > > | |-----------------| | > > | | TIPC | | > > | | Application | | > > | +-----------------+ | > > | > > | | > > +-----------------------------------------------------------------------+ > > > > 4/ New design for container (same as intranode): > > +-----------------------------------------------------------------------+ > > | Container Container | > > | +-----------------+ > > | +-----------------+ +-----------------+ > > | +-----------------+ | > > | | TIPC | | TIPC | | > > | | Application | | Application | | > > | |-----------------| > > | |-----------------| |-----------------| > > | |-----------------| | > > | | | | | | > > | | TIPC |TIPC address TIPC address| TIPC | | > > | | | | | | > > | +-------+---------+ > > | +-------+---------+ +--------+--------+ > > | +-------+---------+ | > > | + Host Kernel (KVM, Native) + | > > | +-------------------------+------------------------+ | > > | +-------------+ | > > | +-----------------+ | | > > | | TIPC | | | > > | | Application | | | > > | |-----------------| | | > > | | +----+ | > > | | TIPC |TIPC address | > > | | | | > > | +-----------------+ | > > | > > | | > > +-----------------------------------------------------------------------+ > > > > TIPC is as an IPC and to designate the transport layer as an "L2.5" > > data link layer. When a TIPC node address has been accepted into a > > cluster and located in the same kernel (as we are trying to ensure in > > this patch), we are 100% certain it is legitimate and authentic. > > So, I cannot see any reason why we should not be allowed to short-cut > > for containers when security checks have already been done. > > Those drawings are nice, but unnecessary in my view. I think my text above is > sufficient as explanation of what we are doing. > > > > > Signed-off-by: Hoang Le <hoa...@de...> > > --- > > net/tipc/discover.c | 6 ++- > > net/tipc/msg.h | 10 +++++ > > net/tipc/name_distr.c | 2 +- > > net/tipc/node.c | 94 > > +++++++++++++++++++++++++++++++++++++++++-- > > net/tipc/node.h | 4 +- > > net/tipc/socket.c | 6 +-- > > 6 files changed, 111 insertions(+), 11 deletions(-) > > > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index > > c138d68e8a69..98d4eea97eb7 100644 > > --- a/net/tipc/discover.c > > +++ b/net/tipc/discover.c > > @@ -38,6 +38,8 @@ > > #include "node.h" > > #include "discover.h" > > > > +#include <net/netns/hash.h> > > + > > /* min delay during bearer start up */ > > #define TIPC_DISC_INIT msecs_to_jiffies(125) > > /* max delay if bearer has no links */ @@ -94,6 +96,7 @@ static void > > tipc_disc_init_msg(struct net *net, struct sk_buff *skb, > > msg_set_dest_domain(hdr, dest_domain); > > msg_set_bc_netid(hdr, tn->net_id); > > b->media->addr2msg(msg_media_addr(hdr), &b->addr); > > + msg_set_peer_net_hash(hdr, net_hash_mix(net)); > > We should not add the hash directly, since that would be exposing kernel > internal info to outside observers. > What we need to add is a *proof* that the sender knows the hash_mix in > question. So, it should XOR its hash_mix with a TIPC/kernel > global random value (also secret) and add the result to the message. The > receiver does XOR on the proof and the same random value, > and compares the result to the hash_mixes of the local name spaces to find a > match. > > > > msg_set_node_id(hdr, tipc_own_id(net)); } > > > > @@ -200,6 +203,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > > *skb, > > u8 peer_id[NODE_ID_LEN] = {0,}; > > u32 dst = msg_dest_domain(hdr); > > u32 net_id = msg_bc_netid(hdr); > > + u32 pnet_hash = msg_peer_net_hash(hdr); > > struct tipc_media_addr maddr; > > u32 src = msg_prevnode(hdr); > > u32 mtyp = msg_type(hdr); > > @@ -242,7 +246,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > > *skb, > > if (!tipc_in_scope(legacy, b->domain, src)) > > return; > > tipc_node_check_dest(net, src, peer_id, b, caps, signature, > > - &maddr, &respond, &dupl_addr); > > + pnet_hash, &maddr, &respond, &dupl_addr); > > if (dupl_addr) > > disc_dupl_alert(b, src, &maddr); > > if (!respond) > > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index > > 0daa6f04ca81..a8d0f28094f2 100644 > > --- a/net/tipc/msg.h > > +++ b/net/tipc/msg.h > > @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct > > tipc_msg *m, u16 n) > > msg_set_bits(m, 9, 16, 0xffff, n); > > } > > > > +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) { > > + msg_set_word(m, 9, n); > > +} > > + > > +static inline u32 msg_peer_net_hash(struct tipc_msg *m) { > > + return msg_word(m, 9); > > +} > > + > > /* Word 10 > > */ > > static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git > > a/net/tipc/name_distr.c b/net/tipc/name_distr.c index > > 836e629e8f4a..5feaf3b67380 100644 > > --- a/net/tipc/name_distr.c > > +++ b/net/tipc/name_distr.c > > @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct > > sk_buff_head *list, > > struct publication *publ; > > struct sk_buff *skb = NULL; > > struct distr_item *item = NULL; > > - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / > > + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) > > +/ > > ITEM_SIZE) * ITEM_SIZE; > > u32 msg_rem = msg_dsz; > > > > diff --git a/net/tipc/node.c b/net/tipc/node.c index > > c8f6177dd5a2..9a4ffd647701 100644 > > --- a/net/tipc/node.c > > +++ b/net/tipc/node.c > > @@ -45,6 +45,8 @@ > > #include "netlink.h" > > #include "trace.h" > > > > +#include <net/netns/hash.h> > > + > > #define INVALID_NODE_SIG 0x10000 > > #define NODE_CLEANUP_AFTER 300000 > > > > @@ -126,6 +128,7 @@ struct tipc_node { > > struct timer_list timer; > > struct rcu_head rcu; > > unsigned long delete_at; > > + struct net *pnet; > > }; > > > > /* Node FSM states and events: > > @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct > > tipc_node *n, int sel) > > return n->links[bearer_id].link; > > } > > > > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > > +connected) > > { > > struct tipc_node *n; > > int bearer_id; > > @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, > > u32 sel) > > if (unlikely(!n)) > > return mtu; > > > > + /* Allow MAX_MSG_SIZE when building connection oriented message > > + * if they are in the same core network > > + */ > > + if (n->pnet && connected) { > > + tipc_node_put(n); > > + return mtu; > > + } > > + > > bearer_id = n->active_links[sel & 1]; > > if (likely(bearer_id != INVALID_BEARER_ID)) > > mtu = n->links[bearer_id].mtu; > > @@ -361,11 +372,14 @@ static void tipc_node_write_unlock(struct > > tipc_node *n) } > > > > static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > > - u8 *peer_id, u16 capabilities) > > + u8 *peer_id, u16 capabilities, > > + u32 signature, u32 pnet_hash) > > { > > struct tipc_net *tn = net_generic(net, tipc_net_id); > > struct tipc_node *n, *temp_node; > > + struct tipc_net *tn_peer; > > struct tipc_link *l; > > + struct net *tmp; > > int bearer_id; > > int i; > > > > @@ -400,6 +414,23 @@ static struct tipc_node *tipc_node_create(struct > net > > *net, u32 addr, > > memcpy(&n->peer_id, peer_id, 16); > > n->net = net; > > n->capabilities = capabilities; > > + n->pnet = NULL; > > + for_each_net_rcu(tmp) { > > + /* Integrity checking whether node exists in namespace or not */ > > + if (net_hash_mix(tmp) != pnet_hash) > > + continue; > > See my comment above. > > > + tn_peer = net_generic(tmp, tipc_net_id); > > + if (!tn_peer) > > + continue; > > + > > + if ((tn_peer->random & 0x7fff) != (signature & 0x7fff)) > > + continue; > > + > > + if (!memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) { > > + n->pnet = tmp; > > + break; > > + } > > We even need to verify cluster ids. > > > + } > > kref_init(&n->kref); > > rwlock_init(&n->lock); > > INIT_HLIST_NODE(&n->hash); > > @@ -979,7 +1010,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, > > u32 addr) > > > > void tipc_node_check_dest(struct net *net, u32 addr, > > u8 *peer_id, struct tipc_bearer *b, > > - u16 capabilities, u32 signature, > > + u16 capabilities, u32 signature, u32 pnet_hash, > > struct tipc_media_addr *maddr, > > bool *respond, bool *dupl_addr) > > { > > @@ -998,7 +1029,8 @@ void tipc_node_check_dest(struct net *net, u32 > > addr, > > *dupl_addr = false; > > *respond = false; > > > > - n = tipc_node_create(net, addr, peer_id, capabilities); > > + n = tipc_node_create(net, addr, peer_id, capabilities, signature, > > + pnet_hash); > > if (!n) > > return; > > > > @@ -1424,6 +1456,49 @@ static int __tipc_nl_add_node(struct > tipc_nl_msg > > *msg, struct tipc_node *node) > > return -EMSGSIZE; > > } > > > > +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) > > +{ > > + struct tipc_msg *hdr = buf_msg(skb_peek(list)); > > + struct sk_buff_head inputq; > > + > > + switch (msg_user(hdr)) { > > + case TIPC_LOW_IMPORTANCE: > > + case TIPC_MEDIUM_IMPORTANCE: > > + case TIPC_HIGH_IMPORTANCE: > > + case TIPC_CRITICAL_IMPORTANCE: > > + if (msg_connected(hdr) || msg_named(hdr)) { > > + spin_lock_init(&list->lock); > > + tipc_sk_rcv(pnet, list); > > + return; > > + } > > + if (msg_mcast(hdr)) { > > + skb_queue_head_init(&inputq); > > + tipc_sk_mcast_rcv(pnet, list, &inputq); > > + __skb_queue_purge(list); > > + skb_queue_purge(&inputq); > > + return; > > + } > > + return; > > + case MSG_FRAGMENTER: > > + if (tipc_msg_assemble(list)) { > > + skb_queue_head_init(&inputq); > > + tipc_sk_mcast_rcv(pnet, list, &inputq); > > + __skb_queue_purge(list); > > + skb_queue_purge(&inputq); > > + } > > + return; > > > + case LINK_PROTOCOL: > > + case NAME_DISTRIBUTOR: > > + case GROUP_PROTOCOL: > > + case CONN_MANAGER: > > GROUP_PROTOCOL and CONN_MANAGER messages must also follow the > wormhole path, otherwise they (e.g. CONN_ACK) will be out of synch > with the corresponding data messages, and probably result in poorer > throughput. > > Regards > ///jon > > > > + case TUNNEL_PROTOCOL: > > + case BCAST_PROTOCOL: > > + return; > > + default: > > + return; > > + }; > > +} > > + > > /** > > * tipc_node_xmit() is the general link level function for message sending > > * @net: the applicable net namespace > > @@ -1439,6 +1514,7 @@ int tipc_node_xmit(struct net *net, struct > > sk_buff_head *list, > > struct tipc_link_entry *le = NULL; > > struct tipc_node *n; > > struct sk_buff_head xmitq; > > + bool node_up = false; > > int bearer_id; > > int rc; > > > > @@ -1455,6 +1531,16 @@ int tipc_node_xmit(struct net *net, struct > > sk_buff_head *list, > > return -EHOSTUNREACH; > > } > > > > + node_up = node_is_up(n); > > + if (node_up && n->pnet && check_net(n->pnet)) { > > + /* xmit inner linux container */ > > + tipc_lxc_xmit(n->pnet, list); > > + if (likely(skb_queue_empty(list))) { > > + tipc_node_put(n); > > + return 0; > > + } > > + } > > + > > tipc_node_read_lock(n); > > bearer_id = n->active_links[selector & 1]; > > if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git > > a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..11eb95ce358b > > 100644 > > --- a/net/tipc/node.h > > +++ b/net/tipc/node.h > > @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); > > u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void > > tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, > > struct tipc_bearer *bearer, > > - u16 capabilities, u32 signature, > > + u16 capabilities, u32 signature, u32 pnet_hash, > > struct tipc_media_addr *maddr, > > bool *respond, bool *dupl_addr); > > void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 > +92,7 > > @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, > > u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); > > int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 > > peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 > > port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); > > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > > +connected); > > bool tipc_node_is_up(struct net *net, u32 addr); > > u16 tipc_node_get_capabilities(struct net *net, u32 addr); int > > tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff -- > git > > a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c > > 100644 > > --- a/net/tipc/socket.c > > +++ b/net/tipc/socket.c > > @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, > > struct tipc_sock *tsk, > > > > /* Build message as chain of buffers */ > > __skb_queue_head_init(&pkts); > > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > > if (unlikely(rc != dlen)) > > return rc; > > @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, > > struct msghdr *m, size_t dlen) > > return rc; > > > > __skb_queue_head_init(&pkts); > > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > > if (unlikely(rc != dlen)) > > return rc; > > @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock > > *tsk, u32 peer_port, > > sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); > > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > > - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > > + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); > > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > > __skb_queue_purge(&sk->sk_write_queue); > > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > > -- > > 2.20.1 > |
From: Hoang Le <hoa...@de...> - 2019-10-21 04:17:24
|
Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network name spaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a name space on the same host, we just replace the name space pointer with the receiving node/name space's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stack. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/name spaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local name space does it choose to let user data messages, and only those, take the crossover path to the receiving node/name space. - If the receiving node/name space is removed, its name space pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that they know a common secret. We use the hash_mix of the sending node/name space for this purpose, since it can be accessed directly by all other name spaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local name spaces' hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local name space, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence should be given more freedom to shortcut the lower protocol than other protocols. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the invloved nodes' loopback interfaces their inter-name space messaging can easily be tracked. Suggested-by: Jon Maloy <jon...@er...> Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/discover.c | 10 ++++- net/tipc/msg.h | 10 +++++ net/tipc/name_distr.c | 2 +- net/tipc/node.c | 100 ++++++++++++++++++++++++++++++++++++++++-- net/tipc/node.h | 4 +- net/tipc/socket.c | 6 +-- 6 files changed, 121 insertions(+), 11 deletions(-) diff --git a/net/tipc/discover.c b/net/tipc/discover.c index c138d68e8a69..338d402fcf39 100644 --- a/net/tipc/discover.c +++ b/net/tipc/discover.c @@ -38,6 +38,8 @@ #include "node.h" #include "discover.h" +#include <net/netns/hash.h> + /* min delay during bearer start up */ #define TIPC_DISC_INIT msecs_to_jiffies(125) /* max delay if bearer has no links */ @@ -83,6 +85,7 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, struct tipc_net *tn = tipc_net(net); u32 dest_domain = b->domain; struct tipc_msg *hdr; + u32 hash; hdr = buf_msg(skb); tipc_msg_init(tn->trial_addr, hdr, LINK_CONFIG, mtyp, @@ -94,6 +97,10 @@ static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb, msg_set_dest_domain(hdr, dest_domain); msg_set_bc_netid(hdr, tn->net_id); b->media->addr2msg(msg_media_addr(hdr), &b->addr); + hash = tn->random; + hash ^= net_hash_mix(&init_net); + hash ^= net_hash_mix(net); + msg_set_peer_net_hash(hdr, hash); msg_set_node_id(hdr, tipc_own_id(net)); } @@ -242,7 +249,8 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb, if (!tipc_in_scope(legacy, b->domain, src)) return; tipc_node_check_dest(net, src, peer_id, b, caps, signature, - &maddr, &respond, &dupl_addr); + msg_peer_net_hash(hdr), &maddr, &respond, + &dupl_addr); if (dupl_addr) disc_dupl_alert(b, src, &maddr); if (!respond) diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..a8d0f28094f2 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct tipc_msg *m, u16 n) msg_set_bits(m, 9, 16, 0xffff, n); } +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) +{ + msg_set_word(m, 9, n); +} + +static inline u32 msg_peer_net_hash(struct tipc_msg *m) +{ + return msg_word(m, 9); +} + /* Word 10 */ static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c index 836e629e8f4a..5feaf3b67380 100644 --- a/net/tipc/name_distr.c +++ b/net/tipc/name_distr.c @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct sk_buff_head *list, struct publication *publ; struct sk_buff *skb = NULL; struct distr_item *item = NULL; - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) / ITEM_SIZE) * ITEM_SIZE; u32 msg_rem = msg_dsz; diff --git a/net/tipc/node.c b/net/tipc/node.c index c8f6177dd5a2..780b726041dd 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -45,6 +45,8 @@ #include "netlink.h" #include "trace.h" +#include <net/netns/hash.h> + #define INVALID_NODE_SIG 0x10000 #define NODE_CLEANUP_AFTER 300000 @@ -126,6 +128,7 @@ struct tipc_node { struct timer_list timer; struct rcu_head rcu; unsigned long delete_at; + struct net *pnet; }; /* Node FSM states and events: @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct tipc_node *n, int sel) return n->links[bearer_id].link; } -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected) { struct tipc_node *n; int bearer_id; @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) if (unlikely(!n)) return mtu; + /* Allow MAX_MSG_SIZE when building connection oriented message + * if they are in the same core network + */ + if (n->pnet && connected) { + tipc_node_put(n); + return mtu; + } + bearer_id = n->active_links[sel & 1]; if (likely(bearer_id != INVALID_BEARER_ID)) mtu = n->links[bearer_id].mtu; @@ -361,12 +372,16 @@ static void tipc_node_write_unlock(struct tipc_node *n) } static struct tipc_node *tipc_node_create(struct net *net, u32 addr, - u8 *peer_id, u16 capabilities) + u8 *peer_id, u16 capabilities, + u32 signature, u32 hash_mixes) { struct tipc_net *tn = net_generic(net, tipc_net_id); struct tipc_node *n, *temp_node; + struct tipc_net *tn_peer; struct tipc_link *l; + struct net *tmp; int bearer_id; + u32 hash_chk; int i; spin_lock_bh(&tn->node_list_lock); @@ -400,6 +415,25 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, memcpy(&n->peer_id, peer_id, 16); n->net = net; n->capabilities = capabilities; + n->pnet = NULL; + for_each_net_rcu(tmp) { + tn_peer = net_generic(tmp, tipc_net_id); + if (!tn_peer) + continue; + /* Integrity checking whether node exists in namespace or not */ + if (tn_peer->net_id != tn->net_id) + continue; + if (memcmp(peer_id, tn_peer->node_id, NODE_ID_LEN)) + continue; + + hash_chk = tn_peer->random; + hash_chk ^= net_hash_mix(&init_net); + hash_chk ^= net_hash_mix(tmp); + if (hash_chk ^ hash_mixes) + continue; + n->pnet = tmp; + break; + } kref_init(&n->kref); rwlock_init(&n->lock); INIT_HLIST_NODE(&n->hash); @@ -979,7 +1013,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) void tipc_node_check_dest(struct net *net, u32 addr, u8 *peer_id, struct tipc_bearer *b, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr) { @@ -998,7 +1032,8 @@ void tipc_node_check_dest(struct net *net, u32 addr, *dupl_addr = false; *respond = false; - n = tipc_node_create(net, addr, peer_id, capabilities); + n = tipc_node_create(net, addr, peer_id, capabilities, signature, + hash_mixes); if (!n) return; @@ -1424,6 +1459,52 @@ static int __tipc_nl_add_node(struct tipc_nl_msg *msg, struct tipc_node *node) return -EMSGSIZE; } +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) +{ + struct tipc_msg *hdr = buf_msg(skb_peek(list)); + struct sk_buff_head inputq; + + switch (msg_user(hdr)) { + case TIPC_LOW_IMPORTANCE: + case TIPC_MEDIUM_IMPORTANCE: + case TIPC_HIGH_IMPORTANCE: + case TIPC_CRITICAL_IMPORTANCE: + if (msg_connected(hdr) || msg_named(hdr)) { + spin_lock_init(&list->lock); + tipc_sk_rcv(pnet, list); + return; + } + if (msg_mcast(hdr)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(pnet, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + return; + } + return; + case MSG_FRAGMENTER: + if (tipc_msg_assemble(list)) { + skb_queue_head_init(&inputq); + tipc_sk_mcast_rcv(pnet, list, &inputq); + __skb_queue_purge(list); + skb_queue_purge(&inputq); + } + return; + case GROUP_PROTOCOL: + case CONN_MANAGER: + spin_lock_init(&list->lock); + tipc_sk_rcv(pnet, list); + return; + case LINK_PROTOCOL: + case NAME_DISTRIBUTOR: + case TUNNEL_PROTOCOL: + case BCAST_PROTOCOL: + return; + default: + return; + }; +} + /** * tipc_node_xmit() is the general link level function for message sending * @net: the applicable net namespace @@ -1439,6 +1520,7 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, struct tipc_link_entry *le = NULL; struct tipc_node *n; struct sk_buff_head xmitq; + bool node_up = false; int bearer_id; int rc; @@ -1455,6 +1537,16 @@ int tipc_node_xmit(struct net *net, struct sk_buff_head *list, return -EHOSTUNREACH; } + node_up = node_is_up(n); + if (node_up && n->pnet && check_net(n->pnet)) { + /* xmit inner linux container */ + tipc_lxc_xmit(n->pnet, list); + if (likely(skb_queue_empty(list))) { + tipc_node_put(n); + return 0; + } + } + tipc_node_read_lock(n); bearer_id = n->active_links[selector & 1]; if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..2557d40fd417 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, struct tipc_bearer *bearer, - u16 capabilities, u32 signature, + u16 capabilities, u32 signature, u32 hash_mixes, struct tipc_media_addr *maddr, bool *respond, bool *dupl_addr); void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected); bool tipc_node_is_up(struct net *net, u32 addr); u16 tipc_node_get_capabilities(struct net *net, u32 addr); int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, struct tipc_sock *tsk, /* Build message as chain of buffers */ __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen) return rc; __skb_queue_head_init(&pkts); - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); if (unlikely(rc != dlen)) return rc; @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); tipc_set_sk_state(sk, TIPC_ESTABLISHED); tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); __skb_queue_purge(&sk->sk_write_queue); if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) -- 2.20.1 |
From: Jon M. <jon...@er...> - 2019-10-18 15:53:37
|
Hi Hoang, Our task is to establish that the message really came from the same node we have found in a local name space. Imagine somebody is sniffing on a network, and finds there is a remote peer with proof(hash_mix)= M node id X and cluster id Y. He then creates an illegitimate local name space with the proof(hash_mix)= N , node id X, but cluster id Z, so that all its discovery messages are dropped by the receiver. He may then create fake discovery messages with proof(hash_mix)= N, node id X and cluster id Y, which will be accepted by the receiver and compared to the fake node's data. Alas, they all match, and he has succeeded in hijacking traffic to the remote node, and this may happen even if the traffic was meant to be encrypted. Admittedly there are some weaknesses in this scenario, e.g., he cannot do this if unless the remote node is temporarily down (maybe he can kill it with a fake RESET message?), and there are other reasons why this might be very hard to do. But, better safe than sorry, if we can avoid this with just a simple extra test that costs nothing. Regards ///jon > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 18-Oct-19 04:24 > To: Jon Maloy <jon...@er...>; ma...@do...; tipc- > de...@de...; tip...@li... > Subject: RE: [net-next] tipc: improve throughput between nodes in netns > > Hi Jon, > > Thanks for good description. > However, w.r.t your comment "We even need to verify cluster ids.", I'm still > unclear why we need to isolate cluster ids here. > I guess the node had been accepted already when bypassed at function > tipc_disc_rcv. Then, we just check to apply new mechanism for kernel local > namespaces. > > Regars, > Hoang > -----Original Message----- > From: Jon Maloy <jon...@er...> > Sent: Friday, October 18, 2019 2:20 AM > To: Hoang Huu Le <hoa...@de...>; ma...@do...; > tip...@de...; tip...@li... > Subject: RE: [net-next] tipc: improve throughput between nodes in netns > > Hi Hoang, > We need a very good log text to justify this. > > My proposal: > > "Currently, TIPC transports intra-node user data messages directly socket to > socket, hence shortcutting all the lower layers of the communication stack. > This gives TIPC very good intra node performance, both regarding throughput > and latency. > > We now introduce a similar mechanism for TIPC data traffic across network > name spaces located in the same kernel. On the send path, the call chain is as > always accompanied by the sending node's network name space pointer. > However, once we have reliably established that the receiving node is > represented by a name space on the same host, we just replace the name > space pointer with the receiving node/name space's ditto, and follow the > regular socket receive patch though the receiving node. This technique gives > us a throughput similar to the node internal throughput, several times larger > than if we let the traffic go though the full network stack. As a comparison, > max throughput for 64k messages is four times larger than TCP throughput for > the same type of traffic. > > To meet any security concerns, the following should be noted. > > - All nodes joining a cluster are supposed to have been be certified and > authenticated by mechanisms outside TIPC. This is no different for > nodes/name spaces on the same host; they have to auto discover each other > using the attached interfaces, and establish links which are supervised via the > regular link monitoring mechanism. Hence, a kernel local node has no other > way to join a cluster than any other node, and have to obey to policies set in > the IP or device layers of the stack. > > - Only when a sender has established with 100% certainty that the peer node > is located in a kernel local name space does it choose to let user data messages, > and only those, take the crossover path to the receiving node/name space. > > - If the receiving node/name space is removed, its name space pointer is > invalidated at all peer nodes, and their neighbor link monitoring will eventually > note that this node is gone. > > - To ensure the "100% certainty" criteria, and prevent any possible spoofing, > received discovery messages must contain a proof that they know a common > secret. We use the hash_mix of the sending node/name space for this > purpose, since it can be accessed directly by all other name spaces in the > kernel. Upon reception of a discovery message, the receiver checks this proof > against all the local name spaces' hash_mix:es. If it finds a match, that, along > with a matching node id and cluster id, this is deemed sufficient proof that the > peer node in question is in a local name space, and a wormhole can be > opened. > > - We should also consider that TIPC is intended to be a cluster local IPC > mechanism (just like e.g. UNIX sockets) rather than a network protocol, and > hence should be given more freedom to shortcut the lower protocol than > other protocols. > > Regarding traceability, we should notice that since commit 6c9081a3915d > ("add loopback device tracing") it is possible to follow the node internal packet > flow by just activating tcpdump on the loopback interface. This will be true > even for this mechanism; by activating tcpdump on the invloved nodes' > loopback interfaces their inter-name space messaging can easily be tracked." > > I also think there should be a "Suggested-by: Jon Maloy > <jon...@er...>" at the bottom of the patch. > > See more comments below. > > > > -----Original Message----- > > From: Hoang Le <hoa...@de...> > > Sent: 17-Oct-19 06:10 > > To: Jon Maloy <jon...@er...>; ma...@do...; tipc- > > de...@de... > > Subject: [net-next] tipc: improve throughput between nodes in netns > > > > Introduce traffic cross namespaces transmission as intranode. > > By this way, throughput between nodes in namespace as fast as local. > > Looks though the architectural view of TIPC, the new TIPC mechanism > > for containers will not introduce any security or breaking the current > > policies at > > all: > > > > 1/ Extranode: > > > > Node A Node B > > +-----------------+ +-----------------+ > > | TIPC | | TIPC | > > | Application | | Application | > > |-----------------| |-----------------| > > | | | | > > | TIPC |TIPC address TIPC address| TIPC | > > | | | | > > |-----------------| |-----------------| > > | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | > > | Service | | Service | > > +-----------------+ +-----------------+ > > NIC NIC > > +---------------- Bearer Transport ----------------+ > > > > 2/ Intranode: > > Node A Node A > > +-----------------+ +-----------------+ > > | TIPC | | TIPC | > > | Application | | Application | > > |-----------------| |-----------------| > > | | | | > > | TIPC |TIPC address TIPC address| TIPC | > > | | | | > > +-------+---------+ +--------+--------+ > > +--------------------------------------------------+ > > > > 3/ For container (same as extranode): > > +-----------------------------------------------------------------------+ > > | Container Container | > > | +-----------------+ > > | +-----------------+ +-----------------+ > > | +-----------------+ | > > | | TIPC | | TIPC | | > > | | Application | | Application | | > > | |-----------------| > > | |-----------------| |-----------------| > > | |-----------------| | > > | | | | | | > > | | TIPC |TIPC address TIPC address| TIPC | | > > | | | | | | > > | |-----------------| > > | |-----------------| |-----------------| > > | |-----------------| | > > | | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | | > > | | Service | | Service | | > > | +-----------------+ > > | +-----------------+ +-----------------+ > > | +-----------------+ | > > | (vNIC) (vNIC) | > > | + Host Kernel (KVM, Native) + | > > | +----------------Bearer Transport-------------------+ | > > | (bridge, OpenVSwitch) | > > | + | > > | +-------+---------+ | > > | | L2 or L3 Bearer | | > > | | Service | | > > | |-----------------| | > > | | | | > > | | TIPC |TIPC address | > > | | | | > > | |-----------------| | > > | | TIPC | | > > | | Application | | > > | +-----------------+ | > > | > > | | > > +-----------------------------------------------------------------------+ > > > > 4/ New design for container (same as intranode): > > +-----------------------------------------------------------------------+ > > | Container Container | > > | +-----------------+ > > | +-----------------+ +-----------------+ > > | +-----------------+ | > > | | TIPC | | TIPC | | > > | | Application | | Application | | > > | |-----------------| > > | |-----------------| |-----------------| > > | |-----------------| | > > | | | | | | > > | | TIPC |TIPC address TIPC address| TIPC | | > > | | | | | | > > | +-------+---------+ > > | +-------+---------+ +--------+--------+ > > | +-------+---------+ | > > | + Host Kernel (KVM, Native) + | > > | +-------------------------+------------------------+ | > > | +-------------+ | > > | +-----------------+ | | > > | | TIPC | | | > > | | Application | | | > > | |-----------------| | | > > | | +----+ | > > | | TIPC |TIPC address | > > | | | | > > | +-----------------+ | > > | > > | | > > +-----------------------------------------------------------------------+ > > > > TIPC is as an IPC and to designate the transport layer as an "L2.5" > > data link layer. When a TIPC node address has been accepted into a > > cluster and located in the same kernel (as we are trying to ensure in > > this patch), we are 100% certain it is legitimate and authentic. > > So, I cannot see any reason why we should not be allowed to short-cut > > for containers when security checks have already been done. > > Those drawings are nice, but unnecessary in my view. I think my text above is > sufficient as explanation of what we are doing. > > > > > Signed-off-by: Hoang Le <hoa...@de...> > > --- > > net/tipc/discover.c | 6 ++- > > net/tipc/msg.h | 10 +++++ > > net/tipc/name_distr.c | 2 +- > > net/tipc/node.c | 94 > > +++++++++++++++++++++++++++++++++++++++++-- > > net/tipc/node.h | 4 +- > > net/tipc/socket.c | 6 +-- > > 6 files changed, 111 insertions(+), 11 deletions(-) > > > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index > > c138d68e8a69..98d4eea97eb7 100644 > > --- a/net/tipc/discover.c > > +++ b/net/tipc/discover.c > > @@ -38,6 +38,8 @@ > > #include "node.h" > > #include "discover.h" > > > > +#include <net/netns/hash.h> > > + > > /* min delay during bearer start up */ > > #define TIPC_DISC_INIT msecs_to_jiffies(125) > > /* max delay if bearer has no links */ @@ -94,6 +96,7 @@ static void > > tipc_disc_init_msg(struct net *net, struct sk_buff *skb, > > msg_set_dest_domain(hdr, dest_domain); > > msg_set_bc_netid(hdr, tn->net_id); > > b->media->addr2msg(msg_media_addr(hdr), &b->addr); > > + msg_set_peer_net_hash(hdr, net_hash_mix(net)); > > We should not add the hash directly, since that would be exposing kernel > internal info to outside observers. > What we need to add is a *proof* that the sender knows the hash_mix in > question. So, it should XOR its hash_mix with a TIPC/kernel > global random value (also secret) and add the result to the message. The > receiver does XOR on the proof and the same random value, > and compares the result to the hash_mixes of the local name spaces to find a > match. > > > > msg_set_node_id(hdr, tipc_own_id(net)); } > > > > @@ -200,6 +203,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > > *skb, > > u8 peer_id[NODE_ID_LEN] = {0,}; > > u32 dst = msg_dest_domain(hdr); > > u32 net_id = msg_bc_netid(hdr); > > + u32 pnet_hash = msg_peer_net_hash(hdr); > > struct tipc_media_addr maddr; > > u32 src = msg_prevnode(hdr); > > u32 mtyp = msg_type(hdr); > > @@ -242,7 +246,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > > *skb, > > if (!tipc_in_scope(legacy, b->domain, src)) > > return; > > tipc_node_check_dest(net, src, peer_id, b, caps, signature, > > - &maddr, &respond, &dupl_addr); > > + pnet_hash, &maddr, &respond, &dupl_addr); > > if (dupl_addr) > > disc_dupl_alert(b, src, &maddr); > > if (!respond) > > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index > > 0daa6f04ca81..a8d0f28094f2 100644 > > --- a/net/tipc/msg.h > > +++ b/net/tipc/msg.h > > @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct > > tipc_msg *m, u16 n) > > msg_set_bits(m, 9, 16, 0xffff, n); > > } > > > > +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) { > > + msg_set_word(m, 9, n); > > +} > > + > > +static inline u32 msg_peer_net_hash(struct tipc_msg *m) { > > + return msg_word(m, 9); > > +} > > + > > /* Word 10 > > */ > > static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git > > a/net/tipc/name_distr.c b/net/tipc/name_distr.c index > > 836e629e8f4a..5feaf3b67380 100644 > > --- a/net/tipc/name_distr.c > > +++ b/net/tipc/name_distr.c > > @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct > > sk_buff_head *list, > > struct publication *publ; > > struct sk_buff *skb = NULL; > > struct distr_item *item = NULL; > > - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / > > + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) > > +/ > > ITEM_SIZE) * ITEM_SIZE; > > u32 msg_rem = msg_dsz; > > > > diff --git a/net/tipc/node.c b/net/tipc/node.c index > > c8f6177dd5a2..9a4ffd647701 100644 > > --- a/net/tipc/node.c > > +++ b/net/tipc/node.c > > @@ -45,6 +45,8 @@ > > #include "netlink.h" > > #include "trace.h" > > > > +#include <net/netns/hash.h> > > + > > #define INVALID_NODE_SIG 0x10000 > > #define NODE_CLEANUP_AFTER 300000 > > > > @@ -126,6 +128,7 @@ struct tipc_node { > > struct timer_list timer; > > struct rcu_head rcu; > > unsigned long delete_at; > > + struct net *pnet; > > }; > > > > /* Node FSM states and events: > > @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct > > tipc_node *n, int sel) > > return n->links[bearer_id].link; > > } > > > > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > > +connected) > > { > > struct tipc_node *n; > > int bearer_id; > > @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, > > u32 sel) > > if (unlikely(!n)) > > return mtu; > > > > + /* Allow MAX_MSG_SIZE when building connection oriented message > > + * if they are in the same core network > > + */ > > + if (n->pnet && connected) { > > + tipc_node_put(n); > > + return mtu; > > + } > > + > > bearer_id = n->active_links[sel & 1]; > > if (likely(bearer_id != INVALID_BEARER_ID)) > > mtu = n->links[bearer_id].mtu; > > @@ -361,11 +372,14 @@ static void tipc_node_write_unlock(struct > > tipc_node *n) } > > > > static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > > - u8 *peer_id, u16 capabilities) > > + u8 *peer_id, u16 capabilities, > > + u32 signature, u32 pnet_hash) > > { > > struct tipc_net *tn = net_generic(net, tipc_net_id); > > struct tipc_node *n, *temp_node; > > + struct tipc_net *tn_peer; > > struct tipc_link *l; > > + struct net *tmp; > > int bearer_id; > > int i; > > > > @@ -400,6 +414,23 @@ static struct tipc_node *tipc_node_create(struct > net > > *net, u32 addr, > > memcpy(&n->peer_id, peer_id, 16); > > n->net = net; > > n->capabilities = capabilities; > > + n->pnet = NULL; > > + for_each_net_rcu(tmp) { > > + /* Integrity checking whether node exists in namespace or not */ > > + if (net_hash_mix(tmp) != pnet_hash) > > + continue; > > See my comment above. > > > + tn_peer = net_generic(tmp, tipc_net_id); > > + if (!tn_peer) > > + continue; > > + > > + if ((tn_peer->random & 0x7fff) != (signature & 0x7fff)) > > + continue; > > + > > + if (!memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) { > > + n->pnet = tmp; > > + break; > > + } > > We even need to verify cluster ids. > > > + } > > kref_init(&n->kref); > > rwlock_init(&n->lock); > > INIT_HLIST_NODE(&n->hash); > > @@ -979,7 +1010,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, > > u32 addr) > > > > void tipc_node_check_dest(struct net *net, u32 addr, > > u8 *peer_id, struct tipc_bearer *b, > > - u16 capabilities, u32 signature, > > + u16 capabilities, u32 signature, u32 pnet_hash, > > struct tipc_media_addr *maddr, > > bool *respond, bool *dupl_addr) > > { > > @@ -998,7 +1029,8 @@ void tipc_node_check_dest(struct net *net, u32 > > addr, > > *dupl_addr = false; > > *respond = false; > > > > - n = tipc_node_create(net, addr, peer_id, capabilities); > > + n = tipc_node_create(net, addr, peer_id, capabilities, signature, > > + pnet_hash); > > if (!n) > > return; > > > > @@ -1424,6 +1456,49 @@ static int __tipc_nl_add_node(struct > tipc_nl_msg > > *msg, struct tipc_node *node) > > return -EMSGSIZE; > > } > > > > +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) > > +{ > > + struct tipc_msg *hdr = buf_msg(skb_peek(list)); > > + struct sk_buff_head inputq; > > + > > + switch (msg_user(hdr)) { > > + case TIPC_LOW_IMPORTANCE: > > + case TIPC_MEDIUM_IMPORTANCE: > > + case TIPC_HIGH_IMPORTANCE: > > + case TIPC_CRITICAL_IMPORTANCE: > > + if (msg_connected(hdr) || msg_named(hdr)) { > > + spin_lock_init(&list->lock); > > + tipc_sk_rcv(pnet, list); > > + return; > > + } > > + if (msg_mcast(hdr)) { > > + skb_queue_head_init(&inputq); > > + tipc_sk_mcast_rcv(pnet, list, &inputq); > > + __skb_queue_purge(list); > > + skb_queue_purge(&inputq); > > + return; > > + } > > + return; > > + case MSG_FRAGMENTER: > > + if (tipc_msg_assemble(list)) { > > + skb_queue_head_init(&inputq); > > + tipc_sk_mcast_rcv(pnet, list, &inputq); > > + __skb_queue_purge(list); > > + skb_queue_purge(&inputq); > > + } > > + return; > > > + case LINK_PROTOCOL: > > + case NAME_DISTRIBUTOR: > > + case GROUP_PROTOCOL: > > + case CONN_MANAGER: > > GROUP_PROTOCOL and CONN_MANAGER messages must also follow the > wormhole path, otherwise they (e.g. CONN_ACK) will be out of synch > with the corresponding data messages, and probably result in poorer > throughput. > > Regards > ///jon > > > > + case TUNNEL_PROTOCOL: > > + case BCAST_PROTOCOL: > > + return; > > + default: > > + return; > > + }; > > +} > > + > > /** > > * tipc_node_xmit() is the general link level function for message sending > > * @net: the applicable net namespace > > @@ -1439,6 +1514,7 @@ int tipc_node_xmit(struct net *net, struct > > sk_buff_head *list, > > struct tipc_link_entry *le = NULL; > > struct tipc_node *n; > > struct sk_buff_head xmitq; > > + bool node_up = false; > > int bearer_id; > > int rc; > > > > @@ -1455,6 +1531,16 @@ int tipc_node_xmit(struct net *net, struct > > sk_buff_head *list, > > return -EHOSTUNREACH; > > } > > > > + node_up = node_is_up(n); > > + if (node_up && n->pnet && check_net(n->pnet)) { > > + /* xmit inner linux container */ > > + tipc_lxc_xmit(n->pnet, list); > > + if (likely(skb_queue_empty(list))) { > > + tipc_node_put(n); > > + return 0; > > + } > > + } > > + > > tipc_node_read_lock(n); > > bearer_id = n->active_links[selector & 1]; > > if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git > > a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..11eb95ce358b > > 100644 > > --- a/net/tipc/node.h > > +++ b/net/tipc/node.h > > @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); > > u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void > > tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, > > struct tipc_bearer *bearer, > > - u16 capabilities, u32 signature, > > + u16 capabilities, u32 signature, u32 pnet_hash, > > struct tipc_media_addr *maddr, > > bool *respond, bool *dupl_addr); > > void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 > +92,7 > > @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, > > u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); > > int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 > > peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 > > port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); > > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > > +connected); > > bool tipc_node_is_up(struct net *net, u32 addr); > > u16 tipc_node_get_capabilities(struct net *net, u32 addr); int > > tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff -- > git > > a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c > > 100644 > > --- a/net/tipc/socket.c > > +++ b/net/tipc/socket.c > > @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, > > struct tipc_sock *tsk, > > > > /* Build message as chain of buffers */ > > __skb_queue_head_init(&pkts); > > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > > if (unlikely(rc != dlen)) > > return rc; > > @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, > > struct msghdr *m, size_t dlen) > > return rc; > > > > __skb_queue_head_init(&pkts); > > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > > if (unlikely(rc != dlen)) > > return rc; > > @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock > > *tsk, u32 peer_port, > > sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); > > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > > - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > > + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); > > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > > __skb_queue_purge(&sk->sk_write_queue); > > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > > -- > > 2.20.1 > |
From: Rune T. <ru...@in...> - 2019-10-18 14:27:55
|
And looking at Ubuntu's git repo for xenial, that patch was never backported. -----Original Message----- From: Rune Torgersen <ru...@in...> Sent: Friday, October 18, 2019 08:28 cat /proc/buddyinfo Node 0, zone DMA 2 2 1 1 3 0 1 0 1 1 3 Node 0, zone DMA32 9275 11572 137 6 0 0 0 0 0 0 0 Node 0, zone Normal 35213 15049 476 11 1 0 1 1 1 0 0 Node 1, zone Normal 5917 25209 490 8 6 3 1 1 0 0 0 And I'm aware of the checkin, as I reported it. I was under the impression that that was backported to the tipc drive in the Ubuntu 16.04 LTS 4.4.0 branch (around 4.4.0-110 I think). Either the fix was never incorporated in the 4.4.0 branch, or was reverted recently. -----Original Message----- From: Partha <par...@gm...> Sent: Friday, October 18, 2019 08:17 Hi Rune, Your systems memory seems to be fragmented, and you need to perform forced reclaim. Can you check the buddy for higher order allocations? cat /proc/buddyinfo BTW, I fixed this in: 57d5f64d83ab tipc: allocate user memory with GFP_KERNEL flag And it was Reported-by: Rune Torgersen <ru...@in...> Its in upstream v4.10-rc3-167-g57d5f64d83ab regards Partha On 2019-10-17 22:08, Rune Torgersen wrote: > Looks like I can kind of make it happen on one system mow. > Stopping some programs (not pattern in which ones) makes it work, and starting some back up again makes it fail. > > Tipc nametable has 231 entries when failing and 183 entries when succeeding (however on a different system the nametable has 251 entries and it is not failing). > > How do I look for memory used by TIPC in the kernel? > > -----Original Message----- > From: Rune Torgersen <ru...@in...> > Sent: Thursday, October 17, 2019 14:53 > > > I will have to look for leaks next time I can make it happen. > I was trying stuff and shut down a different program that was unrelated (but had some TIPC sockets open on a different address (104)), and as soon as I did, the sends started working again. > > It is possible that one of those unrelated sockets has something stuck (as one of them was only ever used to send RDM messages but nothing ever reads it). > > Any suggestions as to what to start looking at (netstat, tipc, tipc_config or kernel params) to try to track it down?. > > Problem with testing a patch (or using Unbuntu 18 LTS) is that we cannot reliably make it happen. > > -----Original Message----- > From: Jon Maloy <jon...@er...> > Sent: Thursday, October 17, 2019 14:35 > > > Hi Rune, > > Do you see any signs of general memory leak ("free") on your node? > > Anyway there can be no doubt that this happens because the big buffer pool is running empty. > > We fixed that in commit 4c94cc2d3d57 ("tipc: fall back to smaller MTU if allocation of local send skb fails") which was delivered to Linux 4.16. > > Do you have any opportunity to apply that patch and try it? > > BR > ///jon > >> -----Original Message----- >> From: Rune Torgersen <ru...@in...> >> Sent: 17-Oct-19 12:38 >> To: 'tip...@li...' <tipc- >> dis...@li...> >> Subject: [tipc-discussion] Error allocating memeory error when sending RDM >> message >> >> Hi. >> >> I am running into an issue when sending SOCK_RDM or SOCK_DGRAM >> messages. On a system that has been up for a time (120+ days inthis case), I >> cannot send any RDM/DGRAM type TIPC messages that are larger than about >> 16000 bytes (16033+ fails, 15100 and smaller still works). >> Any larger messages fails with erro code 12 :"Cannot allocate memory". >> >> Really odd thing about it only happens on some connections and not others, >> on the same system (example, sending to tipc node 103:1003 gets no error, >> while sending to 103:3 get error). >> When it gets into this state, it seems to happen forever on the same >> destination address, and not on others until system is rebooted. (restarting the >> server side application makes no difference). >> The sends are done on the same node as the receiver is on. >> >> Kernel is Ubuntu 16.04 LTS 4.4.0-150 in this case, also seen on 161. >> >> Nametable for 103: >> 103 2 2 <1.1.1:2328193343> 2328193344 cluster >> 103 3 3 <1.1.2:3153441800> 3153441801 cluster >> 103 5 5 <1.1.4:269294867> 269294868 cluster >> 103 1002 1002 <1.1.1:490133365> 490133366 cluster >> 103 1003 1003 <1.1.2:2552019732> 2552019733 cluster >> 103 1005 1005 <1.1.4:625110186> 625110187 cluster >> >> _______________________________________________ >> tipc-discussion mailing list >> tip...@li... >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > _______________________________________________ tipc-discussion mailing list tip...@li... https://lists.sourceforge.net/lists/listinfo/tipc-discussion |
From: Rune T. <ru...@in...> - 2019-10-18 13:28:44
|
cat /proc/buddyinfo Node 0, zone DMA 2 2 1 1 3 0 1 0 1 1 3 Node 0, zone DMA32 9275 11572 137 6 0 0 0 0 0 0 0 Node 0, zone Normal 35213 15049 476 11 1 0 1 1 1 0 0 Node 1, zone Normal 5917 25209 490 8 6 3 1 1 0 0 0 And I'm aware of the checkin, as I reported it. I was under the impression that that was backported to the tipc drive in the Ubuntu 16.04 LTS 4.4.0 branch (around 4.4.0-110 I think). Either the fix was never incorporated in the 4.4.0 branch, or was reverted recently. -----Original Message----- From: Partha <par...@gm...> Sent: Friday, October 18, 2019 08:17 Hi Rune, Your systems memory seems to be fragmented, and you need to perform forced reclaim. Can you check the buddy for higher order allocations? cat /proc/buddyinfo BTW, I fixed this in: 57d5f64d83ab tipc: allocate user memory with GFP_KERNEL flag And it was Reported-by: Rune Torgersen <ru...@in...> Its in upstream v4.10-rc3-167-g57d5f64d83ab regards Partha On 2019-10-17 22:08, Rune Torgersen wrote: > Looks like I can kind of make it happen on one system mow. > Stopping some programs (not pattern in which ones) makes it work, and starting some back up again makes it fail. > > Tipc nametable has 231 entries when failing and 183 entries when succeeding (however on a different system the nametable has 251 entries and it is not failing). > > How do I look for memory used by TIPC in the kernel? > > -----Original Message----- > From: Rune Torgersen <ru...@in...> > Sent: Thursday, October 17, 2019 14:53 > > > I will have to look for leaks next time I can make it happen. > I was trying stuff and shut down a different program that was unrelated (but had some TIPC sockets open on a different address (104)), and as soon as I did, the sends started working again. > > It is possible that one of those unrelated sockets has something stuck (as one of them was only ever used to send RDM messages but nothing ever reads it). > > Any suggestions as to what to start looking at (netstat, tipc, tipc_config or kernel params) to try to track it down?. > > Problem with testing a patch (or using Unbuntu 18 LTS) is that we cannot reliably make it happen. > > -----Original Message----- > From: Jon Maloy <jon...@er...> > Sent: Thursday, October 17, 2019 14:35 > > > Hi Rune, > > Do you see any signs of general memory leak ("free") on your node? > > Anyway there can be no doubt that this happens because the big buffer pool is running empty. > > We fixed that in commit 4c94cc2d3d57 ("tipc: fall back to smaller MTU if allocation of local send skb fails") which was delivered to Linux 4.16. > > Do you have any opportunity to apply that patch and try it? > > BR > ///jon > >> -----Original Message----- >> From: Rune Torgersen <ru...@in...> >> Sent: 17-Oct-19 12:38 >> To: 'tip...@li...' <tipc- >> dis...@li...> >> Subject: [tipc-discussion] Error allocating memeory error when sending RDM >> message >> >> Hi. >> >> I am running into an issue when sending SOCK_RDM or SOCK_DGRAM >> messages. On a system that has been up for a time (120+ days inthis case), I >> cannot send any RDM/DGRAM type TIPC messages that are larger than about >> 16000 bytes (16033+ fails, 15100 and smaller still works). >> Any larger messages fails with erro code 12 :"Cannot allocate memory". >> >> Really odd thing about it only happens on some connections and not others, >> on the same system (example, sending to tipc node 103:1003 gets no error, >> while sending to 103:3 get error). >> When it gets into this state, it seems to happen forever on the same >> destination address, and not on others until system is rebooted. (restarting the >> server side application makes no difference). >> The sends are done on the same node as the receiver is on. >> >> Kernel is Ubuntu 16.04 LTS 4.4.0-150 in this case, also seen on 161. >> >> Nametable for 103: >> 103 2 2 <1.1.1:2328193343> 2328193344 cluster >> 103 3 3 <1.1.2:3153441800> 3153441801 cluster >> 103 5 5 <1.1.4:269294867> 269294868 cluster >> 103 1002 1002 <1.1.1:490133365> 490133366 cluster >> 103 1003 1003 <1.1.2:2552019732> 2552019733 cluster >> 103 1005 1005 <1.1.4:625110186> 625110187 cluster >> >> _______________________________________________ >> tipc-discussion mailing list >> tip...@li... >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |
From: Partha <par...@gm...> - 2019-10-18 13:17:45
|
Hi Rune, Your systems memory seems to be fragmented, and you need to perform forced reclaim. Can you check the buddy for higher order allocations? cat /proc/buddyinfo BTW, I fixed this in: 57d5f64d83ab tipc: allocate user memory with GFP_KERNEL flag And it was Reported-by: Rune Torgersen <ru...@in...> Its in upstream v4.10-rc3-167-g57d5f64d83ab regards Partha On 2019-10-17 22:08, Rune Torgersen wrote: > Looks like I can kind of make it happen on one system mow. > Stopping some programs (not pattern in which ones) makes it work, and starting some back up again makes it fail. > > Tipc nametable has 231 entries when failing and 183 entries when succeeding (however on a different system the nametable has 251 entries and it is not failing). > > How do I look for memory used by TIPC in the kernel? > > -----Original Message----- > From: Rune Torgersen <ru...@in...> > Sent: Thursday, October 17, 2019 14:53 > > > I will have to look for leaks next time I can make it happen. > I was trying stuff and shut down a different program that was unrelated (but had some TIPC sockets open on a different address (104)), and as soon as I did, the sends started working again. > > It is possible that one of those unrelated sockets has something stuck (as one of them was only ever used to send RDM messages but nothing ever reads it). > > Any suggestions as to what to start looking at (netstat, tipc, tipc_config or kernel params) to try to track it down?. > > Problem with testing a patch (or using Unbuntu 18 LTS) is that we cannot reliably make it happen. > > -----Original Message----- > From: Jon Maloy <jon...@er...> > Sent: Thursday, October 17, 2019 14:35 > > > Hi Rune, > > Do you see any signs of general memory leak ("free") on your node? > > Anyway there can be no doubt that this happens because the big buffer pool is running empty. > > We fixed that in commit 4c94cc2d3d57 ("tipc: fall back to smaller MTU if allocation of local send skb fails") which was delivered to Linux 4.16. > > Do you have any opportunity to apply that patch and try it? > > BR > ///jon > >> -----Original Message----- >> From: Rune Torgersen <ru...@in...> >> Sent: 17-Oct-19 12:38 >> To: 'tip...@li...' <tipc- >> dis...@li...> >> Subject: [tipc-discussion] Error allocating memeory error when sending RDM >> message >> >> Hi. >> >> I am running into an issue when sending SOCK_RDM or SOCK_DGRAM >> messages. On a system that has been up for a time (120+ days inthis case), I >> cannot send any RDM/DGRAM type TIPC messages that are larger than about >> 16000 bytes (16033+ fails, 15100 and smaller still works). >> Any larger messages fails with erro code 12 :"Cannot allocate memory". >> >> Really odd thing about it only happens on some connections and not others, >> on the same system (example, sending to tipc node 103:1003 gets no error, >> while sending to 103:3 get error). >> When it gets into this state, it seems to happen forever on the same >> destination address, and not on others until system is rebooted. (restarting the >> server side application makes no difference). >> The sends are done on the same node as the receiver is on. >> >> Kernel is Ubuntu 16.04 LTS 4.4.0-150 in this case, also seen on 161. >> >> Nametable for 103: >> 103 2 2 <1.1.1:2328193343> 2328193344 cluster >> 103 3 3 <1.1.2:3153441800> 3153441801 cluster >> 103 5 5 <1.1.4:269294867> 269294868 cluster >> 103 1002 1002 <1.1.1:490133365> 490133366 cluster >> 103 1003 1003 <1.1.2:2552019732> 2552019733 cluster >> 103 1005 1005 <1.1.4:625110186> 625110187 cluster >> >> _______________________________________________ >> tipc-discussion mailing list >> tip...@li... >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |
From: Hoang L. <hoa...@de...> - 2019-10-18 08:25:26
|
Hi Jon, Thanks for good description. However, w.r.t your comment "We even need to verify cluster ids.", I'm still unclear why we need to isolate cluster ids here. I guess the node had been accepted already when bypassed at function tipc_disc_rcv. Then, we just check to apply new mechanism for kernel local namespaces. Regars, Hoang -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Friday, October 18, 2019 2:20 AM To: Hoang Huu Le <hoa...@de...>; ma...@do...; tip...@de...; tip...@li... Subject: RE: [net-next] tipc: improve throughput between nodes in netns Hi Hoang, We need a very good log text to justify this. My proposal: "Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network name spaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a name space on the same host, we just replace the name space pointer with the receiving node/name space's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stack. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/name spaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local name space does it choose to let user data messages, and only those, take the crossover path to the receiving node/name space. - If the receiving node/name space is removed, its name space pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that they know a common secret. We use the hash_mix of the sending node/name space for this purpose, since it can be accessed directly by all other name spaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local name spaces' hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local name space, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence should be given more freedom to shortcut the lower protocol than other protocols. Regarding traceability, we should notice that since commit 6c9081a3915d ("add loopback device tracing") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the invloved nodes' loopback interfaces their inter-name space messaging can easily be tracked." I also think there should be a "Suggested-by: Jon Maloy <jon...@er...>" at the bottom of the patch. See more comments below. > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 17-Oct-19 06:10 > To: Jon Maloy <jon...@er...>; ma...@do...; tipc- > de...@de... > Subject: [net-next] tipc: improve throughput between nodes in netns > > Introduce traffic cross namespaces transmission as intranode. > By this way, throughput between nodes in namespace as fast as local. > Looks though the architectural view of TIPC, the new TIPC mechanism for > containers will not introduce any security or breaking the current policies at > all: > > 1/ Extranode: > > Node A Node B > +-----------------+ +-----------------+ > | TIPC | | TIPC | > | Application | | Application | > |-----------------| |-----------------| > | | | | > | TIPC |TIPC address TIPC address| TIPC | > | | | | > |-----------------| |-----------------| > | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | > | Service | | Service | > +-----------------+ +-----------------+ > NIC NIC > +---------------- Bearer Transport ----------------+ > > 2/ Intranode: > Node A Node A > +-----------------+ +-----------------+ > | TIPC | | TIPC | > | Application | | Application | > |-----------------| |-----------------| > | | | | > | TIPC |TIPC address TIPC address| TIPC | > | | | | > +-------+---------+ +--------+--------+ > +--------------------------------------------------+ > > 3/ For container (same as extranode): > +-----------------------------------------------------------------------+ > | Container Container | > | +-----------------+ +-----------------+ > | +-----------------+ | > | | TIPC | | TIPC | | > | | Application | | Application | | > | |-----------------| |-----------------| > | |-----------------| | > | | | | | | > | | TIPC |TIPC address TIPC address| TIPC | | > | | | | | | > | |-----------------| |-----------------| > | |-----------------| | > | | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | | > | | Service | | Service | | > | +-----------------+ +-----------------+ > | +-----------------+ | > | (vNIC) (vNIC) | > | + Host Kernel (KVM, Native) + | > | +----------------Bearer Transport-------------------+ | > | (bridge, OpenVSwitch) | > | + | > | +-------+---------+ | > | | L2 or L3 Bearer | | > | | Service | | > | |-----------------| | > | | | | > | | TIPC |TIPC address | > | | | | > | |-----------------| | > | | TIPC | | > | | Application | | > | +-----------------+ | > | > | | > +-----------------------------------------------------------------------+ > > 4/ New design for container (same as intranode): > +-----------------------------------------------------------------------+ > | Container Container | > | +-----------------+ +-----------------+ > | +-----------------+ | > | | TIPC | | TIPC | | > | | Application | | Application | | > | |-----------------| |-----------------| > | |-----------------| | > | | | | | | > | | TIPC |TIPC address TIPC address| TIPC | | > | | | | | | > | +-------+---------+ +--------+--------+ > | +-------+---------+ | > | + Host Kernel (KVM, Native) + | > | +-------------------------+------------------------+ | > | +-------------+ | > | +-----------------+ | | > | | TIPC | | | > | | Application | | | > | |-----------------| | | > | | +----+ | > | | TIPC |TIPC address | > | | | | > | +-----------------+ | > | > | | > +-----------------------------------------------------------------------+ > > TIPC is as an IPC and to designate the transport layer as an "L2.5" > data link layer. When a TIPC node address has been accepted into a cluster > and located in the same kernel (as we are trying to ensure in this patch), we > are 100% certain it is legitimate and authentic. > So, I cannot see any reason why we should not be allowed to short-cut for > containers when security checks have already been done. Those drawings are nice, but unnecessary in my view. I think my text above is sufficient as explanation of what we are doing. > > Signed-off-by: Hoang Le <hoa...@de...> > --- > net/tipc/discover.c | 6 ++- > net/tipc/msg.h | 10 +++++ > net/tipc/name_distr.c | 2 +- > net/tipc/node.c | 94 > +++++++++++++++++++++++++++++++++++++++++-- > net/tipc/node.h | 4 +- > net/tipc/socket.c | 6 +-- > 6 files changed, 111 insertions(+), 11 deletions(-) > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index > c138d68e8a69..98d4eea97eb7 100644 > --- a/net/tipc/discover.c > +++ b/net/tipc/discover.c > @@ -38,6 +38,8 @@ > #include "node.h" > #include "discover.h" > > +#include <net/netns/hash.h> > + > /* min delay during bearer start up */ > #define TIPC_DISC_INIT msecs_to_jiffies(125) > /* max delay if bearer has no links */ > @@ -94,6 +96,7 @@ static void tipc_disc_init_msg(struct net *net, struct > sk_buff *skb, > msg_set_dest_domain(hdr, dest_domain); > msg_set_bc_netid(hdr, tn->net_id); > b->media->addr2msg(msg_media_addr(hdr), &b->addr); > + msg_set_peer_net_hash(hdr, net_hash_mix(net)); We should not add the hash directly, since that would be exposing kernel internal info to outside observers. What we need to add is a *proof* that the sender knows the hash_mix in question. So, it should XOR its hash_mix with a TIPC/kernel global random value (also secret) and add the result to the message. The receiver does XOR on the proof and the same random value, and compares the result to the hash_mixes of the local name spaces to find a match. > msg_set_node_id(hdr, tipc_own_id(net)); } > > @@ -200,6 +203,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > *skb, > u8 peer_id[NODE_ID_LEN] = {0,}; > u32 dst = msg_dest_domain(hdr); > u32 net_id = msg_bc_netid(hdr); > + u32 pnet_hash = msg_peer_net_hash(hdr); > struct tipc_media_addr maddr; > u32 src = msg_prevnode(hdr); > u32 mtyp = msg_type(hdr); > @@ -242,7 +246,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > *skb, > if (!tipc_in_scope(legacy, b->domain, src)) > return; > tipc_node_check_dest(net, src, peer_id, b, caps, signature, > - &maddr, &respond, &dupl_addr); > + pnet_hash, &maddr, &respond, &dupl_addr); > if (dupl_addr) > disc_dupl_alert(b, src, &maddr); > if (!respond) > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index > 0daa6f04ca81..a8d0f28094f2 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct > tipc_msg *m, u16 n) > msg_set_bits(m, 9, 16, 0xffff, n); > } > > +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) { > + msg_set_word(m, 9, n); > +} > + > +static inline u32 msg_peer_net_hash(struct tipc_msg *m) { > + return msg_word(m, 9); > +} > + > /* Word 10 > */ > static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git > a/net/tipc/name_distr.c b/net/tipc/name_distr.c index > 836e629e8f4a..5feaf3b67380 100644 > --- a/net/tipc/name_distr.c > +++ b/net/tipc/name_distr.c > @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct > sk_buff_head *list, > struct publication *publ; > struct sk_buff *skb = NULL; > struct distr_item *item = NULL; > - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / > + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) > +/ > ITEM_SIZE) * ITEM_SIZE; > u32 msg_rem = msg_dsz; > > diff --git a/net/tipc/node.c b/net/tipc/node.c index > c8f6177dd5a2..9a4ffd647701 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -45,6 +45,8 @@ > #include "netlink.h" > #include "trace.h" > > +#include <net/netns/hash.h> > + > #define INVALID_NODE_SIG 0x10000 > #define NODE_CLEANUP_AFTER 300000 > > @@ -126,6 +128,7 @@ struct tipc_node { > struct timer_list timer; > struct rcu_head rcu; > unsigned long delete_at; > + struct net *pnet; > }; > > /* Node FSM states and events: > @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct > tipc_node *n, int sel) > return n->links[bearer_id].link; > } > > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > +connected) > { > struct tipc_node *n; > int bearer_id; > @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, > u32 sel) > if (unlikely(!n)) > return mtu; > > + /* Allow MAX_MSG_SIZE when building connection oriented message > + * if they are in the same core network > + */ > + if (n->pnet && connected) { > + tipc_node_put(n); > + return mtu; > + } > + > bearer_id = n->active_links[sel & 1]; > if (likely(bearer_id != INVALID_BEARER_ID)) > mtu = n->links[bearer_id].mtu; > @@ -361,11 +372,14 @@ static void tipc_node_write_unlock(struct > tipc_node *n) } > > static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > - u8 *peer_id, u16 capabilities) > + u8 *peer_id, u16 capabilities, > + u32 signature, u32 pnet_hash) > { > struct tipc_net *tn = net_generic(net, tipc_net_id); > struct tipc_node *n, *temp_node; > + struct tipc_net *tn_peer; > struct tipc_link *l; > + struct net *tmp; > int bearer_id; > int i; > > @@ -400,6 +414,23 @@ static struct tipc_node *tipc_node_create(struct net > *net, u32 addr, > memcpy(&n->peer_id, peer_id, 16); > n->net = net; > n->capabilities = capabilities; > + n->pnet = NULL; > + for_each_net_rcu(tmp) { > + /* Integrity checking whether node exists in namespace or not */ > + if (net_hash_mix(tmp) != pnet_hash) > + continue; See my comment above. > + tn_peer = net_generic(tmp, tipc_net_id); > + if (!tn_peer) > + continue; > + > + if ((tn_peer->random & 0x7fff) != (signature & 0x7fff)) > + continue; > + > + if (!memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) { > + n->pnet = tmp; > + break; > + } We even need to verify cluster ids. > + } > kref_init(&n->kref); > rwlock_init(&n->lock); > INIT_HLIST_NODE(&n->hash); > @@ -979,7 +1010,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, > u32 addr) > > void tipc_node_check_dest(struct net *net, u32 addr, > u8 *peer_id, struct tipc_bearer *b, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 pnet_hash, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr) > { > @@ -998,7 +1029,8 @@ void tipc_node_check_dest(struct net *net, u32 > addr, > *dupl_addr = false; > *respond = false; > > - n = tipc_node_create(net, addr, peer_id, capabilities); > + n = tipc_node_create(net, addr, peer_id, capabilities, signature, > + pnet_hash); > if (!n) > return; > > @@ -1424,6 +1456,49 @@ static int __tipc_nl_add_node(struct tipc_nl_msg > *msg, struct tipc_node *node) > return -EMSGSIZE; > } > > +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) > +{ > + struct tipc_msg *hdr = buf_msg(skb_peek(list)); > + struct sk_buff_head inputq; > + > + switch (msg_user(hdr)) { > + case TIPC_LOW_IMPORTANCE: > + case TIPC_MEDIUM_IMPORTANCE: > + case TIPC_HIGH_IMPORTANCE: > + case TIPC_CRITICAL_IMPORTANCE: > + if (msg_connected(hdr) || msg_named(hdr)) { > + spin_lock_init(&list->lock); > + tipc_sk_rcv(pnet, list); > + return; > + } > + if (msg_mcast(hdr)) { > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(pnet, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + return; > + } > + return; > + case MSG_FRAGMENTER: > + if (tipc_msg_assemble(list)) { > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(pnet, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + } > + return; > + case LINK_PROTOCOL: > + case NAME_DISTRIBUTOR: > + case GROUP_PROTOCOL: > + case CONN_MANAGER: GROUP_PROTOCOL and CONN_MANAGER messages must also follow the wormhole path, otherwise they (e.g. CONN_ACK) will be out of synch with the corresponding data messages, and probably result in poorer throughput. Regards ///jon > + case TUNNEL_PROTOCOL: > + case BCAST_PROTOCOL: > + return; > + default: > + return; > + }; > +} > + > /** > * tipc_node_xmit() is the general link level function for message sending > * @net: the applicable net namespace > @@ -1439,6 +1514,7 @@ int tipc_node_xmit(struct net *net, struct > sk_buff_head *list, > struct tipc_link_entry *le = NULL; > struct tipc_node *n; > struct sk_buff_head xmitq; > + bool node_up = false; > int bearer_id; > int rc; > > @@ -1455,6 +1531,16 @@ int tipc_node_xmit(struct net *net, struct > sk_buff_head *list, > return -EHOSTUNREACH; > } > > + node_up = node_is_up(n); > + if (node_up && n->pnet && check_net(n->pnet)) { > + /* xmit inner linux container */ > + tipc_lxc_xmit(n->pnet, list); > + if (likely(skb_queue_empty(list))) { > + tipc_node_put(n); > + return 0; > + } > + } > + > tipc_node_read_lock(n); > bearer_id = n->active_links[selector & 1]; > if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git > a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..11eb95ce358b > 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); > u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void > tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, > struct tipc_bearer *bearer, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 pnet_hash, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr); > void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 > @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, > u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); > int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 > peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 > port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > +connected); > bool tipc_node_is_up(struct net *net, u32 addr); > u16 tipc_node_get_capabilities(struct net *net, u32 addr); int > tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff --git > a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c > 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, > struct tipc_sock *tsk, > > /* Build message as chain of buffers */ > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, > struct msghdr *m, size_t dlen) > return rc; > > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock > *tsk, u32 peer_port, > sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > __skb_queue_purge(&sk->sk_write_queue); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > -- > 2.20.1 |
From: Jon M. <jon...@er...> - 2019-10-17 21:53:33
|
Hi Hoang, We need a very good log text to justify this. My proposal: "Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network name spaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a name space on the same host, we just replace the name space pointer with the receiving node/name space's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stack. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/name spaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local name space does it choose to let user data messages, and only those, take the crossover path to the receiving node/name space. - If the receiving node/name space is removed, its name space pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that they know a common secret. We use the hash_mix of the sending node/name space for this purpose, since it can be accessed directly by all other name spaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local name spaces' hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local name space, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence should be given more freedom to shortcut the lower protocol than other protocols. Regarding traceability, we should notice that since commit 6c9081a3915d ("add loopback device tracing") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the invloved nodes' loopback interfaces their inter-name space messaging can easily be tracked." I also think there should be a "Suggested-by: Jon Maloy <jon...@er...>" at the bottom of the patch. See more comments below. > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 17-Oct-19 06:10 > To: Jon Maloy <jon...@er...>; ma...@do...; tipc- > de...@de... > Subject: [net-next] tipc: improve throughput between nodes in netns > > Introduce traffic cross namespaces transmission as intranode. > By this way, throughput between nodes in namespace as fast as local. > Looks though the architectural view of TIPC, the new TIPC mechanism for > containers will not introduce any security or breaking the current policies at > all: > > 1/ Extranode: > > Node A Node B > +-----------------+ +-----------------+ > | TIPC | | TIPC | > | Application | | Application | > |-----------------| |-----------------| > | | | | > | TIPC |TIPC address TIPC address| TIPC | > | | | | > |-----------------| |-----------------| > | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | > | Service | | Service | > +-----------------+ +-----------------+ > NIC NIC > +---------------- Bearer Transport ----------------+ > > 2/ Intranode: > Node A Node A > +-----------------+ +-----------------+ > | TIPC | | TIPC | > | Application | | Application | > |-----------------| |-----------------| > | | | | > | TIPC |TIPC address TIPC address| TIPC | > | | | | > +-------+---------+ +--------+--------+ > +--------------------------------------------------+ > > 3/ For container (same as extranode): > +-----------------------------------------------------------------------+ > | Container Container | > | +-----------------+ +-----------------+ > | +-----------------+ | > | | TIPC | | TIPC | | > | | Application | | Application | | > | |-----------------| |-----------------| > | |-----------------| | > | | | | | | > | | TIPC |TIPC address TIPC address| TIPC | | > | | | | | | > | |-----------------| |-----------------| > | |-----------------| | > | | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | | > | | Service | | Service | | > | +-----------------+ +-----------------+ > | +-----------------+ | > | (vNIC) (vNIC) | > | + Host Kernel (KVM, Native) + | > | +----------------Bearer Transport-------------------+ | > | (bridge, OpenVSwitch) | > | + | > | +-------+---------+ | > | | L2 or L3 Bearer | | > | | Service | | > | |-----------------| | > | | | | > | | TIPC |TIPC address | > | | | | > | |-----------------| | > | | TIPC | | > | | Application | | > | +-----------------+ | > | > | | > +-----------------------------------------------------------------------+ > > 4/ New design for container (same as intranode): > +-----------------------------------------------------------------------+ > | Container Container | > | +-----------------+ +-----------------+ > | +-----------------+ | > | | TIPC | | TIPC | | > | | Application | | Application | | > | |-----------------| |-----------------| > | |-----------------| | > | | | | | | > | | TIPC |TIPC address TIPC address| TIPC | | > | | | | | | > | +-------+---------+ +--------+--------+ > | +-------+---------+ | > | + Host Kernel (KVM, Native) + | > | +-------------------------+------------------------+ | > | +-------------+ | > | +-----------------+ | | > | | TIPC | | | > | | Application | | | > | |-----------------| | | > | | +----+ | > | | TIPC |TIPC address | > | | | | > | +-----------------+ | > | > | | > +-----------------------------------------------------------------------+ > > TIPC is as an IPC and to designate the transport layer as an "L2.5" > data link layer. When a TIPC node address has been accepted into a cluster > and located in the same kernel (as we are trying to ensure in this patch), we > are 100% certain it is legitimate and authentic. > So, I cannot see any reason why we should not be allowed to short-cut for > containers when security checks have already been done. Those drawings are nice, but unnecessary in my view. I think my text above is sufficient as explanation of what we are doing. > > Signed-off-by: Hoang Le <hoa...@de...> > --- > net/tipc/discover.c | 6 ++- > net/tipc/msg.h | 10 +++++ > net/tipc/name_distr.c | 2 +- > net/tipc/node.c | 94 > +++++++++++++++++++++++++++++++++++++++++-- > net/tipc/node.h | 4 +- > net/tipc/socket.c | 6 +-- > 6 files changed, 111 insertions(+), 11 deletions(-) > > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index > c138d68e8a69..98d4eea97eb7 100644 > --- a/net/tipc/discover.c > +++ b/net/tipc/discover.c > @@ -38,6 +38,8 @@ > #include "node.h" > #include "discover.h" > > +#include <net/netns/hash.h> > + > /* min delay during bearer start up */ > #define TIPC_DISC_INIT msecs_to_jiffies(125) > /* max delay if bearer has no links */ > @@ -94,6 +96,7 @@ static void tipc_disc_init_msg(struct net *net, struct > sk_buff *skb, > msg_set_dest_domain(hdr, dest_domain); > msg_set_bc_netid(hdr, tn->net_id); > b->media->addr2msg(msg_media_addr(hdr), &b->addr); > + msg_set_peer_net_hash(hdr, net_hash_mix(net)); We should not add the hash directly, since that would be exposing kernel internal info to outside observers. What we need to add is a *proof* that the sender knows the hash_mix in question. So, it should XOR its hash_mix with a TIPC/kernel global random value (also secret) and add the result to the message. The receiver does XOR on the proof and the same random value, and compares the result to the hash_mixes of the local name spaces to find a match. > msg_set_node_id(hdr, tipc_own_id(net)); } > > @@ -200,6 +203,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > *skb, > u8 peer_id[NODE_ID_LEN] = {0,}; > u32 dst = msg_dest_domain(hdr); > u32 net_id = msg_bc_netid(hdr); > + u32 pnet_hash = msg_peer_net_hash(hdr); > struct tipc_media_addr maddr; > u32 src = msg_prevnode(hdr); > u32 mtyp = msg_type(hdr); > @@ -242,7 +246,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff > *skb, > if (!tipc_in_scope(legacy, b->domain, src)) > return; > tipc_node_check_dest(net, src, peer_id, b, caps, signature, > - &maddr, &respond, &dupl_addr); > + pnet_hash, &maddr, &respond, &dupl_addr); > if (dupl_addr) > disc_dupl_alert(b, src, &maddr); > if (!respond) > diff --git a/net/tipc/msg.h b/net/tipc/msg.h index > 0daa6f04ca81..a8d0f28094f2 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -973,6 +973,16 @@ static inline void msg_set_grp_remitted(struct > tipc_msg *m, u16 n) > msg_set_bits(m, 9, 16, 0xffff, n); > } > > +static inline void msg_set_peer_net_hash(struct tipc_msg *m, u32 n) { > + msg_set_word(m, 9, n); > +} > + > +static inline u32 msg_peer_net_hash(struct tipc_msg *m) { > + return msg_word(m, 9); > +} > + > /* Word 10 > */ > static inline u16 msg_grp_evt(struct tipc_msg *m) diff --git > a/net/tipc/name_distr.c b/net/tipc/name_distr.c index > 836e629e8f4a..5feaf3b67380 100644 > --- a/net/tipc/name_distr.c > +++ b/net/tipc/name_distr.c > @@ -146,7 +146,7 @@ static void named_distribute(struct net *net, struct > sk_buff_head *list, > struct publication *publ; > struct sk_buff *skb = NULL; > struct distr_item *item = NULL; > - u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0) - INT_H_SIZE) / > + u32 msg_dsz = ((tipc_node_get_mtu(net, dnode, 0, false) - INT_H_SIZE) > +/ > ITEM_SIZE) * ITEM_SIZE; > u32 msg_rem = msg_dsz; > > diff --git a/net/tipc/node.c b/net/tipc/node.c index > c8f6177dd5a2..9a4ffd647701 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -45,6 +45,8 @@ > #include "netlink.h" > #include "trace.h" > > +#include <net/netns/hash.h> > + > #define INVALID_NODE_SIG 0x10000 > #define NODE_CLEANUP_AFTER 300000 > > @@ -126,6 +128,7 @@ struct tipc_node { > struct timer_list timer; > struct rcu_head rcu; > unsigned long delete_at; > + struct net *pnet; > }; > > /* Node FSM states and events: > @@ -184,7 +187,7 @@ static struct tipc_link *node_active_link(struct > tipc_node *n, int sel) > return n->links[bearer_id].link; > } > > -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel) > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > +connected) > { > struct tipc_node *n; > int bearer_id; > @@ -194,6 +197,14 @@ int tipc_node_get_mtu(struct net *net, u32 addr, > u32 sel) > if (unlikely(!n)) > return mtu; > > + /* Allow MAX_MSG_SIZE when building connection oriented message > + * if they are in the same core network > + */ > + if (n->pnet && connected) { > + tipc_node_put(n); > + return mtu; > + } > + > bearer_id = n->active_links[sel & 1]; > if (likely(bearer_id != INVALID_BEARER_ID)) > mtu = n->links[bearer_id].mtu; > @@ -361,11 +372,14 @@ static void tipc_node_write_unlock(struct > tipc_node *n) } > > static struct tipc_node *tipc_node_create(struct net *net, u32 addr, > - u8 *peer_id, u16 capabilities) > + u8 *peer_id, u16 capabilities, > + u32 signature, u32 pnet_hash) > { > struct tipc_net *tn = net_generic(net, tipc_net_id); > struct tipc_node *n, *temp_node; > + struct tipc_net *tn_peer; > struct tipc_link *l; > + struct net *tmp; > int bearer_id; > int i; > > @@ -400,6 +414,23 @@ static struct tipc_node *tipc_node_create(struct net > *net, u32 addr, > memcpy(&n->peer_id, peer_id, 16); > n->net = net; > n->capabilities = capabilities; > + n->pnet = NULL; > + for_each_net_rcu(tmp) { > + /* Integrity checking whether node exists in namespace or not */ > + if (net_hash_mix(tmp) != pnet_hash) > + continue; See my comment above. > + tn_peer = net_generic(tmp, tipc_net_id); > + if (!tn_peer) > + continue; > + > + if ((tn_peer->random & 0x7fff) != (signature & 0x7fff)) > + continue; > + > + if (!memcmp(n->peer_id, tn_peer->node_id, NODE_ID_LEN)) { > + n->pnet = tmp; > + break; > + } We even need to verify cluster ids. > + } > kref_init(&n->kref); > rwlock_init(&n->lock); > INIT_HLIST_NODE(&n->hash); > @@ -979,7 +1010,7 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, > u32 addr) > > void tipc_node_check_dest(struct net *net, u32 addr, > u8 *peer_id, struct tipc_bearer *b, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 pnet_hash, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr) > { > @@ -998,7 +1029,8 @@ void tipc_node_check_dest(struct net *net, u32 > addr, > *dupl_addr = false; > *respond = false; > > - n = tipc_node_create(net, addr, peer_id, capabilities); > + n = tipc_node_create(net, addr, peer_id, capabilities, signature, > + pnet_hash); > if (!n) > return; > > @@ -1424,6 +1456,49 @@ static int __tipc_nl_add_node(struct tipc_nl_msg > *msg, struct tipc_node *node) > return -EMSGSIZE; > } > > +static void tipc_lxc_xmit(struct net *pnet, struct sk_buff_head *list) > +{ > + struct tipc_msg *hdr = buf_msg(skb_peek(list)); > + struct sk_buff_head inputq; > + > + switch (msg_user(hdr)) { > + case TIPC_LOW_IMPORTANCE: > + case TIPC_MEDIUM_IMPORTANCE: > + case TIPC_HIGH_IMPORTANCE: > + case TIPC_CRITICAL_IMPORTANCE: > + if (msg_connected(hdr) || msg_named(hdr)) { > + spin_lock_init(&list->lock); > + tipc_sk_rcv(pnet, list); > + return; > + } > + if (msg_mcast(hdr)) { > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(pnet, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + return; > + } > + return; > + case MSG_FRAGMENTER: > + if (tipc_msg_assemble(list)) { > + skb_queue_head_init(&inputq); > + tipc_sk_mcast_rcv(pnet, list, &inputq); > + __skb_queue_purge(list); > + skb_queue_purge(&inputq); > + } > + return; > + case LINK_PROTOCOL: > + case NAME_DISTRIBUTOR: > + case GROUP_PROTOCOL: > + case CONN_MANAGER: GROUP_PROTOCOL and CONN_MANAGER messages must also follow the wormhole path, otherwise they (e.g. CONN_ACK) will be out of synch with the corresponding data messages, and probably result in poorer throughput. Regards ///jon > + case TUNNEL_PROTOCOL: > + case BCAST_PROTOCOL: > + return; > + default: > + return; > + }; > +} > + > /** > * tipc_node_xmit() is the general link level function for message sending > * @net: the applicable net namespace > @@ -1439,6 +1514,7 @@ int tipc_node_xmit(struct net *net, struct > sk_buff_head *list, > struct tipc_link_entry *le = NULL; > struct tipc_node *n; > struct sk_buff_head xmitq; > + bool node_up = false; > int bearer_id; > int rc; > > @@ -1455,6 +1531,16 @@ int tipc_node_xmit(struct net *net, struct > sk_buff_head *list, > return -EHOSTUNREACH; > } > > + node_up = node_is_up(n); > + if (node_up && n->pnet && check_net(n->pnet)) { > + /* xmit inner linux container */ > + tipc_lxc_xmit(n->pnet, list); > + if (likely(skb_queue_empty(list))) { > + tipc_node_put(n); > + return 0; > + } > + } > + > tipc_node_read_lock(n); > bearer_id = n->active_links[selector & 1]; > if (unlikely(bearer_id == INVALID_BEARER_ID)) { diff --git > a/net/tipc/node.h b/net/tipc/node.h index 291d0ecd4101..11eb95ce358b > 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -75,7 +75,7 @@ u32 tipc_node_get_addr(struct tipc_node *node); > u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void > tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, > struct tipc_bearer *bearer, > - u16 capabilities, u32 signature, > + u16 capabilities, u32 signature, u32 pnet_hash, > struct tipc_media_addr *maddr, > bool *respond, bool *dupl_addr); > void tipc_node_delete_links(struct net *net, int bearer_id); @@ -92,7 +92,7 > @@ void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, > u32 addr); void tipc_node_broadcast(struct net *net, struct sk_buff *skb); > int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 > peer_port); void tipc_node_remove_conn(struct net *net, u32 dnode, u32 > port); -int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel); > +int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool > +connected); > bool tipc_node_is_up(struct net *net, u32 addr); > u16 tipc_node_get_capabilities(struct net *net, u32 addr); int > tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb); diff --git > a/net/tipc/socket.c b/net/tipc/socket.c index 3b9f8cc328f5..fb24df03da6c > 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -854,7 +854,7 @@ static int tipc_send_group_msg(struct net *net, > struct tipc_sock *tsk, > > /* Build message as chain of buffers */ > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1388,7 +1388,7 @@ static int __tipc_sendmsg(struct socket *sock, > struct msghdr *m, size_t dlen) > return rc; > > __skb_queue_head_init(&pkts); > - mtu = tipc_node_get_mtu(net, dnode, tsk->portid); > + mtu = tipc_node_get_mtu(net, dnode, tsk->portid, false); > rc = tipc_msg_build(hdr, m, 0, dlen, mtu, &pkts); > if (unlikely(rc != dlen)) > return rc; > @@ -1526,7 +1526,7 @@ static void tipc_sk_finish_conn(struct tipc_sock > *tsk, u32 peer_port, > sk_reset_timer(sk, &sk->sk_timer, jiffies + CONN_PROBING_INTV); > tipc_set_sk_state(sk, TIPC_ESTABLISHED); > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > - tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > + tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid, true); > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > __skb_queue_purge(&sk->sk_write_queue); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > -- > 2.20.1 |
From: Rune T. <ru...@in...> - 2019-10-17 20:09:12
|
Looks like I can kind of make it happen on one system mow. Stopping some programs (not pattern in which ones) makes it work, and starting some back up again makes it fail. Tipc nametable has 231 entries when failing and 183 entries when succeeding (however on a different system the nametable has 251 entries and it is not failing). How do I look for memory used by TIPC in the kernel? -----Original Message----- From: Rune Torgersen <ru...@in...> Sent: Thursday, October 17, 2019 14:53 I will have to look for leaks next time I can make it happen. I was trying stuff and shut down a different program that was unrelated (but had some TIPC sockets open on a different address (104)), and as soon as I did, the sends started working again. It is possible that one of those unrelated sockets has something stuck (as one of them was only ever used to send RDM messages but nothing ever reads it). Any suggestions as to what to start looking at (netstat, tipc, tipc_config or kernel params) to try to track it down?. Problem with testing a patch (or using Unbuntu 18 LTS) is that we cannot reliably make it happen. -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Thursday, October 17, 2019 14:35 Hi Rune, Do you see any signs of general memory leak ("free") on your node? Anyway there can be no doubt that this happens because the big buffer pool is running empty. We fixed that in commit 4c94cc2d3d57 ("tipc: fall back to smaller MTU if allocation of local send skb fails") which was delivered to Linux 4.16. Do you have any opportunity to apply that patch and try it? BR ///jon > -----Original Message----- > From: Rune Torgersen <ru...@in...> > Sent: 17-Oct-19 12:38 > To: 'tip...@li...' <tipc- > dis...@li...> > Subject: [tipc-discussion] Error allocating memeory error when sending RDM > message > > Hi. > > I am running into an issue when sending SOCK_RDM or SOCK_DGRAM > messages. On a system that has been up for a time (120+ days inthis case), I > cannot send any RDM/DGRAM type TIPC messages that are larger than about > 16000 bytes (16033+ fails, 15100 and smaller still works). > Any larger messages fails with erro code 12 :"Cannot allocate memory". > > Really odd thing about it only happens on some connections and not others, > on the same system (example, sending to tipc node 103:1003 gets no error, > while sending to 103:3 get error). > When it gets into this state, it seems to happen forever on the same > destination address, and not on others until system is rebooted. (restarting the > server side application makes no difference). > The sends are done on the same node as the receiver is on. > > Kernel is Ubuntu 16.04 LTS 4.4.0-150 in this case, also seen on 161. > > Nametable for 103: > 103 2 2 <1.1.1:2328193343> 2328193344 cluster > 103 3 3 <1.1.2:3153441800> 3153441801 cluster > 103 5 5 <1.1.4:269294867> 269294868 cluster > 103 1002 1002 <1.1.1:490133365> 490133366 cluster > 103 1003 1003 <1.1.2:2552019732> 2552019733 cluster > 103 1005 1005 <1.1.4:625110186> 625110187 cluster > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion _______________________________________________ tipc-discussion mailing list tip...@li... https://lists.sourceforge.net/lists/listinfo/tipc-discussion |
From: Rune T. <ru...@in...> - 2019-10-17 19:53:07
|
I will have to look for leaks next time I can make it happen. I was trying stuff and shut down a different program that was unrelated (but had some TIPC sockets open on a different address (104)), and as soon as I did, the sends started working again. It is possible that one of those unrelated sockets has something stuck (as one of them was only ever used to send RDM messages but nothing ever reads it). Any suggestions as to what to start looking at (netstat, tipc, tipc_config or kernel params) to try to track it down?. Problem with testing a patch (or using Unbuntu 18 LTS) is that we cannot reliably make it happen. -----Original Message----- From: Jon Maloy <jon...@er...> Sent: Thursday, October 17, 2019 14:35 Hi Rune, Do you see any signs of general memory leak ("free") on your node? Anyway there can be no doubt that this happens because the big buffer pool is running empty. We fixed that in commit 4c94cc2d3d57 ("tipc: fall back to smaller MTU if allocation of local send skb fails") which was delivered to Linux 4.16. Do you have any opportunity to apply that patch and try it? BR ///jon > -----Original Message----- > From: Rune Torgersen <ru...@in...> > Sent: 17-Oct-19 12:38 > To: 'tip...@li...' <tipc- > dis...@li...> > Subject: [tipc-discussion] Error allocating memeory error when sending RDM > message > > Hi. > > I am running into an issue when sending SOCK_RDM or SOCK_DGRAM > messages. On a system that has been up for a time (120+ days inthis case), I > cannot send any RDM/DGRAM type TIPC messages that are larger than about > 16000 bytes (16033+ fails, 15100 and smaller still works). > Any larger messages fails with erro code 12 :"Cannot allocate memory". > > Really odd thing about it only happens on some connections and not others, > on the same system (example, sending to tipc node 103:1003 gets no error, > while sending to 103:3 get error). > When it gets into this state, it seems to happen forever on the same > destination address, and not on others until system is rebooted. (restarting the > server side application makes no difference). > The sends are done on the same node as the receiver is on. > > Kernel is Ubuntu 16.04 LTS 4.4.0-150 in this case, also seen on 161. > > Nametable for 103: > 103 2 2 <1.1.1:2328193343> 2328193344 cluster > 103 3 3 <1.1.2:3153441800> 3153441801 cluster > 103 5 5 <1.1.4:269294867> 269294868 cluster > 103 1002 1002 <1.1.1:490133365> 490133366 cluster > 103 1003 1003 <1.1.2:2552019732> 2552019733 cluster > 103 1005 1005 <1.1.4:625110186> 625110187 cluster > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion |