You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(9) |
Feb
(11) |
Mar
(22) |
Apr
(73) |
May
(78) |
Jun
(146) |
Jul
(80) |
Aug
(27) |
Sep
(5) |
Oct
(14) |
Nov
(18) |
Dec
(27) |
2005 |
Jan
(20) |
Feb
(30) |
Mar
(19) |
Apr
(28) |
May
(50) |
Jun
(31) |
Jul
(32) |
Aug
(14) |
Sep
(36) |
Oct
(43) |
Nov
(74) |
Dec
(63) |
2006 |
Jan
(34) |
Feb
(32) |
Mar
(21) |
Apr
(76) |
May
(106) |
Jun
(72) |
Jul
(70) |
Aug
(175) |
Sep
(130) |
Oct
(39) |
Nov
(81) |
Dec
(43) |
2007 |
Jan
(81) |
Feb
(36) |
Mar
(20) |
Apr
(43) |
May
(54) |
Jun
(34) |
Jul
(44) |
Aug
(55) |
Sep
(44) |
Oct
(54) |
Nov
(43) |
Dec
(41) |
2008 |
Jan
(42) |
Feb
(84) |
Mar
(73) |
Apr
(30) |
May
(119) |
Jun
(54) |
Jul
(54) |
Aug
(93) |
Sep
(173) |
Oct
(130) |
Nov
(145) |
Dec
(153) |
2009 |
Jan
(59) |
Feb
(12) |
Mar
(28) |
Apr
(18) |
May
(56) |
Jun
(9) |
Jul
(28) |
Aug
(62) |
Sep
(16) |
Oct
(19) |
Nov
(15) |
Dec
(17) |
2010 |
Jan
(14) |
Feb
(36) |
Mar
(37) |
Apr
(30) |
May
(33) |
Jun
(53) |
Jul
(42) |
Aug
(50) |
Sep
(67) |
Oct
(66) |
Nov
(69) |
Dec
(36) |
2011 |
Jan
(52) |
Feb
(45) |
Mar
(49) |
Apr
(21) |
May
(34) |
Jun
(13) |
Jul
(19) |
Aug
(37) |
Sep
(43) |
Oct
(10) |
Nov
(23) |
Dec
(30) |
2012 |
Jan
(42) |
Feb
(36) |
Mar
(46) |
Apr
(25) |
May
(96) |
Jun
(146) |
Jul
(40) |
Aug
(28) |
Sep
(61) |
Oct
(45) |
Nov
(100) |
Dec
(53) |
2013 |
Jan
(79) |
Feb
(24) |
Mar
(134) |
Apr
(156) |
May
(118) |
Jun
(75) |
Jul
(278) |
Aug
(145) |
Sep
(136) |
Oct
(168) |
Nov
(137) |
Dec
(439) |
2014 |
Jan
(284) |
Feb
(158) |
Mar
(231) |
Apr
(275) |
May
(259) |
Jun
(91) |
Jul
(222) |
Aug
(215) |
Sep
(165) |
Oct
(166) |
Nov
(211) |
Dec
(150) |
2015 |
Jan
(164) |
Feb
(324) |
Mar
(299) |
Apr
(214) |
May
(111) |
Jun
(109) |
Jul
(105) |
Aug
(36) |
Sep
(58) |
Oct
(131) |
Nov
(68) |
Dec
(30) |
2016 |
Jan
(46) |
Feb
(87) |
Mar
(135) |
Apr
(174) |
May
(132) |
Jun
(135) |
Jul
(149) |
Aug
(125) |
Sep
(79) |
Oct
(49) |
Nov
(95) |
Dec
(102) |
2017 |
Jan
(104) |
Feb
(75) |
Mar
(72) |
Apr
(53) |
May
(18) |
Jun
(5) |
Jul
(14) |
Aug
(19) |
Sep
(2) |
Oct
(13) |
Nov
(21) |
Dec
(67) |
2018 |
Jan
(56) |
Feb
(50) |
Mar
(148) |
Apr
(41) |
May
(37) |
Jun
(34) |
Jul
(34) |
Aug
(11) |
Sep
(52) |
Oct
(48) |
Nov
(28) |
Dec
(46) |
2019 |
Jan
(29) |
Feb
(63) |
Mar
(95) |
Apr
(54) |
May
(14) |
Jun
(71) |
Jul
(60) |
Aug
(49) |
Sep
(3) |
Oct
(64) |
Nov
(115) |
Dec
(57) |
2020 |
Jan
(15) |
Feb
(9) |
Mar
(38) |
Apr
(27) |
May
(60) |
Jun
(53) |
Jul
(35) |
Aug
(46) |
Sep
(37) |
Oct
(64) |
Nov
(20) |
Dec
(25) |
2021 |
Jan
(20) |
Feb
(31) |
Mar
(27) |
Apr
(23) |
May
(21) |
Jun
(30) |
Jul
(30) |
Aug
(7) |
Sep
(18) |
Oct
|
Nov
(15) |
Dec
(4) |
2022 |
Jan
(3) |
Feb
(1) |
Mar
(10) |
Apr
|
May
(2) |
Jun
(26) |
Jul
(5) |
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(9) |
Dec
(2) |
2023 |
Jan
(4) |
Feb
(4) |
Mar
(5) |
Apr
(10) |
May
(29) |
Jun
(17) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
(2) |
Dec
|
2024 |
Jan
|
Feb
(6) |
Mar
|
Apr
(1) |
May
(6) |
Jun
|
Jul
(5) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Jon M. <jon...@er...> - 2019-11-05 13:26:02
|
Acked, ///jon > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: 30-Oct-19 05:38 > To: Jon Maloy <jon...@er...>; ma...@do...; tip...@li...; > yin...@wi... > Subject: [net-next] tipc: update cluster capabilities if node deleted > > There are two improvements when re-calculate cluster capabilities: > > - When deleting a specific down node, need to re-calculate. > - In tipc_node_cleanup(), do not need to re-calculate if node > is still existing in cluster. > > Signed-off-by: Hoang Le <hoa...@de...> > --- > net/tipc/node.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/net/tipc/node.c b/net/tipc/node.c > index 4b60928049ea..1f1584518221 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -667,6 +667,11 @@ static bool tipc_node_cleanup(struct tipc_node *peer) > } > tipc_node_write_unlock(peer); > > + if (!deleted) { > + spin_unlock_bh(&tn->node_list_lock); > + return deleted; > + } > + > /* Calculate cluster capabilities */ > tn->capabilities = TIPC_NODE_CAPABILITIES; > list_for_each_entry_rcu(temp_node, &tn->node_list, list) { > @@ -2043,7 +2048,7 @@ int tipc_nl_peer_rm(struct sk_buff *skb, struct genl_info *info) > struct net *net = sock_net(skb->sk); > struct tipc_net *tn = net_generic(net, tipc_net_id); > struct nlattr *attrs[TIPC_NLA_NET_MAX + 1]; > - struct tipc_node *peer; > + struct tipc_node *peer, *temp_node; > u32 addr; > int err; > > @@ -2084,6 +2089,11 @@ int tipc_nl_peer_rm(struct sk_buff *skb, struct genl_info *info) > tipc_node_write_unlock(peer); > tipc_node_delete(peer); > > + /* Calculate cluster capabilities */ > + tn->capabilities = TIPC_NODE_CAPABILITIES; > + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { > + tn->capabilities &= temp_node->capabilities; > + } > err = 0; > err_out: > tipc_node_put(peer); > -- > 2.20.1 |
From: Tuong L. <tuo...@de...> - 2019-11-05 10:51:32
|
This commit adds two netlink commands to TIPC in order for user to be able to set or remove AEAD keys: - TIPC_NL_KEY_SET - TIPC_NL_KEY_FLUSH When the 'KEY_SET' is given along with the key data, the key will be initiated and attached to TIPC crypto. On the other hand, the 'KEY_FLUSH' command will remove all existing keys if any. v2: rebase, add new kernel option ("TIPC_CRYPTO") Signed-off-by: Tuong Lien <tuo...@de...> --- include/uapi/linux/tipc_netlink.h | 8 +++ net/tipc/netlink.c | 20 +++++- net/tipc/node.c | 135 ++++++++++++++++++++++++++++++++++++++ net/tipc/node.h | 4 ++ 4 files changed, 166 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/tipc_netlink.h b/include/uapi/linux/tipc_netlink.h index efb958fd167d..d712e5d7b3e1 100644 --- a/include/uapi/linux/tipc_netlink.h +++ b/include/uapi/linux/tipc_netlink.h @@ -63,6 +63,10 @@ enum { TIPC_NL_PEER_REMOVE, TIPC_NL_BEARER_ADD, TIPC_NL_UDP_GET_REMOTEIP, +#ifdef CONFIG_TIPC_CRYPTO + TIPC_NL_KEY_SET, + TIPC_NL_KEY_FLUSH, +#endif __TIPC_NL_CMD_MAX, TIPC_NL_CMD_MAX = __TIPC_NL_CMD_MAX - 1 @@ -160,6 +164,10 @@ enum { TIPC_NLA_NODE_UNSPEC, TIPC_NLA_NODE_ADDR, /* u32 */ TIPC_NLA_NODE_UP, /* flag */ +#ifdef CONFIG_TIPC_CRYPTO + TIPC_NLA_NODE_ID, /* data */ + TIPC_NLA_NODE_KEY, /* data */ +#endif __TIPC_NLA_NODE_MAX, TIPC_NLA_NODE_MAX = __TIPC_NLA_NODE_MAX - 1 diff --git a/net/tipc/netlink.c b/net/tipc/netlink.c index d32bbd0f5e46..b63461c6db02 100644 --- a/net/tipc/netlink.c +++ b/net/tipc/netlink.c @@ -102,7 +102,13 @@ const struct nla_policy tipc_nl_link_policy[TIPC_NLA_LINK_MAX + 1] = { const struct nla_policy tipc_nl_node_policy[TIPC_NLA_NODE_MAX + 1] = { [TIPC_NLA_NODE_UNSPEC] = { .type = NLA_UNSPEC }, [TIPC_NLA_NODE_ADDR] = { .type = NLA_U32 }, - [TIPC_NLA_NODE_UP] = { .type = NLA_FLAG } + [TIPC_NLA_NODE_UP] = { .type = NLA_FLAG }, +#ifdef CONFIG_TIPC_CRYPTO + [TIPC_NLA_NODE_ID] = { .type = NLA_BINARY, + .len = TIPC_NODEID_LEN}, + [TIPC_NLA_NODE_KEY] = { .type = NLA_BINARY, + .len = TIPC_AEAD_KEY_SIZE_MAX}, +#endif }; /* Properties valid for media, bearer and link */ @@ -257,6 +263,18 @@ static const struct genl_ops tipc_genl_v2_ops[] = { .dumpit = tipc_udp_nl_dump_remoteip, }, #endif +#ifdef CONFIG_TIPC_CRYPTO + { + .cmd = TIPC_NL_KEY_SET, + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, + .doit = tipc_nl_node_set_key, + }, + { + .cmd = TIPC_NL_KEY_FLUSH, + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, + .doit = tipc_nl_node_flush_key, + }, +#endif }; struct genl_family tipc_genl_family __ro_after_init = { diff --git a/net/tipc/node.c b/net/tipc/node.c index 61d4656c7ccd..710315e778f8 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -2784,6 +2784,141 @@ int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb, return skb->len; } +#ifdef CONFIG_TIPC_CRYPTO +static int tipc_nl_retrieve_key(struct nlattr **attrs, + struct tipc_aead_key **key) +{ + struct nlattr *attr = attrs[TIPC_NLA_NODE_KEY]; + + if (!attr) + return -ENODATA; + + *key = (struct tipc_aead_key *)nla_data(attr); + if (nla_len(attr) < tipc_aead_key_size(*key)) + return -EINVAL; + + return 0; +} + +static int tipc_nl_retrieve_nodeid(struct nlattr **attrs, u8 **node_id) +{ + struct nlattr *attr = attrs[TIPC_NLA_NODE_ID]; + + if (!attr) + return -ENODATA; + + if (nla_len(attr) < TIPC_NODEID_LEN) + return -EINVAL; + + *node_id = (u8 *)nla_data(attr); + return 0; +} + +int __tipc_nl_node_set_key(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr *attrs[TIPC_NLA_NODE_MAX + 1]; + struct net *net = sock_net(skb->sk); + struct tipc_net *tn = tipc_net(net); + struct tipc_node *n = NULL; + struct tipc_aead_key *ukey; + struct tipc_crypto *c; + u8 *id, *own_id; + int rc = 0; + + if (!info->attrs[TIPC_NLA_NODE]) + return -EINVAL; + + rc = nla_parse_nested(attrs, TIPC_NLA_NODE_MAX, + info->attrs[TIPC_NLA_NODE], + tipc_nl_node_policy, info->extack); + if (rc) + goto exit; + + own_id = tipc_own_id(net); + if (!own_id) { + rc = -EPERM; + goto exit; + } + + rc = tipc_nl_retrieve_key(attrs, &ukey); + if (rc) + goto exit; + + rc = tipc_aead_key_validate(ukey); + if (rc) + goto exit; + + rc = tipc_nl_retrieve_nodeid(attrs, &id); + switch (rc) { + case -ENODATA: + /* Cluster key mode */ + rc = tipc_crypto_key_init(tn->crypto_tx, ukey, CLUSTER_KEY); + break; + case 0: + /* Per-node key mode */ + if (!memcmp(id, own_id, NODE_ID_LEN)) { + c = tn->crypto_tx; + } else { + n = tipc_node_find_by_id(net, id) ?: + tipc_node_create(net, 0, id, 0xffffu, 0, true); + if (unlikely(!n)) { + rc = -ENOMEM; + break; + } + c = n->crypto_rx; + } + + rc = tipc_crypto_key_init(c, ukey, PER_NODE_KEY); + if (n) + tipc_node_put(n); + break; + default: + break; + } + +exit: + return (rc < 0) ? rc : 0; +} + +int tipc_nl_node_set_key(struct sk_buff *skb, struct genl_info *info) +{ + int err; + + rtnl_lock(); + err = __tipc_nl_node_set_key(skb, info); + rtnl_unlock(); + + return err; +} + +int __tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info) +{ + struct net *net = sock_net(skb->sk); + struct tipc_net *tn = tipc_net(net); + struct tipc_node *n; + + tipc_crypto_key_flush(tn->crypto_tx); + rcu_read_lock(); + list_for_each_entry_rcu(n, &tn->node_list, list) + tipc_crypto_key_flush(n->crypto_rx); + rcu_read_unlock(); + + pr_info("All keys are flushed!\n"); + return 0; +} + +int tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info) +{ + int err; + + rtnl_lock(); + err = __tipc_nl_node_flush_key(skb, info); + rtnl_unlock(); + + return err; +} +#endif + /** * tipc_node_dump - dump TIPC node data * @n: tipc node to be dumped diff --git a/net/tipc/node.h b/net/tipc/node.h index 1a15cf82cb11..a6803b449a2c 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -119,5 +119,9 @@ int tipc_nl_node_get_monitor(struct sk_buff *skb, struct genl_info *info); int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct netlink_callback *cb); int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb, struct netlink_callback *cb); +#ifdef CONFIG_TIPC_CRYPTO +int tipc_nl_node_set_key(struct sk_buff *skb, struct genl_info *info); +int tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info); +#endif void tipc_node_pre_cleanup_net(struct net *exit_net); #endif -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2019-11-05 10:51:29
|
This commit offers an option to encrypt and authenticate all messaging, including the neighbor discovery messages. The currently most advanced algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All encryption/decryption is done at the bearer layer, just before leaving or after entering TIPC. Supported features: - Encryption & authentication of all TIPC messages (header + data); - Two symmetric-key modes: Cluster and Per-node; - Automatic key switching; - Key-expired revoking (sequence number wrapped); - Lock-free encryption/decryption (RCU); - Asynchronous crypto, Intel AES-NI supported; - Multiple cipher transforms; - Logs & statistics; Two key modes: - Cluster key mode: One single key is used for both TX & RX in all nodes in the cluster. - Per-node key mode: Each nodes in the cluster has one specific TX key. For RX, a node requires its peers' TX key to be able to decrypt the messages from those peers. Key setting from user-space is performed via netlink by a user program (e.g. the iproute2 'tipc' tool). Internal key state machine: Attach Align(RX) +-+ +-+ | V | V +---------+ Attach +---------+ | IDLE |---------------->| PENDING |(user = 0) +---------+ +---------+ A A Switch| A | | | | | | Free(switch/revoked) | | (Free)| +----------------------+ | |Timeout | (TX) | | |(RX) | | | | | | v | +---------+ Switch +---------+ | PASSIVE |<----------------| ACTIVE | +---------+ (RX) +---------+ (user = 1) (user >= 1) The number of TFMs is 10 by default and can be changed via the procfs 'net/tipc/max_tfms'. At this moment, as for simplicity, this file is also used to print the crypto statistics at runtime: echo 0xfff1 > /proc/sys/net/tipc/max_tfms The patch defines a new TIPC version (v7) for the encryption message (- backward compatibility as well). The message is basically encapsulated as follows: +----------------------------------------------------------+ | TIPCv7 encryption | Original TIPCv2 | Authentication | | header | packet (encrypted) | Tag | +----------------------------------------------------------+ The throughput is about ~40% for small messages (compared with non- encryption) and ~9% for large messages. With the support from hardware crypto i.e. the Intel AES-NI CPU instructions, the throughput increases upto ~85% for small messages and ~55% for large messages. By default, the new feature is inactive (i.e. no encryption) until user sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO" in the kernel configuration to enable/disable the new code when needed. v2: rebase, add new kernel option ("TIPC_CRYPTO") MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/Kconfig | 12 + net/tipc/Makefile | 1 + net/tipc/bcast.c | 2 +- net/tipc/bearer.c | 40 +- net/tipc/bearer.h | 8 +- net/tipc/core.c | 18 + net/tipc/core.h | 8 + net/tipc/crypto.c | 1986 ++++++++++++++++++++++++++++++++++++++++++++++++++ net/tipc/crypto.h | 167 +++++ net/tipc/link.c | 19 +- net/tipc/link.h | 1 + net/tipc/msg.c | 15 +- net/tipc/msg.h | 46 +- net/tipc/node.c | 117 ++- net/tipc/node.h | 8 + net/tipc/sysctl.c | 11 + net/tipc/udp_media.c | 1 + 17 files changed, 2422 insertions(+), 38 deletions(-) create mode 100644 net/tipc/crypto.c create mode 100644 net/tipc/crypto.h diff --git a/net/tipc/Kconfig b/net/tipc/Kconfig index b83e16ade4d2..6d94ccf7a9e9 100644 --- a/net/tipc/Kconfig +++ b/net/tipc/Kconfig @@ -35,6 +35,18 @@ config TIPC_MEDIA_UDP Saying Y here will enable support for running TIPC over IP/UDP bool default y +config TIPC_CRYPTO + bool "TIPC encryption support" + depends on TIPC + help + Saying Y here will enable support for TIPC encryption. + All TIPC messages will be encrypted/decrypted by using the currently most + advanced algorithm: AEAD AES-GCM (like IPSec or TLS) before leaving/ + entering the TIPC stack. + Key setting from user-space is performed via netlink by a user program + (e.g. the iproute2 'tipc' tool). + bool + default y config TIPC_DIAG tristate "TIPC: socket monitoring interface" diff --git a/net/tipc/Makefile b/net/tipc/Makefile index c86aba0282af..11255e970dd4 100644 --- a/net/tipc/Makefile +++ b/net/tipc/Makefile @@ -16,6 +16,7 @@ CFLAGS_trace.o += -I$(src) tipc-$(CONFIG_TIPC_MEDIA_UDP) += udp_media.o tipc-$(CONFIG_TIPC_MEDIA_IB) += ib_media.o tipc-$(CONFIG_SYSCTL) += sysctl.o +tipc-$(CONFIG_TIPC_CRYPTO) += crypto.o obj-$(CONFIG_TIPC_DIAG) += diag.o diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index 6ef1abdd525f..f41096a759fa 100644 --- a/net/tipc/bcast.c +++ b/net/tipc/bcast.c @@ -84,7 +84,7 @@ static struct tipc_bc_base *tipc_bc_base(struct net *net) */ int tipc_bcast_get_mtu(struct net *net) { - return tipc_link_mtu(tipc_bc_sndlink(net)) - INT_H_SIZE; + return tipc_link_mss(tipc_bc_sndlink(net)); } void tipc_bcast_disable_rcast(struct net *net) diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c index 6e15b9b1f1ef..eeb5721f7bf4 100644 --- a/net/tipc/bearer.c +++ b/net/tipc/bearer.c @@ -44,6 +44,7 @@ #include "netlink.h" #include "udp_media.h" #include "trace.h" +#include "crypto.h" #define MAX_ADDR_STR 60 @@ -516,10 +517,15 @@ void tipc_bearer_xmit_skb(struct net *net, u32 bearer_id, rcu_read_lock(); b = bearer_get(net, bearer_id); - if (likely(b && (test_bit(0, &b->up) || msg_is_reset(hdr)))) - b->media->send_msg(net, skb, b, dest); - else + if (likely(b && (test_bit(0, &b->up) || msg_is_reset(hdr)))) { +#ifdef CONFIG_TIPC_CRYPTO + tipc_crypto_xmit(net, &skb, b, dest, NULL); + if (skb) +#endif + b->media->send_msg(net, skb, b, dest); + } else { kfree_skb(skb); + } rcu_read_unlock(); } @@ -527,7 +533,13 @@ void tipc_bearer_xmit_skb(struct net *net, u32 bearer_id, */ void tipc_bearer_xmit(struct net *net, u32 bearer_id, struct sk_buff_head *xmitq, - struct tipc_media_addr *dst) +#ifdef CONFIG_TIPC_CRYPTO + struct tipc_media_addr *dst, + struct tipc_node *__dnode +#else + struct tipc_media_addr *dst +#endif + ) { struct tipc_bearer *b; struct sk_buff *skb, *tmp; @@ -541,10 +553,15 @@ void tipc_bearer_xmit(struct net *net, u32 bearer_id, __skb_queue_purge(xmitq); skb_queue_walk_safe(xmitq, skb, tmp) { __skb_dequeue(xmitq); - if (likely(test_bit(0, &b->up) || msg_is_reset(buf_msg(skb)))) - b->media->send_msg(net, skb, b, dst); - else + if (likely(test_bit(0, &b->up) || msg_is_reset(buf_msg(skb)))) { +#ifdef CONFIG_TIPC_CRYPTO + tipc_crypto_xmit(net, &skb, b, dst, __dnode); + if (skb) +#endif + b->media->send_msg(net, skb, b, dst); + } else { kfree_skb(skb); + } } rcu_read_unlock(); } @@ -555,6 +572,7 @@ void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id, struct sk_buff_head *xmitq) { struct tipc_net *tn = tipc_net(net); + struct tipc_media_addr *dst; int net_id = tn->net_id; struct tipc_bearer *b; struct sk_buff *skb, *tmp; @@ -569,7 +587,12 @@ void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id, msg_set_non_seq(hdr, 1); msg_set_mc_netid(hdr, net_id); __skb_dequeue(xmitq); - b->media->send_msg(net, skb, b, &b->bcast_addr); + dst = &b->bcast_addr; +#ifdef CONFIG_TIPC_CRYPTO + tipc_crypto_xmit(net, &skb, b, dst, NULL); + if (skb) +#endif + b->media->send_msg(net, skb, b, dst); } rcu_read_unlock(); } @@ -596,6 +619,7 @@ static int tipc_l2_rcv_msg(struct sk_buff *skb, struct net_device *dev, if (likely(b && test_bit(0, &b->up) && (skb->pkt_type <= PACKET_MULTICAST))) { skb_mark_not_on_list(skb); + TIPC_SKB_CB(skb)->flags = 0; tipc_rcv(dev_net(b->pt.dev), skb, b); rcu_read_unlock(); return NET_RX_SUCCESS; diff --git a/net/tipc/bearer.h b/net/tipc/bearer.h index faca696d422f..110754c333b4 100644 --- a/net/tipc/bearer.h +++ b/net/tipc/bearer.h @@ -232,7 +232,13 @@ void tipc_bearer_xmit_skb(struct net *net, u32 bearer_id, struct tipc_media_addr *dest); void tipc_bearer_xmit(struct net *net, u32 bearer_id, struct sk_buff_head *xmitq, - struct tipc_media_addr *dst); +#ifdef CONFIG_TIPC_CRYPTO + struct tipc_media_addr *dst, + struct tipc_node *__dnode +#else + struct tipc_media_addr *dst +#endif + ); void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id, struct sk_buff_head *xmitq); void tipc_clone_to_loopback(struct net *net, struct sk_buff_head *pkts); diff --git a/net/tipc/core.c b/net/tipc/core.c index ab648dd150ee..73fccf48aa34 100644 --- a/net/tipc/core.c +++ b/net/tipc/core.c @@ -44,6 +44,7 @@ #include "socket.h" #include "bcast.h" #include "node.h" +#include "crypto.h" #include <linux/module.h> @@ -68,6 +69,11 @@ static int __net_init tipc_init_net(struct net *net) INIT_LIST_HEAD(&tn->node_list); spin_lock_init(&tn->node_list_lock); +#ifdef CONFIG_TIPC_CRYPTO + err = tipc_crypto_start(&tn->crypto_tx, net, NULL); + if (err) + goto out_crypto; +#endif err = tipc_sk_rht_init(net); if (err) goto out_sk_rht; @@ -93,16 +99,28 @@ static int __net_init tipc_init_net(struct net *net) out_nametbl: tipc_sk_rht_destroy(net); out_sk_rht: + +#ifdef CONFIG_TIPC_CRYPTO + tipc_crypto_stop(&tn->crypto_tx); +out_crypto: +#endif return err; } static void __net_exit tipc_exit_net(struct net *net) { +#ifdef CONFIG_TIPC_CRYPTO + struct tipc_net *tn = tipc_net(net); +#endif + tipc_detach_loopback(net); tipc_net_stop(net); tipc_bcast_stop(net); tipc_nametbl_stop(net); tipc_sk_rht_destroy(net); +#ifdef CONFIG_TIPC_CRYPTO + tipc_crypto_stop(&tn->crypto_tx); +#endif } static void __net_exit tipc_pernet_pre_exit(struct net *net) diff --git a/net/tipc/core.h b/net/tipc/core.h index 8776d32a4a47..775848a5f27e 100644 --- a/net/tipc/core.h +++ b/net/tipc/core.h @@ -68,6 +68,9 @@ struct tipc_link; struct tipc_name_table; struct tipc_topsrv; struct tipc_monitor; +#ifdef CONFIG_TIPC_CRYPTO +struct tipc_crypto; +#endif #define TIPC_MOD_VER "2.0.0" @@ -129,6 +132,11 @@ struct tipc_net { /* Tracing of node internal messages */ struct packet_type loopback_pt; + +#ifdef CONFIG_TIPC_CRYPTO + /* TX crypto handler */ + struct tipc_crypto *crypto_tx; +#endif }; static inline struct tipc_net *tipc_net(struct net *net) diff --git a/net/tipc/crypto.c b/net/tipc/crypto.c new file mode 100644 index 000000000000..05f7ca76e8ce --- /dev/null +++ b/net/tipc/crypto.c @@ -0,0 +1,1986 @@ +// SPDX-License-Identifier: GPL-2.0 +/** + * net/tipc/crypto.c: TIPC crypto for key handling & packet en/decryption + * + * Copyright (c) 2019, Ericsson AB + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the names of the copyright holders nor the names of its + * contributors may be used to endorse or promote products derived from + * this software without specific prior written permission. + * + * Alternatively, this software may be distributed under the terms of the + * GNU General Public License ("GPL") version 2 as published by the Free + * Software Foundation. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + */ + +#include <crypto/aead.h> +#include <crypto/aes.h> +#include "crypto.h" + +#define TIPC_TX_PROBE_LIM msecs_to_jiffies(1000) /* > 1s */ +#define TIPC_TX_LASTING_LIM msecs_to_jiffies(120000) /* 2 mins */ +#define TIPC_RX_ACTIVE_LIM msecs_to_jiffies(3000) /* 3s */ +#define TIPC_RX_PASSIVE_LIM msecs_to_jiffies(180000) /* 3 mins */ +#define TIPC_MAX_TFMS_DEF 10 +#define TIPC_MAX_TFMS_LIM 1000 + +/** + * TIPC Key ids + */ +enum { + KEY_UNUSED = 0, + KEY_MIN, + KEY_1 = KEY_MIN, + KEY_2, + KEY_3, + KEY_MAX = KEY_3, +}; + +/** + * TIPC Crypto statistics + */ +enum { + STAT_OK, + STAT_NOK, + STAT_ASYNC, + STAT_ASYNC_OK, + STAT_ASYNC_NOK, + STAT_BADKEYS, /* tx only */ + STAT_BADMSGS = STAT_BADKEYS, /* rx only */ + STAT_NOKEYS, + STAT_SWITCHES, + + MAX_STATS, +}; + +/* TIPC crypto statistics' header */ +static const char *hstats[MAX_STATS] = {"ok", "nok", "async", "async_ok", + "async_nok", "badmsgs", "nokeys", + "switches"}; + +/* Max TFMs number per key */ +int sysctl_tipc_max_tfms __read_mostly = TIPC_MAX_TFMS_DEF; + +/** + * struct tipc_key - TIPC keys' status indicator + * + * 7 6 5 4 3 2 1 0 + * +-----+-----+-----+-----+-----+-----+-----+-----+ + * key: | (reserved)|passive idx| active idx|pending idx| + * +-----+-----+-----+-----+-----+-----+-----+-----+ + */ +struct tipc_key { +#define KEY_BITS (2) +#define KEY_MASK ((1 << KEY_BITS) - 1) + union { + struct { +#if defined(__LITTLE_ENDIAN_BITFIELD) + u8 pending:2, + active:2, + passive:2, /* rx only */ + reserved:2; +#elif defined(__BIG_ENDIAN_BITFIELD) + u8 reserved:2, + passive:2, /* rx only */ + active:2, + pending:2; +#else +#error "Please fix <asm/byteorder.h>" +#endif + } __packed; + u8 keys; + }; +}; + +/** + * struct tipc_tfm - TIPC TFM structure to form a list of TFMs + */ +struct tipc_tfm { + struct crypto_aead *tfm; + struct list_head list; +}; + +/** + * struct tipc_aead - TIPC AEAD key structure + * @tfm_entry: per-cpu pointer to one entry in TFM list + * @crypto: TIPC crypto owns this key + * @cloned: reference to the source key in case cloning + * @users: the number of the key users (TX/RX) + * @salt: the key's SALT value + * @authsize: authentication tag size (max = 16) + * @mode: crypto mode is applied to the key + * @hint[]: a hint for user key + * @rcu: struct rcu_head + * @seqno: the key seqno (cluster scope) + * @refcnt: the key reference counter + */ +struct tipc_aead { +#define TIPC_AEAD_HINT_LEN (5) + struct tipc_tfm * __percpu *tfm_entry; + struct tipc_crypto *crypto; + struct tipc_aead *cloned; + atomic_t users; + u32 salt; + u8 authsize; + u8 mode; + char hint[TIPC_AEAD_HINT_LEN + 1]; + struct rcu_head rcu; + + atomic64_t seqno ____cacheline_aligned; + refcount_t refcnt ____cacheline_aligned; + +} ____cacheline_aligned; + +/** + * struct tipc_crypto_stats - TIPC Crypto statistics + */ +struct tipc_crypto_stats { + unsigned int stat[MAX_STATS]; +}; + +/** + * struct tipc_crypto - TIPC TX/RX crypto structure + * @net: struct net + * @node: TIPC node (RX) + * @aead: array of pointers to AEAD keys for encryption/decryption + * @peer_rx_active: replicated peer RX active key index + * @key: the key states + * @working: the crypto is working or not + * @stats: the crypto statistics + * @sndnxt: the per-peer sndnxt (TX) + * @timer1: general timer 1 (jiffies) + * @timer2: general timer 1 (jiffies) + * @lock: tipc_key lock + */ +struct tipc_crypto { + struct net *net; + struct tipc_node *node; + struct tipc_aead __rcu *aead[KEY_MAX + 1]; /* key[0] is UNUSED */ + atomic_t peer_rx_active; + struct tipc_key key; + u8 working:1; + struct tipc_crypto_stats __percpu *stats; + + atomic64_t sndnxt ____cacheline_aligned; + unsigned long timer1; + unsigned long timer2; + spinlock_t lock; /* crypto lock */ + +} ____cacheline_aligned; + +/* struct tipc_crypto_tx_ctx - TX context for callbacks */ +struct tipc_crypto_tx_ctx { + struct tipc_aead *aead; + struct tipc_bearer *bearer; + struct tipc_media_addr dst; +}; + +/* struct tipc_crypto_rx_ctx - RX context for callbacks */ +struct tipc_crypto_rx_ctx { + struct tipc_aead *aead; + struct tipc_bearer *bearer; +}; + +static struct tipc_aead *tipc_aead_get(struct tipc_aead __rcu *aead); +static inline void tipc_aead_put(struct tipc_aead *aead); +static void tipc_aead_free(struct rcu_head *rp); +static int tipc_aead_users(struct tipc_aead __rcu *aead); +static void tipc_aead_users_inc(struct tipc_aead __rcu *aead, int lim); +static void tipc_aead_users_dec(struct tipc_aead __rcu *aead, int lim); +static void tipc_aead_users_set(struct tipc_aead __rcu *aead, int val); +static struct crypto_aead *tipc_aead_tfm_next(struct tipc_aead *aead); +static int tipc_aead_init(struct tipc_aead **aead, struct tipc_aead_key *ukey, + u8 mode); +static int tipc_aead_clone(struct tipc_aead **dst, struct tipc_aead *src); +static void *tipc_aead_mem_alloc(struct crypto_aead *tfm, + unsigned int crypto_ctx_size, + u8 **iv, struct aead_request **req, + struct scatterlist **sg, int nsg); +static int tipc_aead_encrypt(struct tipc_aead *aead, struct sk_buff *skb, + struct tipc_bearer *b, + struct tipc_media_addr *dst, + struct tipc_node *__dnode); +static void tipc_aead_encrypt_done(struct crypto_async_request *base, int err); +static int tipc_aead_decrypt(struct net *net, struct tipc_aead *aead, + struct sk_buff *skb, struct tipc_bearer *b); +static void tipc_aead_decrypt_done(struct crypto_async_request *base, int err); +static inline int tipc_ehdr_size(struct tipc_ehdr *ehdr); +static int tipc_ehdr_build(struct net *net, struct tipc_aead *aead, + u8 tx_key, struct sk_buff *skb, + struct tipc_crypto *__rx); +static inline void tipc_crypto_key_set_state(struct tipc_crypto *c, + u8 new_passive, + u8 new_active, + u8 new_pending); +static int tipc_crypto_key_attach(struct tipc_crypto *c, + struct tipc_aead *aead, u8 pos); +static bool tipc_crypto_key_try_align(struct tipc_crypto *rx, u8 new_pending); +static struct tipc_aead *tipc_crypto_key_pick_tx(struct tipc_crypto *tx, + struct tipc_crypto *rx, + struct sk_buff *skb); +static void tipc_crypto_key_synch(struct tipc_crypto *rx, u8 new_rx_active, + struct tipc_msg *hdr); +static int tipc_crypto_key_revoke(struct net *net, u8 tx_key); +static void tipc_crypto_rcv_complete(struct net *net, struct tipc_aead *aead, + struct tipc_bearer *b, + struct sk_buff **skb, int err); +static void tipc_crypto_do_cmd(struct net *net, int cmd); +static char *tipc_crypto_key_dump(struct tipc_crypto *c, char *buf); +#ifdef TIPC_CRYPTO_DEBUG +static char *tipc_key_change_dump(struct tipc_key old, struct tipc_key new, + char *buf); +#endif + +#define key_next(cur) ((cur) % KEY_MAX + 1) + +#define tipc_aead_rcu_ptr(rcu_ptr, lock) \ + rcu_dereference_protected((rcu_ptr), lockdep_is_held(lock)) + +#define tipc_aead_rcu_swap(rcu_ptr, ptr, lock) \ + rcu_swap_protected((rcu_ptr), (ptr), lockdep_is_held(lock)) + +#define tipc_aead_rcu_replace(rcu_ptr, ptr, lock) \ +do { \ + typeof(rcu_ptr) __tmp = rcu_dereference_protected((rcu_ptr), \ + lockdep_is_held(lock)); \ + rcu_assign_pointer((rcu_ptr), (ptr)); \ + tipc_aead_put(__tmp); \ +} while (0) + +#define tipc_crypto_key_detach(rcu_ptr, lock) \ + tipc_aead_rcu_replace((rcu_ptr), NULL, lock) + +/** + * tipc_aead_key_validate - Validate a AEAD user key + */ +int tipc_aead_key_validate(struct tipc_aead_key *ukey) +{ + int keylen; + + /* Check if algorithm exists */ + if (unlikely(!crypto_has_alg(ukey->alg_name, 0, 0))) { + pr_info("Not found cipher: \"%s\"!\n", ukey->alg_name); + return -ENODEV; + } + + /* Currently, we only support the "gcm(aes)" cipher algorithm */ + if (strcmp(ukey->alg_name, "gcm(aes)")) + return -ENOTSUPP; + + /* Check if key size is correct */ + keylen = ukey->keylen - TIPC_AES_GCM_SALT_SIZE; + if (unlikely(keylen != TIPC_AES_GCM_KEY_SIZE_128 && + keylen != TIPC_AES_GCM_KEY_SIZE_192 && + keylen != TIPC_AES_GCM_KEY_SIZE_256)) + return -EINVAL; + + return 0; +} + +static struct tipc_aead *tipc_aead_get(struct tipc_aead __rcu *aead) +{ + struct tipc_aead *tmp; + + rcu_read_lock(); + tmp = rcu_dereference(aead); + if (unlikely(!tmp || !refcount_inc_not_zero(&tmp->refcnt))) + tmp = NULL; + rcu_read_unlock(); + + return tmp; +} + +static inline void tipc_aead_put(struct tipc_aead *aead) +{ + if (aead && refcount_dec_and_test(&aead->refcnt)) + call_rcu(&aead->rcu, tipc_aead_free); +} + +/** + * tipc_aead_free - Release AEAD key incl. all the TFMs in the list + * @rp: rcu head pointer + */ +static void tipc_aead_free(struct rcu_head *rp) +{ + struct tipc_aead *aead = container_of(rp, struct tipc_aead, rcu); + struct tipc_tfm *tfm_entry, *head, *tmp; + + if (aead->cloned) { + tipc_aead_put(aead->cloned); + } else { + head = *this_cpu_ptr(aead->tfm_entry); + list_for_each_entry_safe(tfm_entry, tmp, &head->list, list) { + crypto_free_aead(tfm_entry->tfm); + list_del(&tfm_entry->list); + kfree(tfm_entry); + } + /* Free the head */ + crypto_free_aead(head->tfm); + list_del(&head->list); + kfree(head); + } + free_percpu(aead->tfm_entry); + kfree(aead); +} + +static int tipc_aead_users(struct tipc_aead __rcu *aead) +{ + struct tipc_aead *tmp; + int users = 0; + + rcu_read_lock(); + tmp = rcu_dereference(aead); + if (tmp) + users = atomic_read(&tmp->users); + rcu_read_unlock(); + + return users; +} + +static void tipc_aead_users_inc(struct tipc_aead __rcu *aead, int lim) +{ + struct tipc_aead *tmp; + + rcu_read_lock(); + tmp = rcu_dereference(aead); + if (tmp) + atomic_add_unless(&tmp->users, 1, lim); + rcu_read_unlock(); +} + +static void tipc_aead_users_dec(struct tipc_aead __rcu *aead, int lim) +{ + struct tipc_aead *tmp; + + rcu_read_lock(); + tmp = rcu_dereference(aead); + if (tmp) + atomic_add_unless(&rcu_dereference(aead)->users, -1, lim); + rcu_read_unlock(); +} + +static void tipc_aead_users_set(struct tipc_aead __rcu *aead, int val) +{ + struct tipc_aead *tmp; + int cur; + + rcu_read_lock(); + tmp = rcu_dereference(aead); + if (tmp) { + do { + cur = atomic_read(&tmp->users); + if (cur == val) + break; + } while (atomic_cmpxchg(&tmp->users, cur, val) != cur); + } + rcu_read_unlock(); +} + +/** + * tipc_aead_tfm_next - Move TFM entry to the next one in list and return it + */ +static struct crypto_aead *tipc_aead_tfm_next(struct tipc_aead *aead) +{ + struct tipc_tfm **tfm_entry = this_cpu_ptr(aead->tfm_entry); + + *tfm_entry = list_next_entry(*tfm_entry, list); + return (*tfm_entry)->tfm; +} + +/** + * tipc_aead_init - Initiate TIPC AEAD + * @aead: returned new TIPC AEAD key handle pointer + * @ukey: pointer to user key data + * @mode: the key mode + * + * Allocate a (list of) new cipher transformation (TFM) with the specific user + * key data if valid. The number of the allocated TFMs can be set via the sysfs + * "net/tipc/max_tfms" first. + * Also, all the other AEAD data are also initialized. + * + * Return: 0 if the initiation is successful, otherwise: < 0 + */ +static int tipc_aead_init(struct tipc_aead **aead, struct tipc_aead_key *ukey, + u8 mode) +{ + struct tipc_tfm *tfm_entry, *head; + struct crypto_aead *tfm; + struct tipc_aead *tmp; + int keylen, err, cpu; + int tfm_cnt = 0; + + if (unlikely(*aead)) + return -EEXIST; + + /* Allocate a new AEAD */ + tmp = kzalloc(sizeof(*tmp), GFP_ATOMIC); + if (unlikely(!tmp)) + return -ENOMEM; + + /* The key consists of two parts: [AES-KEY][SALT] */ + keylen = ukey->keylen - TIPC_AES_GCM_SALT_SIZE; + + /* Allocate per-cpu TFM entry pointer */ + tmp->tfm_entry = alloc_percpu(struct tipc_tfm *); + if (!tmp->tfm_entry) { + kzfree(tmp); + return -ENOMEM; + } + + /* Make a list of TFMs with the user key data */ + do { + tfm = crypto_alloc_aead(ukey->alg_name, 0, 0); + if (IS_ERR(tfm)) { + err = PTR_ERR(tfm); + break; + } + + if (unlikely(!tfm_cnt && + crypto_aead_ivsize(tfm) != TIPC_AES_GCM_IV_SIZE)) { + crypto_free_aead(tfm); + err = -ENOTSUPP; + break; + } + + err |= crypto_aead_setauthsize(tfm, TIPC_AES_GCM_TAG_SIZE); + err |= crypto_aead_setkey(tfm, ukey->key, keylen); + if (unlikely(err)) { + crypto_free_aead(tfm); + break; + } + + tfm_entry = kmalloc(sizeof(*tfm_entry), GFP_KERNEL); + if (unlikely(!tfm_entry)) { + crypto_free_aead(tfm); + err = -ENOMEM; + break; + } + INIT_LIST_HEAD(&tfm_entry->list); + tfm_entry->tfm = tfm; + + /* First entry? */ + if (!tfm_cnt) { + head = tfm_entry; + for_each_possible_cpu(cpu) { + *per_cpu_ptr(tmp->tfm_entry, cpu) = head; + } + } else { + list_add_tail(&tfm_entry->list, &head->list); + } + + } while (++tfm_cnt < sysctl_tipc_max_tfms); + + /* Not any TFM is allocated? */ + if (!tfm_cnt) { + free_percpu(tmp->tfm_entry); + kzfree(tmp); + return err; + } + + /* Copy some chars from the user key as a hint */ + memcpy(tmp->hint, ukey->key, TIPC_AEAD_HINT_LEN); + tmp->hint[TIPC_AEAD_HINT_LEN] = '\0'; + + /* Initialize the other data */ + tmp->mode = mode; + tmp->cloned = NULL; + tmp->authsize = TIPC_AES_GCM_TAG_SIZE; + memcpy(&tmp->salt, ukey->key + keylen, TIPC_AES_GCM_SALT_SIZE); + atomic_set(&tmp->users, 0); + atomic64_set(&tmp->seqno, 0); + refcount_set(&tmp->refcnt, 1); + + *aead = tmp; + return 0; +} + +/** + * tipc_aead_clone - Clone a TIPC AEAD key + * @dst: dest key for the cloning + * @src: source key to clone from + * + * Make a "copy" of the source AEAD key data to the dest, the TFMs list is + * common for the keys. + * A reference to the source is hold in the "cloned" pointer for the later + * freeing purposes. + * + * Note: this must be done in cluster-key mode only! + * Return: 0 in case of success, otherwise < 0 + */ +static int tipc_aead_clone(struct tipc_aead **dst, struct tipc_aead *src) +{ + struct tipc_aead *aead; + int cpu; + + if (!src) + return -ENOKEY; + + if (src->mode != CLUSTER_KEY) + return -EINVAL; + + if (unlikely(*dst)) + return -EEXIST; + + aead = kzalloc(sizeof(*aead), GFP_ATOMIC); + if (unlikely(!aead)) + return -ENOMEM; + + aead->tfm_entry = alloc_percpu_gfp(struct tipc_tfm *, GFP_ATOMIC); + if (unlikely(!aead->tfm_entry)) { + kzfree(aead); + return -ENOMEM; + } + + for_each_possible_cpu(cpu) { + *per_cpu_ptr(aead->tfm_entry, cpu) = + *per_cpu_ptr(src->tfm_entry, cpu); + } + + memcpy(aead->hint, src->hint, sizeof(src->hint)); + aead->mode = src->mode; + aead->salt = src->salt; + aead->authsize = src->authsize; + atomic_set(&aead->users, 0); + atomic64_set(&aead->seqno, 0); + refcount_set(&aead->refcnt, 1); + + WARN_ON(!refcount_inc_not_zero(&src->refcnt)); + aead->cloned = src; + + *dst = aead; + return 0; +} + +/** + * tipc_aead_mem_alloc - Allocate memory for AEAD request operations + * @tfm: cipher handle to be registered with the request + * @crypto_ctx_size: size of crypto context for callback + * @iv: returned pointer to IV data + * @req: returned pointer to AEAD request data + * @sg: returned pointer to SG lists + * @nsg: number of SG lists to be allocated + * + * Allocate memory to store the crypto context data, AEAD request, IV and SG + * lists, the memory layout is as follows: + * crypto_ctx || iv || aead_req || sg[] + * + * Return: the pointer to the memory areas in case of success, otherwise NULL + */ +static void *tipc_aead_mem_alloc(struct crypto_aead *tfm, + unsigned int crypto_ctx_size, + u8 **iv, struct aead_request **req, + struct scatterlist **sg, int nsg) +{ + unsigned int iv_size, req_size; + unsigned int len; + u8 *mem; + + iv_size = crypto_aead_ivsize(tfm); + req_size = sizeof(**req) + crypto_aead_reqsize(tfm); + + len = crypto_ctx_size; + len += iv_size; + len += crypto_aead_alignmask(tfm) & ~(crypto_tfm_ctx_alignment() - 1); + len = ALIGN(len, crypto_tfm_ctx_alignment()); + len += req_size; + len = ALIGN(len, __alignof__(struct scatterlist)); + len += nsg * sizeof(**sg); + + mem = kmalloc(len, GFP_ATOMIC); + if (!mem) + return NULL; + + *iv = (u8 *)PTR_ALIGN(mem + crypto_ctx_size, + crypto_aead_alignmask(tfm) + 1); + *req = (struct aead_request *)PTR_ALIGN(*iv + iv_size, + crypto_tfm_ctx_alignment()); + *sg = (struct scatterlist *)PTR_ALIGN((u8 *)*req + req_size, + __alignof__(struct scatterlist)); + + return (void *)mem; +} + +/** + * tipc_aead_encrypt - Encrypt a message + * @aead: TIPC AEAD key for the message encryption + * @skb: the input/output skb + * @b: TIPC bearer where the message will be delivered after the encryption + * @dst: the destination media address + * @__dnode: TIPC dest node if "known" + * + * Return: + * 0 : if the encryption has completed + * -EINPROGRESS/-EBUSY : if a callback will be performed + * < 0 : the encryption has failed + */ +static int tipc_aead_encrypt(struct tipc_aead *aead, struct sk_buff *skb, + struct tipc_bearer *b, + struct tipc_media_addr *dst, + struct tipc_node *__dnode) +{ + struct crypto_aead *tfm = tipc_aead_tfm_next(aead); + struct tipc_crypto_tx_ctx *tx_ctx; + struct aead_request *req; + struct sk_buff *trailer; + struct scatterlist *sg; + struct tipc_ehdr *ehdr; + int ehsz, len, tailen, nsg, rc; + void *ctx; + u32 salt; + u8 *iv; + + /* Make sure message len at least 4-byte aligned */ + len = ALIGN(skb->len, 4); + tailen = len - skb->len + aead->authsize; + + /* Expand skb tail for authentication tag: + * As for simplicity, we'd have made sure skb having enough tailroom + * for authentication tag @skb allocation. Even when skb is nonlinear + * but there is no frag_list, it should be still fine! + * Otherwise, we must cow it to be a writable buffer with the tailroom. + */ +#ifdef TIPC_CRYPTO_DEBUG + SKB_LINEAR_ASSERT(skb); + if (tailen > skb_tailroom(skb)) { + pr_warn("TX: skb tailroom is not enough: %d, requires: %d\n", + skb_tailroom(skb), tailen); + } +#endif + + if (unlikely(!skb_cloned(skb) && tailen <= skb_tailroom(skb))) { + nsg = 1; + trailer = skb; + } else { + /* TODO: We could avoid skb_cow_data() if skb has no frag_list + * e.g. by skb_fill_page_desc() to add another page to the skb + * with the wanted tailen... However, page skbs look not often, + * so take it easy now! + * Cloned skbs e.g. from link_xmit() seems no choice though :( + */ + nsg = skb_cow_data(skb, tailen, &trailer); + if (unlikely(nsg < 0)) { + pr_err("TX: skb_cow_data() returned %d\n", nsg); + return nsg; + } + } + + pskb_put(skb, trailer, tailen); + + /* Allocate memory for the AEAD operation */ + ctx = tipc_aead_mem_alloc(tfm, sizeof(*tx_ctx), &iv, &req, &sg, nsg); + if (unlikely(!ctx)) + return -ENOMEM; + TIPC_SKB_CB(skb)->crypto_ctx = ctx; + + /* Map skb to the sg lists */ + sg_init_table(sg, nsg); + rc = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(rc < 0)) { + pr_err("TX: skb_to_sgvec() returned %d, nsg %d!\n", rc, nsg); + goto exit; + } + + /* Prepare IV: [SALT (4 octets)][SEQNO (8 octets)] + * In case we're in cluster-key mode, SALT is varied by xor-ing with + * the source address (or w0 of id), otherwise with the dest address + * if dest is known. + */ + ehdr = (struct tipc_ehdr *)skb->data; + salt = aead->salt; + if (aead->mode == CLUSTER_KEY) + salt ^= ehdr->addr; /* __be32 */ + else if (__dnode) + salt ^= tipc_node_get_addr(__dnode); + memcpy(iv, &salt, 4); + memcpy(iv + 4, (u8 *)&ehdr->seqno, 8); + + /* Prepare request */ + ehsz = tipc_ehdr_size(ehdr); + aead_request_set_tfm(req, tfm); + aead_request_set_ad(req, ehsz); + aead_request_set_crypt(req, sg, sg, len - ehsz, iv); + + /* Set callback function & data */ + aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, + tipc_aead_encrypt_done, skb); + tx_ctx = (struct tipc_crypto_tx_ctx *)ctx; + tx_ctx->aead = aead; + tx_ctx->bearer = b; + memcpy(&tx_ctx->dst, dst, sizeof(*dst)); + + /* Hold bearer */ + if (unlikely(!tipc_bearer_hold(b))) { + rc = -ENODEV; + goto exit; + } + + /* Now, do encrypt */ + rc = crypto_aead_encrypt(req); + if (rc == -EINPROGRESS || rc == -EBUSY) + return rc; + + tipc_bearer_put(b); + +exit: + kfree(ctx); + TIPC_SKB_CB(skb)->crypto_ctx = NULL; + return rc; +} + +static void tipc_aead_encrypt_done(struct crypto_async_request *base, int err) +{ + struct sk_buff *skb = base->data; + struct tipc_crypto_tx_ctx *tx_ctx = TIPC_SKB_CB(skb)->crypto_ctx; + struct tipc_bearer *b = tx_ctx->bearer; + struct tipc_aead *aead = tx_ctx->aead; + struct tipc_crypto *tx = aead->crypto; + struct net *net = tx->net; + + switch (err) { + case 0: + this_cpu_inc(tx->stats->stat[STAT_ASYNC_OK]); + if (likely(test_bit(0, &b->up))) + b->media->send_msg(net, skb, b, &tx_ctx->dst); + else + kfree_skb(skb); + break; + case -EINPROGRESS: + return; + default: + this_cpu_inc(tx->stats->stat[STAT_ASYNC_NOK]); + kfree_skb(skb); + break; + } + + kfree(tx_ctx); + tipc_bearer_put(b); + tipc_aead_put(aead); +} + +/** + * tipc_aead_decrypt - Decrypt an encrypted message + * @net: struct net + * @aead: TIPC AEAD for the message decryption + * @skb: the input/output skb + * @b: TIPC bearer where the message has been received + * + * Return: + * 0 : if the decryption has completed + * -EINPROGRESS/-EBUSY : if a callback will be performed + * < 0 : the decryption has failed + */ +static int tipc_aead_decrypt(struct net *net, struct tipc_aead *aead, + struct sk_buff *skb, struct tipc_bearer *b) +{ + struct tipc_crypto_rx_ctx *rx_ctx; + struct aead_request *req; + struct crypto_aead *tfm; + struct sk_buff *unused; + struct scatterlist *sg; + struct tipc_ehdr *ehdr; + int ehsz, nsg, rc; + void *ctx; + u32 salt; + u8 *iv; + + if (unlikely(!aead)) + return -ENOKEY; + + /* Cow skb data if needed */ + if (likely(!skb_cloned(skb) && + (!skb_is_nonlinear(skb) || !skb_has_frag_list(skb)))) { + nsg = 1 + skb_shinfo(skb)->nr_frags; + } else { + nsg = skb_cow_data(skb, 0, &unused); + if (unlikely(nsg < 0)) { + pr_err("RX: skb_cow_data() returned %d\n", nsg); + return nsg; + } + } + + /* Allocate memory for the AEAD operation */ + tfm = tipc_aead_tfm_next(aead); + ctx = tipc_aead_mem_alloc(tfm, sizeof(*rx_ctx), &iv, &req, &sg, nsg); + if (unlikely(!ctx)) + return -ENOMEM; + TIPC_SKB_CB(skb)->crypto_ctx = ctx; + + /* Map skb to the sg lists */ + sg_init_table(sg, nsg); + rc = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(rc < 0)) { + pr_err("RX: skb_to_sgvec() returned %d, nsg %d\n", rc, nsg); + goto exit; + } + + /* Reconstruct IV: */ + ehdr = (struct tipc_ehdr *)skb->data; + salt = aead->salt; + if (aead->mode == CLUSTER_KEY) + salt ^= ehdr->addr; /* __be32 */ + else if (ehdr->destined) + salt ^= tipc_own_addr(net); + memcpy(iv, &salt, 4); + memcpy(iv + 4, (u8 *)&ehdr->seqno, 8); + + /* Prepare request */ + ehsz = tipc_ehdr_size(ehdr); + aead_request_set_tfm(req, tfm); + aead_request_set_ad(req, ehsz); + aead_request_set_crypt(req, sg, sg, skb->len - ehsz, iv); + + /* Set callback function & data */ + aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, + tipc_aead_decrypt_done, skb); + rx_ctx = (struct tipc_crypto_rx_ctx *)ctx; + rx_ctx->aead = aead; + rx_ctx->bearer = b; + + /* Hold bearer */ + if (unlikely(!tipc_bearer_hold(b))) { + rc = -ENODEV; + goto exit; + } + + /* Now, do decrypt */ + rc = crypto_aead_decrypt(req); + if (rc == -EINPROGRESS || rc == -EBUSY) + return rc; + + tipc_bearer_put(b); + +exit: + kfree(ctx); + TIPC_SKB_CB(skb)->crypto_ctx = NULL; + return rc; +} + +static void tipc_aead_decrypt_done(struct crypto_async_request *base, int err) +{ + struct sk_buff *skb = base->data; + struct tipc_crypto_rx_ctx *rx_ctx = TIPC_SKB_CB(skb)->crypto_ctx; + struct tipc_bearer *b = rx_ctx->bearer; + struct tipc_aead *aead = rx_ctx->aead; + struct tipc_crypto_stats __percpu *stats = aead->crypto->stats; + struct net *net = aead->crypto->net; + + switch (err) { + case 0: + this_cpu_inc(stats->stat[STAT_ASYNC_OK]); + break; + case -EINPROGRESS: + return; + default: + this_cpu_inc(stats->stat[STAT_ASYNC_NOK]); + break; + } + + kfree(rx_ctx); + tipc_crypto_rcv_complete(net, aead, b, &skb, err); + if (likely(skb)) { + if (likely(test_bit(0, &b->up))) + tipc_rcv(net, skb, b); + else + kfree_skb(skb); + } + + tipc_bearer_put(b); +} + +static inline int tipc_ehdr_size(struct tipc_ehdr *ehdr) +{ + return (ehdr->user != LINK_CONFIG) ? EHDR_SIZE : EHDR_CFG_SIZE; +} + +/** + * tipc_ehdr_validate - Validate an encryption message + * @skb: the message buffer + * + * Returns "true" if this is a valid encryption message, otherwise "false" + */ +bool tipc_ehdr_validate(struct sk_buff *skb) +{ + struct tipc_ehdr *ehdr; + int ehsz; + + if (unlikely(!pskb_may_pull(skb, EHDR_MIN_SIZE))) + return false; + + ehdr = (struct tipc_ehdr *)skb->data; + if (unlikely(ehdr->version != TIPC_EVERSION)) + return false; + ehsz = tipc_ehdr_size(ehdr); + if (unlikely(!pskb_may_pull(skb, ehsz))) + return false; + if (unlikely(skb->len <= ehsz + TIPC_AES_GCM_TAG_SIZE)) + return false; + if (unlikely(!ehdr->tx_key)) + return false; + + return true; +} + +/** + * tipc_ehdr_build - Build TIPC encryption message header + * @net: struct net + * @aead: TX AEAD key to be used for the message encryption + * @tx_key: key id used for the message encryption + * @skb: input/output message skb + * @__rx: RX crypto handle if dest is "known" + * + * Return: the header size if the building is successful, otherwise < 0 + */ +static int tipc_ehdr_build(struct net *net, struct tipc_aead *aead, + u8 tx_key, struct sk_buff *skb, + struct tipc_crypto *__rx) +{ + struct tipc_msg *hdr = buf_msg(skb); + struct tipc_ehdr *ehdr; + u32 user = msg_user(hdr); + u64 seqno; + int ehsz; + + /* Make room for encryption header */ + ehsz = (user != LINK_CONFIG) ? EHDR_SIZE : EHDR_CFG_SIZE; + WARN_ON(skb_headroom(skb) < ehsz); + ehdr = (struct tipc_ehdr *)skb_push(skb, ehsz); + + /* Obtain a seqno first: + * Use the key seqno (= cluster wise) if dest is unknown or we're in + * cluster key mode, otherwise it's better for a per-peer seqno! + */ + if (!__rx || aead->mode == CLUSTER_KEY) + seqno = atomic64_inc_return(&aead->seqno); + else + seqno = atomic64_inc_return(&__rx->sndnxt); + + /* Revoke the key if seqno is wrapped around */ + if (unlikely(!seqno)) + return tipc_crypto_key_revoke(net, tx_key); + + /* Word 1-2 */ + ehdr->seqno = cpu_to_be64(seqno); + + /* Words 0, 3- */ + ehdr->version = TIPC_EVERSION; + ehdr->user = 0; + ehdr->keepalive = 0; + ehdr->tx_key = tx_key; + ehdr->destined = (__rx) ? 1 : 0; + ehdr->rx_key_active = (__rx) ? __rx->key.active : 0; + ehdr->reserved_1 = 0; + ehdr->reserved_2 = 0; + + switch (user) { + case LINK_CONFIG: + ehdr->user = LINK_CONFIG; + memcpy(ehdr->id, tipc_own_id(net), NODE_ID_LEN); + break; + default: + if (user == LINK_PROTOCOL && msg_type(hdr) == STATE_MSG) { + ehdr->user = LINK_PROTOCOL; + ehdr->keepalive = msg_is_keepalive(hdr); + } + ehdr->addr = hdr->hdr[3]; + break; + } + + return ehsz; +} + +static inline void tipc_crypto_key_set_state(struct tipc_crypto *c, + u8 new_passive, + u8 new_active, + u8 new_pending) +{ +#ifdef TIPC_CRYPTO_DEBUG + struct tipc_key old = c->key; + char buf[32]; +#endif + + c->key.keys = ((new_passive & KEY_MASK) << (KEY_BITS * 2)) | + ((new_active & KEY_MASK) << (KEY_BITS)) | + ((new_pending & KEY_MASK)); + +#ifdef TIPC_CRYPTO_DEBUG + pr_info("%s(%s): key changing %s ::%pS\n", + (c->node) ? "RX" : "TX", + (c->node) ? tipc_node_get_id_str(c->node) : + tipc_own_id_string(c->net), + tipc_key_change_dump(old, c->key, buf), + __builtin_return_address(0)); +#endif +} + +/** + * tipc_crypto_key_init - Initiate a new user / AEAD key + * @c: TIPC crypto to which new key is attached + * @ukey: the user key + * @mode: the key mode (CLUSTER_KEY or PER_NODE_KEY) + * + * A new TIPC AEAD key will be allocated and initiated with the specified user + * key, then attached to the TIPC crypto. + * + * Return: new key id in case of success, otherwise: < 0 + */ +int tipc_crypto_key_init(struct tipc_crypto *c, struct tipc_aead_key *ukey, + u8 mode) +{ + struct tipc_aead *aead = NULL; + int rc = 0; + + /* Initiate with the new user key */ + rc = tipc_aead_init(&aead, ukey, mode); + + /* Attach it to the crypto */ + if (likely(!rc)) { + rc = tipc_crypto_key_attach(c, aead, 0); + if (rc < 0) + tipc_aead_free(&aead->rcu); + } + + pr_info("%s(%s): key initiating, rc %d!\n", + (c->node) ? "RX" : "TX", + (c->node) ? tipc_node_get_id_str(c->node) : + tipc_own_id_string(c->net), + rc); + + return rc; +} + +/** + * tipc_crypto_key_attach - Attach a new AEAD key to TIPC crypto + * @c: TIPC crypto to which the new AEAD key is attached + * @aead: the new AEAD key pointer + * @pos: desired slot in the crypto key array, = 0 if any! + * + * Return: new key id in case of success, otherwise: -EBUSY + */ +static int tipc_crypto_key_attach(struct tipc_crypto *c, + struct tipc_aead *aead, u8 pos) +{ + u8 new_pending, new_passive, new_key; + struct tipc_key key; + int rc = -EBUSY; + + spin_lock_bh(&c->lock); + key = c->key; + if (key.active && key.passive) + goto exit; + if (key.passive && !tipc_aead_users(c->aead[key.passive])) + goto exit; + if (key.pending) { + if (pos) + goto exit; + if (tipc_aead_users(c->aead[key.pending]) > 0) + goto exit; + /* Replace it */ + new_pending = key.pending; + new_passive = key.passive; + new_key = new_pending; + } else { + if (pos) { + if (key.active && pos != key_next(key.active)) { + new_pending = key.pending; + new_passive = pos; + new_key = new_passive; + goto attach; + } else if (!key.active && !key.passive) { + new_pending = pos; + new_passive = key.passive; + new_key = new_pending; + goto attach; + } + } + new_pending = key_next(key.active ?: key.passive); + new_passive = key.passive; + new_key = new_pending; + } + +attach: + aead->crypto = c; + tipc_crypto_key_set_state(c, new_passive, key.active, new_pending); + tipc_aead_rcu_replace(c->aead[new_key], aead, &c->lock); + + c->working = 1; + c->timer1 = jiffies; + c->timer2 = jiffies; + rc = new_key; + +exit: + spin_unlock_bh(&c->lock); + return rc; +} + +void tipc_crypto_key_flush(struct tipc_crypto *c) +{ + int k; + + spin_lock_bh(&c->lock); + c->working = 0; + tipc_crypto_key_set_state(c, 0, 0, 0); + for (k = KEY_MIN; k <= KEY_MAX; k++) + tipc_crypto_key_detach(c->aead[k], &c->lock); + atomic_set(&c->peer_rx_active, 0); + atomic64_set(&c->sndnxt, 0); + spin_unlock_bh(&c->lock); +} + +/** + * tipc_crypto_key_try_align - Align RX keys if possible + * @rx: RX crypto handle + * @new_pending: new pending slot if aligned (= TX key from peer) + * + * Peer has used an unknown key slot, this only happens when peer has left and + * rejoned, or we are newcomer. + * That means, there must be no active key but a pending key at unaligned slot. + * If so, we try to move the pending key to the new slot. + * Note: A potential passive key can exist, it will be shifted correspondingly! + * + * Return: "true" if key is successfully aligned, otherwise "false" + */ +static bool tipc_crypto_key_try_align(struct tipc_crypto *rx, u8 new_pending) +{ + struct tipc_aead *tmp1, *tmp2 = NULL; + struct tipc_key key; + bool aligned = false; + u8 new_passive = 0; + int x; + + spin_lock(&rx->lock); + key = rx->key; + if (key.pending == new_pending) { + aligned = true; + goto exit; + } + if (key.active) + goto exit; + if (!key.pending) + goto exit; + if (tipc_aead_users(rx->aead[key.pending]) > 0) + goto exit; + + /* Try to "isolate" this pending key first */ + tmp1 = tipc_aead_rcu_ptr(rx->aead[key.pending], &rx->lock); + if (!refcount_dec_if_one(&tmp1->refcnt)) + goto exit; + rcu_assign_pointer(rx->aead[key.pending], NULL); + + /* Move passive key if any */ + if (key.passive) { + tipc_aead_rcu_swap(rx->aead[key.passive], tmp2, &rx->lock); + x = (key.passive - key.pending + new_pending) % KEY_MAX; + new_passive = (x <= 0) ? x + KEY_MAX : x; + } + + /* Re-allocate the key(s) */ + tipc_crypto_key_set_state(rx, new_passive, 0, new_pending); + rcu_assign_pointer(rx->aead[new_pending], tmp1); + if (new_passive) + rcu_assign_pointer(rx->aead[new_passive], tmp2); + refcount_set(&tmp1->refcnt, 1); + aligned = true; + pr_info("RX(%s): key is aligned!\n", tipc_node_get_id_str(rx->node)); + +exit: + spin_unlock(&rx->lock); + return aligned; +} + +/** + * tipc_crypto_key_pick_tx - Pick one TX key for message decryption + * @tx: TX crypto handle + * @rx: RX crypto handle (can be NULL) + * @skb: the message skb which will be decrypted later + * + * This function looks up the existing TX keys and pick one which is suitable + * for the message decryption, that must be a cluster key and not used before + * on the same message (i.e. recursive). + * + * Return: the TX AEAD key handle in case of success, otherwise NULL + */ +static struct tipc_aead *tipc_crypto_key_pick_tx(struct tipc_crypto *tx, + struct tipc_crypto *rx, + struct sk_buff *skb) +{ + struct tipc_skb_cb *skb_cb = TIPC_SKB_CB(skb); + struct tipc_aead *aead = NULL; + struct tipc_key key = tx->key; + u8 k, i = 0; + + /* Initialize data if not yet */ + if (!skb_cb->tx_clone_deferred) { + skb_cb->tx_clone_deferred = 1; + memset(&skb_cb->tx_clone_ctx, 0, sizeof(skb_cb->tx_clone_ctx)); + } + + skb_cb->tx_clone_ctx.rx = rx; + if (++skb_cb->tx_clone_ctx.recurs > 2) + return NULL; + + /* Pick one TX key */ + spin_lock(&tx->lock); + do { + k = (i == 0) ? key.pending : + ((i == 1) ? key.active : key.passive); + if (!k) + continue; + aead = tipc_aead_rcu_ptr(tx->aead[k], &tx->lock); + if (!aead) + continue; + if (aead->mode != CLUSTER_KEY || + aead == skb_cb->tx_clone_ctx.last) { + aead = NULL; + continue; + } + /* Ok, found one cluster key */ + skb_cb->tx_clone_ctx.last = aead; + WARN_ON(skb->next); + skb->next = skb_clone(skb, GFP_ATOMIC); + if (unlikely(!skb->next)) + pr_warn("Failed to clone skb for next round if any\n"); + WARN_ON(!refcount_inc_not_zero(&aead->refcnt)); + break; + } while (++i < 3); + spin_unlock(&tx->lock); + + return aead; +} + +/** + * tipc_crypto_key_synch: Synch own key data according to peer key status + * @rx: RX crypto handle + * @new_rx_active: latest RX active key from peer + * @hdr: TIPCv2 message + * + * This function updates the peer node related data as the peer RX active key + * has changed, so the number of TX keys' users on this node are increased and + * decreased correspondingly. + * + * The "per-peer" sndnxt is also reset when the peer key has switched. + */ +static void tipc_crypto_key_synch(struct tipc_crypto *rx, u8 new_rx_active, + struct tipc_msg *hdr) +{ + struct net *net = rx->net; + struct tipc_crypto *tx = tipc_net(net)->crypto_tx; + u8 cur_rx_active; + + /* TX might be even not ready yet */ + if (unlikely(!tx->key.active && !tx->key.pending)) + return; + + cur_rx_active = atomic_read(&rx->peer_rx_active); + if (likely(cur_rx_active == new_rx_active)) + return; + + /* Make sure this message destined for this node */ + if (unlikely(msg_short(hdr) || + msg_destnode(hdr) != tipc_own_addr(net))) + return; + + /* Peer RX active key has changed, try to update owns' & TX users */ + if (atomic_cmpxchg(&rx->peer_rx_active, + cur_rx_active, + new_rx_active) == cur_rx_active) { + if (new_rx_active) + tipc_aead_users_inc(tx->aead[new_rx_active], INT_MAX); + if (cur_rx_active) + tipc_aead_users_dec(tx->aead[cur_rx_active], 0); + + atomic64_set(&rx->sndnxt, 0); + /* Mark the point TX key users changed */ + tx->timer1 = jiffies; + +#ifdef TIPC_CRYPTO_DEBUG + pr_info("TX(%s): key users changed %d-- %d++, peer RX(%s)\n", + tipc_own_id_string(net), cur_rx_active, + new_rx_active, tipc_node_get_id_str(rx->node)); +#endif + } +} + +static int tipc_crypto_key_revoke(struct net *net, u8 tx_key) +{ + struct tipc_crypto *tx = tipc_net(net)->crypto_tx; + struct tipc_key key; + + spin_lock(&tx->lock); + key = tx->key; + WARN_ON(!key.active || tx_key != key.active); + + /* Free the active key */ + tipc_crypto_key_set_state(tx, key.passive, 0, key.pending); + tipc_crypto_key_detach(tx->aead[key.active], &tx->lock); + spin_unlock(&tx->lock); + + pr_warn("TX(%s): key is revoked!\n", tipc_own_id_string(net)); + return -EKEYREVOKED; +} + +int tipc_crypto_start(struct tipc_crypto **crypto, struct net *net, + struct tipc_node *node) +{ + struct tipc_crypto *c; + + if (*crypto) + return -EEXIST; + + /* Allocate crypto */ + c = kzalloc(sizeof(*c), GFP_ATOMIC); + if (!c) + return -ENOMEM; + + /* Allocate statistic structure */ + c->stats = alloc_percpu_gfp(struct tipc_crypto_stats, GFP_ATOMIC); + if (!c->stats) { + kzfree(c); + return -ENOMEM; + } + + c->working = 0; + c->net = net; + c->node = node; + tipc_crypto_key_set_state(c, 0, 0, 0); + atomic_set(&c->peer_rx_active, 0); + atomic64_set(&c->sndnxt, 0); + c->timer1 = jiffies; + c->timer2 = jiffies; + spin_lock_init(&c->lock); + *crypto = c; + + return 0; +} + +void tipc_crypto_stop(struct tipc_crypto **crypto) +{ + struct tipc_crypto *c, *tx, *rx; + bool is_rx; + u8 k; + + if (!*crypto) + return; + + rcu_read_lock(); + /* RX stopping? => decrease TX key users if any */ + is_rx = !!((*crypto)->node); + if (is_rx) { + rx = *crypto; + tx = tipc_net(rx->net)->crypto_tx; + k = atomic_read(&rx->peer_rx_active); + if (k) { + tipc_aead_users_dec(tx->aead[k], 0); + /* Mark the point TX key users changed */ + tx->timer1 = jiffies; + } + } + + /* Release AEAD keys */ + c = *crypto; + for (k = KEY_MIN; k <= KEY_MAX; k++) + tipc_aead_put(rcu_dereference(c->aead[k])); + rcu_read_unlock(); + + pr_warn("%s(%s) has been purged, node left!\n", + (is_rx) ? "RX" : "TX", + (is_rx) ? tipc_node_get_id_str((*crypto)->node) : + tipc_own_id_string((*crypto)->net)); + + /* Free this crypto statistics */ + free_percpu(c->stats); + + *crypto = NULL; + kzfree(c); +} + +void tipc_crypto_timeout(struct tipc_crypto *rx) +{ + struct tipc_net *tn = tipc_net(rx->net); + struct tipc_crypto *tx = tn->crypto_tx; + struct tipc_key key; + u8 new_pending, new_passive; + int cmd; + + /* TX key activating: + * The pending key (users > 0) -> active + * The active key if any (users == 0) -> free + */ + spin_lock(&tx->lock); + key = tx->key; + if (key.active && tipc_aead_users(tx->aead[key.active]) > 0) + goto s1; + if (!key.pending || tipc_aead_users(tx->aead[key.pending]) <= 0) + goto s1; + if (time_before(jiffies, tx->timer1 + TIPC_TX_LASTING_LIM)) + goto s1; + + tipc_crypto_key_set_state(tx, key.passive, key.pending, 0); + if (key.active) + tipc_crypto_key_detach(tx->aead[key.active], &tx->lock); + this_cpu_inc(tx->stats->stat[STAT_SWITCHES]); + pr_info("TX(%s): key %d is activated!\n", tipc_own_id_string(tx->net), + key.pending); + +s1: + spin_unlock(&tx->lock); + + /* RX key activating: + * The pending key (users > 0) -> active + * The active key if any -> passive, freed later + */ + spin_lock(&rx->lock); + key = rx->key; + if (!key.pending || tipc_aead_users(rx->aead[key.pending]) <= 0) + goto s2; + + new_pending = (key.passive && + !tipc_aead_users(rx->aead[key.passive])) ? + key.passive : 0; + new_passive = (key.active) ?: ((new_pending) ? 0 : key.passive); + tipc_crypto_key_set_state(rx, new_passive, key.pending, new_pending); + this_cpu_inc(rx->stats->stat[STAT_SWITCHES]); + pr_info("RX(%s): key %d is activated!\n", + tipc_node_get_id_str(rx->node), key.pending); + goto s5; + +s2: + /* RX key "faulty" switching: + * The faulty pending key (users < -30) -> passive + * The passive key (users = 0) -> pending + * Note: This only happens after RX deactivated - s3! + */ + key = rx->key; + if (!key.pending || tipc_aead_users(rx->aead[key.pending]) > -30) + goto s3; + if (!key.passive || tipc_aead_users(rx->aead[key.passive]) != 0) + goto s3; + + new_pending = key.passive; + new_passive = key.pending; + tipc_crypto_key_set_state(rx, new_passive, key.active, new_pending); + goto s5; + +s3: + /* RX key deactivating: + * The passive key if any -> pending + * The active key -> passive (users = 0) / pending + * The pending key if any -> passive (users = 0) + */ + key = rx->key; + if (!key.active) + goto s4; + if (time_before(jiffies, rx->timer1 + TIPC_RX_ACTIVE_LIM)) + goto s4; + + new_pending = (key.passive) ?: key.active; + new_passive = (key.passive) ? key.active : key.pending; + tipc_aead_users_set(rx->aead[new_pending], 0); + if (new_passive) + tipc_aead_users_set(rx->aead[new_passive], 0); + tipc_crypto_key_set_state(rx, new_passive, 0, new_pending); + pr_info("RX(%s): key %d is deactivated!\n", + tipc_node_get_id_str(rx->node), key.active); + goto s5; + +s4: + /* RX key passive -> freed: */ + key = rx->key; + if (!key.passive || !tipc_aead_users(rx->aead[key.passive])) + goto s5; + if (time_before(jiffies, rx->timer2 + TIPC_RX_PASSIVE_LIM)) + goto s5; + + tipc_crypto_key_set_state(rx, 0, key.active, key.pending); + tipc_crypto_key_detach(rx->aead[key.passive], &rx->lock); + pr_info("RX(%s): key %d is freed!\n", tipc_node_get_id_str(rx->node), + key.passive); + +s5: + spin_unlock(&rx->lock); + + /* Limit max_tfms & do debug commands if needed */ + if (likely(sysctl_tipc_max_tfms <= TIPC_MAX_TFMS_LIM)) + return; + + cmd = sysctl_tipc_max_tfms; + sysctl_tipc_max_tfms = TIPC_MAX_TFMS_DEF; + tipc_crypto_do_cmd(rx->net, cmd); +} + +/** + * tipc_crypto_xmit - Build & encrypt TIPC message for xmit + * @net: struct net + * @skb: input/output message skb pointer + * @b: bearer used for xmit later + * @dst: destination media address + * @__dnode: destination node for reference if any + * + * First, build an encryption message header on the top of the message, then + * encrypt the original TIPC message by using the active or pending TX key. + * If the encryption is successful, the encrypted skb is returned directly or + * via the callback. + * Otherwise, the skb is freed! + * + * Return: + * 0 : the encryption has succeeded (or no encryption) + * -EINPROGRESS/-EBUSY : the encryption is ongoing, a callback will be made + * -ENOKEK : the encryption has failed due to no key + * -EKEYREVOKED : the encryption has failed due to key revoked + * -ENOMEM : the encryption has failed due to no memory + * < 0 : the encryption has failed due to other reasons + */ +int tipc_crypto_xmit(struct net *net, struct sk_buff **skb, + struct tipc_bearer *b, struct tipc_media_addr *dst, + struct tipc_node *__dnode) +{ + struct tipc_crypto *__rx = tipc_node_crypto_rx(__dnode); + struct tipc_crypto *tx = tipc_net(net)->crypto_tx; + struct tipc_crypto_stats __percpu *stats = tx->stats; + struct tipc_key key = tx->key; + struct tipc_aead *aead = NULL; + struct sk_buff *probe; + int rc = -ENOKEY; + u8 tx_key; + + /* No encryption? */ + if (!tx->working) + return 0; + + /* Try with the pending key if available and: + * 1) This is the only choice (i.e. no active key) or; + * 2) Peer has switched to this key (unicast only) or; + * 3) It is time to do a pending key probe; + */ + if (unlikely(key.pending)) { + tx_key = key.pending; + if (!key.active) + goto encrypt; + if (__rx && atomic_read(&__rx->peer_rx_active) == tx_key) + goto encrypt; + if (TIPC_SKB_CB(*skb)->probe) + goto encrypt; + if (!__rx && + time_after(jiffies, tx->timer2 + TIPC_TX_PROBE_LIM)) { + tx->timer2 = jiffies; + probe = skb_clone(*skb, GFP_ATOMIC); + if (probe) { + TIPC_SKB_CB(probe)->probe = 1; + tipc_crypto_xmit(net, &probe, b, dst, __dnode); + if (probe) + b->media->send_msg(net, probe, b, dst); + } + } + } + /* Else, use the active key if any */ + if (likely(key.active)) { + tx_key = key.active; + goto encrypt; + } + goto exit; + +encrypt: + aead = tipc_aead_get(tx->aead[tx_key]); + if (unlikely(!aead)) + goto exit; + rc = tipc_ehdr_build(net, aead, tx_key, *skb, __rx); + if (likely(rc > 0)) + rc = tipc_aead_encrypt(aead, *skb, b, dst, __dnode); + +exit: + switch (rc) { + case 0: + this_cpu_inc(stats->stat[STAT_OK]); + break; + case -EINPROGRESS: + case -EBUSY: + this_cpu_inc(stats->stat[STAT_ASYNC]); + *skb = NULL; + return rc; + default: + this_cpu_inc(stats->stat[STAT_NOK]); + if (rc == -ENOKEY) + this_cpu_inc(stats->stat[STAT_NOKEYS]); + else if (rc == -EKEYREVOKED) + this_cpu_inc(stats->stat[STAT_BADKEYS]); + kfree_skb(*skb); + *skb = NULL; + break; + } + + tipc_aead_put(aead); + return rc; +} + +/** + * tipc_crypto_rcv - Decrypt an encrypted TIPC message from peer + * @net: struct net + * @rx: RX crypto handle + * @skb: input/output message skb pointer + * @b: bearer where the message has been received + * + * If the decryption is successful, the decrypted skb is returned directly or + * as the callback, the encryption header and auth tag will be trimed out + * before forwarding to tipc_rcv() via the tipc_crypto_rcv_complete(). + * Otherwise, the skb will be freed! + * Note: RX key(s) can be re-aligned, or in case of no key suitable, TX + * cluster key(s) can be taken for decryption (- recursive). + * + * Return: + * 0 : the decryption has successfully completed + * -EINPROGRESS/-EBUSY : the decryption is ongoing, a callback will be made + * -ENOKEY : the decryption has failed due to no key + * -EBADMSG : the decryption has failed due to bad message + * -ENOMEM : the decryption has failed due to no memory + * < 0 : the decryption has failed due to other reasons + */ +int tipc_crypto_rcv(struct net *net, struct tipc_crypto *rx, + struct sk_buff **skb, struct tipc_bearer *b) +{ + struct tipc_crypto *tx = tipc_net(net)->crypto_tx; + struct tipc_crypto_stats __percpu *stats; + struct tipc_aead *aead = NULL; + struct tipc_key key; + int rc = -ENOKEY; + u8 tx_key = 0; + + /* New peer? + * Let's try with TX key (i.e. cluster mode) & verify the skb first! + */ + if (unlikely(!rx)) + goto pick_tx; + + /* Pick RX key according to TX key, three cases are possible: + * 1) The current active key (likely) or; + * 2) The pending (new or deactivated) key (if any) or; + * 3) The passive or old active key (i.e. users > 0); + */ + tx_key = ((struct tipc_ehdr *)(*skb)->data)->tx_key; + key = rx->key; + if (likely(tx_key == key.active)) + goto decrypt; + if (tx_key == key.pending) + goto decrypt; + if (tx_key == key.passive) { + rx->timer2 = jiffies; + if (tipc_aead_users(rx->aead[key.passive]) > 0) + goto decrypt; + } + + /* Unknown key, let's try to align RX key(s) */ + if (tipc_crypto_key_try_align(rx, tx_key)) + goto decrypt; + +pick_tx: + /* No key suitable? Try to pick one from TX... */ + aead = tipc_crypto_key_pick_tx(tx, rx, *skb); + if (aead) + goto decrypt; + goto exit; + +decrypt: + rcu_read_lock(); + if (!aead) + aead = tipc_aead_get(rx->aead[tx_key]); + rc = tipc_aead_decrypt(net, aead, *skb, b); + rcu_read_unlock(); + +exit: + stats = ((rx) ?: tx)->stats; + switch (rc) { + case 0: + this_cpu_inc(stats->stat[STAT_OK]); + break; + case -EINPROGRESS: + case -EBUSY: + this_cpu_inc(stats->stat[STAT_ASYNC]); + *skb = NULL; + return rc; + default: + this_cpu_inc(stats->stat[STAT_NOK]); + if (rc == -ENOKEY) { + kfree_skb(*skb); + *skb = NULL; + if (rx) + tipc_node_put(rx->node); + this_cpu_inc(stats->stat[STAT_NOKEYS]); + return rc; + } else if (rc == -EBADMSG) { + this_cpu_inc(stats->stat[STAT_BADMSGS]); + } + break; + } + + tipc_crypto_rcv_complete(net, aead, b, skb, rc); + return rc; +} + +static void tipc_crypto_rcv_complete(struct net *net, struct tipc_aead *aead, + struct tipc_bearer *b, + struct sk_buff **skb, int err) +{ + struct tipc_skb_cb *skb_cb = TIPC_SKB_CB(*skb); + struct tipc_crypto *rx = aead->crypto; + struct tipc_aead *tmp = NULL; + struct t... [truncated message content] |
From: Tuong L. <tuo...@de...> - 2019-11-05 10:51:28
|
The new structure 'tipc_aead_key' is added to the 'tipc.h' for user to be able to transfer a key to TIPC in kernel. Netlink will be used for this purpose in the later commits. Signed-off-by: Tuong Lien <tuo...@de...> --- include/uapi/linux/tipc.h | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h index 76421b878767..add01db1daef 100644 --- a/include/uapi/linux/tipc.h +++ b/include/uapi/linux/tipc.h @@ -233,6 +233,27 @@ struct tipc_sioc_nodeid_req { char node_id[TIPC_NODEID_LEN]; }; +/* + * TIPC Crypto, AEAD + */ +#define TIPC_AEAD_ALG_NAME (32) + +struct tipc_aead_key { + char alg_name[TIPC_AEAD_ALG_NAME]; + unsigned int keylen; /* in bytes */ + char key[]; +}; + +#define TIPC_AEAD_KEYLEN_MIN (16 + 4) +#define TIPC_AEAD_KEYLEN_MAX (32 + 4) +#define TIPC_AEAD_KEY_SIZE_MAX (sizeof(struct tipc_aead_key) + \ + TIPC_AEAD_KEYLEN_MAX) + +static inline int tipc_aead_key_size(struct tipc_aead_key *key) +{ + return sizeof(*key) + key->keylen; +} + /* The macros and functions below are deprecated: */ -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2019-11-05 10:51:25
|
This series provides TIPC encryption feature, kernel part. There will be another one in the 'iproute2/tipc' for user space to set key. Tuong Lien (5): tipc: add reference counter to bearer tipc: enable creating a "preliminary" node tipc: add new AEAD key structure for user API tipc: introduce TIPC encryption & authentication tipc: add support for AEAD key setting via netlink include/uapi/linux/tipc.h | 21 + include/uapi/linux/tipc_netlink.h | 8 + net/tipc/Kconfig | 12 + net/tipc/Makefile | 1 + net/tipc/bcast.c | 2 +- net/tipc/bearer.c | 54 +- net/tipc/bearer.h | 11 +- net/tipc/core.c | 18 + net/tipc/core.h | 8 + net/tipc/crypto.c | 1986 +++++++++++++++++++++++++++++++++++++ net/tipc/crypto.h | 167 ++++ net/tipc/link.c | 19 +- net/tipc/link.h | 1 + net/tipc/msg.c | 15 +- net/tipc/msg.h | 46 +- net/tipc/netlink.c | 20 +- net/tipc/node.c | 347 ++++++- net/tipc/node.h | 13 + net/tipc/sysctl.c | 11 + net/tipc/udp_media.c | 1 + 20 files changed, 2694 insertions(+), 67 deletions(-) create mode 100644 net/tipc/crypto.c create mode 100644 net/tipc/crypto.h -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2019-11-05 10:51:25
|
When user sets RX key for a peer not existing on the own node, a new node entry is needed to which the RX key will be attached. However, since the peer node address (& capabilities) is unknown at that moment, only the node-ID is provided, this commit allows the creation of a node with only the data that we call as “preliminary”. A preliminary node is not the object of the “tipc_node_find()” but the “tipc_node_find_by_id()”. Once the first message i.e. LINK_CONFIG comes from that peer, and is successfully decrypted by the own node, the actual peer node data will be properly updated and the node will function as usual. In addition, the node timer always starts when a node object is created so if a preliminary node is not used, it will be cleaned up. The later encryption functions will also use the node timer and be able to create a preliminary node automatically when needed. v2: rebase Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/node.c | 103 +++++++++++++++++++++++++++++++++++++++----------------- net/tipc/node.h | 1 + 2 files changed, 73 insertions(+), 31 deletions(-) diff --git a/net/tipc/node.c b/net/tipc/node.c index 4b60928049ea..5d5c95c9b4e6 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -89,6 +89,7 @@ struct tipc_bclink_entry { * @links: array containing references to all links to node * @action_flags: bit mask of different types of node actions * @state: connectivity state vs peer node + * @preliminary: a preliminary node or not * @sync_point: sequence number where synch/failover is finished * @list: links to adjacent nodes in sorted list of cluster's nodes * @working_links: number of working links to node (both active and standby) @@ -112,6 +113,7 @@ struct tipc_node { int action_flags; struct list_head list; int state; + bool preliminary; bool failover_sent; u16 sync_point; int link_cnt; @@ -120,6 +122,7 @@ struct tipc_node { u32 signature; u32 link_id; u8 peer_id[16]; + char peer_id_string[NODE_ID_STR_LEN]; struct list_head publ_list; struct list_head conn_sks; unsigned long keepalive_intv; @@ -245,6 +248,16 @@ u16 tipc_node_get_capabilities(struct net *net, u32 addr) return caps; } +u32 tipc_node_get_addr(struct tipc_node *node) +{ + return (node) ? node->addr : 0; +} + +char *tipc_node_get_id_str(struct tipc_node *node) +{ + return node->peer_id_string; +} + static void tipc_node_kref_release(struct kref *kref) { struct tipc_node *n = container_of(kref, struct tipc_node, kref); @@ -274,7 +287,7 @@ static struct tipc_node *tipc_node_find(struct net *net, u32 addr) rcu_read_lock(); hlist_for_each_entry_rcu(node, &tn->node_htable[thash], hash) { - if (node->addr != addr) + if (node->addr != addr || node->preliminary) continue; if (!kref_get_unless_zero(&node->kref)) node = NULL; @@ -400,17 +413,39 @@ static void tipc_node_assign_peer_net(struct tipc_node *n, u32 hash_mixes) static struct tipc_node *tipc_node_create(struct net *net, u32 addr, u8 *peer_id, u16 capabilities, - u32 signature, u32 hash_mixes) + u32 hash_mixes, bool preliminary) { struct tipc_net *tn = net_generic(net, tipc_net_id); struct tipc_node *n, *temp_node; struct tipc_link *l; + unsigned long intv; int bearer_id; int i; spin_lock_bh(&tn->node_list_lock); - n = tipc_node_find(net, addr); + n = tipc_node_find(net, addr) ?: + tipc_node_find_by_id(net, peer_id); if (n) { + if (!n->preliminary) + goto update; + if (preliminary) + goto exit; + /* A preliminary node becomes "real" now, refresh its data */ + tipc_node_write_lock(n); + n->preliminary = false; + n->addr = addr; + hlist_del_rcu(&n->hash); + hlist_add_head_rcu(&n->hash, + &tn->node_htable[tipc_hashfn(addr)]); + list_del_rcu(&n->list); + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + if (n->addr < temp_node->addr) + break; + } + list_add_tail_rcu(&n->list, &temp_node->list); + tipc_node_write_unlock_fast(n); + +update: if (n->peer_hash_mix ^ hash_mixes) tipc_node_assign_peer_net(n, hash_mixes); if (n->capabilities == capabilities) @@ -438,7 +473,9 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, pr_warn("Node creation failed, no memory\n"); goto exit; } + tipc_nodeid2string(n->peer_id_string, peer_id); n->addr = addr; + n->preliminary = preliminary; memcpy(&n->peer_id, peer_id, 16); n->net = net; n->peer_net = NULL; @@ -463,26 +500,14 @@ static struct tipc_node *tipc_node_create(struct net *net, u32 addr, n->signature = INVALID_NODE_SIG; n->active_links[0] = INVALID_BEARER_ID; n->active_links[1] = INVALID_BEARER_ID; - if (!tipc_link_bc_create(net, tipc_own_addr(net), - addr, U16_MAX, - tipc_link_window(tipc_bc_sndlink(net)), - n->capabilities, - &n->bc_entry.inputq1, - &n->bc_entry.namedq, - tipc_bc_sndlink(net), - &n->bc_entry.link)) { - pr_warn("Broadcast rcv link creation failed, no memory\n"); - if (n->peer_net) { - n->peer_net = NULL; - n->peer_hash_mix = 0; - } - kfree(n); - n = NULL; - goto exit; - } + n->bc_entry.link = NULL; tipc_node_get(n); timer_setup(&n->timer, tipc_node_timeout, 0); - n->keepalive_intv = U32_MAX; + /* Start a slow timer anyway, crypto needs it */ + n->keepalive_intv = 10000; + intv = jiffies + msecs_to_jiffies(n->keepalive_intv); + if (!mod_timer(&n->timer, intv)) + tipc_node_get(n); hlist_add_head_rcu(&n->hash, &tn->node_htable[tipc_hashfn(addr)]); list_for_each_entry_rcu(temp_node, &tn->node_list, list) { if (n->addr < temp_node->addr) @@ -1000,6 +1025,8 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) { struct tipc_net *tn = tipc_net(net); struct tipc_node *n; + bool preliminary; + u32 sugg_addr; /* Suggest new address if some other peer is using this one */ n = tipc_node_find(net, addr); @@ -1015,9 +1042,11 @@ u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr) /* Suggest previously used address if peer is known */ n = tipc_node_find_by_id(net, id); if (n) { - addr = n->addr; + sugg_addr = n->addr; + preliminary = n->preliminary; tipc_node_put(n); - return addr; + if (!preliminary) + return sugg_addr; } /* Even this node may be in conflict */ @@ -1034,7 +1063,7 @@ void tipc_node_check_dest(struct net *net, u32 addr, bool *respond, bool *dupl_addr) { struct tipc_node *n; - struct tipc_link *l; + struct tipc_link *l, *snd_l; struct tipc_link_entry *le; bool addr_match = false; bool sign_match = false; @@ -1048,12 +1077,27 @@ void tipc_node_check_dest(struct net *net, u32 addr, *dupl_addr = false; *respond = false; - n = tipc_node_create(net, addr, peer_id, capabilities, signature, - hash_mixes); + n = tipc_node_create(net, addr, peer_id, capabilities, hash_mixes, + false); if (!n) return; tipc_node_write_lock(n); + if (unlikely(!n->bc_entry.link)) { + snd_l = tipc_bc_sndlink(net); + if (!tipc_link_bc_create(net, tipc_own_addr(net), + addr, U16_MAX, + tipc_link_window(snd_l), + n->capabilities, + &n->bc_entry.inputq1, + &n->bc_entry.namedq, snd_l, + &n->bc_entry.link)) { + pr_warn("Broadcast rcv link creation failed, no mem\n"); + tipc_node_write_unlock_fast(n); + tipc_node_put(n); + return; + } + } le = &n->links[b->identity]; @@ -2128,6 +2172,8 @@ int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb) } list_for_each_entry_rcu(node, &tn->node_list, list) { + if (node->preliminary) + continue; if (last_addr) { if (node->addr == last_addr) last_addr = 0; @@ -2643,11 +2689,6 @@ int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb, return skb->len; } -u32 tipc_node_get_addr(struct tipc_node *node) -{ - return (node) ? node->addr : 0; -} - /** * tipc_node_dump - dump TIPC node data * @n: tipc node to be dumped diff --git a/net/tipc/node.h b/net/tipc/node.h index c39cd861c07d..50f8838b32c2 100644 --- a/net/tipc/node.h +++ b/net/tipc/node.h @@ -75,6 +75,7 @@ enum { void tipc_node_stop(struct net *net); bool tipc_node_get_id(struct net *net, u32 addr, u8 *id); u32 tipc_node_get_addr(struct tipc_node *node); +char *tipc_node_get_id_str(struct tipc_node *node); u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr); void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128, struct tipc_bearer *bearer, -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2019-11-05 10:51:24
|
As a need to support the crypto asynchronous operations in the later commits, apart from the current RCU mechanism for bearer pointer, we add a 'refcnt' to the bearer object as well. So, a bearer can be hold via 'tipc_bearer_hold()' without being freed even though the bearer or interface can be disabled in the meanwhile. If that happens, the bearer will be released then when the crypto operation is completed and 'tipc_bearer_put()' is called. Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/bearer.c | 14 +++++++++++++- net/tipc/bearer.h | 3 +++ 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c index 0214aa1c4427..6e15b9b1f1ef 100644 --- a/net/tipc/bearer.c +++ b/net/tipc/bearer.c @@ -315,6 +315,7 @@ static int tipc_enable_bearer(struct net *net, const char *name, b->net_plane = bearer_id + 'A'; b->priority = prio; test_and_set_bit_lock(0, &b->up); + refcount_set(&b->refcnt, 1); res = tipc_disc_create(net, b, &b->bcast_addr, &skb); if (res) { @@ -351,6 +352,17 @@ static int tipc_reset_bearer(struct net *net, struct tipc_bearer *b) return 0; } +bool tipc_bearer_hold(struct tipc_bearer *b) +{ + return (b && refcount_inc_not_zero(&b->refcnt)); +} + +void tipc_bearer_put(struct tipc_bearer *b) +{ + if (b && refcount_dec_and_test(&b->refcnt)) + kfree_rcu(b, rcu); +} + /** * bearer_disable * @@ -369,7 +381,7 @@ static void bearer_disable(struct net *net, struct tipc_bearer *b) if (b->disc) tipc_disc_delete(b->disc); RCU_INIT_POINTER(tn->bearer_list[bearer_id], NULL); - kfree_rcu(b, rcu); + tipc_bearer_put(b); tipc_mon_delete(net, bearer_id); } diff --git a/net/tipc/bearer.h b/net/tipc/bearer.h index ea0f3c49cbed..faca696d422f 100644 --- a/net/tipc/bearer.h +++ b/net/tipc/bearer.h @@ -165,6 +165,7 @@ struct tipc_bearer { struct tipc_discoverer *disc; char net_plane; unsigned long up; + refcount_t refcnt; }; struct tipc_bearer_names { @@ -210,6 +211,8 @@ int tipc_media_set_window(const char *name, u32 new_value); int tipc_media_addr_printf(char *buf, int len, struct tipc_media_addr *a); int tipc_enable_l2_media(struct net *net, struct tipc_bearer *b, struct nlattr *attrs[]); +bool tipc_bearer_hold(struct tipc_bearer *b); +void tipc_bearer_put(struct tipc_bearer *b); void tipc_disable_l2_media(struct tipc_bearer *b); int tipc_l2_send_msg(struct net *net, struct sk_buff *buf, struct tipc_bearer *b, struct tipc_media_addr *dest); -- 2.13.7 |
From: Hoang Le <hoa...@de...> - 2019-11-05 10:01:40
|
We add the support to remove a specific node down with 128bit node identifier, as an alternative to legacy 32-bit node address. v2: improve usage for 'tipc peer remove' command Signed-off-by: Hoang Le <hoa...@de...> --- tipc/peer.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 52 insertions(+), 1 deletion(-) diff --git a/tipc/peer.c b/tipc/peer.c index f6380777033d..f14ec35e6f71 100644 --- a/tipc/peer.c +++ b/tipc/peer.c @@ -59,17 +59,68 @@ static int cmd_peer_rm_addr(struct nlmsghdr *nlh, const struct cmd *cmd, return msg_doit(nlh, NULL, NULL); } +static int cmd_peer_rm_nodeid(struct nlmsghdr *nlh, const struct cmd *cmd, + struct cmdl *cmdl, void *data) +{ + char buf[MNL_SOCKET_BUFFER_SIZE]; + __u8 id[16] = {0,}; + __u64 *w0 = (__u64 *)&id[0]; + __u64 *w1 = (__u64 *)&id[8]; + struct nlattr *nest; + char *str; + + if (cmdl->argc != cmdl->optind + 1) { + fprintf(stderr, "Usage: %s peer remove identity NODEID\n", + cmdl->argv[0]); + return -EINVAL; + } + + str = shift_cmdl(cmdl); + if (str2nodeid(str, id)) { + fprintf(stderr, "Invalid node identity\n"); + return -EINVAL; + } + + nlh = msg_init(buf, TIPC_NL_PEER_REMOVE); + if (!nlh) { + fprintf(stderr, "error, message initialisation failed\n"); + return -1; + } + + nest = mnl_attr_nest_start(nlh, TIPC_NLA_NET); + mnl_attr_put_u64(nlh, TIPC_NLA_NET_NODEID, *w0); + mnl_attr_put_u64(nlh, TIPC_NLA_NET_NODEID_W1, *w1); + mnl_attr_nest_end(nlh, nest); + + return msg_doit(nlh, NULL, NULL); +} + static void cmd_peer_rm_help(struct cmdl *cmdl) +{ + fprintf(stderr, "Usage: %s peer remove PROPERTY\n\n" + "PROPERTIES\n" + " identity NODEID - Remove peer node identity\n", + cmdl->argv[0]); +} + +static void cmd_peer_rm_addr_help(struct cmdl *cmdl) { fprintf(stderr, "Usage: %s peer remove address ADDRESS\n", cmdl->argv[0]); } +static void cmd_peer_rm_nodeid_help(struct cmdl *cmdl) +{ + fprintf(stderr, "Usage: %s peer remove identity NODEID\n", + cmdl->argv[0]); +} + static int cmd_peer_rm(struct nlmsghdr *nlh, const struct cmd *cmd, struct cmdl *cmdl, void *data) { const struct cmd cmds[] = { - { "address", cmd_peer_rm_addr, cmd_peer_rm_help }, + { "address", cmd_peer_rm_addr, cmd_peer_rm_addr_help }, + { "identity", cmd_peer_rm_nodeid, cmd_peer_rm_nodeid_help }, { NULL } }; -- 2.20.1 |
From: Hoang Le <hoa...@de...> - 2019-11-05 10:01:38
|
Example: Node Identity Hash Is container? State 1001002 01001002 no up 1001010 31000101 no up 1001011 31010101 no up 1001012 31020101 no up 1001003 31030001 yes up 1001013 31030101 no up 1001004 31040001 yes up 1001014 31040101 no up 1001015 31050101 no up 1001006 31060001 yes up 1001016 31060101 no up 1001007 31070001 yes up 1001008 31080001 yes up 1001009 31090001 yes up 100100a 31510001 yes up 100100b 31520001 yes up 100100c 31530001 yes up 100100d 31540001 no up 100100e 31550001 no up 100100f 31560001 no up Signed-off-by: Hoang Le <hoa...@de...> --- include/uapi/linux/tipc_netlink.h | 1 + tipc/node.c | 7 ++++++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/tipc_netlink.h b/include/uapi/linux/tipc_netlink.h index efb958fd167d..1a071268bf5d 100644 --- a/include/uapi/linux/tipc_netlink.h +++ b/include/uapi/linux/tipc_netlink.h @@ -160,6 +160,7 @@ enum { TIPC_NLA_NODE_UNSPEC, TIPC_NLA_NODE_ADDR, /* u32 */ TIPC_NLA_NODE_UP, /* flag */ + TIPC_NLA_NODE_LOCAL, /* flag */ __TIPC_NLA_NODE_MAX, TIPC_NLA_NODE_MAX = __TIPC_NLA_NODE_MAX - 1 diff --git a/tipc/node.c b/tipc/node.c index 2fec6753c974..b4203af014d3 100644 --- a/tipc/node.c +++ b/tipc/node.c @@ -42,6 +42,11 @@ static int node_list_cb(const struct nlmsghdr *nlh, void *data) addr = mnl_attr_get_u32(attrs[TIPC_NLA_NODE_ADDR]); hash2nodestr(addr, str); printf("%-32s %08x ", str, addr); + if (attrs[TIPC_NLA_NODE_LOCAL]) + printf("%-12s ", "yes"); + else + printf("%-12s ", "no"); + if (attrs[TIPC_NLA_NODE_UP]) printf("up\n"); else @@ -63,7 +68,7 @@ static int cmd_node_list(struct nlmsghdr *nlh, const struct cmd *cmd, fprintf(stderr, "error, message initialisation failed\n"); return -1; } - printf("Node Identity Hash State\n"); + printf("Node Identity Hash Is container? State\n"); return msg_dumpit(nlh, node_list_cb, NULL); } -- 2.20.1 |
From: Tuong L. T. <tuo...@de...> - 2019-11-05 08:45:40
|
Hi Ying, Yes, that makes sense. I have updated the patches with your suggestion and am going to send them now. As for patch #2, I guess you're referring to creating a preliminary node. Let's consider it as separated from the tipc crypto as well. In fact, we had a boolean flag to control it, also in addition to the new kernel option, if the tipc crypto is disabled, it will be no meaningful at all but just fallback to the original one, so that the node must work exactly the same as usual (i.e. preliminary = false)... I think we can keep it without the "ifdef else" instructions, otherwise we might need to duplicate the code for two cases. Do you agree? BR/Tuong -----Original Message----- From: Ying Xue <yin...@wi...> Sent: Tuesday, November 5, 2019 3:02 PM To: Jon Maloy <jon...@er...>; Tuong Tong Lien <tuo...@de...>; tip...@li...; ma...@do... Subject: Re: [PATCH RFC 0/5] TIPC encryption On 11/5/19 5:33 AM, Jon Maloy wrote: > Tuong, Ying > I am ok with a kernel option, as long as it is enabled by default. I can imagine smaller embedded systems where the deployer want a small module, and encryption anyway is managed differently, or not at all. > > ///jon When I gave the suggestion to add a new kernel option, I also ever figured out how to reach this goal based on this series of patches. In my opinion, we don't need to modify the patches of making some preparation things. For example, we don't need to change patch #1 at all. Patch #3, #4 and #5 are almost completely separate with current code. The only thing is that we need to consider how to modify patch #2. In sum, it looks like it's not very difficult to introduce the new kernel configuration option. The kernel option not only can help us keep TIPC module size smaller, but also can help us maintain it easily particularly with TIPC becoming more and more complex. Lastly, probably it can help to make it possible that the series of patches can be easily accepted by upstream if user can disable the feature with a kernel option. |
From: Ying X. <yin...@wi...> - 2019-11-05 08:03:24
|
On 11/5/19 5:33 AM, Jon Maloy wrote: > Tuong, Ying > I am ok with a kernel option, as long as it is enabled by default. I can imagine smaller embedded systems where the deployer want a small module, and encryption anyway is managed differently, or not at all. > > ///jon When I gave the suggestion to add a new kernel option, I also ever figured out how to reach this goal based on this series of patches. In my opinion, we don't need to modify the patches of making some preparation things. For example, we don't need to change patch #1 at all. Patch #3, #4 and #5 are almost completely separate with current code. The only thing is that we need to consider how to modify patch #2. In sum, it looks like it's not very difficult to introduce the new kernel configuration option. The kernel option not only can help us keep TIPC module size smaller, but also can help us maintain it easily particularly with TIPC becoming more and more complex. Lastly, probably it can help to make it possible that the series of patches can be easily accepted by upstream if user can disable the feature with a kernel option. |
From: Jon M. <jon...@er...> - 2019-11-04 21:33:55
|
Tuong, Ying I am ok with a kernel option, as long as it is enabled by default. I can imagine smaller embedded systems where the deployer want a small module, and encryption anyway is managed differently, or not at all. ///jon > -----Original Message----- > From: Tuong Lien Tong <tuo...@de...> > Sent: 4-Nov-19 06:30 > To: 'Xue, Ying' <Yin...@wi...>; tip...@li...; Jon Maloy > <jon...@er...>; ma...@do... > Subject: RE: [PATCH RFC 0/5] TIPC encryption > > Hi Ying, > > Thanks a lot for reviewing the series! > Your idea of a new kernel option is fine, but I'm not sure what its goal is. The new code is already "disabled" > by default unless there's a key set by user, so it's generally still under user's control... The advantage I can > see is the module's size but it is not that much (compared to the whole kernel). On the other hand, we will > need to custom the kernel to get the feature on and some additional code for the "ifdef...else..." > instructions. Do we really need the option? > > @Jon: What is your opinion about this? > > BR/Tuong > > -----Original Message----- > From: Xue, Ying <Yin...@wi...> > Sent: Friday, November 1, 2019 9:20 PM > To: Tuong Lien <tuo...@de...>; tip...@li...; > jon...@er...; ma...@do... > Subject: RE: [PATCH RFC 0/5] TIPC encryption > > Good job. > > This is a big and complex feature. Particularly for most of users who might not consider to use this feature, > please consider to give them a choice to completely disable it by adding a new kernel option like > TIPC_CRYPTO. > > Thanks, > Ying > > -----Original Message----- > From: Tuong Lien [mailto:tuo...@de...] > Sent: Monday, October 14, 2019 7:07 PM > To: tip...@li...; jon...@er...; ma...@do...; Xue, Ying > Subject: [PATCH RFC 0/5] TIPC encryption > > This series provides TIPC encryption feature, kernel part. There will be > another one in the 'iproute2/tipc' for user space to set key. > > Tuong Lien (5): > tipc: add reference counter to bearer > tipc: enable creating a "preliminary" node > tipc: add new AEAD key structure for user API > tipc: introduce TIPC encryption & authentication > tipc: add support for AEAD key setting via netlink > > include/uapi/linux/tipc.h | 21 + > include/uapi/linux/tipc_netlink.h | 4 + > net/tipc/Makefile | 2 +- > net/tipc/bcast.c | 2 +- > net/tipc/bearer.c | 52 +- > net/tipc/bearer.h | 6 +- > net/tipc/core.c | 10 + > net/tipc/core.h | 4 + > net/tipc/crypto.c | 1986 +++++++++++++++++++++++++++++++++++++ > net/tipc/crypto.h | 166 ++++ > net/tipc/link.c | 16 +- > net/tipc/link.h | 1 + > net/tipc/msg.c | 24 +- > net/tipc/msg.h | 44 +- > net/tipc/netlink.c | 16 +- > net/tipc/node.c | 314 +++++- > net/tipc/node.h | 10 + > net/tipc/sysctl.c | 9 + > net/tipc/udp_media.c | 1 + > 19 files changed, 2604 insertions(+), 84 deletions(-) > create mode 100644 net/tipc/crypto.c > create mode 100644 net/tipc/crypto.h > > -- > 2.13.7 > |
From: Tuong L. T. <tuo...@de...> - 2019-11-04 11:29:52
|
Hi Ying, Thanks a lot for reviewing the series! Your idea of a new kernel option is fine, but I'm not sure what its goal is. The new code is already "disabled" by default unless there's a key set by user, so it's generally still under user's control... The advantage I can see is the module's size but it is not that much (compared to the whole kernel). On the other hand, we will need to custom the kernel to get the feature on and some additional code for the "ifdef...else..." instructions. Do we really need the option? @Jon: What is your opinion about this? BR/Tuong -----Original Message----- From: Xue, Ying <Yin...@wi...> Sent: Friday, November 1, 2019 9:20 PM To: Tuong Lien <tuo...@de...>; tip...@li...; jon...@er...; ma...@do... Subject: RE: [PATCH RFC 0/5] TIPC encryption Good job. This is a big and complex feature. Particularly for most of users who might not consider to use this feature, please consider to give them a choice to completely disable it by adding a new kernel option like TIPC_CRYPTO. Thanks, Ying -----Original Message----- From: Tuong Lien [mailto:tuo...@de...] Sent: Monday, October 14, 2019 7:07 PM To: tip...@li...; jon...@er...; ma...@do...; Xue, Ying Subject: [PATCH RFC 0/5] TIPC encryption This series provides TIPC encryption feature, kernel part. There will be another one in the 'iproute2/tipc' for user space to set key. Tuong Lien (5): tipc: add reference counter to bearer tipc: enable creating a "preliminary" node tipc: add new AEAD key structure for user API tipc: introduce TIPC encryption & authentication tipc: add support for AEAD key setting via netlink include/uapi/linux/tipc.h | 21 + include/uapi/linux/tipc_netlink.h | 4 + net/tipc/Makefile | 2 +- net/tipc/bcast.c | 2 +- net/tipc/bearer.c | 52 +- net/tipc/bearer.h | 6 +- net/tipc/core.c | 10 + net/tipc/core.h | 4 + net/tipc/crypto.c | 1986 +++++++++++++++++++++++++++++++++++++ net/tipc/crypto.h | 166 ++++ net/tipc/link.c | 16 +- net/tipc/link.h | 1 + net/tipc/msg.c | 24 +- net/tipc/msg.h | 44 +- net/tipc/netlink.c | 16 +- net/tipc/node.c | 314 +++++- net/tipc/node.h | 10 + net/tipc/sysctl.c | 9 + net/tipc/udp_media.c | 1 + 19 files changed, 2604 insertions(+), 84 deletions(-) create mode 100644 net/tipc/crypto.c create mode 100644 net/tipc/crypto.h -- 2.13.7 |
From: David M. <da...@da...> - 2019-11-04 01:27:48
|
From: Tuong Lien <tuo...@de...> Date: Fri, 1 Nov 2019 09:58:57 +0700 > As mentioned in commit e95584a889e1 ("tipc: fix unlimited bundling of > small messages"), the current message bundling algorithm is inefficient > that can generate bundles of only one payload message, that causes > unnecessary overheads for both the sender and receiver. > > This commit re-designs the 'tipc_msg_make_bundle()' function (now named > as 'tipc_msg_try_bundle()'), so that when a message comes at the first > place, we will just check & keep a reference to it if the message is > suitable for bundling. The message buffer will be put into the link > backlog queue and processed as normal. Later on, when another one comes > we will make a bundle with the first message if possible and so on... > This way, a bundle if really needed will always consist of at least two > payload messages. Otherwise, we let the first buffer go its way without > any need of bundling, so reduce the overheads to zero. > > Moreover, since now we have both the messages in hand, we can even > optimize the 'tipc_msg_bundle()' function, make bundle of a very large > (size ~ MSS) and small messages which is not with the current algorithm > e.g. [1400-byte message] + [10-byte message] (MTU = 1500). > > Acked-by: Ying Xue <yin...@wi...> > Acked-by: Jon Maloy <jon...@er...> > Signed-off-by: Tuong Lien <tuo...@de...> Applied. |
From: Xue, Y. <Yin...@wi...> - 2019-11-01 19:00:18
|
Good job. This is a big and complex feature. Particularly for most of users who might not consider to use this feature, please consider to give them a choice to completely disable it by adding a new kernel option like TIPC_CRYPTO. Thanks, Ying -----Original Message----- From: Tuong Lien [mailto:tuo...@de...] Sent: Monday, October 14, 2019 7:07 PM To: tip...@li...; jon...@er...; ma...@do...; Xue, Ying Subject: [PATCH RFC 0/5] TIPC encryption This series provides TIPC encryption feature, kernel part. There will be another one in the 'iproute2/tipc' for user space to set key. Tuong Lien (5): tipc: add reference counter to bearer tipc: enable creating a "preliminary" node tipc: add new AEAD key structure for user API tipc: introduce TIPC encryption & authentication tipc: add support for AEAD key setting via netlink include/uapi/linux/tipc.h | 21 + include/uapi/linux/tipc_netlink.h | 4 + net/tipc/Makefile | 2 +- net/tipc/bcast.c | 2 +- net/tipc/bearer.c | 52 +- net/tipc/bearer.h | 6 +- net/tipc/core.c | 10 + net/tipc/core.h | 4 + net/tipc/crypto.c | 1986 +++++++++++++++++++++++++++++++++++++ net/tipc/crypto.h | 166 ++++ net/tipc/link.c | 16 +- net/tipc/link.h | 1 + net/tipc/msg.c | 24 +- net/tipc/msg.h | 44 +- net/tipc/netlink.c | 16 +- net/tipc/node.c | 314 +++++- net/tipc/node.h | 10 + net/tipc/sysctl.c | 9 + net/tipc/udp_media.c | 1 + 19 files changed, 2604 insertions(+), 84 deletions(-) create mode 100644 net/tipc/crypto.c create mode 100644 net/tipc/crypto.h -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2019-11-01 02:59:24
|
As mentioned in commit e95584a889e1 ("tipc: fix unlimited bundling of small messages"), the current message bundling algorithm is inefficient that can generate bundles of only one payload message, that causes unnecessary overheads for both the sender and receiver. This commit re-designs the 'tipc_msg_make_bundle()' function (now named as 'tipc_msg_try_bundle()'), so that when a message comes at the first place, we will just check & keep a reference to it if the message is suitable for bundling. The message buffer will be put into the link backlog queue and processed as normal. Later on, when another one comes we will make a bundle with the first message if possible and so on... This way, a bundle if really needed will always consist of at least two payload messages. Otherwise, we let the first buffer go its way without any need of bundling, so reduce the overheads to zero. Moreover, since now we have both the messages in hand, we can even optimize the 'tipc_msg_bundle()' function, make bundle of a very large (size ~ MSS) and small messages which is not with the current algorithm e.g. [1400-byte message] + [10-byte message] (MTU = 1500). Acked-by: Ying Xue <yin...@wi...> Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/link.c | 59 +++++++++++----------- net/tipc/msg.c | 153 +++++++++++++++++++++++++++++--------------------------- net/tipc/msg.h | 5 +- 3 files changed, 113 insertions(+), 104 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index 7d7a66178607..038861bad72b 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -940,16 +940,17 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, struct sk_buff_head *xmitq) { struct tipc_msg *hdr = buf_msg(skb_peek(list)); - unsigned int maxwin = l->window; - int imp = msg_importance(hdr); - unsigned int mtu = l->mtu; + struct sk_buff_head *backlogq = &l->backlogq; + struct sk_buff_head *transmq = &l->transmq; + struct sk_buff *skb, *_skb; + u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; u16 ack = l->rcv_nxt - 1; u16 seqno = l->snd_nxt; - u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; - struct sk_buff_head *transmq = &l->transmq; - struct sk_buff_head *backlogq = &l->backlogq; - struct sk_buff *skb, *_skb, **tskb; int pkt_cnt = skb_queue_len(list); + int imp = msg_importance(hdr); + unsigned int maxwin = l->window; + unsigned int mtu = l->mtu; + bool new_bundle; int rc = 0; if (unlikely(msg_size(hdr) > mtu)) { @@ -975,20 +976,18 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, } /* Prepare each packet for sending, and add to relevant queue: */ - while (skb_queue_len(list)) { - skb = skb_peek(list); - hdr = buf_msg(skb); - msg_set_seqno(hdr, seqno); - msg_set_ack(hdr, ack); - msg_set_bcast_ack(hdr, bc_ack); - + while ((skb = __skb_dequeue(list))) { if (likely(skb_queue_len(transmq) < maxwin)) { + hdr = buf_msg(skb); + msg_set_seqno(hdr, seqno); + msg_set_ack(hdr, ack); + msg_set_bcast_ack(hdr, bc_ack); _skb = skb_clone(skb, GFP_ATOMIC); if (!_skb) { + kfree_skb(skb); __skb_queue_purge(list); return -ENOBUFS; } - __skb_dequeue(list); __skb_queue_tail(transmq, skb); /* next retransmit attempt */ if (link_is_bc_sndlink(l)) @@ -1000,22 +999,26 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, seqno++; continue; } - tskb = &l->backlog[imp].target_bskb; - if (tipc_msg_bundle(*tskb, hdr, mtu)) { - kfree_skb(__skb_dequeue(list)); - l->stats.sent_bundled++; - continue; - } - if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { - kfree_skb(__skb_dequeue(list)); - __skb_queue_tail(backlogq, *tskb); - l->backlog[imp].len++; - l->stats.sent_bundled++; - l->stats.sent_bundles++; + if (tipc_msg_try_bundle(l->backlog[imp].target_bskb, &skb, + mtu - INT_H_SIZE, l->addr, + &new_bundle)) { + if (skb) { + /* Keep a ref. to the skb for next try */ + l->backlog[imp].target_bskb = skb; + l->backlog[imp].len++; + __skb_queue_tail(backlogq, skb); + } else { + if (new_bundle) { + l->stats.sent_bundles++; + l->stats.sent_bundled++; + } + l->stats.sent_bundled++; + } continue; } l->backlog[imp].target_bskb = NULL; - l->backlog[imp].len += skb_queue_len(list); + l->backlog[imp].len += (1 + skb_queue_len(list)); + __skb_queue_tail(backlogq, skb); skb_queue_splice_tail_init(list, backlogq); } l->snd_nxt = seqno; diff --git a/net/tipc/msg.c b/net/tipc/msg.c index 973795a1a968..acb7be592fb1 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -472,48 +472,98 @@ int tipc_msg_build(struct tipc_msg *mhdr, struct msghdr *m, int offset, } /** - * tipc_msg_bundle(): Append contents of a buffer to tail of an existing one - * @skb: the buffer to append to ("bundle") - * @msg: message to be appended - * @mtu: max allowable size for the bundle buffer - * Consumes buffer if successful - * Returns true if bundling could be performed, otherwise false + * tipc_msg_bundle - Append contents of a buffer to tail of an existing one + * @bskb: the bundle buffer to append to + * @msg: message to be appended + * @max: max allowable size for the bundle buffer + * + * Returns "true" if bundling has been performed, otherwise "false" */ -bool tipc_msg_bundle(struct sk_buff *skb, struct tipc_msg *msg, u32 mtu) +static bool tipc_msg_bundle(struct sk_buff *bskb, struct tipc_msg *msg, + u32 max) { - struct tipc_msg *bmsg; - unsigned int bsz; - unsigned int msz = msg_size(msg); - u32 start, pad; - u32 max = mtu - INT_H_SIZE; + struct tipc_msg *bmsg = buf_msg(bskb); + u32 msz, bsz, offset, pad; - if (likely(msg_user(msg) == MSG_FRAGMENTER)) - return false; - if (!skb) - return false; - bmsg = buf_msg(skb); + msz = msg_size(msg); bsz = msg_size(bmsg); - start = align(bsz); - pad = start - bsz; + offset = align(bsz); + pad = offset - bsz; - if (unlikely(msg_user(msg) == TUNNEL_PROTOCOL)) + if (unlikely(skb_tailroom(bskb) < (pad + msz))) return false; - if (unlikely(msg_user(msg) == BCAST_PROTOCOL)) + if (unlikely(max < (offset + msz))) return false; - if (unlikely(msg_user(bmsg) != MSG_BUNDLER)) + + skb_put(bskb, pad + msz); + skb_copy_to_linear_data_offset(bskb, offset, msg, msz); + msg_set_size(bmsg, offset + msz); + msg_set_msgcnt(bmsg, msg_msgcnt(bmsg) + 1); + return true; +} + +/** + * tipc_msg_try_bundle - Try to bundle a new message to the last one + * @tskb: the last/target message to which the new one will be appended + * @skb: the new message skb pointer + * @mss: max message size (header inclusive) + * @dnode: destination node for the message + * @new_bundle: if this call made a new bundle or not + * + * Return: "true" if the new message skb is potential for bundling this time or + * later, in the case a bundling has been done this time, the skb is consumed + * (the skb pointer = NULL). + * Otherwise, "false" if the skb cannot be bundled at all. + */ +bool tipc_msg_try_bundle(struct sk_buff *tskb, struct sk_buff **skb, u32 mss, + u32 dnode, bool *new_bundle) +{ + struct tipc_msg *msg, *inner, *outer; + u32 tsz; + + /* First, check if the new buffer is suitable for bundling */ + msg = buf_msg(*skb); + if (msg_user(msg) == MSG_FRAGMENTER) return false; - if (unlikely(skb_tailroom(skb) < (pad + msz))) + if (msg_user(msg) == TUNNEL_PROTOCOL) return false; - if (unlikely(max < (start + msz))) + if (msg_user(msg) == BCAST_PROTOCOL) return false; - if ((msg_importance(msg) < TIPC_SYSTEM_IMPORTANCE) && - (msg_importance(bmsg) == TIPC_SYSTEM_IMPORTANCE)) + if (mss <= INT_H_SIZE + msg_size(msg)) return false; - skb_put(skb, pad + msz); - skb_copy_to_linear_data_offset(skb, start, msg, msz); - msg_set_size(bmsg, start + msz); - msg_set_msgcnt(bmsg, msg_msgcnt(bmsg) + 1); + /* Ok, but the last/target buffer can be empty? */ + if (unlikely(!tskb)) + return true; + + /* Is it a bundle already? Try to bundle the new message to it */ + if (msg_user(buf_msg(tskb)) == MSG_BUNDLER) { + *new_bundle = false; + goto bundle; + } + + /* Make a new bundle of the two messages if possible */ + tsz = msg_size(buf_msg(tskb)); + if (unlikely(mss < align(INT_H_SIZE + tsz) + msg_size(msg))) + return true; + if (unlikely(pskb_expand_head(tskb, INT_H_SIZE, mss - tsz - INT_H_SIZE, + GFP_ATOMIC))) + return true; + inner = buf_msg(tskb); + skb_push(tskb, INT_H_SIZE); + outer = buf_msg(tskb); + tipc_msg_init(msg_prevnode(inner), outer, MSG_BUNDLER, 0, INT_H_SIZE, + dnode); + msg_set_importance(outer, msg_importance(inner)); + msg_set_size(outer, INT_H_SIZE + tsz); + msg_set_msgcnt(outer, 1); + *new_bundle = true; + +bundle: + if (likely(tipc_msg_bundle(tskb, msg, mss))) { + consume_skb(*skb); + *skb = NULL; + } return true; } @@ -563,49 +613,6 @@ bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff **iskb, int *pos) } /** - * tipc_msg_make_bundle(): Create bundle buf and append message to its tail - * @list: the buffer chain, where head is the buffer to replace/append - * @skb: buffer to be created, appended to and returned in case of success - * @msg: message to be appended - * @mtu: max allowable size for the bundle buffer, inclusive header - * @dnode: destination node for message. (Not always present in header) - * Returns true if success, otherwise false - */ -bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, - u32 mtu, u32 dnode) -{ - struct sk_buff *_skb; - struct tipc_msg *bmsg; - u32 msz = msg_size(msg); - u32 max = mtu - INT_H_SIZE; - - if (msg_user(msg) == MSG_FRAGMENTER) - return false; - if (msg_user(msg) == TUNNEL_PROTOCOL) - return false; - if (msg_user(msg) == BCAST_PROTOCOL) - return false; - if (msz > (max / 2)) - return false; - - _skb = tipc_buf_acquire(max, GFP_ATOMIC); - if (!_skb) - return false; - - skb_trim(_skb, INT_H_SIZE); - bmsg = buf_msg(_skb); - tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, - INT_H_SIZE, dnode); - msg_set_importance(bmsg, msg_importance(msg)); - msg_set_seqno(bmsg, msg_seqno(msg)); - msg_set_ack(bmsg, msg_ack(msg)); - msg_set_bcast_ack(bmsg, msg_bcast_ack(msg)); - tipc_msg_bundle(_skb, msg, mtu); - *skb = _skb; - return true; -} - -/** * tipc_msg_reverse(): swap source and destination addresses and add error code * @own_node: originating node id for reversed message * @skb: buffer containing message to be reversed; will be consumed diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0435dda4b90c..14697e6c995e 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -1081,9 +1081,8 @@ struct sk_buff *tipc_msg_create(uint user, uint type, uint hdr_sz, uint data_sz, u32 dnode, u32 onode, u32 dport, u32 oport, int errcode); int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf); -bool tipc_msg_bundle(struct sk_buff *skb, struct tipc_msg *msg, u32 mtu); -bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, - u32 mtu, u32 dnode); +bool tipc_msg_try_bundle(struct sk_buff *tskb, struct sk_buff **skb, u32 mss, + u32 dnode, bool *new_bundle); bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff **iskb, int *pos); int tipc_msg_fragment(struct sk_buff *skb, const struct tipc_msg *hdr, int pktmax, struct sk_buff_head *frags); -- 2.13.7 |
From: Xue, Y. <Yin...@wi...> - 2019-10-31 12:51:48
|
Sorry for late response to this commit. Great change! Acked-by: Ying Xue <yin...@wi...> -----Original Message----- From: Tuong Lien [mailto:tuo...@de...] Sent: Tuesday, October 15, 2019 12:59 PM To: tip...@li...; jon...@er...; ma...@do...; Xue, Ying Subject: [net-next] tipc: improve message bundling algorithm As mentioned in commit e95584a889e1 ("tipc: fix unlimited bundling of small messages"), the current message bundling algorithm is inefficient that can generate bundles of only one payload message, that causes unnecessary overheads for both the sender and receiver. This commit re-designs the 'tipc_msg_make_bundle()' function (now named as 'tipc_msg_try_bundle()'), so that when a message comes at the first place, we will just check & keep a reference to it if the message is suitable for bundling. The message buffer will be put into the link backlog queue and processed as normal. Later on, when another one comes we will make a bundle with the first message if possible and so on... This way, a bundle if really needed will always consist of at least two payload messages. Otherwise, we let the first buffer go its way without any need of bundling, so reduce the overheads to zero. Moreover, since now we have both the messages in hand, we can even optimize the 'tipc_msg_bundle()' function, make bundle of a very large (size ~ MSS) and small messages which is not with the current algorithm e.g. [1400-byte message] + [10-byte message] (MTU = 1500). Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/link.c | 60 +++++++++++----------- net/tipc/msg.c | 153 +++++++++++++++++++++++++++++--------------------------- net/tipc/msg.h | 5 +- 3 files changed, 114 insertions(+), 104 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index 999eab592de8..3bd60bdbf56c 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -940,16 +940,17 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, struct sk_buff_head *xmitq) { struct tipc_msg *hdr = buf_msg(skb_peek(list)); - unsigned int maxwin = l->window; - int imp = msg_importance(hdr); - unsigned int mtu = l->mtu; + struct sk_buff_head *backlogq = &l->backlogq; + struct sk_buff_head *transmq = &l->transmq; + struct sk_buff *skb, *_skb; + u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; u16 ack = l->rcv_nxt - 1; u16 seqno = l->snd_nxt; - u16 bc_ack = l->bc_rcvlink->rcv_nxt - 1; - struct sk_buff_head *transmq = &l->transmq; - struct sk_buff_head *backlogq = &l->backlogq; - struct sk_buff *skb, *_skb, **tskb; int pkt_cnt = skb_queue_len(list); + int imp = msg_importance(hdr); + unsigned int maxwin = l->window; + unsigned int mtu = l->mtu; + bool new_bundle; int rc = 0; if (unlikely(msg_size(hdr) > mtu)) { @@ -975,20 +976,18 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, } /* Prepare each packet for sending, and add to relevant queue: */ - while (skb_queue_len(list)) { - skb = skb_peek(list); - hdr = buf_msg(skb); - msg_set_seqno(hdr, seqno); - msg_set_ack(hdr, ack); - msg_set_bcast_ack(hdr, bc_ack); - + while ((skb = __skb_dequeue(list))) { if (likely(skb_queue_len(transmq) < maxwin)) { + hdr = buf_msg(skb); + msg_set_seqno(hdr, seqno); + msg_set_ack(hdr, ack); + msg_set_bcast_ack(hdr, bc_ack); _skb = skb_clone(skb, GFP_ATOMIC); if (!_skb) { + kfree_skb(skb); __skb_queue_purge(list); return -ENOBUFS; } - __skb_dequeue(list); __skb_queue_tail(transmq, skb); /* next retransmit attempt */ if (link_is_bc_sndlink(l)) @@ -1000,22 +999,27 @@ int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list, seqno++; continue; } - tskb = &l->backlog[imp].target_bskb; - if (tipc_msg_bundle(*tskb, hdr, mtu)) { - kfree_skb(__skb_dequeue(list)); - l->stats.sent_bundled++; - continue; - } - if (tipc_msg_make_bundle(tskb, hdr, mtu, l->addr)) { - kfree_skb(__skb_dequeue(list)); - __skb_queue_tail(backlogq, *tskb); - l->backlog[imp].len++; - l->stats.sent_bundled++; - l->stats.sent_bundles++; + if (tipc_msg_try_bundle(l->backlog[imp].target_bskb, &skb, + mtu - INT_H_SIZE, l->addr, + &new_bundle)) { + if (skb) { + /* Keep a ref. to the skb for next try */ + l->backlog[imp].target_bskb = skb; + l->backlog[imp].len++; + __skb_queue_tail(backlogq, skb); + } else { + if (new_bundle) { + l->stats.sent_bundles++; + l->stats.sent_bundled++; + } + l->stats.sent_bundled++; + } continue; } + /* Let's the last skb (if any) go its way! */ l->backlog[imp].target_bskb = NULL; - l->backlog[imp].len += skb_queue_len(list); + l->backlog[imp].len += (1 + skb_queue_len(list)); + __skb_queue_tail(backlogq, skb); skb_queue_splice_tail_init(list, backlogq); } l->snd_nxt = seqno; diff --git a/net/tipc/msg.c b/net/tipc/msg.c index 922d262e153f..a2f3582cf9fe 100644 --- a/net/tipc/msg.c +++ b/net/tipc/msg.c @@ -419,48 +419,98 @@ int tipc_msg_build(struct tipc_msg *mhdr, struct msghdr *m, int offset, } /** - * tipc_msg_bundle(): Append contents of a buffer to tail of an existing one - * @skb: the buffer to append to ("bundle") - * @msg: message to be appended - * @mtu: max allowable size for the bundle buffer - * Consumes buffer if successful - * Returns true if bundling could be performed, otherwise false + * tipc_msg_bundle - Append contents of a buffer to tail of an existing one + * @bskb: the bundle buffer to append to + * @msg: message to be appended + * @max: max allowable size for the bundle buffer + * + * Returns "true" if bundling has been performed, otherwise "false" */ -bool tipc_msg_bundle(struct sk_buff *skb, struct tipc_msg *msg, u32 mtu) +static bool tipc_msg_bundle(struct sk_buff *bskb, struct tipc_msg *msg, + u32 max) { - struct tipc_msg *bmsg; - unsigned int bsz; - unsigned int msz = msg_size(msg); - u32 start, pad; - u32 max = mtu - INT_H_SIZE; + struct tipc_msg *bmsg = buf_msg(bskb); + u32 msz, bsz, offset, pad; - if (likely(msg_user(msg) == MSG_FRAGMENTER)) - return false; - if (!skb) - return false; - bmsg = buf_msg(skb); + msz = msg_size(msg); bsz = msg_size(bmsg); - start = align(bsz); - pad = start - bsz; + offset = align(bsz); + pad = offset - bsz; - if (unlikely(msg_user(msg) == TUNNEL_PROTOCOL)) + if (unlikely(skb_tailroom(bskb) < (pad + msz))) return false; - if (unlikely(msg_user(msg) == BCAST_PROTOCOL)) + if (unlikely(max < (offset + msz))) return false; - if (unlikely(msg_user(bmsg) != MSG_BUNDLER)) + + skb_put(bskb, pad + msz); + skb_copy_to_linear_data_offset(bskb, offset, msg, msz); + msg_set_size(bmsg, offset + msz); + msg_set_msgcnt(bmsg, msg_msgcnt(bmsg) + 1); + return true; +} + +/** + * tipc_msg_try_bundle - Try to bundle a new message to the last one + * @tskb: the last/target message to which the new one will be appended + * @skb: the new message skb pointer + * @mss: max message size (header inclusive) + * @dnode: destination node for the message + * @new_bundle: if this call made a new bundle or not + * + * Return: "true" if the new message skb is potential for bundling this time or + * later, in the case a bundling has been done this time, the skb is consumed + * (the skb pointer = NULL). + * Otherwise, "false" if the skb cannot be bundled at all. + */ +bool tipc_msg_try_bundle(struct sk_buff *tskb, struct sk_buff **skb, u32 mss, + u32 dnode, bool *new_bundle) +{ + struct tipc_msg *msg, *inner, *outer; + u32 bsz; + + /* First, check if the new buffer is suitable for bundling */ + msg = buf_msg(*skb); + if (msg_user(msg) == MSG_FRAGMENTER) return false; - if (unlikely(skb_tailroom(skb) < (pad + msz))) + if (msg_user(msg) == TUNNEL_PROTOCOL) return false; - if (unlikely(max < (start + msz))) + if (msg_user(msg) == BCAST_PROTOCOL) return false; - if ((msg_importance(msg) < TIPC_SYSTEM_IMPORTANCE) && - (msg_importance(bmsg) == TIPC_SYSTEM_IMPORTANCE)) + if (mss <= INT_H_SIZE + msg_size(msg)) return false; - skb_put(skb, pad + msz); - skb_copy_to_linear_data_offset(skb, start, msg, msz); - msg_set_size(bmsg, start + msz); - msg_set_msgcnt(bmsg, msg_msgcnt(bmsg) + 1); + /* Ok, but the last/target buffer can be empty? */ + if (unlikely(!tskb)) + return true; + + /* Is it a bundle already? Try to bundle the new message to it */ + if (msg_user(buf_msg(tskb)) == MSG_BUNDLER) { + *new_bundle = false; + goto bundle; + } + + /* Make a new bundle of the two messages if possible */ + bsz = msg_size(buf_msg(tskb)); + if (unlikely(mss < align(INT_H_SIZE + bsz) + msg_size(msg))) + return true; + if (unlikely(pskb_expand_head(tskb, INT_H_SIZE, mss - bsz, + GFP_ATOMIC))) + return true; + inner = buf_msg(tskb); + skb_push(tskb, INT_H_SIZE); + outer = buf_msg(tskb); + tipc_msg_init(msg_prevnode(inner), outer, MSG_BUNDLER, 0, INT_H_SIZE, + dnode); + msg_set_importance(outer, msg_importance(inner)); + msg_set_size(outer, INT_H_SIZE + bsz); + msg_set_msgcnt(outer, 1); + *new_bundle = true; + +bundle: + if (likely(tipc_msg_bundle(tskb, msg, mss))) { + consume_skb(*skb); + *skb = NULL; + } return true; } @@ -510,49 +560,6 @@ bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff **iskb, int *pos) } /** - * tipc_msg_make_bundle(): Create bundle buf and append message to its tail - * @list: the buffer chain, where head is the buffer to replace/append - * @skb: buffer to be created, appended to and returned in case of success - * @msg: message to be appended - * @mtu: max allowable size for the bundle buffer, inclusive header - * @dnode: destination node for message. (Not always present in header) - * Returns true if success, otherwise false - */ -bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, - u32 mtu, u32 dnode) -{ - struct sk_buff *_skb; - struct tipc_msg *bmsg; - u32 msz = msg_size(msg); - u32 max = mtu - INT_H_SIZE; - - if (msg_user(msg) == MSG_FRAGMENTER) - return false; - if (msg_user(msg) == TUNNEL_PROTOCOL) - return false; - if (msg_user(msg) == BCAST_PROTOCOL) - return false; - if (msz > (max / 2)) - return false; - - _skb = tipc_buf_acquire(max, GFP_ATOMIC); - if (!_skb) - return false; - - skb_trim(_skb, INT_H_SIZE); - bmsg = buf_msg(_skb); - tipc_msg_init(msg_prevnode(msg), bmsg, MSG_BUNDLER, 0, - INT_H_SIZE, dnode); - msg_set_importance(bmsg, msg_importance(msg)); - msg_set_seqno(bmsg, msg_seqno(msg)); - msg_set_ack(bmsg, msg_ack(msg)); - msg_set_bcast_ack(bmsg, msg_bcast_ack(msg)); - tipc_msg_bundle(_skb, msg, mtu); - *skb = _skb; - return true; -} - -/** * tipc_msg_reverse(): swap source and destination addresses and add error code * @own_node: originating node id for reversed message * @skb: buffer containing message to be reversed; will be consumed diff --git a/net/tipc/msg.h b/net/tipc/msg.h index 0daa6f04ca81..4d4ed4bd058a 100644 --- a/net/tipc/msg.h +++ b/net/tipc/msg.h @@ -1057,9 +1057,8 @@ struct sk_buff *tipc_msg_create(uint user, uint type, uint hdr_sz, uint data_sz, u32 dnode, u32 onode, u32 dport, u32 oport, int errcode); int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf); -bool tipc_msg_bundle(struct sk_buff *skb, struct tipc_msg *msg, u32 mtu); -bool tipc_msg_make_bundle(struct sk_buff **skb, struct tipc_msg *msg, - u32 mtu, u32 dnode); +bool tipc_msg_try_bundle(struct sk_buff *tskb, struct sk_buff **skb, u32 mss, + u32 dnode, bool *new_bundle); bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff **iskb, int *pos); int tipc_msg_fragment(struct sk_buff *skb, const struct tipc_msg *hdr, int pktmax, struct sk_buff_head *frags); -- 2.13.7 |
From: David M. <da...@da...> - 2019-10-30 19:17:09
|
From: Jon Maloy <jon...@er...> Date: Wed, 30 Oct 2019 14:00:41 +0100 > We introduce a feature that works like a combination of TCP_NAGLE and > TCP_CORK, but without some of the weaknesses of those. In particular, > we will not observe long delivery delays because of delayed acks, since > the algorithm itself decides if and when acks are to be sent from the > receiving peer. > > - The nagle property as such is determined by manipulating a new > 'maxnagle' field in struct tipc_sock. If certain conditions are met, > 'maxnagle' will define max size of the messages which can be bundled. > If it is set to zero no messages are ever bundled, implying that the > nagle property is disabled. > - A socket with the nagle property enabled enters nagle mode when more > than 4 messages have been sent out without receiving any data message > from the peer. > - A socket leaves nagle mode whenever it receives a data message from > the peer. > > In nagle mode, messages smaller than 'maxnagle' are accumulated in the > socket write queue. The last buffer in the queue is marked with a new > 'ack_required' bit, which forces the receiving peer to send a CONN_ACK > message back to the sender upon reception. > > The accumulated contents of the write queue is transmitted when one of > the following events or conditions occur. > > - A CONN_ACK message is received from the peer. > - A data message is received from the peer. > - A SOCK_WAKEUP pseudo message is received from the link level. > - The write queue contains more than 64 1k blocks of data. > - The connection is being shut down. > - There is no CONN_ACK message to expect. I.e., there is currently > no outstanding message where the 'ack_required' bit was set. As a > consequence, the first message added after we enter nagle mode > is always sent directly with this bit set. > > This new feature gives a 50-100% improvement of throughput for small > (i.e., less than MTU size) messages, while it might add up to one RTT > to latency time when the socket is in nagle mode. > > Acked-by: Ying Xue <yin...@wi...> > Signed-off-by: Jon Maloy <jon...@er...> Applied, thanks Jon. |
From: Jon M. <jon...@er...> - 2019-10-30 13:03:03
|
> -----Original Message----- > From: Tuong Lien Tong <tuo...@de...> > Sent: 30-Oct-19 07:39 > To: Jon Maloy <jon...@er...>; 'Jon Maloy' <ma...@do...> > Cc: Mohan Krishna Ghanta Krishnamurthy <moh...@er...>; > par...@gm...; Tung Quang Nguyen <tun...@de...>; Hoang > Huu Le <hoa...@de...>; Gordan Mihaljevic <gor...@de...>; > yin...@wi...; tip...@li... > Subject: RE: [net-next v3 1/1] tipc: add smart nagle feature > > Hi Jon, > > Please see below one comment from my side. > > BR/Tuong > > @@ -1437,16 +1492,17 @@ static int __tipc_sendstream(struct socket *sock, > struct msghdr *m, size_t dlen) > struct sock *sk = sock->sk; > DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name); > long timeout = sock_sndtimeo(sk, m->msg_flags & MSG_DONTWAIT); > + struct sk_buff_head *txq = &sk->sk_write_queue; > struct tipc_sock *tsk = tipc_sk(sk); > struct tipc_msg *hdr = &tsk->phdr; > struct net *net = sock_net(sk); > - struct sk_buff_head pkts; > u32 dnode = tsk_peer_node(tsk); > + int blocks = tsk->snd_backlog; > + int maxnagle = tsk->maxnagle; > + int maxpkt = tsk->max_pkt; > int send, sent = 0; > int rc = 0; > > - __skb_queue_head_init(&pkts); > - > if (unlikely(dlen > INT_MAX)) > return -EMSGSIZE; > > @@ -1467,21 +1523,35 @@ static int __tipc_sendstream(struct socket *sock, > struct msghdr *m, size_t dlen) > tipc_sk_connected(sk))); > if (unlikely(rc)) > break; > - > send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE); > - rc = tipc_msg_build(hdr, m, sent, send, tsk->max_pkt, > &pkts); > - if (unlikely(rc != send)) > - break; > > - trace_tipc_sk_sendstream(sk, skb_peek(&pkts), > > [Tuong]: Should we set the 'blocks' here instead i.e. blocks = > tsk->snd_backlog' as it can be changed if we have to release the sock & > sleep in advance (e.g. tipc_wait_for_cond), also the 'while' statement can > be re-run? Yes, you are right. I'll fix this. ///jon > > + if (tsk->oneway++ >= 4 && send <= maxnagle) { > + rc = tipc_msg_append(hdr, m, send, maxnagle, txq); > + if (rc < 0) > + break; > + blocks += rc; > + if (blocks <= 64 && tsk->expect_ack) { > + tsk->snd_backlog = blocks; > + sent += send; > + break; > + } > + tsk->expect_ack = true; > + } else { > + rc = tipc_msg_build(hdr, m, sent, send, maxpkt, > txq); > + if (unlikely(rc != send)) > + break; > + blocks += tsk_inc(tsk, send + MIN_H_SIZE); > + } > + trace_tipc_sk_sendstream(sk, skb_peek(txq), > TIPC_DUMP_SK_SNDQ, " "); > - rc = tipc_node_xmit(net, &pkts, dnode, tsk->portid); > + rc = tipc_node_xmit(net, txq, dnode, tsk->portid); > if (unlikely(rc == -ELINKCONG)) { > tsk->cong_link_cnt = 1; > rc = 0; > } > if (likely(!rc)) { > - tsk->snt_unacked += tsk_inc(tsk, send + MIN_H_SIZE); > + tsk->snt_unacked += blocks; > + tsk->snd_backlog = 0; > sent += send; > } > } while (sent < dlen && !rc); > @@ -1528,6 +1598,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, > u32 peer_port, > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > |
From: Tuong L. T. <tuo...@de...> - 2019-10-30 11:40:30
|
Hi Jon, Please see below one comment from my side. BR/Tuong @@ -1437,16 +1492,17 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen) struct sock *sk = sock->sk; DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name); long timeout = sock_sndtimeo(sk, m->msg_flags & MSG_DONTWAIT); + struct sk_buff_head *txq = &sk->sk_write_queue; struct tipc_sock *tsk = tipc_sk(sk); struct tipc_msg *hdr = &tsk->phdr; struct net *net = sock_net(sk); - struct sk_buff_head pkts; u32 dnode = tsk_peer_node(tsk); + int blocks = tsk->snd_backlog; + int maxnagle = tsk->maxnagle; + int maxpkt = tsk->max_pkt; int send, sent = 0; int rc = 0; - __skb_queue_head_init(&pkts); - if (unlikely(dlen > INT_MAX)) return -EMSGSIZE; @@ -1467,21 +1523,35 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen) tipc_sk_connected(sk))); if (unlikely(rc)) break; - send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE); - rc = tipc_msg_build(hdr, m, sent, send, tsk->max_pkt, &pkts); - if (unlikely(rc != send)) - break; - trace_tipc_sk_sendstream(sk, skb_peek(&pkts), [Tuong]: Should we set the 'blocks' here instead i.e. blocks = tsk->snd_backlog' as it can be changed if we have to release the sock & sleep in advance (e.g. tipc_wait_for_cond), also the 'while' statement can be re-run? + if (tsk->oneway++ >= 4 && send <= maxnagle) { + rc = tipc_msg_append(hdr, m, send, maxnagle, txq); + if (rc < 0) + break; + blocks += rc; + if (blocks <= 64 && tsk->expect_ack) { + tsk->snd_backlog = blocks; + sent += send; + break; + } + tsk->expect_ack = true; + } else { + rc = tipc_msg_build(hdr, m, sent, send, maxpkt, txq); + if (unlikely(rc != send)) + break; + blocks += tsk_inc(tsk, send + MIN_H_SIZE); + } + trace_tipc_sk_sendstream(sk, skb_peek(txq), TIPC_DUMP_SK_SNDQ, " "); - rc = tipc_node_xmit(net, &pkts, dnode, tsk->portid); + rc = tipc_node_xmit(net, txq, dnode, tsk->portid); if (unlikely(rc == -ELINKCONG)) { tsk->cong_link_cnt = 1; rc = 0; } if (likely(!rc)) { - tsk->snt_unacked += tsk_inc(tsk, send + MIN_H_SIZE); + tsk->snt_unacked += blocks; + tsk->snd_backlog = 0; sent += send; } } while (sent < dlen && !rc); @@ -1528,6 +1598,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); |
From: Tuong L. <tuo...@de...> - 2019-10-30 10:42:31
|
It is observed that TIPC service binding order will not be kept in the publication event report to user if the service is subscribed after the bindings. For example, services are bound by application in the following order: Server: bound port A to {18888,66,66} scope 2 Server: bound port A to {18888,33,33} scope 2 Now, if a client subscribes to the service range (e.g. {18888, 0-100}), it will get the 'TIPC_PUBLISHED' events in that binding order only when the subscription is started before the bindings. Otherwise, if started after the bindings, the events will arrive in the opposite order: Client: received event for published {18888,33,33} Client: received event for published {18888,66,66} For the latter case, it is clear that the bindings have existed in the name table already, so when reported, the events' order will follow the order of the rbtree binding nodes (- a node with lesser 'lower'/'upper' range value will be first). This is correct as we provide the tracking on a specific service status (available or not), not the relationship between multiple services. However, some users expect to see the same order of arriving events irrespective of when the subscription is issued. This turns out to be easy to fix. We now add functionality to ensure that publication events always are issued in the same temporal order as the corresponding bindings were performed. Acked-by: Jon Maloy <jon...@er...> Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/name_table.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c index 66a65c2cdb23..a98cfba7ed93 100644 --- a/net/tipc/name_table.c +++ b/net/tipc/name_table.c @@ -35,6 +35,7 @@ */ #include <net/sock.h> +#include <linux/list_sort.h> #include "core.h" #include "netlink.h" #include "name_table.h" @@ -54,6 +55,8 @@ * Used by closest_first lookup and multicast lookup algorithm * @all_publ: all publications identical to this one, whatever node and scope * Used by round-robin lookup algorithm + * @list: to form a list of interested service ranges + * @id: the service range's ID in the same tipc service */ struct service_range { u32 lower; @@ -61,11 +64,14 @@ struct service_range { struct rb_node tree_node; struct list_head local_publ; struct list_head all_publ; + struct list_head list; + u16 id; }; /** * struct tipc_service - container for all published instances of a service type * @type: 32 bit 'type' value for service + * @range_cnt: increasing counter for service ranges in this service * @ranges: rb tree containing all service ranges for this service * @service_list: links to adjacent name ranges in hash chain * @subscriptions: list of subscriptions for this service type @@ -74,6 +80,7 @@ struct service_range { */ struct tipc_service { u32 type; + u16 range_cnt; struct rb_root ranges; struct hlist_node service_list; struct list_head subscriptions; @@ -209,6 +216,7 @@ static struct service_range *tipc_service_create_range(struct tipc_service *sc, return NULL; sr->lower = lower; sr->upper = upper; + sr->id = sc->range_cnt++; INIT_LIST_HEAD(&sr->local_publ); INIT_LIST_HEAD(&sr->all_publ); rb_link_node(&sr->tree_node, parent, n); @@ -277,6 +285,17 @@ static struct publication *tipc_service_remove_publ(struct service_range *sr, return NULL; } +static int service_range_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct service_range *sra, *srb; + + sra = container_of(a, struct service_range, list); + srb = container_of(b, struct service_range, list); + + return more(sra->id, srb->id); +} + /** * tipc_service_subscribe - attach a subscription, and optionally * issue the prescribed number of events if there is any service @@ -287,6 +306,7 @@ static void tipc_service_subscribe(struct tipc_service *service, { struct tipc_subscr *sb = &sub->evt.s; struct service_range *sr; + struct list_head sr_list; struct tipc_name_seq ns; struct publication *p; struct rb_node *n; @@ -302,14 +322,19 @@ static void tipc_service_subscribe(struct tipc_service *service, if (tipc_sub_read(sb, filter) & TIPC_SUB_NO_STATUS) return; + INIT_LIST_HEAD(&sr_list); for (n = rb_first(&service->ranges); n; n = rb_next(n)) { sr = container_of(n, struct service_range, tree_node); if (sr->lower > ns.upper) break; if (!tipc_sub_check_overlap(&ns, sr->lower, sr->upper)) continue; - first = true; + list_add(&sr->list, &sr_list); + } + list_sort(NULL, &sr_list, service_range_cmp); + list_for_each_entry(sr, &sr_list, list) { + first = true; list_for_each_entry(p, &sr->all_publ, all_publ) { tipc_sub_report_overlap(sub, sr->lower, sr->upper, TIPC_PUBLISHED, p->port, -- 2.13.7 |
From: Hoang Le <hoa...@de...> - 2019-10-30 09:39:24
|
There are two improvements when re-calculate cluster capabilities: - When deleting a specific down node, need to re-calculate. - In tipc_node_cleanup(), do not need to re-calculate if node is still existing in cluster. Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/node.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/net/tipc/node.c b/net/tipc/node.c index 4b60928049ea..1f1584518221 100644 --- a/net/tipc/node.c +++ b/net/tipc/node.c @@ -667,6 +667,11 @@ static bool tipc_node_cleanup(struct tipc_node *peer) } tipc_node_write_unlock(peer); + if (!deleted) { + spin_unlock_bh(&tn->node_list_lock); + return deleted; + } + /* Calculate cluster capabilities */ tn->capabilities = TIPC_NODE_CAPABILITIES; list_for_each_entry_rcu(temp_node, &tn->node_list, list) { @@ -2043,7 +2048,7 @@ int tipc_nl_peer_rm(struct sk_buff *skb, struct genl_info *info) struct net *net = sock_net(skb->sk); struct tipc_net *tn = net_generic(net, tipc_net_id); struct nlattr *attrs[TIPC_NLA_NET_MAX + 1]; - struct tipc_node *peer; + struct tipc_node *peer, *temp_node; u32 addr; int err; @@ -2084,6 +2089,11 @@ int tipc_nl_peer_rm(struct sk_buff *skb, struct genl_info *info) tipc_node_write_unlock(peer); tipc_node_delete(peer); + /* Calculate cluster capabilities */ + tn->capabilities = TIPC_NODE_CAPABILITIES; + list_for_each_entry_rcu(temp_node, &tn->node_list, list) { + tn->capabilities &= temp_node->capabilities; + } err = 0; err_out: tipc_node_put(peer); -- 2.20.1 |
From: Hoang Le <hoa...@de...> - 2019-10-30 06:27:44
|
With huge cluster (e.g >200nodes), the amount of that flow: gap -> retransmit packet -> acked will take time in case of STATE_MSG dropped/delayed because a lot of traffic. This lead to 1.5 sec tolerance value criteria made link easy failure around 2nd, 3rd of failed retransmission attempts. Instead of re-introduced criteria of 99 failed retransmissions to fix the issue, we increase failure detection timer to ten times tolerance value. Fixes: 77cf8edbc0e7 ("tipc: simplify stale link failure criteria") Signed-off-by: Hoang Le <hoa...@de...> --- net/tipc/link.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index 7d7a66178607..9f524c325c0d 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -1084,7 +1084,7 @@ static bool link_retransmit_failure(struct tipc_link *l, struct tipc_link *r, return false; if (!time_after(jiffies, TIPC_SKB_CB(skb)->retr_stamp + - msecs_to_jiffies(r->tolerance))) + msecs_to_jiffies(r->tolerance * 10))) return false; hdr = buf_msg(skb); -- 2.20.1 |
From: Ying X. <yin...@wi...> - 2019-10-30 05:01:44
|
On 10/30/19 4:11 AM, Jon Maloy wrote: > We introduce a feature that works like a combination of TCP_NAGLE and > TCP_CORK, but without some of the weaknesses of those. In particular, > we will not observe long delivery delays because of delayed acks, since > the algorithm itself decides if and when acks are to be sent from the > receiving peer. > > - The nagle property as such is determined by manipulating a new > 'maxnagle' field in struct tipc_sock. If certain conditions are met, > 'maxnagle' will define max size of the messages which can be bundled. > If it is set to zero no messages are ever bundled, implying that the > nagle property is disabled. > - A socket with the nagle property enabled enters nagle mode when more > than 4 messages have been sent out without receiving any data message > from the peer. > - A socket leaves nagle mode whenever it receives a data message from > the peer. > > In nagle mode, messages smaller than 'maxnagle' are accumulated in the > socket write queue. The last buffer in the queue is marked with a new > 'ack_required' bit, which forces the receiving peer to send a CONN_ACK > message back to the sender upon reception. > > The accumulated contents of the write queue is transmitted when one of > the following events or conditions occur. > > - A CONN_ACK message is received from the peer. > - A data message is received from the peer. > - A SOCK_WAKEUP pseudo message is received from the link level. > - The write queue contains more than 64 1k blocks of data. > - The connection is being shut down. > - There is no CONN_ACK message to expect. I.e., there is currently > no outstanding message where the 'ack_required' bit was set. As a > consequence, the first message added after we enter nagle mode > is always sent directly with this bit set. > > This new feature gives a 50-100% improvement of throughput for small > (i.e., less than MTU size) messages, while it might add up to one RTT > to latency time when the socket is in nagle mode. > > Signed-off-by: Jon Maloy <jon...@er...> Acked-by: Ying Xue <yin...@wi...> > > v2: Added TIPC_NODELAY socket option > v3: Given TIPC_NODELAY a parameter as suggested by Ying, and simplified > test for when we can bundle messages. > --- > include/uapi/linux/tipc.h | 1 + > net/tipc/msg.c | 53 +++++++++++++++++++++ > net/tipc/msg.h | 12 +++++ > net/tipc/node.h | 7 ++- > net/tipc/socket.c | 114 +++++++++++++++++++++++++++++++++++++++------- > 5 files changed, 169 insertions(+), 18 deletions(-) > > diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h > index 7df026e..76421b8 100644 > --- a/include/uapi/linux/tipc.h > +++ b/include/uapi/linux/tipc.h > @@ -191,6 +191,7 @@ struct sockaddr_tipc { > #define TIPC_GROUP_JOIN 135 /* Takes struct tipc_group_req* */ > #define TIPC_GROUP_LEAVE 136 /* No argument */ > #define TIPC_SOCK_RECVQ_USED 137 /* Default: none (read only) */ > +#define TIPC_NODELAY 138 /* Default: false */ > > /* > * Flag values > diff --git a/net/tipc/msg.c b/net/tipc/msg.c > index 922d262..973795a 100644 > --- a/net/tipc/msg.c > +++ b/net/tipc/msg.c > @@ -190,6 +190,59 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf) > return 0; > } > > +/** > + * tipc_msg_append(): Append data to tail of an existing buffer queue > + * @hdr: header to be used > + * @m: the data to be appended > + * @mss: max allowable size of buffer > + * @dlen: size of data to be appended > + * @txq: queue to appand to > + * Returns the number og 1k blocks appended or errno value > + */ > +int tipc_msg_append(struct tipc_msg *_hdr, struct msghdr *m, int dlen, > + int mss, struct sk_buff_head *txq) > +{ > + struct sk_buff *skb, *prev; > + int accounted, total, curr; > + int mlen, cpy, rem = dlen; > + struct tipc_msg *hdr; > + > + skb = skb_peek_tail(txq); > + accounted = skb ? msg_blocks(buf_msg(skb)) : 0; > + total = accounted; > + > + while (rem) { > + if (!skb || skb->len >= mss) { > + prev = skb; > + skb = tipc_buf_acquire(mss, GFP_KERNEL); > + if (unlikely(!skb)) > + return -ENOMEM; > + skb_orphan(skb); > + skb_trim(skb, MIN_H_SIZE); > + hdr = buf_msg(skb); > + skb_copy_to_linear_data(skb, _hdr, MIN_H_SIZE); > + msg_set_hdr_sz(hdr, MIN_H_SIZE); > + msg_set_size(hdr, MIN_H_SIZE); > + __skb_queue_tail(txq, skb); > + total += 1; > + if (prev) > + msg_set_ack_required(buf_msg(prev), 0); > + msg_set_ack_required(hdr, 1); > + } > + hdr = buf_msg(skb); > + curr = msg_blocks(hdr); > + mlen = msg_size(hdr); > + cpy = min_t(int, rem, mss - mlen); > + if (cpy != copy_from_iter(skb->data + mlen, cpy, &m->msg_iter)) > + return -EFAULT; > + msg_set_size(hdr, mlen + cpy); > + skb_put(skb, cpy); > + rem -= cpy; > + total += msg_blocks(hdr) - curr; > + } > + return total - accounted; > +} > + > /* tipc_msg_validate - validate basic format of received message > * > * This routine ensures a TIPC message has an acceptable header, and at least > diff --git a/net/tipc/msg.h b/net/tipc/msg.h > index 0daa6f0..b85b85a 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -290,6 +290,16 @@ static inline void msg_set_src_droppable(struct tipc_msg *m, u32 d) > msg_set_bits(m, 0, 18, 1, d); > } > > +static inline int msg_ack_required(struct tipc_msg *m) > +{ > + return msg_bits(m, 0, 18, 1); > +} > + > +static inline void msg_set_ack_required(struct tipc_msg *m, u32 d) > +{ > + msg_set_bits(m, 0, 18, 1, d); > +} > + > static inline bool msg_is_rcast(struct tipc_msg *m) > { > return msg_bits(m, 0, 18, 0x1); > @@ -1065,6 +1075,8 @@ int tipc_msg_fragment(struct sk_buff *skb, const struct tipc_msg *hdr, > int pktmax, struct sk_buff_head *frags); > int tipc_msg_build(struct tipc_msg *mhdr, struct msghdr *m, > int offset, int dsz, int mtu, struct sk_buff_head *list); > +int tipc_msg_append(struct tipc_msg *hdr, struct msghdr *m, int dlen, > + int mss, struct sk_buff_head *txq); > bool tipc_msg_lookup_dest(struct net *net, struct sk_buff *skb, int *err); > bool tipc_msg_assemble(struct sk_buff_head *list); > bool tipc_msg_reassemble(struct sk_buff_head *list, struct sk_buff_head *rcvq); > diff --git a/net/tipc/node.h b/net/tipc/node.h > index 291d0ec..b9036f28 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -54,7 +54,8 @@ enum { > TIPC_LINK_PROTO_SEQNO = (1 << 6), > TIPC_MCAST_RBCTL = (1 << 7), > TIPC_GAP_ACK_BLOCK = (1 << 8), > - TIPC_TUNNEL_ENHANCED = (1 << 9) > + TIPC_TUNNEL_ENHANCED = (1 << 9), > + TIPC_NAGLE = (1 << 10) > }; > > #define TIPC_NODE_CAPABILITIES (TIPC_SYN_BIT | \ > @@ -66,7 +67,9 @@ enum { > TIPC_LINK_PROTO_SEQNO | \ > TIPC_MCAST_RBCTL | \ > TIPC_GAP_ACK_BLOCK | \ > - TIPC_TUNNEL_ENHANCED) > + TIPC_TUNNEL_ENHANCED | \ > + TIPC_NAGLE) > + > #define INVALID_BEARER_ID -1 > > void tipc_node_stop(struct net *net); > diff --git a/net/tipc/socket.c b/net/tipc/socket.c > index 35e32ff..20bad67 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -75,6 +75,7 @@ struct sockaddr_pair { > * @conn_instance: TIPC instance used when connection was established > * @published: non-zero if port has one or more associated names > * @max_pkt: maximum packet size "hint" used when building messages sent by port > + * @maxnagle: maximum size of msg which can be subject to nagle > * @portid: unique port identity in TIPC socket hash table > * @phdr: preformatted message header used when sending messages > * #cong_links: list of congested links > @@ -97,6 +98,7 @@ struct tipc_sock { > u32 conn_instance; > int published; > u32 max_pkt; > + u32 maxnagle; > u32 portid; > struct tipc_msg phdr; > struct list_head cong_links; > @@ -116,6 +118,10 @@ struct tipc_sock { > struct tipc_mc_method mc_method; > struct rcu_head rcu; > struct tipc_group *group; > + u32 oneway; > + u16 snd_backlog; > + bool expect_ack; > + bool nodelay; > bool group_is_open; > }; > > @@ -137,6 +143,7 @@ static int tipc_sk_insert(struct tipc_sock *tsk); > static void tipc_sk_remove(struct tipc_sock *tsk); > static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dsz); > static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dsz); > +static void tipc_sk_push_backlog(struct tipc_sock *tsk); > > static const struct proto_ops packet_ops; > static const struct proto_ops stream_ops; > @@ -227,6 +234,26 @@ static u16 tsk_inc(struct tipc_sock *tsk, int msglen) > return 1; > } > > +/* tsk_set_nagle - enable/disable nagle property by manipulating maxnagle > + */ > +static void tsk_set_nagle(struct tipc_sock *tsk) > +{ > + struct sock *sk = &tsk->sk; > + > + tsk->maxnagle = 0; > + if (sk->sk_type != SOCK_STREAM) > + return; > + if (tsk->nodelay) > + return; > + if (!(tsk->peer_caps & TIPC_NAGLE)) > + return; > + /* Limit node local buffer size to avoid receive queue overflow */ > + if (tsk->max_pkt == MAX_MSG_SIZE) > + tsk->maxnagle = 1500; > + else > + tsk->maxnagle = tsk->max_pkt; > +} > + > /** > * tsk_advance_rx_queue - discard first buffer in socket receive queue > * > @@ -446,6 +473,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock, > > tsk = tipc_sk(sk); > tsk->max_pkt = MAX_PKT_DEFAULT; > + tsk->maxnagle = 0; > INIT_LIST_HEAD(&tsk->publications); > INIT_LIST_HEAD(&tsk->cong_links); > msg = &tsk->phdr; > @@ -512,8 +540,12 @@ static void __tipc_shutdown(struct socket *sock, int error) > tipc_wait_for_cond(sock, &timeout, (!tsk->cong_link_cnt && > !tsk_conn_cong(tsk))); > > - /* Remove any pending SYN message */ > - __skb_queue_purge(&sk->sk_write_queue); > + /* Push out unsent messages or remove if pending SYN */ > + skb = skb_peek(&sk->sk_write_queue); > + if (skb && !msg_is_syn(buf_msg(skb))) > + tipc_sk_push_backlog(tsk); > + else > + __skb_queue_purge(&sk->sk_write_queue); > > /* Reject all unreceived messages, except on an active connection > * (which disconnects locally & sends a 'FIN+' to peer). > @@ -1208,6 +1240,27 @@ void tipc_sk_mcast_rcv(struct net *net, struct sk_buff_head *arrvq, > tipc_sk_rcv(net, inputq); > } > > +/* tipc_sk_push_backlog(): send accumulated buffers in socket write queue > + * when socket is in Nagle mode > + */ > +static void tipc_sk_push_backlog(struct tipc_sock *tsk) > +{ > + struct sk_buff_head *txq = &tsk->sk.sk_write_queue; > + struct net *net = sock_net(&tsk->sk); > + u32 dnode = tsk_peer_node(tsk); > + int rc; > + > + if (skb_queue_empty(txq) || tsk->cong_link_cnt) > + return; > + > + tsk->snt_unacked += tsk->snd_backlog; > + tsk->snd_backlog = 0; > + tsk->expect_ack = true; > + rc = tipc_node_xmit(net, txq, dnode, tsk->portid); > + if (rc == -ELINKCONG) > + tsk->cong_link_cnt = 1; > +} > + > /** > * tipc_sk_conn_proto_rcv - receive a connection mng protocol message > * @tsk: receiving socket > @@ -1221,7 +1274,7 @@ static void tipc_sk_conn_proto_rcv(struct tipc_sock *tsk, struct sk_buff *skb, > u32 onode = tsk_own_node(tsk); > struct sock *sk = &tsk->sk; > int mtyp = msg_type(hdr); > - bool conn_cong; > + bool was_cong; > > /* Ignore if connection cannot be validated: */ > if (!tsk_peer_msg(tsk, hdr)) { > @@ -1254,11 +1307,13 @@ static void tipc_sk_conn_proto_rcv(struct tipc_sock *tsk, struct sk_buff *skb, > __skb_queue_tail(xmitq, skb); > return; > } else if (mtyp == CONN_ACK) { > - conn_cong = tsk_conn_cong(tsk); > + was_cong = tsk_conn_cong(tsk); > + tsk->expect_ack = false; > + tipc_sk_push_backlog(tsk); > tsk->snt_unacked -= msg_conn_ack(hdr); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > tsk->snd_win = msg_adv_win(hdr); > - if (conn_cong) > + if (was_cong && !tsk_conn_cong(tsk)) > sk->sk_write_space(sk); > } else if (mtyp != CONN_PROBE_REPLY) { > pr_warn("Received unknown CONN_PROTO msg\n"); > @@ -1437,16 +1492,17 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen) > struct sock *sk = sock->sk; > DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name); > long timeout = sock_sndtimeo(sk, m->msg_flags & MSG_DONTWAIT); > + struct sk_buff_head *txq = &sk->sk_write_queue; > struct tipc_sock *tsk = tipc_sk(sk); > struct tipc_msg *hdr = &tsk->phdr; > struct net *net = sock_net(sk); > - struct sk_buff_head pkts; > u32 dnode = tsk_peer_node(tsk); > + int blocks = tsk->snd_backlog; > + int maxnagle = tsk->maxnagle; > + int maxpkt = tsk->max_pkt; > int send, sent = 0; > int rc = 0; > > - __skb_queue_head_init(&pkts); > - > if (unlikely(dlen > INT_MAX)) > return -EMSGSIZE; > > @@ -1467,21 +1523,35 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen) > tipc_sk_connected(sk))); > if (unlikely(rc)) > break; > - > send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE); > - rc = tipc_msg_build(hdr, m, sent, send, tsk->max_pkt, &pkts); > - if (unlikely(rc != send)) > - break; > > - trace_tipc_sk_sendstream(sk, skb_peek(&pkts), > + if (tsk->oneway++ >= 4 && send <= maxnagle) { > + rc = tipc_msg_append(hdr, m, send, maxnagle, txq); > + if (rc < 0) > + break; > + blocks += rc; > + if (blocks <= 64 && tsk->expect_ack) { > + tsk->snd_backlog = blocks; > + sent += send; > + break; > + } > + tsk->expect_ack = true; > + } else { > + rc = tipc_msg_build(hdr, m, sent, send, maxpkt, txq); > + if (unlikely(rc != send)) > + break; > + blocks += tsk_inc(tsk, send + MIN_H_SIZE); > + } > + trace_tipc_sk_sendstream(sk, skb_peek(txq), > TIPC_DUMP_SK_SNDQ, " "); > - rc = tipc_node_xmit(net, &pkts, dnode, tsk->portid); > + rc = tipc_node_xmit(net, txq, dnode, tsk->portid); > if (unlikely(rc == -ELINKCONG)) { > tsk->cong_link_cnt = 1; > rc = 0; > } > if (likely(!rc)) { > - tsk->snt_unacked += tsk_inc(tsk, send + MIN_H_SIZE); > + tsk->snt_unacked += blocks; > + tsk->snd_backlog = 0; > sent += send; > } > } while (sent < dlen && !rc); > @@ -1528,6 +1598,7 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port, > tipc_node_add_conn(net, peer_node, tsk->portid, peer_port); > tsk->max_pkt = tipc_node_get_mtu(net, peer_node, tsk->portid); > tsk->peer_caps = tipc_node_get_capabilities(net, peer_node); > + tsk_set_nagle(tsk); > __skb_queue_purge(&sk->sk_write_queue); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > return; > @@ -1848,6 +1919,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, > bool peek = flags & MSG_PEEK; > int offset, required, copy, copied = 0; > int hlen, dlen, err, rc; > + bool ack = false; > long timeout; > > /* Catch invalid receive attempts */ > @@ -1892,6 +1964,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, > > /* Copy data if msg ok, otherwise return error/partial data */ > if (likely(!err)) { > + ack = msg_ack_required(hdr); > offset = skb_cb->bytes_read; > copy = min_t(int, dlen - offset, buflen - copied); > rc = skb_copy_datagram_msg(skb, hlen + offset, m, copy); > @@ -1919,7 +1992,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, > > /* Send connection flow control advertisement when applicable */ > tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen); > - if (unlikely(tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE)) > + if (ack || tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE) > tipc_sk_send_ack(tsk); > > /* Exit if all requested data or FIN/error received */ > @@ -1990,6 +2063,7 @@ static void tipc_sk_proto_rcv(struct sock *sk, > smp_wmb(); > tsk->cong_link_cnt--; > wakeup = true; > + tipc_sk_push_backlog(tsk); > break; > case GROUP_PROTOCOL: > tipc_group_proto_rcv(grp, &wakeup, hdr, inputq, xmitq); > @@ -2029,6 +2103,7 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb) > > if (unlikely(msg_mcast(hdr))) > return false; > + tsk->oneway = 0; > > switch (sk->sk_state) { > case TIPC_CONNECTING: > @@ -2074,6 +2149,8 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb) > return true; > return false; > case TIPC_ESTABLISHED: > + if (!skb_queue_empty(&sk->sk_write_queue)) > + tipc_sk_push_backlog(tsk); > /* Accept only connection-based messages sent by peer */ > if (likely(con_msg && !err && pport == oport && pnode == onode)) > return true; > @@ -2959,6 +3036,7 @@ static int tipc_setsockopt(struct socket *sock, int lvl, int opt, > case TIPC_SRC_DROPPABLE: > case TIPC_DEST_DROPPABLE: > case TIPC_CONN_TIMEOUT: > + case TIPC_NODELAY: > if (ol < sizeof(value)) > return -EINVAL; > if (get_user(value, (u32 __user *)ov)) > @@ -3007,6 +3085,10 @@ static int tipc_setsockopt(struct socket *sock, int lvl, int opt, > case TIPC_GROUP_LEAVE: > res = tipc_sk_leave(tsk); > break; > + case TIPC_NODELAY: > + tsk->nodelay = !!value; > + tsk_set_nagle(tsk); > + break; > default: > res = -EINVAL; > } > |
From: David M. <da...@da...> - 2019-10-30 00:56:22
|
From: Hoang Le <hoa...@de...> Date: Tue, 29 Oct 2019 07:51:21 +0700 > Currently, TIPC transports intra-node user data messages directly > socket to socket, hence shortcutting all the lower layers of the > communication stack. This gives TIPC very good intra node performance, > both regarding throughput and latency. > > We now introduce a similar mechanism for TIPC data traffic across > network namespaces located in the same kernel. On the send path, the > call chain is as always accompanied by the sending node's network name > space pointer. However, once we have reliably established that the > receiving node is represented by a namespace on the same host, we just > replace the namespace pointer with the receiving node/namespace's > ditto, and follow the regular socket receive patch though the receiving > node. This technique gives us a throughput similar to the node internal > throughput, several times larger than if we let the traffic go though > the full network stacks. As a comparison, max throughput for 64k > messages is four times larger than TCP throughput for the same type of > traffic. > > To meet any security concerns, the following should be noted. ... > Regarding traceability, we should notice that since commit 6c9081a3915d > ("tipc: add loopback device tracking") it is possible to follow the node > internal packet flow by just activating tcpdump on the loopback > interface. This will be true even for this mechanism; by activating > tcpdump on the involved nodes' loopback interfaces their inter-name > space messaging can easily be tracked. > > v2: > - update 'net' pointer when node left/rejoined > v3: > - grab read/write lock when using node ref obj > v4: > - clone traffics between netns to loopback > > Suggested-by: Jon Maloy <jon...@er...> > Acked-by: Jon Maloy <jon...@er...> > Signed-off-by: Hoang Le <hoa...@de...> Applied to net-next. |