You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(9) |
Feb
(11) |
Mar
(22) |
Apr
(73) |
May
(78) |
Jun
(146) |
Jul
(80) |
Aug
(27) |
Sep
(5) |
Oct
(14) |
Nov
(18) |
Dec
(27) |
2005 |
Jan
(20) |
Feb
(30) |
Mar
(19) |
Apr
(28) |
May
(50) |
Jun
(31) |
Jul
(32) |
Aug
(14) |
Sep
(36) |
Oct
(43) |
Nov
(74) |
Dec
(63) |
2006 |
Jan
(34) |
Feb
(32) |
Mar
(21) |
Apr
(76) |
May
(106) |
Jun
(72) |
Jul
(70) |
Aug
(175) |
Sep
(130) |
Oct
(39) |
Nov
(81) |
Dec
(43) |
2007 |
Jan
(81) |
Feb
(36) |
Mar
(20) |
Apr
(43) |
May
(54) |
Jun
(34) |
Jul
(44) |
Aug
(55) |
Sep
(44) |
Oct
(54) |
Nov
(43) |
Dec
(41) |
2008 |
Jan
(42) |
Feb
(84) |
Mar
(73) |
Apr
(30) |
May
(119) |
Jun
(54) |
Jul
(54) |
Aug
(93) |
Sep
(173) |
Oct
(130) |
Nov
(145) |
Dec
(153) |
2009 |
Jan
(59) |
Feb
(12) |
Mar
(28) |
Apr
(18) |
May
(56) |
Jun
(9) |
Jul
(28) |
Aug
(62) |
Sep
(16) |
Oct
(19) |
Nov
(15) |
Dec
(17) |
2010 |
Jan
(14) |
Feb
(36) |
Mar
(37) |
Apr
(30) |
May
(33) |
Jun
(53) |
Jul
(42) |
Aug
(50) |
Sep
(67) |
Oct
(66) |
Nov
(69) |
Dec
(36) |
2011 |
Jan
(52) |
Feb
(45) |
Mar
(49) |
Apr
(21) |
May
(34) |
Jun
(13) |
Jul
(19) |
Aug
(37) |
Sep
(43) |
Oct
(10) |
Nov
(23) |
Dec
(30) |
2012 |
Jan
(42) |
Feb
(36) |
Mar
(46) |
Apr
(25) |
May
(96) |
Jun
(146) |
Jul
(40) |
Aug
(28) |
Sep
(61) |
Oct
(45) |
Nov
(100) |
Dec
(53) |
2013 |
Jan
(79) |
Feb
(24) |
Mar
(134) |
Apr
(156) |
May
(118) |
Jun
(75) |
Jul
(278) |
Aug
(145) |
Sep
(136) |
Oct
(168) |
Nov
(137) |
Dec
(439) |
2014 |
Jan
(284) |
Feb
(158) |
Mar
(231) |
Apr
(275) |
May
(259) |
Jun
(91) |
Jul
(222) |
Aug
(215) |
Sep
(165) |
Oct
(166) |
Nov
(211) |
Dec
(150) |
2015 |
Jan
(164) |
Feb
(324) |
Mar
(299) |
Apr
(214) |
May
(111) |
Jun
(109) |
Jul
(105) |
Aug
(36) |
Sep
(58) |
Oct
(131) |
Nov
(68) |
Dec
(30) |
2016 |
Jan
(46) |
Feb
(87) |
Mar
(135) |
Apr
(174) |
May
(132) |
Jun
(135) |
Jul
(149) |
Aug
(125) |
Sep
(79) |
Oct
(49) |
Nov
(95) |
Dec
(102) |
2017 |
Jan
(104) |
Feb
(75) |
Mar
(72) |
Apr
(53) |
May
(18) |
Jun
(5) |
Jul
(14) |
Aug
(19) |
Sep
(2) |
Oct
(13) |
Nov
(21) |
Dec
(67) |
2018 |
Jan
(56) |
Feb
(50) |
Mar
(148) |
Apr
(41) |
May
(37) |
Jun
(34) |
Jul
(34) |
Aug
(11) |
Sep
(52) |
Oct
(48) |
Nov
(28) |
Dec
(46) |
2019 |
Jan
(29) |
Feb
(63) |
Mar
(95) |
Apr
(54) |
May
(14) |
Jun
(71) |
Jul
(60) |
Aug
(49) |
Sep
(3) |
Oct
(64) |
Nov
(115) |
Dec
(57) |
2020 |
Jan
(15) |
Feb
(9) |
Mar
(38) |
Apr
(27) |
May
(60) |
Jun
(53) |
Jul
(35) |
Aug
(46) |
Sep
(37) |
Oct
(64) |
Nov
(20) |
Dec
(25) |
2021 |
Jan
(20) |
Feb
(31) |
Mar
(27) |
Apr
(23) |
May
(21) |
Jun
(30) |
Jul
(30) |
Aug
(7) |
Sep
(18) |
Oct
|
Nov
(15) |
Dec
(4) |
2022 |
Jan
(3) |
Feb
(1) |
Mar
(10) |
Apr
|
May
(2) |
Jun
(26) |
Jul
(5) |
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(9) |
Dec
(2) |
2023 |
Jan
(4) |
Feb
(4) |
Mar
(5) |
Apr
(10) |
May
(29) |
Jun
(17) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
(2) |
Dec
|
2024 |
Jan
|
Feb
(6) |
Mar
|
Apr
(1) |
May
(6) |
Jun
|
Jul
(5) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Jon M. <jm...@re...> - 2020-05-12 21:21:42
|
cc to tipc-discussion and others. ///jon -------- Forwarded Message -------- Subject: Re: FW: [PATCH 2/2] tipc: update a binding service via broadcast Date: Tue, 12 May 2020 14:38:29 -0400 From: Jon Maloy <jm...@re...> To: Hoang Huu Le <hoa...@de...>, ma...@do... <ma...@do...> Hi Hoang, I had to re-read the discussion we had back in November to refresh my memory about this. To me v2 looks like a better base for further development than the most recent version, so I am sorry to have wasted your time here. I am still convinced that the mcast/rcast synch protocol we developed can solve almost all our problems if we understand to use it right. What about the following: Sending side: - Even the initial bulk is sent via tipc_mxast_xmit(), but with only one destination node and as "mandatory rcast". The broadcast link may still choose to send it as bcast in case rcast has been manually disabled by the user, but the surplus messages will be dropped on the non-destination nodes. (In case of legacy nodes this is certain, in the case of newer nodes we must make sure that it happens by setting the 'destnode' field in the header and always check for this in tipc_named_rcv().) I think this is a weird and rare case where we can accept the extra overhead of broadcast. Under all normal conditions rcast will be used here. - Regular PUBLISH/WITHDRAW messages are sent as "mandatory bcast", or if some nodes don't not support TIPC_MCAST_RBCTL as "mandatory rcast". The broadcast link may still choose to send it as rcast, but this is nothing we need to worry about as long as the syncronization mechanism is active. Under all normal conditions bcast will be selected when possible, since all nodes are destinations. Receiving side: - We need a deferred queue, but last time i missed that we don't need one per node, it is sufficient to have a global one, as you were suggesting. So to me the code is ok here. - As mentioned earlier, tipc_named_rcv() only accepts messages where destnode is "self" (bulk) or zero (publish/withdraw). The point with all this is that all messages will now be subject to the synchronization mechanism, so the bulk messages are guaranteed to be delivered in order and ahead of individual publications/withdraws, and even topology changes will almost never lead to change of sending method and need for synchronization. We must make one change to the broadcast protocol though: even "mandatory" messages must be made subject to the synchronization mechanism, something that doesn't seem to be the case now. ("Or maybe we don't need to set "mandatory" at all? This needs to be checked.) The advantage of this method is simplicity. - We don't need any separate synchronization mechanism for NAME_DISTR messages at all, we just use what is already there. The disadvantage is that there will be a lot of SYN messages sent if we are not careful. - Every time a new node comes up, the other nodes will send it a a bcast SYN before they send the bulk, because they believe they have just switched from bcast (normal mode) to rcast (bulk mode). This one is in reality unnecessary, since we can be sure that the new node has not been sent any previous bcast that needs to be synched with. - At the first publication/withdraw sent after the bulk the neighbor nodes will send an rcast SYN to all other nodes, because they just "switched" back from rcast to bcast. In this case we really need to send an rcast SYN to the new node that came up, since this is the only one where there is risk of a race. This message does in reality serve as "end-of-bulk" message but is handled like any other SYN by the reciving tipc_mcast_filter_rcv() So, to avoid this the broadcast protocol needs to be able to recognize NAME_DISTR messages and treat then slightly different from the others, or it needs to be told what to do by the name_distr.c code via the API. Maybe a couple of new fields in struct tipc_mcast_method? What do you think? Regards ///jon On 5/12/20 6:22 AM, Hoang Huu Le wrote: > Just forward the patch I mentioned. > > -----Original Message----- > From: Hoang Le <hoa...@de...> > Sent: Tuesday, November 19, 2019 5:01 PM > To: jon...@er...; ma...@do...; tip...@de... > Subject: [PATCH 2/2] tipc: update a binding service via broadcast > > Currently, updating binding table (add service binding to > name table/withdraw a service binding) is being sent over replicast. > However, if we are scaling up clusters to > 100 nodes/containers this > method is less affection because of looping through nodes in a cluster one > by one. > > It is worth to use broadcast to update a binding service. Then binding > table updates in all nodes for one shot. > > The mechanism is backward compatible as sync message slient dropped. > > v2: resolve synchronization problem when switching from unicast to > broadcast > > Signed-off-by: Hoang Le <hoa...@de...> > --- > net/tipc/bcast.c | 3 ++- > net/tipc/link.c | 6 ++++++ > net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++--- > net/tipc/name_table.h | 4 ++++ > net/tipc/node.c | 32 ++++++++++++++++++++++++++++++++ > net/tipc/node.h | 2 ++ > 6 files changed, 76 insertions(+), 4 deletions(-) > > diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c > index e06f05d55534..44ed481fec47 100644 > --- a/net/tipc/bcast.c > +++ b/net/tipc/bcast.c > @@ -324,7 +324,8 @@ static int tipc_mcast_send_sync(struct net *net, > struct sk_buff *skb, > hdr = buf_msg(skb); > if (msg_user(hdr) == MSG_FRAGMENTER) > hdr = msg_inner_hdr(hdr); > - if (msg_type(hdr) != TIPC_MCAST_MSG) > + if (msg_user(hdr) != NAME_DISTRIBUTOR && > + msg_type(hdr) != TIPC_MCAST_MSG) > return 0; > /* Allocate dummy message */ > diff --git a/net/tipc/link.c b/net/tipc/link.c > index fb72031228c9..a2e9a64d5a0f 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct tipc_link *l, > struct sk_buff *skb, > struct sk_buff_head *inputq) > { > struct sk_buff_head *mc_inputq = l->bc_rcvlink->inputq; > + struct name_table *nt = tipc_name_table(l->net); > + struct sk_buff_head *defnq = &nt->defer_namedq; > struct tipc_msg *hdr = buf_msg(skb); > switch (msg_user(hdr)) { > @@ -1211,6 +1213,10 @@ static bool tipc_data_input(struct tipc_link > *l, struct sk_buff *skb, > case NAME_DISTRIBUTOR: > l->bc_rcvlink->state = LINK_ESTABLISHED; > skb_queue_tail(l->namedq, skb); > + > + spin_lock_bh(&defnq->lock); > + tipc_mcast_filter_msg(l->net, defnq, l->namedq); > + spin_unlock_bh(&defnq->lock); > return true; > case MSG_BUNDLER: > case TUNNEL_PROTOCOL: > diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c > index 66a65c2cdb23..593dcd11357f 100644 > --- a/net/tipc/name_table.c > +++ b/net/tipc/name_table.c > @@ -615,9 +615,11 @@ struct publication *tipc_nametbl_publish(struct > net *net, u32 type, u32 lower, > struct tipc_net *tn = tipc_net(net); > struct publication *p = NULL; > struct sk_buff *skb = NULL; > + bool rcast; > spin_lock_bh(&tn->nametbl_lock); > + rcast = nt->rcast; > if (nt->local_publ_count >= TIPC_MAX_PUBL) { > pr_warn("Bind failed, max limit %u reached\n", TIPC_MAX_PUBL); > goto exit; > @@ -632,8 +634,18 @@ struct publication *tipc_nametbl_publish(struct > net *net, u32 type, u32 lower, > exit: > spin_unlock_bh(&tn->nametbl_lock); > - if (skb) > - tipc_node_broadcast(net, skb); > + if (skb) { > + /* Use broadcast if all nodes support broadcast NAME_DISTR */ > + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) { > + tipc_node_broadcast_named_publish(net, skb, &rcast); > + spin_lock_bh(&tn->nametbl_lock); > + nt->rcast = rcast; > + spin_unlock_bh(&tn->nametbl_lock); > + } else { > + /* Otherwise, be backwards compatible */ > + tipc_node_broadcast(net, skb); > + } > + } > return p; > } > @@ -648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net *net, u32 > type, u32 lower, > u32 self = tipc_own_addr(net); > struct sk_buff *skb = NULL; > struct publication *p; > + bool rcast; > spin_lock_bh(&tn->nametbl_lock); > + rcast = nt->rcast; > p = tipc_nametbl_remove_publ(net, type, lower, upper, self, key); > if (p) { > @@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net *net, u32 > type, u32 lower, > spin_unlock_bh(&tn->nametbl_lock); > if (skb) { > - tipc_node_broadcast(net, skb); > + /* Use broadcast if all nodes support broadcast NAME_DISTR */ > + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL) { > + tipc_node_broadcast_named_publish(net, skb, &rcast); > + spin_lock_bh(&tn->nametbl_lock); > + nt->rcast = rcast; > + spin_unlock_bh(&tn->nametbl_lock); > + } else { > + /* Otherwise, be backwards compatible */ > + tipc_node_broadcast(net, skb); > + } > return 1; > } > return 0; > @@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net) > INIT_LIST_HEAD(&nt->cluster_scope); > rwlock_init(&nt->cluster_scope_lock); > tn->nametbl = nt; > + /* 'bulk' updated messages via unicast */ > + nt->rcast = true; > + skb_queue_head_init(&nt->defer_namedq); > spin_lock_init(&tn->nametbl_lock); > return 0; > } > @@ -784,6 +810,7 @@ void tipc_nametbl_stop(struct net *net) > * publications, then release the name table > */ > spin_lock_bh(&tn->nametbl_lock); > + skb_queue_purge(&nt->defer_namedq); > for (i = 0; i < TIPC_NAMETBL_SIZE; i++) { > if (hlist_empty(&nt->services[i])) > continue; > diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h > index f79066334cc8..b8cdf2a29d48 100644 > --- a/net/tipc/name_table.h > +++ b/net/tipc/name_table.h > @@ -95,6 +95,8 @@ struct publication { > * - used by name_distr to send bulk updates to new nodes > * - used by name_distr during re-init of name table > * @local_publ_count: number of publications issued by this node > + * @defer_namedq: temporarily queue for 'synching' update > + * @rcast: previous method used to publish/withdraw a service > */ > struct name_table { > struct hlist_head services[TIPC_NAMETBL_SIZE]; > @@ -102,6 +104,8 @@ struct name_table { > struct list_head cluster_scope; > rwlock_t cluster_scope_lock; > u32 local_publ_count; > + struct sk_buff_head defer_namedq; > + bool rcast; > }; > int tipc_nl_name_table_dump(struct sk_buff *skb, struct > netlink_callback *cb); > diff --git a/net/tipc/node.c b/net/tipc/node.c > index aaf595613e6e..b058647fa78b 100644 > --- a/net/tipc/node.c > +++ b/net/tipc/node.c > @@ -2981,3 +2981,35 @@ void tipc_node_pre_cleanup_net(struct net > *exit_net) > } > rcu_read_unlock(); > } > + > +int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff > *skb, > + bool *rcast) > +{ > + struct tipc_mc_method method = {.rcast = *rcast}; > + struct sk_buff_head xmitq; > + struct tipc_nlist dests; > + struct tipc_node *n; > + u16 cong_link_cnt; > + int rc = 0; > + > + __skb_queue_head_init(&xmitq); > + __skb_queue_tail(&xmitq, skb); > + > + tipc_nlist_init(&dests, tipc_own_addr(net)); > + rcu_read_lock(); > + list_for_each_entry_rcu(n, tipc_nodes(net), list) { > + if (in_own_node(net, n->addr)) > + continue; > + if (!node_is_up(n)) > + continue; > + tipc_nlist_add(&dests, n->addr); > + } > + rcu_read_unlock(); > + > + rc = tipc_mcast_xmit(net, &xmitq, &method, &dests, &cong_link_cnt); > + *rcast = method.rcast; > + > + tipc_nlist_purge(&dests); > + __skb_queue_purge(&xmitq); > + return rc; > +} > diff --git a/net/tipc/node.h b/net/tipc/node.h > index a6803b449a2c..d7d19f9932b1 100644 > --- a/net/tipc/node.h > +++ b/net/tipc/node.h > @@ -124,4 +124,6 @@ int tipc_nl_node_set_key(struct sk_buff *skb, > struct genl_info *info); > int tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info); > #endif > void tipc_node_pre_cleanup_net(struct net *exit_net); > +int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff > *skb, > + bool *rcast); > #endif |
From: Jon M. <jm...@re...> - 2020-05-12 20:39:17
|
Hi Ying, I have for several years claimed that TIPC and RAFT would be the perfect combination. But what I had in mind was a more general-purpose user-land RAFT consensus service on top of TIPC, where TIPC provides properties that makes RAFT easier to implement and gives it shorter discovery/convergence times during topology changes. By adding RAFT as a service *provided by* TIPC, I imagine you mean something similar to (or even extension of) the topology server we have now. This has not occurred to me, but it might not be a bad idea. It all boils down to how complex it would be, and how much more code we would need to add to TIPC, vs. the benefit we get. Your colleague Allan Stevens used to say that TIPC needs a new "unique selling point" to reach a wider adoption than it has now, and this might indeed be it. Maybe a prototype to be done by somebody at WindRiver, DEK or RH? What I don't quite understand is that you present this as an alternative to the 128-bit address scheme. We need that anyway as I see it. Uniqueness for service types is only part of the reason for that proposal. It is even about practicality of use, e.g., letting users use strings, database keys, disk partitions etc. directly as service instances and hence avoiding any lookup/translation steps. I did not imagine that the TIPC itself generates any UUIDs or service types/instance values; that part is still left to the user. I think the risk of service type collisions inside a cluster, if generated properly by the user, is so low that it alone would not be enough to justify such a large extension to TIPC. But if he could have 100% guaranteed uniqueness as a side effect of such a new service it would of course in nice. I assume you have more than just TIPC address uniqueness in mind with this proposal? Regards ///jon On 5/12/20 5:45 AM, Xue, Ying wrote: > Hi Jon, > > Sorry for late response. > > Before I reply to your comments below, I want to discuss a more general question: > > When you posted the idea that we use UUID to resolve service name conflict issue, in the recent days I was always wondering whether we could implement Raft consensus algorithm (https://raft.github.io/) in internal TIPC module. In my opinion, there are different advantages and disadvantages respectively: > > UUID: > Advantages: > - Its generation algorithm is straightforward and Linux kernel has the interfaces available to generate UUID numbers. > Disadvantages: > - Protocol backwards compatibility > - UUID generation algorithms cannot 100% guarantee that UUID numbers are not repeated particularly in distribution environment although the probability of UUID repeated occurrence is very little. > > Raft: > Advantages: > - Can 100% guarantee that service name doesn't conflict each other in in distribution environment > - No protocol backwards compatibility issue > Disadvantages: > - Compared to the changes brought by UUID, the changes might not be very big, but it's quite complex particularly when we try to implement its algorithm in kernel space. > > I love to hear your opinion. > > Thanks., > Ying > > -----Original Message----- > From: Jon Maloy [mailto:jm...@re...] > Sent: Tuesday, May 5, 2020 8:15 AM > To: tip...@li... > Cc: tun...@de...; hoa...@de...; tuo...@de...; ma...@do...; xi...@re...; Xue, Ying; par...@gm... > Subject: Re: [RFC PATCH] tipc: define TIPC version 3 address types > > Hi all, > I was pondering a little more about this possible feature. > First of all, I realized that the following test > > bool tipc_msg_validate(struct sk_buff **_skb) > { > [...] > if (unlikely(msg_version(hdr) != TIPC_VERSION)) > return false; > [...] > } > makes it very hard to update the version number in a > backwards compatible way. Even discovery messages > will be rejected by v2 nodes, and we don't get around > that unless we do discovery with v2 messages, or send > out a duplicate set (v2 +v3) of discovery messages. > And, we can actually achieve exactly what we want > with just setting another capability bit. > So, we set bit #12 to mean "TIPC_EXTENDED", to also to > mean "all previous capabilities are valid if this bit is set, > no need to test for it" > That way, we can zero out these bits and start reusing them > for new capabilities when we need to. > > AF_TIPC3 now becomes AF_TIPCE, tipc_addr becomes > tipce_addr etc. > > The binding table needs to be updated the following way: > > union publ_item { > struct { > __be32 type; > __be32 lower; > __be32 upper; > } legacy; > struct { > u8 type[16]; > u8 lower[16]; > u8 upper[16]; > } extended; > }; > > struct publication { > u8 extended; > u8 scope; /* This can only take values [0:3] */ > u8 spare[2]; > union publ_item publ; > u8 node[16]; > u32 port; > u32 key; > struct list_head binding_node; > struct list_head binding_sock; > struct list_head local_publ; > struct list_head all_publ; > struct rcu_head rcu; > }; > > struct distr_item { > union publ_item; > __be32 port; > __be32 key; > }; > > The NAME_DISTR protocol must be extended with a field > indicating if it contains legacy publication(s) or extended > publication(s). > 'Extended' nodes receive separate bulks for legacy and > extended publications, since it is hard to mix them in the > same message. > Legacy nodes only receive legacy publications, so in this > case the distributor just send a bulk for those. > > The topology subscriber must be updated in a similar > manner, but we can assume that the same socket cannot > issue two types of subscriptions and receive two types > of events; it has to be on or the other. This should > simplify the task somewhat. > > User message header format needs to be changed for > Service Address (Port Name) messages: > - Type occupies word [8:B], i.e. bytes [32:47] > - Instance occupies word [C:F], i.e. bytes [48:64] > > This is where it gets tricky. The 'header size' field is only 4 > bits and counts 32-bit words. This means that current > max header size that can be indicated is 60 bytes. > A simple way might be to just extend the field with one of > the tree unused bits [16:18] in word 1 as msb. That would > be backwards compatible since those bits are currently 0, > and no special tricks are needed. > Regarding TIPC_MCAST_MSG we need yet another 16 bytes, > [65:80] if we want to preserve the current semantics on > [lower,upper]. However, I am highly uncertain if that feature > is ever used and needed. We may be good by just keeping > one 'instance' field just as in NAMED messages. > > The group cast protocol could be left for later, once we understand > the consequences better than now, but semantically it should > work just like now, except with a longer header and type/instance > fields. > > It would also be nice if the 16 byte node identity replaces the current > 4 byte node address/number in all interaction with user land, inclusive > the presentation of the neighbor monitoring status in monitor.c. > That can possibly also be left for later. > > Finally, would it be possible to mark a socket at 'legacy' or 'extended' > without adding a new AF_TIPCE value? If this can be done in a > not-too-ugly way it might be worth considering. > > ///jon > > > > > On 4/27/20 9:53 PM, jm...@re... wrote: >> From: Jon Maloy <jm...@re...> >> >> TIPC would be more attractive in a modern user environment such >> as Kubernetes if it could provide a larger address range. >> >> Advantages: >> - Users could directly use UUIDs, strings or other values as service >> instances types and instances. >> - No more risk of collisions between randomly selected service types >> >> The effect on the TIPC implementation and protocol would be significant, >> but this is still worth considering. >> --- >> include/linux/socket.h | 5 ++- >> include/uapi/linux/tipc3.h | 79 ++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 82 insertions(+), 2 deletions(-) >> create mode 100644 include/uapi/linux/tipc3.h >> >> diff --git a/include/linux/socket.h b/include/linux/socket.h >> index 54338fac45cb..ff2268ceedaf 100644 >> --- a/include/linux/socket.h >> +++ b/include/linux/socket.h >> @@ -209,8 +209,8 @@ struct ucred { >> * reuses AF_INET address family >> */ >> #define AF_XDP 44 /* XDP sockets */ >> - >> -#define AF_MAX 45 /* For now.. */ >> +#define AF_TIPC3 45 /* TIPC version 3 sockets */ >> +#define AF_MAX 46 /* For now.. */ >> >> /* Protocol families, same as address families. */ >> #define PF_UNSPEC AF_UNSPEC >> @@ -260,6 +260,7 @@ struct ucred { >> #define PF_QIPCRTR AF_QIPCRTR >> #define PF_SMC AF_SMC >> #define PF_XDP AF_XDP >> +#define PF_TIPC3 AF_TIPC3 >> #define PF_MAX AF_MAX >> >> /* Maximum queue length specifiable by listen. */ >> diff --git a/include/uapi/linux/tipc3.h b/include/uapi/linux/tipc3.h >> new file mode 100644 >> index 000000000000..0d385bc41b66 >> --- /dev/null >> +++ b/include/uapi/linux/tipc3.h >> @@ -0,0 +1,79 @@ >> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ >> +/* >> + * include/uapi/linux/tipc3.h: Header for TIPC v3 socket interface >> + * >> + * Copyright (c) 2020 Red Hat Inc >> + * All rights reserved. >> + * >> + * Redistribution and use in source and binary forms, with or without >> + * modification, are permitted provided that the following conditions are met: >> + * >> + * 1. Redistributions of source code must retain the above copyright >> + * notice, this list of conditions and the following disclaimer. >> + * 2. Redistributions in binary form must reproduce the above copyright >> + * notice, this list of conditions and the following disclaimer in the >> + * documentation and/or other materials provided with the distribution. >> + * 3. Neither the names of the copyright holders nor the names of its >> + * contributors may be used to endorse or promote products derived from >> + * this software without specific prior written permission. >> + * >> + * Alternatively, this software may be distributed under the terms of the >> + * GNU General Public License ("GPL") version 2 as published by the Free >> + * Software Foundation. >> + * >> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" >> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE >> + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE >> + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR >> + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF >> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS >> + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN >> + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) >> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE >> + * POSSIBILITY OF SUCH DAMAGE. >> + */ >> + >> +#ifndef _LINUX_TIPC3_H_ >> +#define _LINUX_TIPC3_H_ >> + >> +#include <linux/types.h> >> +#include <linux/sockios.h> >> +#include <linux/tipc.h> >> + >> +struct tipc3_addr { >> + __u8[16] type; /* zero if socket address */ >> + __u8[16] instance; /* port if socket address */ >> + __u8[16] node; /* zero if whole cluster */ >> +}; >> + >> +struct tipc3_subscr { >> + __u8[16] type; >> + __u8[16] lower; >> + __u8[16] upper; >> + __u8[16] node; >> + __u32 timeout; /* subscription duration (in ms) */ >> + __u32 filter; /* bitmask of filter options */ >> + __u8 usr_handle[16]; /* available for subscriber use */ >> +}; >> + >> +struct tipc3_event { >> + __u8[16] lower; /* matching range */ >> + __u8[16] upper; /* " " */ >> + struct tipc3_addr socket; /* associated socket */ >> + struct tipc2_subscr sub; /* associated subscription */ >> + __u32 event; /* event type */ >> +}; >> + >> +struct sockaddr_tipc3 { >> + unsigned short family; >> + bool mcast; >> + struct tipc3_addr addr; >> +}; >> + >> +struct tipc3_group_req { >> + struct tipc3_addr addr; >> + __u32 flags; >> +}; >> + >> +#endif |
From: Ying X. <yin...@wi...> - 2020-05-12 10:37:39
|
On 5/11/20 11:59 AM, Tuong Lien wrote: > When a service subscription is expired or canceled by user, it needs to > be deleted from the subscription list, so that new subscriptions can be > registered (max = 65535 per net). However, there are two issues in code > that can cause such an unused subscription to persist: > > 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but > it makes a break shortly when the 1st subscription differs from the one > specified, so the subscription will not be deleted. > > 2) In case a subscription is canceled, the code to remove the > 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it > is a local subscription (i.e. the little endian isn't involved). So, it > will be no matches when looking for the subscription to delete later. > > The subscription(s) will be removed eventually when the user terminates > its topology connection but that could be a long time later. Meanwhile, > the number of available subscriptions may be exhausted. > > This commit fixes the two issues above, so as needed a subscription can > be deleted correctly. > > v2: define a new macro to write sub field value (- Jon's comment) > v3: break if the sub to be deleted has been found > > Signed-off-by: Tuong Lien <tuo...@de...> Acked-by: Ying Xue <yin...@wi...> > --- > net/tipc/subscr.h | 10 ++++++++++ > net/tipc/topsrv.c | 9 +++++---- > 2 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/net/tipc/subscr.h b/net/tipc/subscr.h > index aa015c233898..6ebbec1bedd1 100644 > --- a/net/tipc/subscr.h > +++ b/net/tipc/subscr.h > @@ -96,6 +96,16 @@ void tipc_sub_get(struct tipc_subscription *subscription); > (swap_ ? swab32(val__) : val__); \ > }) > > +/* tipc_sub_write - write val_ to field_ of struct sub_ in user endian format > + */ > +#define tipc_sub_write(sub_, field_, val_) \ > + ({ \ > + struct tipc_subscr *sub__ = sub_; \ > + u32 val__ = val_; \ > + int swap_ = !((sub__)->filter & TIPC_FILTER_MASK); \ > + (sub__)->field_ = swap_ ? swab32(val__) : val__; \ > + }) > + > /* tipc_evt_write - write val_ to field_ of struct evt_ in user endian format > */ > #define tipc_evt_write(evt_, field_, val_) \ > diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c > index 931c426673c0..446af7bbd13e 100644 > --- a/net/tipc/topsrv.c > +++ b/net/tipc/topsrv.c > @@ -237,8 +237,8 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) > if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { > tipc_sub_unsubscribe(sub); > atomic_dec(&tn->subscription_count); > - } else if (s) { > - break; > + if (s) > + break; > } > } > spin_unlock_bh(&con->sub_lock); > @@ -362,9 +362,10 @@ static int tipc_conn_rcv_sub(struct tipc_topsrv *srv, > { > struct tipc_net *tn = tipc_net(srv->net); > struct tipc_subscription *sub; > + u32 s_filter = tipc_sub_read(s, filter); > > - if (tipc_sub_read(s, filter) & TIPC_SUB_CANCEL) { > - s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); > + if (s_filter & TIPC_SUB_CANCEL) { > + tipc_sub_write(s, filter, s_filter & ~TIPC_SUB_CANCEL); > tipc_conn_delete_sub(con, s); > return 0; > } > |
From: Xue, Y. <Yin...@wi...> - 2020-05-12 09:45:46
|
Hi Jon, Sorry for late response. Before I reply to your comments below, I want to discuss a more general question: When you posted the idea that we use UUID to resolve service name conflict issue, in the recent days I was always wondering whether we could implement Raft consensus algorithm (https://raft.github.io/) in internal TIPC module. In my opinion, there are different advantages and disadvantages respectively: UUID: Advantages: - Its generation algorithm is straightforward and Linux kernel has the interfaces available to generate UUID numbers. Disadvantages: - Protocol backwards compatibility - UUID generation algorithms cannot 100% guarantee that UUID numbers are not repeated particularly in distribution environment although the probability of UUID repeated occurrence is very little. Raft: Advantages: - Can 100% guarantee that service name doesn't conflict each other in in distribution environment - No protocol backwards compatibility issue Disadvantages: - Compared to the changes brought by UUID, the changes might not be very big, but it's quite complex particularly when we try to implement its algorithm in kernel space. I love to hear your opinion. Thanks., Ying -----Original Message----- From: Jon Maloy [mailto:jm...@re...] Sent: Tuesday, May 5, 2020 8:15 AM To: tip...@li... Cc: tun...@de...; hoa...@de...; tuo...@de...; ma...@do...; xi...@re...; Xue, Ying; par...@gm... Subject: Re: [RFC PATCH] tipc: define TIPC version 3 address types Hi all, I was pondering a little more about this possible feature. First of all, I realized that the following test bool tipc_msg_validate(struct sk_buff **_skb) { [...] if (unlikely(msg_version(hdr) != TIPC_VERSION)) return false; [...] } makes it very hard to update the version number in a backwards compatible way. Even discovery messages will be rejected by v2 nodes, and we don't get around that unless we do discovery with v2 messages, or send out a duplicate set (v2 +v3) of discovery messages. And, we can actually achieve exactly what we want with just setting another capability bit. So, we set bit #12 to mean "TIPC_EXTENDED", to also to mean "all previous capabilities are valid if this bit is set, no need to test for it" That way, we can zero out these bits and start reusing them for new capabilities when we need to. AF_TIPC3 now becomes AF_TIPCE, tipc_addr becomes tipce_addr etc. The binding table needs to be updated the following way: union publ_item { struct { __be32 type; __be32 lower; __be32 upper; } legacy; struct { u8 type[16]; u8 lower[16]; u8 upper[16]; } extended; }; struct publication { u8 extended; u8 scope; /* This can only take values [0:3] */ u8 spare[2]; union publ_item publ; u8 node[16]; u32 port; u32 key; struct list_head binding_node; struct list_head binding_sock; struct list_head local_publ; struct list_head all_publ; struct rcu_head rcu; }; struct distr_item { union publ_item; __be32 port; __be32 key; }; The NAME_DISTR protocol must be extended with a field indicating if it contains legacy publication(s) or extended publication(s). 'Extended' nodes receive separate bulks for legacy and extended publications, since it is hard to mix them in the same message. Legacy nodes only receive legacy publications, so in this case the distributor just send a bulk for those. The topology subscriber must be updated in a similar manner, but we can assume that the same socket cannot issue two types of subscriptions and receive two types of events; it has to be on or the other. This should simplify the task somewhat. User message header format needs to be changed for Service Address (Port Name) messages: - Type occupies word [8:B], i.e. bytes [32:47] - Instance occupies word [C:F], i.e. bytes [48:64] This is where it gets tricky. The 'header size' field is only 4 bits and counts 32-bit words. This means that current max header size that can be indicated is 60 bytes. A simple way might be to just extend the field with one of the tree unused bits [16:18] in word 1 as msb. That would be backwards compatible since those bits are currently 0, and no special tricks are needed. Regarding TIPC_MCAST_MSG we need yet another 16 bytes, [65:80] if we want to preserve the current semantics on [lower,upper]. However, I am highly uncertain if that feature is ever used and needed. We may be good by just keeping one 'instance' field just as in NAMED messages. The group cast protocol could be left for later, once we understand the consequences better than now, but semantically it should work just like now, except with a longer header and type/instance fields. It would also be nice if the 16 byte node identity replaces the current 4 byte node address/number in all interaction with user land, inclusive the presentation of the neighbor monitoring status in monitor.c. That can possibly also be left for later. Finally, would it be possible to mark a socket at 'legacy' or 'extended' without adding a new AF_TIPCE value? If this can be done in a not-too-ugly way it might be worth considering. ///jon On 4/27/20 9:53 PM, jm...@re... wrote: > From: Jon Maloy <jm...@re...> > > TIPC would be more attractive in a modern user environment such > as Kubernetes if it could provide a larger address range. > > Advantages: > - Users could directly use UUIDs, strings or other values as service > instances types and instances. > - No more risk of collisions between randomly selected service types > > The effect on the TIPC implementation and protocol would be significant, > but this is still worth considering. > --- > include/linux/socket.h | 5 ++- > include/uapi/linux/tipc3.h | 79 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 82 insertions(+), 2 deletions(-) > create mode 100644 include/uapi/linux/tipc3.h > > diff --git a/include/linux/socket.h b/include/linux/socket.h > index 54338fac45cb..ff2268ceedaf 100644 > --- a/include/linux/socket.h > +++ b/include/linux/socket.h > @@ -209,8 +209,8 @@ struct ucred { > * reuses AF_INET address family > */ > #define AF_XDP 44 /* XDP sockets */ > - > -#define AF_MAX 45 /* For now.. */ > +#define AF_TIPC3 45 /* TIPC version 3 sockets */ > +#define AF_MAX 46 /* For now.. */ > > /* Protocol families, same as address families. */ > #define PF_UNSPEC AF_UNSPEC > @@ -260,6 +260,7 @@ struct ucred { > #define PF_QIPCRTR AF_QIPCRTR > #define PF_SMC AF_SMC > #define PF_XDP AF_XDP > +#define PF_TIPC3 AF_TIPC3 > #define PF_MAX AF_MAX > > /* Maximum queue length specifiable by listen. */ > diff --git a/include/uapi/linux/tipc3.h b/include/uapi/linux/tipc3.h > new file mode 100644 > index 000000000000..0d385bc41b66 > --- /dev/null > +++ b/include/uapi/linux/tipc3.h > @@ -0,0 +1,79 @@ > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ > +/* > + * include/uapi/linux/tipc3.h: Header for TIPC v3 socket interface > + * > + * Copyright (c) 2020 Red Hat Inc > + * All rights reserved. > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions are met: > + * > + * 1. Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * 2. Redistributions in binary form must reproduce the above copyright > + * notice, this list of conditions and the following disclaimer in the > + * documentation and/or other materials provided with the distribution. > + * 3. Neither the names of the copyright holders nor the names of its > + * contributors may be used to endorse or promote products derived from > + * this software without specific prior written permission. > + * > + * Alternatively, this software may be distributed under the terms of the > + * GNU General Public License ("GPL") version 2 as published by the Free > + * Software Foundation. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE > + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE > + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR > + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF > + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS > + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN > + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) > + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE > + * POSSIBILITY OF SUCH DAMAGE. > + */ > + > +#ifndef _LINUX_TIPC3_H_ > +#define _LINUX_TIPC3_H_ > + > +#include <linux/types.h> > +#include <linux/sockios.h> > +#include <linux/tipc.h> > + > +struct tipc3_addr { > + __u8[16] type; /* zero if socket address */ > + __u8[16] instance; /* port if socket address */ > + __u8[16] node; /* zero if whole cluster */ > +}; > + > +struct tipc3_subscr { > + __u8[16] type; > + __u8[16] lower; > + __u8[16] upper; > + __u8[16] node; > + __u32 timeout; /* subscription duration (in ms) */ > + __u32 filter; /* bitmask of filter options */ > + __u8 usr_handle[16]; /* available for subscriber use */ > +}; > + > +struct tipc3_event { > + __u8[16] lower; /* matching range */ > + __u8[16] upper; /* " " */ > + struct tipc3_addr socket; /* associated socket */ > + struct tipc2_subscr sub; /* associated subscription */ > + __u32 event; /* event type */ > +}; > + > +struct sockaddr_tipc3 { > + unsigned short family; > + bool mcast; > + struct tipc3_addr addr; > +}; > + > +struct tipc3_group_req { > + struct tipc3_addr addr; > + __u32 flags; > +}; > + > +#endif |
From: Tuong T. L. <tuo...@de...> - 2020-05-12 01:53:03
|
-----Original Message----- From: Ying Xue <yin...@wi...> Sent: Monday, May 11, 2020 8:23 PM To: Tuong Tong Lien <tuo...@de...>; jm...@re...; ma...@do...; tip...@li... Cc: tipc-dek <tip...@de...> Subject: Re: [net 2/2] tipc: fix failed service subscription deletion On 5/7/20 7:12 PM, Tuong Lien wrote: > When a service subscription is expired or canceled by user, it needs to > be deleted from the subscription list, so that new subscriptions can be > registered (max = 65535 per net). However, there are two issues in code > that can cause such an unused subscription to persist: > > 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but > it makes a break shortly when the 1st subscription differs from the one > specified, so the subscription will not be deleted. > > 2) In case a subscription is canceled, the code to remove the > 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it > is a local subscription (i.e. the little endian isn't involved). So, it > will be no matches when looking for the subscription to delete later. > > The subscription(s) will be removed eventually when the user terminates > its topology connection but that could be a long time later. Meanwhile, > the number of available subscriptions may be exhausted. > > This commit fixes the two issues above, so as needed a subscription can > be deleted correctly. > > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/topsrv.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c > index 399a89f6f1bf..44609238849e 100644 > --- a/net/tipc/topsrv.c > +++ b/net/tipc/topsrv.c > @@ -237,8 +237,6 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) > if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { > tipc_sub_unsubscribe(sub); > atomic_dec(&tn->subscription_count); > - } else if (s) { > - break; I am quite surprised at this issue. The issue really exists as you describe above, but it's better to fix it as follows: @@ -237,8 +237,8 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { tipc_sub_unsubscribe(sub); atomic_dec(&tn->subscription_count); - } else if (s) { - break; + if (s) + break; } } [Tuong]: Yes, this has been included in v3 of the patch along with the change for the sub write below, please see from there. BR/Tuong > } > } > spin_unlock_bh(&con->sub_lock); > @@ -364,7 +362,10 @@ static int tipc_conn_rcv_sub(struct tipc_topsrv *srv, > struct tipc_subscription *sub; > > if (tipc_sub_read(s, filter) & TIPC_SUB_CANCEL) { > - s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); > + if (!(s->filter & TIPC_FILTER_MASK)) > + s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); > + else > + s->filter &= ~TIPC_SUB_CANCEL; > tipc_conn_delete_sub(con, s); > return 0; > } > |
From: Jon M. <jm...@re...> - 2020-05-11 16:02:11
|
On 5/10/20 11:59 PM, Tuong Lien wrote: > When a service subscription is expired or canceled by user, it needs to > be deleted from the subscription list, so that new subscriptions can be > registered (max = 65535 per net). However, there are two issues in code > that can cause such an unused subscription to persist: > > 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but > it makes a break shortly when the 1st subscription differs from the one > specified, so the subscription will not be deleted. > > 2) In case a subscription is canceled, the code to remove the > 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it > is a local subscription (i.e. the little endian isn't involved). So, it > will be no matches when looking for the subscription to delete later. > > The subscription(s) will be removed eventually when the user terminates > its topology connection but that could be a long time later. Meanwhile, > the number of available subscriptions may be exhausted. > > This commit fixes the two issues above, so as needed a subscription can > be deleted correctly. > > v2: define a new macro to write sub field value (- Jon's comment) > v3: break if the sub to be deleted has been found > > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/subscr.h | 10 ++++++++++ > net/tipc/topsrv.c | 9 +++++---- > 2 files changed, 15 insertions(+), 4 deletions(-) > > diff --git a/net/tipc/subscr.h b/net/tipc/subscr.h > index aa015c233898..6ebbec1bedd1 100644 > --- a/net/tipc/subscr.h > +++ b/net/tipc/subscr.h > @@ -96,6 +96,16 @@ void tipc_sub_get(struct tipc_subscription *subscription); > (swap_ ? swab32(val__) : val__); \ > }) > > +/* tipc_sub_write - write val_ to field_ of struct sub_ in user endian format > + */ > +#define tipc_sub_write(sub_, field_, val_) \ > + ({ \ > + struct tipc_subscr *sub__ = sub_; \ > + u32 val__ = val_; \ > + int swap_ = !((sub__)->filter & TIPC_FILTER_MASK); \ > + (sub__)->field_ = swap_ ? swab32(val__) : val__; \ > + }) > + > /* tipc_evt_write - write val_ to field_ of struct evt_ in user endian format > */ > #define tipc_evt_write(evt_, field_, val_) \ > diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c > index 931c426673c0..446af7bbd13e 100644 > --- a/net/tipc/topsrv.c > +++ b/net/tipc/topsrv.c > @@ -237,8 +237,8 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) > if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { > tipc_sub_unsubscribe(sub); > atomic_dec(&tn->subscription_count); > - } else if (s) { > - break; > + if (s) > + break; > } > } > spin_unlock_bh(&con->sub_lock); > @@ -362,9 +362,10 @@ static int tipc_conn_rcv_sub(struct tipc_topsrv *srv, > { > struct tipc_net *tn = tipc_net(srv->net); > struct tipc_subscription *sub; > + u32 s_filter = tipc_sub_read(s, filter); > > - if (tipc_sub_read(s, filter) & TIPC_SUB_CANCEL) { > - s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); > + if (s_filter & TIPC_SUB_CANCEL) { > + tipc_sub_write(s, filter, s_filter & ~TIPC_SUB_CANCEL); > tipc_conn_delete_sub(con, s); > return 0; > } Now it looks good. Acked-by: Jon Maloy <jm...@re...> |
From: Ying X. <yin...@wi...> - 2020-05-11 14:10:40
|
On 5/7/20 7:12 PM, Tuong Lien wrote: > When a service subscription is expired or canceled by user, it needs to > be deleted from the subscription list, so that new subscriptions can be > registered (max = 65535 per net). However, there are two issues in code > that can cause such an unused subscription to persist: > > 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but > it makes a break shortly when the 1st subscription differs from the one > specified, so the subscription will not be deleted. > > 2) In case a subscription is canceled, the code to remove the > 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it > is a local subscription (i.e. the little endian isn't involved). So, it > will be no matches when looking for the subscription to delete later. > > The subscription(s) will be removed eventually when the user terminates > its topology connection but that could be a long time later. Meanwhile, > the number of available subscriptions may be exhausted. > > This commit fixes the two issues above, so as needed a subscription can > be deleted correctly. > > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/topsrv.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c > index 399a89f6f1bf..44609238849e 100644 > --- a/net/tipc/topsrv.c > +++ b/net/tipc/topsrv.c > @@ -237,8 +237,6 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) > if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { > tipc_sub_unsubscribe(sub); > atomic_dec(&tn->subscription_count); > - } else if (s) { > - break; I am quite surprised at this issue. The issue really exists as you describe above, but it's better to fix it as follows: @@ -237,8 +237,8 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { tipc_sub_unsubscribe(sub); atomic_dec(&tn->subscription_count); - } else if (s) { - break; + if (s) + break; } } > } > } > spin_unlock_bh(&con->sub_lock); > @@ -364,7 +362,10 @@ static int tipc_conn_rcv_sub(struct tipc_topsrv *srv, > struct tipc_subscription *sub; > > if (tipc_sub_read(s, filter) & TIPC_SUB_CANCEL) { > - s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); > + if (!(s->filter & TIPC_FILTER_MASK)) > + s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); > + else > + s->filter &= ~TIPC_SUB_CANCEL; > tipc_conn_delete_sub(con, s); > return 0; > } > |
From: Ying X. <yin...@wi...> - 2020-05-11 13:52:56
|
Please validate whether it appears with the latest kernel. If you cannot, please provide more detailed info: 1. Capture TIPC packets with tcpdump tool; 2. Provide dmesg log, but be sure each line of demeg logs should be prefixed by timestamps. 3. Print timestamp as well when you receive TIPC_WITHDRAW on user land. On 5/8/20 11:01 AM, Hung Ta wrote: > Hi TIPC experts, > > I'm using the TIPC library in my project which needs to be aware of a node > leaves the cluster. > > So, I use socket type AF_TIPC and SOCK_SEQPACKET and connect it > to TIPC_TOP_SRV. > > I tried to get several nodes up and then make one of them leave and then I > can see the event TIPC_WITHDRAW seen, but the issue is it comes very late, > in particular, it comes about 10 seconds late after a node left the > cluster. > I'm using Ubuntu 16.04, kernel 4.4.0-87-generic. > The same issue also occurs in Ubuntu 14.04. > > Why is it too late? > Appreciate your help. > > Thanks and regards, > Hung > > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |
From: Ying X. <yin...@wi...> - 2020-05-11 13:16:39
|
Good catch! On 5/7/20 7:12 PM, Tuong Lien wrote: > Upon receipt of a service subscription request from user via a topology > connection, one 'sub' object will be allocated in kernel, so it will be > able to send an event of the service if any to the user correspondingly > then. Also, in case of any failure, the connection will be shutdown and > all the pertaining 'sub' objects will be freed. > > However, there is a race condition as follows resulting in memory leak: > > receive-work connection send-work > | | | > sub-1 |<------//-------| | > sub-2 |<------//-------| | > | |<---------------| evt for sub-x > sub-3 |<------//-------| | > : : : > : : : > | /--------| | > | | * peer closed | > | | | | > | | |<-------X-------| evt for sub-y > | | |<===============| > sub-n |<------/ X shutdown | > -> orphan | | > > That is, the 'receive-work' may get the last subscription request while > the 'send-work' is shutting down the connection due to peer close. > > We had a 'lock' on the connection, so the two actions cannot be carried > out simultaneously. If the last subscription is allocated e.g. 'sub-n', > before the 'send-work' closes the connection, there will be no issue at > all, the 'sub' objects will be freed. In contrast the last subscription > will become orphan since the connection was closed, and we released all > references. > > This commit fixes the issue by simply adding one test if the connection > remains in 'connected' state soon after we obtain the connection's lock > then a subscription object can be created as usual, otherwise we ignore > it. > > Reported-by: Thang Ngo <tha...@de...> > Signed-off-by: Tuong Lien <tuo...@de...> Acked-by: Ying Xue <yin...@wi...> > --- > net/tipc/topsrv.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c > index 73dbed0c4b6b..399a89f6f1bf 100644 > --- a/net/tipc/topsrv.c > +++ b/net/tipc/topsrv.c > @@ -400,7 +400,9 @@ static int tipc_conn_rcv_from_sock(struct tipc_conn *con) > return -EWOULDBLOCK; > if (ret == sizeof(s)) { > read_lock_bh(&sk->sk_callback_lock); > - ret = tipc_conn_rcv_sub(srv, con, &s); > + /* RACE: the connection can be closed in meanwhile */ > + if (likely(connected(con))) > + ret = tipc_conn_rcv_sub(srv, con, &s); > read_unlock_bh(&sk->sk_callback_lock); > if (!ret) > return 0; > |
From: Ying X. <yin...@wi...> - 2020-05-11 12:48:09
|
> You mean for SOCK_SEQPACKET? But we don't apply smart Nagle to that sock type (only SOCK_STREAM), nor is there such code in tipc_recvmsg(). Thanks for the reminder. Yes, you are right. Please ignore my previous comment. |
From: Xue, Y. <Yin...@wi...> - 2020-05-11 12:45:54
|
Acked-by: Ying Xue <yin...@wi...> -----Original Message----- From: Tuong Lien [mailto:tuo...@de...] Sent: Monday, May 4, 2020 7:28 PM To: jm...@re...; ma...@do...; Xue, Ying; tip...@li... Cc: tip...@de... Subject: [RFC PATCH 0/2] tipc: patches for Nagle algorithm issues Hi Jon, all, Here are the patches for the Nagle issues that I mentioned in the last meeting, please take a look and give me feedback. Thanks a lot! BR/Tuong Tuong Lien (2): tipc: fix large latency in smart Nagle streaming tipc: add test for Nagle algorithm effectiveness net/tipc/msg.c | 3 -- net/tipc/msg.h | 14 ++++++-- net/tipc/socket.c | 101 ++++++++++++++++++++++++++++++++++++++++++------------ 3 files changed, 91 insertions(+), 27 deletions(-) -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2020-05-11 04:00:26
|
When a service subscription is expired or canceled by user, it needs to be deleted from the subscription list, so that new subscriptions can be registered (max = 65535 per net). However, there are two issues in code that can cause such an unused subscription to persist: 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but it makes a break shortly when the 1st subscription differs from the one specified, so the subscription will not be deleted. 2) In case a subscription is canceled, the code to remove the 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it is a local subscription (i.e. the little endian isn't involved). So, it will be no matches when looking for the subscription to delete later. The subscription(s) will be removed eventually when the user terminates its topology connection but that could be a long time later. Meanwhile, the number of available subscriptions may be exhausted. This commit fixes the two issues above, so as needed a subscription can be deleted correctly. v2: define a new macro to write sub field value (- Jon's comment) v3: break if the sub to be deleted has been found Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/subscr.h | 10 ++++++++++ net/tipc/topsrv.c | 9 +++++---- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/net/tipc/subscr.h b/net/tipc/subscr.h index aa015c233898..6ebbec1bedd1 100644 --- a/net/tipc/subscr.h +++ b/net/tipc/subscr.h @@ -96,6 +96,16 @@ void tipc_sub_get(struct tipc_subscription *subscription); (swap_ ? swab32(val__) : val__); \ }) +/* tipc_sub_write - write val_ to field_ of struct sub_ in user endian format + */ +#define tipc_sub_write(sub_, field_, val_) \ + ({ \ + struct tipc_subscr *sub__ = sub_; \ + u32 val__ = val_; \ + int swap_ = !((sub__)->filter & TIPC_FILTER_MASK); \ + (sub__)->field_ = swap_ ? swab32(val__) : val__; \ + }) + /* tipc_evt_write - write val_ to field_ of struct evt_ in user endian format */ #define tipc_evt_write(evt_, field_, val_) \ diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index 931c426673c0..446af7bbd13e 100644 --- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -237,8 +237,8 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { tipc_sub_unsubscribe(sub); atomic_dec(&tn->subscription_count); - } else if (s) { - break; + if (s) + break; } } spin_unlock_bh(&con->sub_lock); @@ -362,9 +362,10 @@ static int tipc_conn_rcv_sub(struct tipc_topsrv *srv, { struct tipc_net *tn = tipc_net(srv->net); struct tipc_subscription *sub; + u32 s_filter = tipc_sub_read(s, filter); - if (tipc_sub_read(s, filter) & TIPC_SUB_CANCEL) { - s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); + if (s_filter & TIPC_SUB_CANCEL) { + tipc_sub_write(s, filter, s_filter & ~TIPC_SUB_CANCEL); tipc_conn_delete_sub(con, s); return 0; } -- 2.13.7 |
From: Tuong T. L. <tuo...@de...> - 2020-05-11 03:06:59
|
Hi Ying, You mean for SOCK_SEQPACKET? But we don't apply smart Nagle to that sock type (only SOCK_STREAM), nor is there such code in tipc_recvmsg(). BR/Tuong -----Original Message----- From: Xue, Ying <Yin...@wi...> Sent: Friday, May 8, 2020 9:44 PM To: Tuong Tong Lien <tuo...@de...>; jm...@re...; ma...@do...; tip...@li... Cc: tipc-dek <tip...@de...> Subject: RE: [RFC PATCH 1/2] tipc: fix large latency in smart Nagle streaming @@ -2011,7 +2021,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, /* Send connection flow control advertisement when applicable */ tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen); - if (ack || tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE) + if (tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE) tipc_sk_send_ack(tsk); Beside tipc_recvstream(), we also need to make the same change in tipc_recvmsg(). |
From: Tuong L. <tuo...@de...> - 2020-05-11 02:59:22
|
When a service subscription is expired or canceled by user, it needs to be deleted from the subscription list, so that new subscriptions can be registered (max = 65535 per net). However, there are two issues in code that can cause such an unused subscription to persist: 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but it makes a break shortly when the 1st subscription differs from the one specified, so the subscription will not be deleted. 2) In case a subscription is canceled, the code to remove the 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it is a local subscription (i.e. the little endian isn't involved). So, it will be no matches when looking for the subscription to delete later. The subscription(s) will be removed eventually when the user terminates its topology connection but that could be a long time later. Meanwhile, the number of available subscriptions may be exhausted. This commit fixes the two issues above, so as needed a subscription can be deleted correctly. v2: define a new macro to write sub field value (- Jon's comment) Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/subscr.h | 10 ++++++++++ net/tipc/topsrv.c | 7 +++---- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/net/tipc/subscr.h b/net/tipc/subscr.h index aa015c233898..6ebbec1bedd1 100644 --- a/net/tipc/subscr.h +++ b/net/tipc/subscr.h @@ -96,6 +96,16 @@ void tipc_sub_get(struct tipc_subscription *subscription); (swap_ ? swab32(val__) : val__); \ }) +/* tipc_sub_write - write val_ to field_ of struct sub_ in user endian format + */ +#define tipc_sub_write(sub_, field_, val_) \ + ({ \ + struct tipc_subscr *sub__ = sub_; \ + u32 val__ = val_; \ + int swap_ = !((sub__)->filter & TIPC_FILTER_MASK); \ + (sub__)->field_ = swap_ ? swab32(val__) : val__; \ + }) + /* tipc_evt_write - write val_ to field_ of struct evt_ in user endian format */ #define tipc_evt_write(evt_, field_, val_) \ diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index 931c426673c0..fb6d06eee29f 100644 --- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -237,8 +237,6 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { tipc_sub_unsubscribe(sub); atomic_dec(&tn->subscription_count); - } else if (s) { - break; } } spin_unlock_bh(&con->sub_lock); @@ -362,9 +360,10 @@ static int tipc_conn_rcv_sub(struct tipc_topsrv *srv, { struct tipc_net *tn = tipc_net(srv->net); struct tipc_subscription *sub; + u32 s_filter = tipc_sub_read(s, filter); - if (tipc_sub_read(s, filter) & TIPC_SUB_CANCEL) { - s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); + if (s_filter & TIPC_SUB_CANCEL) { + tipc_sub_write(s, filter, s_filter & ~TIPC_SUB_CANCEL); tipc_conn_delete_sub(con, s); return 0; } -- 2.13.7 |
From: Xue, Y. <Yin...@wi...> - 2020-05-08 15:17:20
|
@@ -2011,7 +2021,7 @@ static int tipc_recvstream(struct socket *sock, struct msghdr *m, /* Send connection flow control advertisement when applicable */ tsk->rcv_unacked += tsk_inc(tsk, hlen + dlen); - if (ack || tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE) + if (tsk->rcv_unacked >= tsk->rcv_win / TIPC_ACK_RATE) tipc_sk_send_ack(tsk); Beside tipc_recvstream(), we also need to make the same change in tipc_recvmsg(). |
From: Hung Ta <hu...@op...> - 2020-05-08 04:41:03
|
Hi TIPC experts, I'm using the TIPC library in my project which needs to be aware of a node leaves the cluster. So, I use socket type AF_TIPC and SOCK_SEQPACKET and connect it to TIPC_TOP_SRV. I tried to get several nodes up and then make one of them leave and then I can see the event TIPC_WITHDRAW seen, but the issue is it comes very late, in particular, it comes about 10 seconds late after a node left the cluster. I'm using Ubuntu 16.04, kernel 4.4.0-87-generic. The same issue also occurs in Ubuntu 14.04. Why is it too late? Appreciate your help. Thanks and regards, Hung |
From: Jon M. <jm...@re...> - 2020-05-07 13:57:15
|
On 5/7/20 7:12 AM, Tuong Lien wrote: > When a service subscription is expired or canceled by user, it needs to > be deleted from the subscription list, so that new subscriptions can be > registered (max = 65535 per net). However, there are two issues in code > that can cause such an unused subscription to persist: > > 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but > it makes a break shortly when the 1st subscription differs from the one > specified, so the subscription will not be deleted. > > 2) In case a subscription is canceled, the code to remove the > 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it > is a local subscription (i.e. the little endian isn't involved). So, it > will be no matches when looking for the subscription to delete later. > > The subscription(s) will be removed eventually when the user terminates > its topology connection but that could be a long time later. Meanwhile, > the number of available subscriptions may be exhausted. > > This commit fixes the two issues above, so as needed a subscription can > be deleted correctly. > > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/topsrv.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c > index 399a89f6f1bf..44609238849e 100644 > --- a/net/tipc/topsrv.c > +++ b/net/tipc/topsrv.c > @@ -237,8 +237,6 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) > if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { > tipc_sub_unsubscribe(sub); > atomic_dec(&tn->subscription_count); > - } else if (s) { > - break; ok > } > } > spin_unlock_bh(&con->sub_lock); > @@ -364,7 +362,10 @@ static int tipc_conn_rcv_sub(struct tipc_topsrv *srv, > struct tipc_subscription *sub; > > if (tipc_sub_read(s, filter) & TIPC_SUB_CANCEL) { > - s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); > + if (!(s->filter & TIPC_FILTER_MASK)) > + s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); > + else > + s->filter &= ~TIPC_SUB_CANCEL; It has a value to keep the little-endian/big-endian issues isolated in separate code, as I did with the macros tipc_sub_read() and tipc_evt_write(). In this case we need another macro, tipc_sub_write(), despite the fact that it is used only once. BR ///jon > tipc_conn_delete_sub(con, s); > return 0; > } |
From: Jon M. <jm...@re...> - 2020-05-07 13:51:01
|
On 5/7/20 7:12 AM, Tuong Lien wrote: > Upon receipt of a service subscription request from user via a topology > connection, one 'sub' object will be allocated in kernel, so it will be > able to send an event of the service if any to the user correspondingly > then. Also, in case of any failure, the connection will be shutdown and > all the pertaining 'sub' objects will be freed. > > However, there is a race condition as follows resulting in memory leak: > > receive-work connection send-work > | | | > sub-1 |<------//-------| | > sub-2 |<------//-------| | > | |<---------------| evt for sub-x > sub-3 |<------//-------| | > : : : > : : : > | /--------| | > | | * peer closed | > | | | | > | | |<-------X-------| evt for sub-y > | | |<===============| > sub-n |<------/ X shutdown | > -> orphan | | > > That is, the 'receive-work' may get the last subscription request while > the 'send-work' is shutting down the connection due to peer close. > > We had a 'lock' on the connection, so the two actions cannot be carried > out simultaneously. If the last subscription is allocated e.g. 'sub-n', > before the 'send-work' closes the connection, there will be no issue at > all, the 'sub' objects will be freed. In contrast the last subscription > will become orphan since the connection was closed, and we released all > references. > > This commit fixes the issue by simply adding one test if the connection > remains in 'connected' state soon after we obtain the connection's lock > then a subscription object can be created as usual, otherwise we ignore > it. > > Reported-by: Thang Ngo <tha...@de...> > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/topsrv.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c > index 73dbed0c4b6b..399a89f6f1bf 100644 > --- a/net/tipc/topsrv.c > +++ b/net/tipc/topsrv.c > @@ -400,7 +400,9 @@ static int tipc_conn_rcv_from_sock(struct tipc_conn *con) > return -EWOULDBLOCK; > if (ret == sizeof(s)) { > read_lock_bh(&sk->sk_callback_lock); > - ret = tipc_conn_rcv_sub(srv, con, &s); > + /* RACE: the connection can be closed in meanwhile */ s/in meanwhile/in the meantime/ > + if (likely(connected(con))) > + ret = tipc_conn_rcv_sub(srv, con, &s); > read_unlock_bh(&sk->sk_callback_lock); > if (!ret) > return 0; Acked-by: Jon Maloy <jm...@re...> |
From: Tuong L. <tuo...@de...> - 2020-05-07 11:13:23
|
When a service subscription is expired or canceled by user, it needs to be deleted from the subscription list, so that new subscriptions can be registered (max = 65535 per net). However, there are two issues in code that can cause such an unused subscription to persist: 1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but it makes a break shortly when the 1st subscription differs from the one specified, so the subscription will not be deleted. 2) In case a subscription is canceled, the code to remove the 'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it is a local subscription (i.e. the little endian isn't involved). So, it will be no matches when looking for the subscription to delete later. The subscription(s) will be removed eventually when the user terminates its topology connection but that could be a long time later. Meanwhile, the number of available subscriptions may be exhausted. This commit fixes the two issues above, so as needed a subscription can be deleted correctly. Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/topsrv.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index 399a89f6f1bf..44609238849e 100644 --- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -237,8 +237,6 @@ static void tipc_conn_delete_sub(struct tipc_conn *con, struct tipc_subscr *s) if (!s || !memcmp(s, &sub->evt.s, sizeof(*s))) { tipc_sub_unsubscribe(sub); atomic_dec(&tn->subscription_count); - } else if (s) { - break; } } spin_unlock_bh(&con->sub_lock); @@ -364,7 +362,10 @@ static int tipc_conn_rcv_sub(struct tipc_topsrv *srv, struct tipc_subscription *sub; if (tipc_sub_read(s, filter) & TIPC_SUB_CANCEL) { - s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); + if (!(s->filter & TIPC_FILTER_MASK)) + s->filter &= __constant_ntohl(~TIPC_SUB_CANCEL); + else + s->filter &= ~TIPC_SUB_CANCEL; tipc_conn_delete_sub(con, s); return 0; } -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2020-05-07 11:13:21
|
Upon receipt of a service subscription request from user via a topology connection, one 'sub' object will be allocated in kernel, so it will be able to send an event of the service if any to the user correspondingly then. Also, in case of any failure, the connection will be shutdown and all the pertaining 'sub' objects will be freed. However, there is a race condition as follows resulting in memory leak: receive-work connection send-work | | | sub-1 |<------//-------| | sub-2 |<------//-------| | | |<---------------| evt for sub-x sub-3 |<------//-------| | : : : : : : | /--------| | | | * peer closed | | | | | | | |<-------X-------| evt for sub-y | | |<===============| sub-n |<------/ X shutdown | -> orphan | | That is, the 'receive-work' may get the last subscription request while the 'send-work' is shutting down the connection due to peer close. We had a 'lock' on the connection, so the two actions cannot be carried out simultaneously. If the last subscription is allocated e.g. 'sub-n', before the 'send-work' closes the connection, there will be no issue at all, the 'sub' objects will be freed. In contrast the last subscription will become orphan since the connection was closed, and we released all references. This commit fixes the issue by simply adding one test if the connection remains in 'connected' state soon after we obtain the connection's lock then a subscription object can be created as usual, otherwise we ignore it. Reported-by: Thang Ngo <tha...@de...> Signed-off-by: Tuong Lien <tuo...@de...> --- net/tipc/topsrv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index 73dbed0c4b6b..399a89f6f1bf 100644 --- a/net/tipc/topsrv.c +++ b/net/tipc/topsrv.c @@ -400,7 +400,9 @@ static int tipc_conn_rcv_from_sock(struct tipc_conn *con) return -EWOULDBLOCK; if (ret == sizeof(s)) { read_lock_bh(&sk->sk_callback_lock); - ret = tipc_conn_rcv_sub(srv, con, &s); + /* RACE: the connection can be closed in meanwhile */ + if (likely(connected(con))) + ret = tipc_conn_rcv_sub(srv, con, &s); read_unlock_bh(&sk->sk_callback_lock); if (!ret) return 0; -- 2.13.7 |
From: Tuong L. <tuo...@de...> - 2020-05-07 11:13:19
|
Hi Jon, all, There are a couple of issues with our service subscriptions and here is the bug fixes, please help review before I send to net. Thanks a lot! BR/Tuong Tuong Lien (2): tipc: fix memory leak in service subscripting tipc: fix failed service subscription deletion net/tipc/topsrv.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) -- 2.13.7 |
From: Tuong T. L. <tuo...@de...> - 2020-05-05 11:36:09
|
-----Original Message----- From: Jon Maloy <jm...@re...> Sent: Tuesday, May 5, 2020 1:13 AM To: Tuong Tong Lien <tuo...@de...>; ma...@do...; yin...@wi...; tip...@li... Cc: tipc-dek <tip...@de...> Subject: Re: [RFC PATCH 2/2] tipc: add test for Nagle algorithm effectiveness On 5/4/20 7:28 AM, Tuong Lien wrote: > When streaming in Nagle mode, we try to bundle small messages from user > as many as possible if there is one outstanding buffer, i.e. not ACK-ed > by the receiving side, which helps boost up the overall throughput. So, > the algorithm's effectiveness really depends on when Nagle ACK comes or > what the specific network latency (RTT) is, compared to the user's > message sending rate. > > In a bad case, the user's sending rate is low or the network latency is > small, there will not be many bundles, so making a Nagle ACK or waiting > for it is not meaningful. > For example: a user sends its messages every 100ms and the RTT is 50ms, > then for each messages, we require one Nagle ACK but then there is only > one user message sent without any bundles. > > In a better case, even if we have a few bundles (e.g. the RTT = 300ms), > but now the user sends messages in medium size, then there will not be > any difference at all, that says 3 x 1000-byte data messages if bundled > will still result in 3 bundles with MTU = 1500. > > When Nagle is ineffective, the delay in user message sending is clearly > wasted instead of sending directly. > > Besides, adding Nagle ACKs will consume some processor load on both the > sending and receiving sides. > > This commit adds a test on the effectiveness of the Nagle algorithm for > an individual connection in the network on which it actually runs. > Particularly, upon receipt of a Nagle ACK we will compare the number of > bundles in the backlog queue to the number of user messages which would > be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle > mode will be kept for further message sending. Otherwise, we will leave > Nagle and put a 'penalty' on the connection, so it will have to spend > more 'one-way' messages before being able to re-enter Nagle. > > In addition, the 'ack-required' bit is only set when really needed that > the number of Nagle ACKs will be reduced during Nagle mode. > > Testing with benchmark showed that with the patch, there was not much > difference in throughput for small messages since the tool continuously > sends messages without a break, so Nagle would still take in effect. > > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/msg.c | 3 --- > net/tipc/msg.h | 14 +++++++++++-- > net/tipc/socket.c | 60 ++++++++++++++++++++++++++++++++++++++++++++----------- > 3 files changed, 60 insertions(+), 17 deletions(-) > > diff --git a/net/tipc/msg.c b/net/tipc/msg.c > index 69d68512300a..732cd95b5c75 100644 > --- a/net/tipc/msg.c > +++ b/net/tipc/msg.c > @@ -235,9 +235,6 @@ int tipc_msg_append(struct tipc_msg *_hdr, struct msghdr *m, int dlen, > msg_set_size(hdr, MIN_H_SIZE); > __skb_queue_tail(txq, skb); > total += 1; > - if (prev) > - msg_set_ack_required(buf_msg(prev), 0); > - msg_set_ack_required(hdr, 1); > } > hdr = buf_msg(skb); > curr = msg_blocks(hdr); > diff --git a/net/tipc/msg.h b/net/tipc/msg.h > index 5f37a611e8c9..44543892af11 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -340,9 +340,19 @@ static inline int msg_ack_required(struct tipc_msg *m) > return msg_bits(m, 0, 18, 1); > } > > -static inline void msg_set_ack_required(struct tipc_msg *m, u32 d) > +static inline void msg_set_ack_required(struct tipc_msg *m) > { > - msg_set_bits(m, 0, 18, 1, d); > + msg_set_bits(m, 0, 18, 1, 1); > +} > + > +static inline int msg_nagle_ack(struct tipc_msg *m) > +{ > + return msg_bits(m, 0, 18, 1); > +} > + > +static inline void msg_set_nagle_ack(struct tipc_msg *m) > +{ > + msg_set_bits(m, 0, 18, 1, 1); > } > > static inline bool msg_is_rcast(struct tipc_msg *m) > diff --git a/net/tipc/socket.c b/net/tipc/socket.c > index 4e71774528ad..93b0a6159cb1 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -119,7 +119,10 @@ struct tipc_sock { > struct rcu_head rcu; > struct tipc_group *group; > u32 oneway; > + u32 nagle_start; > u16 snd_backlog; > + u16 msg_acc; > + u16 pkt_cnt; > bool expect_ack; > bool nodelay; > bool group_is_open; > @@ -143,7 +146,7 @@ static int tipc_sk_insert(struct tipc_sock *tsk); > static void tipc_sk_remove(struct tipc_sock *tsk); > static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dsz); > static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dsz); > -static void tipc_sk_push_backlog(struct tipc_sock *tsk); > +static void tipc_sk_push_backlog(struct tipc_sock *tsk, bool nagle_ack); > > static const struct proto_ops packet_ops; > static const struct proto_ops stream_ops; > @@ -474,6 +477,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock, > tsk = tipc_sk(sk); > tsk->max_pkt = MAX_PKT_DEFAULT; > tsk->maxnagle = 0; > + tsk->nagle_start = 4; > INIT_LIST_HEAD(&tsk->publications); > INIT_LIST_HEAD(&tsk->cong_links); > msg = &tsk->phdr; > @@ -541,7 +545,7 @@ static void __tipc_shutdown(struct socket *sock, int error) > !tsk_conn_cong(tsk))); > > /* Push out delayed messages if in Nagle mode */ > - tipc_sk_push_backlog(tsk); > + tipc_sk_push_backlog(tsk, false); > /* Remove pending SYN */ > __skb_queue_purge(&sk->sk_write_queue); > > @@ -1252,14 +1256,37 @@ void tipc_sk_mcast_rcv(struct net *net, struct sk_buff_head *arrvq, > /* tipc_sk_push_backlog(): send accumulated buffers in socket write queue > * when socket is in Nagle mode > */ > -static void tipc_sk_push_backlog(struct tipc_sock *tsk) > +static void tipc_sk_push_backlog(struct tipc_sock *tsk, bool nagle_ack) > { > struct sk_buff_head *txq = &tsk->sk.sk_write_queue; > + struct sk_buff *skb = skb_peek_tail(txq); > struct net *net = sock_net(&tsk->sk); > u32 dnode = tsk_peer_node(tsk); > - struct sk_buff *skb = skb_peek(txq); > int rc; > > + if (nagle_ack) { > + tsk->pkt_cnt += skb_queue_len(txq); > + if (!tsk->pkt_cnt || tsk->msg_acc / tsk->pkt_cnt < 2) { > + tsk->oneway = 0; > + if (tsk->nagle_start < 1000) > + tsk->nagle_start *= 2; > + tsk->expect_ack = false; > + pr_debug("tsk %10u: bad nagle %u -> %u, next start %u!\n", > + tsk->portid, tsk->msg_acc, tsk->pkt_cnt, > + tsk->nagle_start); > + } else { > + tsk->nagle_start = 4; > + if (skb) { > + msg_set_ack_required(buf_msg(skb)); > + tsk->expect_ack = true; > + } else { > + tsk->expect_ack = false; > + } > + } > + tsk->msg_acc = 0; > + tsk->pkt_cnt = 0; > + } > + > if (!skb || tsk->cong_link_cnt) > return; > > @@ -1267,9 +1294,10 @@ static void tipc_sk_push_backlog(struct tipc_sock *tsk) > if (msg_is_syn(buf_msg(skb))) > return; > > + if (tsk->msg_acc) > + tsk->pkt_cnt += skb_queue_len(txq); > tsk->snt_unacked += tsk->snd_backlog; > tsk->snd_backlog = 0; > - tsk->expect_ack = true; > rc = tipc_node_xmit(net, txq, dnode, tsk->portid); > if (rc == -ELINKCONG) > tsk->cong_link_cnt = 1; > @@ -1322,8 +1350,7 @@ static void tipc_sk_conn_proto_rcv(struct tipc_sock *tsk, struct sk_buff *skb, > return; > } else if (mtyp == CONN_ACK) { > was_cong = tsk_conn_cong(tsk); > - tsk->expect_ack = false; > - tipc_sk_push_backlog(tsk); > + tipc_sk_push_backlog(tsk, msg_nagle_ack(hdr)); > tsk->snt_unacked -= msg_conn_ack(hdr); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > tsk->snd_win = msg_adv_win(hdr); > @@ -1544,17 +1571,24 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen) > break; > send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE); > blocks = tsk->snd_backlog; > - if (tsk->oneway++ >= 4 && send <= maxnagle) { > + if (tsk->oneway++ >= tsk->nagle_start && send <= maxnagle) { > rc = tipc_msg_append(hdr, m, send, maxnagle, txq); > if (unlikely(rc < 0)) > break; > blocks += rc; > + tsk->msg_acc++; > if (blocks <= 64 && tsk->expect_ack) { > tsk->snd_backlog = blocks; > sent += send; > break; > + } else if (blocks > 64) { > + tsk->pkt_cnt += skb_queue_len(txq); > + } else { > + msg_set_ack_required(buf_msg(skb_peek_tail(txq))); > + tsk->expect_ack = true; > + tsk->msg_acc = 0; > + tsk->pkt_cnt = 0; > } > - tsk->expect_ack = true; > } else { > rc = tipc_msg_build(hdr, m, sent, send, maxpkt, txq); > if (unlikely(rc != send)) > @@ -2092,7 +2126,7 @@ static void tipc_sk_proto_rcv(struct sock *sk, > smp_wmb(); > tsk->cong_link_cnt--; > wakeup = true; > - tipc_sk_push_backlog(tsk); > + tipc_sk_push_backlog(tsk, false); > break; > case GROUP_PROTOCOL: > tipc_group_proto_rcv(grp, &wakeup, hdr, inputq, xmitq); > @@ -2181,7 +2215,7 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb, > return false; > case TIPC_ESTABLISHED: > if (!skb_queue_empty(&sk->sk_write_queue)) > - tipc_sk_push_backlog(tsk); > + tipc_sk_push_backlog(tsk, false); > /* Accept only connection-based messages sent by peer */ > if (likely(con_msg && !err && pport == oport && > pnode == onode)) { > @@ -2189,8 +2223,10 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb, > struct sk_buff *skb; > > skb = tipc_sk_build_ack(tsk); > - if (skb) > + if (skb) { > + msg_set_nagle_ack(buf_msg(skb)); > __skb_queue_tail(xmitq, skb); > + } > } > return true; > } Nice job. Does this even solve the latency problem you had observed at customer site? [Tuong]: Yes, after applying these two patches, they didn't observe the increase in latency or CPU load in their tests (even though Nagle was still enabled in some way). BR/Tuong Acked-by: Jon Maloy <jm...@re...> |
From: Jon M. <jm...@re...> - 2020-05-05 00:15:16
|
Hi all, I was pondering a little more about this possible feature. First of all, I realized that the following test bool tipc_msg_validate(struct sk_buff **_skb) { [...] if (unlikely(msg_version(hdr) != TIPC_VERSION)) return false; [...] } makes it very hard to update the version number in a backwards compatible way. Even discovery messages will be rejected by v2 nodes, and we don't get around that unless we do discovery with v2 messages, or send out a duplicate set (v2 +v3) of discovery messages. And, we can actually achieve exactly what we want with just setting another capability bit. So, we set bit #12 to mean "TIPC_EXTENDED", to also to mean "all previous capabilities are valid if this bit is set, no need to test for it" That way, we can zero out these bits and start reusing them for new capabilities when we need to. AF_TIPC3 now becomes AF_TIPCE, tipc_addr becomes tipce_addr etc. The binding table needs to be updated the following way: union publ_item { struct { __be32 type; __be32 lower; __be32 upper; } legacy; struct { u8 type[16]; u8 lower[16]; u8 upper[16]; } extended; }; struct publication { u8 extended; u8 scope; /* This can only take values [0:3] */ u8 spare[2]; union publ_item publ; u8 node[16]; u32 port; u32 key; struct list_head binding_node; struct list_head binding_sock; struct list_head local_publ; struct list_head all_publ; struct rcu_head rcu; }; struct distr_item { union publ_item; __be32 port; __be32 key; }; The NAME_DISTR protocol must be extended with a field indicating if it contains legacy publication(s) or extended publication(s). 'Extended' nodes receive separate bulks for legacy and extended publications, since it is hard to mix them in the same message. Legacy nodes only receive legacy publications, so in this case the distributor just send a bulk for those. The topology subscriber must be updated in a similar manner, but we can assume that the same socket cannot issue two types of subscriptions and receive two types of events; it has to be on or the other. This should simplify the task somewhat. User message header format needs to be changed for Service Address (Port Name) messages: - Type occupies word [8:B], i.e. bytes [32:47] - Instance occupies word [C:F], i.e. bytes [48:64] This is where it gets tricky. The 'header size' field is only 4 bits and counts 32-bit words. This means that current max header size that can be indicated is 60 bytes. A simple way might be to just extend the field with one of the tree unused bits [16:18] in word 1 as msb. That would be backwards compatible since those bits are currently 0, and no special tricks are needed. Regarding TIPC_MCAST_MSG we need yet another 16 bytes, [65:80] if we want to preserve the current semantics on [lower,upper]. However, I am highly uncertain if that feature is ever used and needed. We may be good by just keeping one 'instance' field just as in NAMED messages. The group cast protocol could be left for later, once we understand the consequences better than now, but semantically it should work just like now, except with a longer header and type/instance fields. It would also be nice if the 16 byte node identity replaces the current 4 byte node address/number in all interaction with user land, inclusive the presentation of the neighbor monitoring status in monitor.c. That can possibly also be left for later. Finally, would it be possible to mark a socket at 'legacy' or 'extended' without adding a new AF_TIPCE value? If this can be done in a not-too-ugly way it might be worth considering. ///jon On 4/27/20 9:53 PM, jm...@re... wrote: > From: Jon Maloy <jm...@re...> > > TIPC would be more attractive in a modern user environment such > as Kubernetes if it could provide a larger address range. > > Advantages: > - Users could directly use UUIDs, strings or other values as service > instances types and instances. > - No more risk of collisions between randomly selected service types > > The effect on the TIPC implementation and protocol would be significant, > but this is still worth considering. > --- > include/linux/socket.h | 5 ++- > include/uapi/linux/tipc3.h | 79 ++++++++++++++++++++++++++++++++++++++ > 2 files changed, 82 insertions(+), 2 deletions(-) > create mode 100644 include/uapi/linux/tipc3.h > > diff --git a/include/linux/socket.h b/include/linux/socket.h > index 54338fac45cb..ff2268ceedaf 100644 > --- a/include/linux/socket.h > +++ b/include/linux/socket.h > @@ -209,8 +209,8 @@ struct ucred { > * reuses AF_INET address family > */ > #define AF_XDP 44 /* XDP sockets */ > - > -#define AF_MAX 45 /* For now.. */ > +#define AF_TIPC3 45 /* TIPC version 3 sockets */ > +#define AF_MAX 46 /* For now.. */ > > /* Protocol families, same as address families. */ > #define PF_UNSPEC AF_UNSPEC > @@ -260,6 +260,7 @@ struct ucred { > #define PF_QIPCRTR AF_QIPCRTR > #define PF_SMC AF_SMC > #define PF_XDP AF_XDP > +#define PF_TIPC3 AF_TIPC3 > #define PF_MAX AF_MAX > > /* Maximum queue length specifiable by listen. */ > diff --git a/include/uapi/linux/tipc3.h b/include/uapi/linux/tipc3.h > new file mode 100644 > index 000000000000..0d385bc41b66 > --- /dev/null > +++ b/include/uapi/linux/tipc3.h > @@ -0,0 +1,79 @@ > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ > +/* > + * include/uapi/linux/tipc3.h: Header for TIPC v3 socket interface > + * > + * Copyright (c) 2020 Red Hat Inc > + * All rights reserved. > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions are met: > + * > + * 1. Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * 2. Redistributions in binary form must reproduce the above copyright > + * notice, this list of conditions and the following disclaimer in the > + * documentation and/or other materials provided with the distribution. > + * 3. Neither the names of the copyright holders nor the names of its > + * contributors may be used to endorse or promote products derived from > + * this software without specific prior written permission. > + * > + * Alternatively, this software may be distributed under the terms of the > + * GNU General Public License ("GPL") version 2 as published by the Free > + * Software Foundation. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE > + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE > + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR > + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF > + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS > + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN > + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) > + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE > + * POSSIBILITY OF SUCH DAMAGE. > + */ > + > +#ifndef _LINUX_TIPC3_H_ > +#define _LINUX_TIPC3_H_ > + > +#include <linux/types.h> > +#include <linux/sockios.h> > +#include <linux/tipc.h> > + > +struct tipc3_addr { > + __u8[16] type; /* zero if socket address */ > + __u8[16] instance; /* port if socket address */ > + __u8[16] node; /* zero if whole cluster */ > +}; > + > +struct tipc3_subscr { > + __u8[16] type; > + __u8[16] lower; > + __u8[16] upper; > + __u8[16] node; > + __u32 timeout; /* subscription duration (in ms) */ > + __u32 filter; /* bitmask of filter options */ > + __u8 usr_handle[16]; /* available for subscriber use */ > +}; > + > +struct tipc3_event { > + __u8[16] lower; /* matching range */ > + __u8[16] upper; /* " " */ > + struct tipc3_addr socket; /* associated socket */ > + struct tipc2_subscr sub; /* associated subscription */ > + __u32 event; /* event type */ > +}; > + > +struct sockaddr_tipc3 { > + unsigned short family; > + bool mcast; > + struct tipc3_addr addr; > +}; > + > +struct tipc3_group_req { > + struct tipc3_addr addr; > + __u32 flags; > +}; > + > +#endif |
From: Jon M. <jm...@re...> - 2020-05-04 18:13:37
|
On 5/4/20 7:28 AM, Tuong Lien wrote: > When streaming in Nagle mode, we try to bundle small messages from user > as many as possible if there is one outstanding buffer, i.e. not ACK-ed > by the receiving side, which helps boost up the overall throughput. So, > the algorithm's effectiveness really depends on when Nagle ACK comes or > what the specific network latency (RTT) is, compared to the user's > message sending rate. > > In a bad case, the user's sending rate is low or the network latency is > small, there will not be many bundles, so making a Nagle ACK or waiting > for it is not meaningful. > For example: a user sends its messages every 100ms and the RTT is 50ms, > then for each messages, we require one Nagle ACK but then there is only > one user message sent without any bundles. > > In a better case, even if we have a few bundles (e.g. the RTT = 300ms), > but now the user sends messages in medium size, then there will not be > any difference at all, that says 3 x 1000-byte data messages if bundled > will still result in 3 bundles with MTU = 1500. > > When Nagle is ineffective, the delay in user message sending is clearly > wasted instead of sending directly. > > Besides, adding Nagle ACKs will consume some processor load on both the > sending and receiving sides. > > This commit adds a test on the effectiveness of the Nagle algorithm for > an individual connection in the network on which it actually runs. > Particularly, upon receipt of a Nagle ACK we will compare the number of > bundles in the backlog queue to the number of user messages which would > be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle > mode will be kept for further message sending. Otherwise, we will leave > Nagle and put a 'penalty' on the connection, so it will have to spend > more 'one-way' messages before being able to re-enter Nagle. > > In addition, the 'ack-required' bit is only set when really needed that > the number of Nagle ACKs will be reduced during Nagle mode. > > Testing with benchmark showed that with the patch, there was not much > difference in throughput for small messages since the tool continuously > sends messages without a break, so Nagle would still take in effect. > > Signed-off-by: Tuong Lien <tuo...@de...> > --- > net/tipc/msg.c | 3 --- > net/tipc/msg.h | 14 +++++++++++-- > net/tipc/socket.c | 60 ++++++++++++++++++++++++++++++++++++++++++++----------- > 3 files changed, 60 insertions(+), 17 deletions(-) > > diff --git a/net/tipc/msg.c b/net/tipc/msg.c > index 69d68512300a..732cd95b5c75 100644 > --- a/net/tipc/msg.c > +++ b/net/tipc/msg.c > @@ -235,9 +235,6 @@ int tipc_msg_append(struct tipc_msg *_hdr, struct msghdr *m, int dlen, > msg_set_size(hdr, MIN_H_SIZE); > __skb_queue_tail(txq, skb); > total += 1; > - if (prev) > - msg_set_ack_required(buf_msg(prev), 0); > - msg_set_ack_required(hdr, 1); > } > hdr = buf_msg(skb); > curr = msg_blocks(hdr); > diff --git a/net/tipc/msg.h b/net/tipc/msg.h > index 5f37a611e8c9..44543892af11 100644 > --- a/net/tipc/msg.h > +++ b/net/tipc/msg.h > @@ -340,9 +340,19 @@ static inline int msg_ack_required(struct tipc_msg *m) > return msg_bits(m, 0, 18, 1); > } > > -static inline void msg_set_ack_required(struct tipc_msg *m, u32 d) > +static inline void msg_set_ack_required(struct tipc_msg *m) > { > - msg_set_bits(m, 0, 18, 1, d); > + msg_set_bits(m, 0, 18, 1, 1); > +} > + > +static inline int msg_nagle_ack(struct tipc_msg *m) > +{ > + return msg_bits(m, 0, 18, 1); > +} > + > +static inline void msg_set_nagle_ack(struct tipc_msg *m) > +{ > + msg_set_bits(m, 0, 18, 1, 1); > } > > static inline bool msg_is_rcast(struct tipc_msg *m) > diff --git a/net/tipc/socket.c b/net/tipc/socket.c > index 4e71774528ad..93b0a6159cb1 100644 > --- a/net/tipc/socket.c > +++ b/net/tipc/socket.c > @@ -119,7 +119,10 @@ struct tipc_sock { > struct rcu_head rcu; > struct tipc_group *group; > u32 oneway; > + u32 nagle_start; > u16 snd_backlog; > + u16 msg_acc; > + u16 pkt_cnt; > bool expect_ack; > bool nodelay; > bool group_is_open; > @@ -143,7 +146,7 @@ static int tipc_sk_insert(struct tipc_sock *tsk); > static void tipc_sk_remove(struct tipc_sock *tsk); > static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dsz); > static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dsz); > -static void tipc_sk_push_backlog(struct tipc_sock *tsk); > +static void tipc_sk_push_backlog(struct tipc_sock *tsk, bool nagle_ack); > > static const struct proto_ops packet_ops; > static const struct proto_ops stream_ops; > @@ -474,6 +477,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock, > tsk = tipc_sk(sk); > tsk->max_pkt = MAX_PKT_DEFAULT; > tsk->maxnagle = 0; > + tsk->nagle_start = 4; > INIT_LIST_HEAD(&tsk->publications); > INIT_LIST_HEAD(&tsk->cong_links); > msg = &tsk->phdr; > @@ -541,7 +545,7 @@ static void __tipc_shutdown(struct socket *sock, int error) > !tsk_conn_cong(tsk))); > > /* Push out delayed messages if in Nagle mode */ > - tipc_sk_push_backlog(tsk); > + tipc_sk_push_backlog(tsk, false); > /* Remove pending SYN */ > __skb_queue_purge(&sk->sk_write_queue); > > @@ -1252,14 +1256,37 @@ void tipc_sk_mcast_rcv(struct net *net, struct sk_buff_head *arrvq, > /* tipc_sk_push_backlog(): send accumulated buffers in socket write queue > * when socket is in Nagle mode > */ > -static void tipc_sk_push_backlog(struct tipc_sock *tsk) > +static void tipc_sk_push_backlog(struct tipc_sock *tsk, bool nagle_ack) > { > struct sk_buff_head *txq = &tsk->sk.sk_write_queue; > + struct sk_buff *skb = skb_peek_tail(txq); > struct net *net = sock_net(&tsk->sk); > u32 dnode = tsk_peer_node(tsk); > - struct sk_buff *skb = skb_peek(txq); > int rc; > > + if (nagle_ack) { > + tsk->pkt_cnt += skb_queue_len(txq); > + if (!tsk->pkt_cnt || tsk->msg_acc / tsk->pkt_cnt < 2) { > + tsk->oneway = 0; > + if (tsk->nagle_start < 1000) > + tsk->nagle_start *= 2; > + tsk->expect_ack = false; > + pr_debug("tsk %10u: bad nagle %u -> %u, next start %u!\n", > + tsk->portid, tsk->msg_acc, tsk->pkt_cnt, > + tsk->nagle_start); > + } else { > + tsk->nagle_start = 4; > + if (skb) { > + msg_set_ack_required(buf_msg(skb)); > + tsk->expect_ack = true; > + } else { > + tsk->expect_ack = false; > + } > + } > + tsk->msg_acc = 0; > + tsk->pkt_cnt = 0; > + } > + > if (!skb || tsk->cong_link_cnt) > return; > > @@ -1267,9 +1294,10 @@ static void tipc_sk_push_backlog(struct tipc_sock *tsk) > if (msg_is_syn(buf_msg(skb))) > return; > > + if (tsk->msg_acc) > + tsk->pkt_cnt += skb_queue_len(txq); > tsk->snt_unacked += tsk->snd_backlog; > tsk->snd_backlog = 0; > - tsk->expect_ack = true; > rc = tipc_node_xmit(net, txq, dnode, tsk->portid); > if (rc == -ELINKCONG) > tsk->cong_link_cnt = 1; > @@ -1322,8 +1350,7 @@ static void tipc_sk_conn_proto_rcv(struct tipc_sock *tsk, struct sk_buff *skb, > return; > } else if (mtyp == CONN_ACK) { > was_cong = tsk_conn_cong(tsk); > - tsk->expect_ack = false; > - tipc_sk_push_backlog(tsk); > + tipc_sk_push_backlog(tsk, msg_nagle_ack(hdr)); > tsk->snt_unacked -= msg_conn_ack(hdr); > if (tsk->peer_caps & TIPC_BLOCK_FLOWCTL) > tsk->snd_win = msg_adv_win(hdr); > @@ -1544,17 +1571,24 @@ static int __tipc_sendstream(struct socket *sock, struct msghdr *m, size_t dlen) > break; > send = min_t(size_t, dlen - sent, TIPC_MAX_USER_MSG_SIZE); > blocks = tsk->snd_backlog; > - if (tsk->oneway++ >= 4 && send <= maxnagle) { > + if (tsk->oneway++ >= tsk->nagle_start && send <= maxnagle) { > rc = tipc_msg_append(hdr, m, send, maxnagle, txq); > if (unlikely(rc < 0)) > break; > blocks += rc; > + tsk->msg_acc++; > if (blocks <= 64 && tsk->expect_ack) { > tsk->snd_backlog = blocks; > sent += send; > break; > + } else if (blocks > 64) { > + tsk->pkt_cnt += skb_queue_len(txq); > + } else { > + msg_set_ack_required(buf_msg(skb_peek_tail(txq))); > + tsk->expect_ack = true; > + tsk->msg_acc = 0; > + tsk->pkt_cnt = 0; > } > - tsk->expect_ack = true; > } else { > rc = tipc_msg_build(hdr, m, sent, send, maxpkt, txq); > if (unlikely(rc != send)) > @@ -2092,7 +2126,7 @@ static void tipc_sk_proto_rcv(struct sock *sk, > smp_wmb(); > tsk->cong_link_cnt--; > wakeup = true; > - tipc_sk_push_backlog(tsk); > + tipc_sk_push_backlog(tsk, false); > break; > case GROUP_PROTOCOL: > tipc_group_proto_rcv(grp, &wakeup, hdr, inputq, xmitq); > @@ -2181,7 +2215,7 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb, > return false; > case TIPC_ESTABLISHED: > if (!skb_queue_empty(&sk->sk_write_queue)) > - tipc_sk_push_backlog(tsk); > + tipc_sk_push_backlog(tsk, false); > /* Accept only connection-based messages sent by peer */ > if (likely(con_msg && !err && pport == oport && > pnode == onode)) { > @@ -2189,8 +2223,10 @@ static bool tipc_sk_filter_connect(struct tipc_sock *tsk, struct sk_buff *skb, > struct sk_buff *skb; > > skb = tipc_sk_build_ack(tsk); > - if (skb) > + if (skb) { > + msg_set_nagle_ack(buf_msg(skb)); > __skb_queue_tail(xmitq, skb); > + } > } > return true; > } Nice job. Does this even solve the latency problem you had observed at customer site? Acked-by: Jon Maloy <jm...@re...> |
From: David M. <da...@da...> - 2020-05-04 17:41:40
|
From: Tuong Lien <tuo...@de...> Date: Mon, 4 May 2020 11:15:54 +0700 > When an application connects to the TIPC topology server and subscribes > to some services, a new connection is created along with some objects - > 'tipc_subscription' to store related data correspondingly... > However, there is one omission in the connection handling that when the > connection or application is orderly shutdown (e.g. via SIGQUIT, etc.), > the connection is not closed in kernel, the 'tipc_subscription' objects > are not freed too. > This results in: > - The maximum number of subscriptions (65535) will be reached soon, new > subscriptions will be rejected; > - TIPC module cannot be removed (unless the objects are somehow forced > to release first); > > The commit fixes the issue by closing the connection if the 'recvmsg()' > returns '0' i.e. when the peer is shutdown gracefully. It also includes > the other unexpected cases. > > Acked-by: Jon Maloy <jm...@re...> > Acked-by: Ying Xue <yin...@wi...> > Signed-off-by: Tuong Lien <tuo...@de...> Applied, thanks. |