From: Jon M. <jm...@re...> - 2020-05-21 15:04:05
|
Hi Ying, Thinking more about this, isn't the perfect solution 64-bit addresses, *plus* a topology server that delivers events according to the consensus model, so that all users receive a consistent view of the cluster and application topology? Then, we could even go further, and add virtual synchrony on top of that, but that would have to be a selectable property, I think. I had a quick look at Canonical's RAFT (https://github.com/canonical/raft.git) implementation. It is 18,000 lines of code, i.e., larger than the whole of TIPC (14,000 lines), so it is a big chunk. I also found another one at https://github.com/willemt/raft.git, that was only 3,500 lines of code, although with uncertain quality status of course. I suspect that a large part of those is code that are dealing with neighbor monitoring, failure detection etc, i.e, services that the topology server already has for free. It would be very interesting to see if it would be possible to reduce this code amount to an acceptable volume for inclusion in the TIPC module. ///jon On 5/12/20 4:38 PM, Jon Maloy wrote: > Hi Ying, > I have for several years claimed that TIPC and RAFT would be the > perfect combination. > But what I had in mind was a more general-purpose user-land RAFT > consensus service on top of TIPC, where TIPC provides properties that > makes RAFT easier to implement and gives it shorter > discovery/convergence times during topology changes. > > By adding RAFT as a service *provided by* TIPC, I imagine you mean > something similar to (or even extension of) the topology server we > have now. This has not occurred to me, but it might not be a bad idea. > It all boils down to how complex it would be, and how much more code > we would need to add to TIPC, vs. the benefit we get. Your colleague > Allan Stevens used to say that TIPC needs a new "unique selling point" > to reach a wider adoption than it has now, and this might indeed be > it. Maybe a prototype to be done by somebody at WindRiver, DEK or RH? > > What I don't quite understand is that you present this as an > alternative to the 128-bit address scheme. We need that anyway as I > see it. Uniqueness for service types is only part of the reason for > that proposal. It is even about practicality of use, e.g., letting > users use strings, database keys, disk partitions etc. directly as > service instances and hence avoiding any lookup/translation steps. I > did not imagine that the TIPC itself generates any UUIDs or service > types/instance values; that part is still left to the user. > > I think the risk of service type collisions inside a cluster, if > generated properly by the user, is so low that it alone would not be > enough to justify such a large extension to TIPC. But if he could > have 100% guaranteed uniqueness as a side effect of such a new service > it would of course in nice. I assume you have more than just TIPC > address uniqueness in mind with this proposal? > > Regards > ///jon > > > > On 5/12/20 5:45 AM, Xue, Ying wrote: >> Hi Jon, >> >> Sorry for late response. >> >> Before I reply to your comments below, I want to discuss a more >> general question: >> >> When you posted the idea that we use UUID to resolve service name >> conflict issue, in the recent days I was always wondering whether we >> could implement Raft consensus algorithm (https://raft.github.io/) in >> internal TIPC module. In my opinion, there are different advantages >> and disadvantages respectively: >> >> UUID: >> Advantages: >> - Its generation algorithm is straightforward and Linux kernel has >> the interfaces available to generate UUID numbers. >> Disadvantages: >> - Protocol backwards compatibility >> - UUID generation algorithms cannot 100% guarantee that UUID >> numbers are not repeated particularly in distribution environment >> although the probability of UUID repeated occurrence is very little. >> >> Raft: >> Advantages: >> - Can 100% guarantee that service name doesn't conflict each other >> in in distribution environment >> - No protocol backwards compatibility issue >> Disadvantages: >> - Compared to the changes brought by UUID, the changes might not be >> very big, but it's quite complex particularly when we try to >> implement its algorithm in kernel space. >> >> I love to hear your opinion. >> >> Thanks., >> Ying >> >> -----Original Message----- >> From: Jon Maloy [mailto:jm...@re...] >> Sent: Tuesday, May 5, 2020 8:15 AM >> To: tip...@li... >> Cc: tun...@de...; hoa...@de...; >> tuo...@de...; ma...@do...; xi...@re...; Xue, >> Ying; par...@gm... >> Subject: Re: [RFC PATCH] tipc: define TIPC version 3 address types >> >> Hi all, >> I was pondering a little more about this possible feature. >> First of all, I realized that the following test >> >> bool tipc_msg_validate(struct sk_buff **_skb) >> { >> [...] >> if (unlikely(msg_version(hdr) != TIPC_VERSION)) >> return false; >> [...] >> } >> makes it very hard to update the version number in a >> backwards compatible way. Even discovery messages >> will be rejected by v2 nodes, and we don't get around >> that unless we do discovery with v2 messages, or send >> out a duplicate set (v2 +v3) of discovery messages. >> And, we can actually achieve exactly what we want >> with just setting another capability bit. >> So, we set bit #12 to mean "TIPC_EXTENDED", to also to >> mean "all previous capabilities are valid if this bit is set, >> no need to test for it" >> That way, we can zero out these bits and start reusing them >> for new capabilities when we need to. >> >> AF_TIPC3 now becomes AF_TIPCE, tipc_addr becomes >> tipce_addr etc. >> >> The binding table needs to be updated the following way: >> >> union publ_item { >> struct { >> __be32 type; >> __be32 lower; >> __be32 upper; >> } legacy; >> struct { >> u8 type[16]; >> u8 lower[16]; >> u8 upper[16]; >> } extended; >> }; >> >> struct publication { >> u8 extended; >> u8 scope; /* This can only take values [0:3] */ >> u8 spare[2]; >> union publ_item publ; >> u8 node[16]; >> u32 port; >> u32 key; >> struct list_head binding_node; >> struct list_head binding_sock; >> struct list_head local_publ; >> struct list_head all_publ; >> struct rcu_head rcu; >> }; >> >> struct distr_item { >> union publ_item; >> __be32 port; >> __be32 key; >> }; >> >> The NAME_DISTR protocol must be extended with a field >> indicating if it contains legacy publication(s) or extended >> publication(s). >> 'Extended' nodes receive separate bulks for legacy and >> extended publications, since it is hard to mix them in the >> same message. >> Legacy nodes only receive legacy publications, so in this >> case the distributor just send a bulk for those. >> >> The topology subscriber must be updated in a similar >> manner, but we can assume that the same socket cannot >> issue two types of subscriptions and receive two types >> of events; it has to be on or the other. This should >> simplify the task somewhat. >> >> User message header format needs to be changed for >> Service Address (Port Name) messages: >> - Type occupies word [8:B], i.e. bytes [32:47] >> - Instance occupies word [C:F], i.e. bytes [48:64] >> >> This is where it gets tricky. The 'header size' field is only 4 >> bits and counts 32-bit words. This means that current >> max header size that can be indicated is 60 bytes. >> A simple way might be to just extend the field with one of >> the tree unused bits [16:18] in word 1 as msb. That would >> be backwards compatible since those bits are currently 0, >> and no special tricks are needed. >> Regarding TIPC_MCAST_MSG we need yet another 16 bytes, >> [65:80] if we want to preserve the current semantics on >> [lower,upper]. However, I am highly uncertain if that feature >> is ever used and needed. We may be good by just keeping >> one 'instance' field just as in NAMED messages. >> >> The group cast protocol could be left for later, once we understand >> the consequences better than now, but semantically it should >> work just like now, except with a longer header and type/instance >> fields. >> >> It would also be nice if the 16 byte node identity replaces the current >> 4 byte node address/number in all interaction with user land, inclusive >> the presentation of the neighbor monitoring status in monitor.c. >> That can possibly also be left for later. >> >> Finally, would it be possible to mark a socket at 'legacy' or 'extended' >> without adding a new AF_TIPCE value? If this can be done in a >> not-too-ugly way it might be worth considering. >> >> ///jon >> >> >> >> >> On 4/27/20 9:53 PM, jm...@re... wrote: >>> From: Jon Maloy <jm...@re...> >>> >>> TIPC would be more attractive in a modern user environment such >>> as Kubernetes if it could provide a larger address range. >>> >>> Advantages: >>> - Users could directly use UUIDs, strings or other values as service >>> instances types and instances. >>> - No more risk of collisions between randomly selected service types >>> >>> The effect on the TIPC implementation and protocol would be >>> significant, >>> but this is still worth considering. >>> --- >>> include/linux/socket.h | 5 ++- >>> include/uapi/linux/tipc3.h | 79 >>> ++++++++++++++++++++++++++++++++++++++ >>> 2 files changed, 82 insertions(+), 2 deletions(-) >>> create mode 100644 include/uapi/linux/tipc3.h >>> >>> diff --git a/include/linux/socket.h b/include/linux/socket.h >>> index 54338fac45cb..ff2268ceedaf 100644 >>> --- a/include/linux/socket.h >>> +++ b/include/linux/socket.h >>> @@ -209,8 +209,8 @@ struct ucred { >>> * reuses AF_INET address family >>> */ >>> #define AF_XDP 44 /* XDP sockets */ >>> - >>> -#define AF_MAX 45 /* For now.. */ >>> +#define AF_TIPC3 45 /* TIPC version 3 sockets */ >>> +#define AF_MAX 46 /* For now.. */ >>> /* Protocol families, same as address families. */ >>> #define PF_UNSPEC AF_UNSPEC >>> @@ -260,6 +260,7 @@ struct ucred { >>> #define PF_QIPCRTR AF_QIPCRTR >>> #define PF_SMC AF_SMC >>> #define PF_XDP AF_XDP >>> +#define PF_TIPC3 AF_TIPC3 >>> #define PF_MAX AF_MAX >>> /* Maximum queue length specifiable by listen. */ >>> diff --git a/include/uapi/linux/tipc3.h b/include/uapi/linux/tipc3.h >>> new file mode 100644 >>> index 000000000000..0d385bc41b66 >>> --- /dev/null >>> +++ b/include/uapi/linux/tipc3.h >>> @@ -0,0 +1,79 @@ >>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR >>> BSD-3-Clause) */ >>> +/* >>> + * include/uapi/linux/tipc3.h: Header for TIPC v3 socket interface >>> + * >>> + * Copyright (c) 2020 Red Hat Inc >>> + * All rights reserved. >>> + * >>> + * Redistribution and use in source and binary forms, with or without >>> + * modification, are permitted provided that the following >>> conditions are met: >>> + * >>> + * 1. Redistributions of source code must retain the above copyright >>> + * notice, this list of conditions and the following disclaimer. >>> + * 2. Redistributions in binary form must reproduce the above >>> copyright >>> + * notice, this list of conditions and the following disclaimer >>> in the >>> + * documentation and/or other materials provided with the >>> distribution. >>> + * 3. Neither the names of the copyright holders nor the names of its >>> + * contributors may be used to endorse or promote products >>> derived from >>> + * this software without specific prior written permission. >>> + * >>> + * Alternatively, this software may be distributed under the terms >>> of the >>> + * GNU General Public License ("GPL") version 2 as published by the >>> Free >>> + * Software Foundation. >>> + * >>> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND >>> CONTRIBUTORS "AS IS" >>> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT >>> LIMITED TO, THE >>> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A >>> PARTICULAR PURPOSE >>> + * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR >>> CONTRIBUTORS BE >>> + * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR >>> + * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, >>> PROCUREMENT OF >>> + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR >>> BUSINESS >>> + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, >>> WHETHER IN >>> + * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR >>> OTHERWISE) >>> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF >>> ADVISED OF THE >>> + * POSSIBILITY OF SUCH DAMAGE. >>> + */ >>> + >>> +#ifndef _LINUX_TIPC3_H_ >>> +#define _LINUX_TIPC3_H_ >>> + >>> +#include <linux/types.h> >>> +#include <linux/sockios.h> >>> +#include <linux/tipc.h> >>> + >>> +struct tipc3_addr { >>> + __u8[16] type; /* zero if socket address */ >>> + __u8[16] instance; /* port if socket address */ >>> + __u8[16] node; /* zero if whole cluster */ >>> +}; >>> + >>> +struct tipc3_subscr { >>> + __u8[16] type; >>> + __u8[16] lower; >>> + __u8[16] upper; >>> + __u8[16] node; >>> + __u32 timeout; /* subscription duration (in ms) */ >>> + __u32 filter; /* bitmask of filter options */ >>> + __u8 usr_handle[16]; /* available for subscriber use */ >>> +}; >>> + >>> +struct tipc3_event { >>> + __u8[16] lower; /* matching range */ >>> + __u8[16] upper; /* " " */ >>> + struct tipc3_addr socket; /* associated socket */ >>> + struct tipc2_subscr sub; /* associated subscription */ >>> + __u32 event; /* event type */ >>> +}; >>> + >>> +struct sockaddr_tipc3 { >>> + unsigned short family; >>> + bool mcast; >>> + struct tipc3_addr addr; >>> +}; >>> + >>> +struct tipc3_group_req { >>> + struct tipc3_addr addr; >>> + __u32 flags; >>> +}; >>> + >>> +#endif > |