Re: [mpls-linux-devel] Jamal's MPLS design document
Status: Beta
Brought to you by:
jleu
From: Ramon C. <cas...@in...> - 2003-12-07 14:09:38
|
James/Hamal/All, Thank you for submitting this document for review/comments. Some thoughts right away (more to follow). For the moment, I'll be giving my impressions w.r.t. the existing implementation and the patches posted on this list. Disclaimer: These are the opinions of my own, and in no way they are written in stone. You'll see that I do not agree with some aspects of the document, but I am open to discussion. A general comment (maybe I'm wrong, let me dig a little more into this) is that it is too rushy to discard the existing implementation without taking the time to understand it (which is logical, since there was 0 documentation no even code comments, but it is something on which I'm working on), but I would like to know in which particular parts the existing implementation is flawed and cannot be corrected / extended / reviewed. For the moment, I have not seen a design flaw or major valid reason to start from scratch. > 2. Tables involved: > We cant ignore these table names because tons of SNMP MIBs exist > which at least talk about them; implementation is a different > issue but at least we should be able to somehow semantically match > them. The tables are the NHLFE, FTN and ILM. > The code should use similar names when possible. Agreed. One of the first things I noticed was the (apparent) lack of these tables in Jim's implementation. Hopefully, the devel-guide will explain this. What I have understood: - the NHLFE is in fact split (which I consider a good design) into two parts : MPLS Incoming Information (MII for short), a struct that has the set of opcodes to execute upon arrival of the incoming packet, and MPLS Outgoing Information (MOI for short) containing the set of opcodes to execute when forwarding the packet. This also decouples w.r.t. labelled packets that are delivered to the host. - The ILM table exists (it is in fact the Incoming Radix Tree). The "ILM" lookup is indeed mpls_get_mii_by_label. It looks the corresponding NHLFE from a Label (which is the MII objet, see my previous paragraph). So, yes: - Some existing symbols should be renamed. For example the function mpls_opcode_peek has been renamed (in my tree anyway) to mpls_label_entry peek. In a similar way, we should define a high level function like "mpls_ilm_lookup" which takes the topmost label entry and gets the corresponding MII.These changes are trivial. - the FTN table is indeed somehow missing, and is basically done in parts using mplsadm: add an outgoing label (with an associated MOI (see below) and then map some traffic to that label (this was mainly done hacking some net/core parts). So the big missing part is FEC management (at the per MPLS-domain ingress node). We *should not* assume that a FEC is always an IPv4/IPv6 adress prefix: * We need a "generic" FEC encoding, so later we can also integrate L2 LSPs EoMPLS, etc. * A new part "classifier" that maps data objects (and not only address prefixes) to FECs. These points are discussed later in this document. > > ILM and FTN derive a FECid from their respective lookups > The result (FECid) is then used to lookup > the NHLFE to determine how to forward the packet. > 2.1 Next Hop Label Forwarding Entry (NHLFE) Table: > This table is looked up using the FEC as the key (maybe > + label space) although label spaces are still in the TOD below. > > A standard structure for NHLFE contains: > - FEC id See my comments (*) below. > - neighbor information (IPV4/6 + egress interface) Yes. It is/should be in the MOI part (that is, the "second half" of the NHLFE) > - MPLS operations to perform The MII/MOI opcodes. With the benefit that if it is locally delivered, theres no need to check the MOI. (*) Comments: I'm afraid that I don't agree here. IMHO, I thing we should not add the "FECid" indirection here. it has several drawbacks: - NHLFE are FEC agnostic. The same NHLFE could be reused for different FECs. This is necessary for example, for LSP merging. - The notion of FEC should only be defined at Ingress LSRs. - W.r.t. the ILM table the FECid *is* the topmost label itself! Explciitly the label represents the FEC w.r.t to a couple upstream/downstream LSRs. The lookup should be label -> NHLFE (MII object). No need to manage FECids (allocation/removal/etc) - In some cases, it is necessary to establish cross-connects without knowing the FEC that will be transported over the LSP (e.g. when working at > 2 hierarchy evels): e.g. Incoming label (+labelspace+interface) -> Outgoing label + outgoing interface. No need to know the FEC here. With the notion of FECid, you have two issues: Label management and FECId management. Let me explain myself a little more here: imagine we have a simple FEC F, 'all packets with @IPdest = A.B.C.D/N', well defined, so it should have a 'locally' unique FECid. This FECid cannot "non ambiguously" be used to look up a NHLFE (e.g. when received over two different interfaces for example). of course the same argument applies to labels, but my point is "let the label only identify itself the FEC", do not add another indirection. > 2.1.1 NHLFE Configuration: > The way i see it being setup is via netlink (this way we can take > advantage of distributed architectures later). Definitely :). For updates/requests. However, this does not preclude the use of procfs/sysfs for exposing ReadOnly simple objects like attributes/labelspaces/etc. > > tc l2conf <cmd> dev <devname> > mpls nhlfe index <val> proto <ipv4|ipv6> nh <neighbor> It is too soon to define a grammar for a userspace application. I would work on the protocol between the kernel MPLS subsystem and the userspace, defining which information objects are required. Netlink datagram format format and then define a userspace app. Moreover, I would propose having a new userspace app (something like mplsadm) rather than patching tc ip route etc, given the fact that most users won't be using MPLS at all. > 2.2 FEC to NHLFE mapping (FTN) Table > > I dont see this table existing by itself. it doesn't :) > Each MPLS interfacing component will derive a FECid which is used > to search the NHLFE table. See my previous comment. The topmost label + the labelspace (or incoming device) + optionnally the upstream router in the case that the downstream cannot know the upstream router *and* has allocated the same label for the same FEC should suffice to derive the NHLFE (MII) to apply. > 2.2.1 IPV4/6 route component FTN > Typically, the FEC will be in the IPV4/6 FIB nexthop entry. > This way we can have equal cost multi path entries > with different FECids. I'll discuss load sharing later in this mail. > > 2.2.2 ingress classification component: > This has nothing to do with FTN rather it provides another mapping to > the NHLFE table. > (when i port tc extension code to 2.6 - we will need a new > skb field called FECid); > *ingress code matches a packet description and then sets the skb->FECid > as an action. We could use the skb->FECid to overrule the FIB FEC > when we are selecting the fast path. Good point. But it should not manage FECids, it should manage NHLFE entries. > [The u32 classifier could be used to map based on any header bits and select > the FECid.] semantically, it does map the "data object" to a FEC, but it does not mean that it needs to explicitely manage FECids. FECids are the labels. If you add the FEC id notion you have two problems: Label management and FECid management. At most, "FECids" (if you really really want to add them) should be only managed at the "per domain I-LSR) > 2.2.3 Tunneling and L2 technologies FTN > Revist this later. Yes! :) Ethernet over MPLS should now be a primary objective > > 2.2.4 NHLFE packet path: > > As in standard Linux, the fast path is first accessed. Two > results: > 1) On success a MPLS cache entry is found and attached to the skb->dst > the skb->dst is used to forward. > 2) On failure a slow path is exercised and a new dst cache is created > from the NHLFE table. Agreed. Nothing to add here. > the FECid used to lookup the NHLFE for the cache entry creation. :) nope!! the "label" should be used. > 2.2.5 Configuration IPV4/6 routing: > The ip tool should allow you specify route you want then > specify the FECid for that route, i.e: Again, too soon to focus on userspace / control plane > [??? What would happen if the route nexthop entry and the NHLFE point > to different egress devices?] The NHLFE overrides the route nexthop. This is the basis of MPLS Traffic Engineering. LSPs are not IGP constrained. > 2.2.6 Configuration for others > > They need to be netlink enabled. At the moment only ipsec is. What others? anyway. A well defined netlink based protocol is, as you state, much needed. > 2.3 ILM (incoming label mapping): > > Typical entries for this table are: label, ingress dev, FECid > Lookup is based on label. and this is the Radix Tree. > ILM is used by both LSR or egress LER. > > 2.3.1 ILM packet processing: > > Incoming packets: > - use label to lookup the dst cache via route_input() The label is used to lookup a MII object from a Radix Tree that holds information regarding what to do with the packet. Only in the case that the packet needs to be forwarded, we obtain a MOI object. It is then that we could: * check the dst_cache as you state. right.. this fits well in mpls_output* family. > 3.2 Label action opcodes what's wrong with the existing opcodes? I see little performance gain in having X_AND_Y opcodes rather than X Y, sequentially. Atomic opcodes in (IMHO) the way to go. Otherwise we'll end up with POP_AND_SWAP_AND_PUSH_AND_MAPEXP. However, I do not want to state that we nned not try to optimize performance later, but chaining opcodes adds great flexibility. > - POP_AND_LOOKUP POP & DLV ? > - POP_AND_FORWARD > - NO_POP_AND_FORWARD FWD > - DISCARD DROP > TODO: > 1. look into multi next hop for loadbalancing For LSRs. > Is this necessary? If yes, there has to be multiple FECids > in the ILM table. I've been working on load balancing in MPLS networks. The "right" approach is as you state, to have several pointed NHLFEs in the ILM table for a given label(+labelspace+..). However, another nice approach is to setup tunnels ("mpls%d") and then use an equalizer algorithm to split the load. This decouples the algorithm from the implementation. To test this, played with having two mpls%d tunnels and use teql which is interface "agnostic" and worked well. In academic research and IETF w.g. we have discussed Load Sharing several times, and the "current" consensus is that it is difficult to implement L.S. in split points other that the "per domain" Ingress LSRs. Since intermediate LSRs are not allowed to look at the IP header, no hash techinques (or not with enough granularity) can be used to make sure that packets beloging to the same microflow are forwarded over the same physical route. In my opinion, what needs to be done - Define a complete framework for FEC management at ingress LSRs, with policies to define: - How to classify "L2/L3 data objects" into FECs, without limiting only to IPvX address prefixes as FECs. How do we encode "if the Ethernet dst addres is a:b:c:d:e:f and the ehtertype is IPX then this data object belongs to FEC F". DEfine related FEC encodings. Let the control plane apps distribute FEC information *as labels*. The non-ambigous incoming label conveys all FEC information, do not add the FECid indirection. - Define the protocol between MPLS subsystem and userspace - Develop (or adapt) a new userspace app (mplsnl in my previous mail) that communicates with the MPLS kernel subsystem in order to Get/Update MPLS tables. - Rewrite (I agree with previous comments that this is the least elegant part of the existing implementation) dst/neigh/hh cache management, Once we have the whole outgoing MPLS PDU rebuilt. - Multicast "barebones" support (or, more adequatelly, point to multipoint LSP support), conceptually similar to Load Sharing: the incoming label + labelspaces + interface + ...(this all means "a non ambiguous incoming label") should determine a set of MIIs to apply. The implementation dependant details are, for example: Are the MIIs to be applied iteratively (e.g. point to multipoint) or one among all (Load Sharing). Things that I don't like from the existing implementation: - The Radix Tree/Label space implementation. since mpls_recv_pckt (the callback when registering the packet handler) contains the incoming device, I'm still analyzing the drawbacks /advantages of having the "ILM global table" split in "per interface" ilm tables . That is, the incoming interface + the topmost incoming table are the "key" to find the MII object in a hash table. Thank you for reading. Best regards, Ramon |