Re: [mpls-linux-devel] Jamal's MPLS design document

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

See my comments within:

On Sun, Dec 07, 2003 at 03:08:44PM +0100, Ramon Casellas wrote:
> 
> 
> James/Hamal/All,
> 
> Thank you for submitting this document for review/comments.
> 
> Some thoughts right away (more to follow). For the moment, I'll be giving
> my impressions w.r.t. the existing implementation and the patches posted
> on this list.
> 
> Disclaimer: These are the opinions of my own, and in no way they are
> written in stone. You'll see that I do not agree with some aspects of the
> document, but I am open to discussion.
> 
> 
> A general comment (maybe I'm wrong, let me dig a little more into this)
> is that it is too rushy to discard the existing implementation without
> taking the time to understand it (which is logical, since there was 0
> documentation no even code comments, but it is something on which I'm
> working on), but I would like to know in which particular parts the
> existing implementation is flawed and cannot be corrected / extended /
> reviewed.

I agree.  Thank you for being the one to state this.

> For the moment, I have not seen a design flaw or major valid reason to
> start from scratch.
> 
> 
> 
> > 2. Tables involved:
> > We cant ignore these table names because tons of SNMP MIBs exist
> > which at least talk about them; implementation is a different
> > issue but at least we should be able to somehow semantically match
> > them. The tables are the NHLFE, FTN and ILM.
> > The code should use similar names when possible.
> 

One thing to remember is that the names NHLFE, ILM and FTN are terms
used when talking about MPLS architectures.  These are just logical
'tables' that encompass required functionality.  A great example of
this is that the LSR MIB does not have ILM or NHLFE tables, but referes
to these logical entities in the explaination of the insegment and
outsegment tables.

So to use the names ILM and NHLFE in any implementation is misleading
unless you limit the implementation to be _only_ what is refered to
by the MPLS architecture RFC.  As everyone knows the architecture
RFC is way too generic to try and use as the sole basis of a forwarding
plane.

> Agreed. One of the first things I noticed was the (apparent) lack of these
> tables in Jim's implementation. Hopefully, the devel-guide will explain
> this. What I have understood:

<snip>

> * A new part "classifier" that maps data objects (and not only address
> prefixes) to FECs. These points are discussed later in this document.

The FTN again is a logical block of funtionality.  The only reason a
FTN table should exist is so that a particular service can register
a binding with it.  In otherwords it is informational only.  I do not
think we want to implement a generic FEC registration mechanism because
no matter what we do it will never be flexible enough to handle new FEC
definition without re-writting all of the old FEC definition.  Leave FEC
binding to be extentions to existing tool which already deal with that
type of traffic (ie iproute2, iptables, tc, brctl etc).

> > ILM and FTN derive a FECid from their respective lookups
> > The result (FECid) is then used to lookup
> > the NHLFE to determine how to forward the packet.
> > 2.1 Next Hop Label Forwarding Entry (NHLFE) Table:
> > This table is looked up using the FEC as the key (maybe
> > + label space) although label spaces are still in the TOD below.
> >
> > A standard structure for NHLFE contains:
> > - FEC id
> 
> See my comments (*) below.
> 
> 
> 
> > - neighbor information (IPV4/6 + egress interface)
> 
> Yes. It is/should be in the MOI part (that is, the "second half" of the
> NHLFE)
> 
> 
> > - MPLS operations to perform
> 
> The MII/MOI opcodes. With the benefit that if it is locally delivered,
> theres no need to check the MOI.
> 
> 
> 
> (*) Comments:
> I'm afraid that I don't agree here. IMHO, I thing we should not add the
> "FECid" indirection here. it has several drawbacks:
> 
> - NHLFE are FEC agnostic. The same NHLFE could be reused for different
> FECs. This is necessary for example, for LSP merging.
> 
> - The notion of FEC should only be defined at Ingress LSRs.
> 
> - W.r.t. the ILM table the FECid *is* the topmost label itself!
> Explciitly the label represents the FEC w.r.t to a couple
> upstream/downstream LSRs. The lookup should be label -> NHLFE (MII
> object). No need to manage FECids (allocation/removal/etc)
> 
> - In some cases, it is necessary to establish cross-connects without
> knowing the FEC that will be transported over the LSP (e.g. when working
> at > 2 hierarchy evels): e.g. Incoming label (+labelspace+interface) ->
> Outgoing label + outgoing interface. No need to know the FEC here.
> 
> With the notion of FECid, you have two issues: Label management and FECId
> management. Let me explain myself a little more here: imagine we have a
> simple FEC F, 'all packets with @IPdest = A.B.C.D/N', well defined, so it
> should have a 'locally' unique FECid. This FECid cannot "non ambiguously"
> be used to look up a NHLFE (e.g. when received over two different
> interfaces for example). of course the same argument applies to labels,
> but my point is "let the label only identify itself the FEC", do not add
> another indirection.

I agree.  This is kind of the point I was getting at in my previous e-mail.

<snip>

> > 2.2.3 Tunneling and L2 technologies FTN
> > Revist this later.
> Yes! :) Ethernet over MPLS should now be a primary objective

Take a look at my l2cc code (in my p4 tree).  If you ask me this is
the correct seperation.  l2cc is more then just being able to transport
L2 frames over MPLS.  It is a generic mechanism for implementing L2 switching
and splicing for Linux.  If you look at any mature L2 over MPLS implementation
it also has the ability to do local L2 switching/splicing.

<snip>

> > 3.2 Label action opcodes
> 
> what's wrong with the existing opcodes? I see little performance gain in
> having X_AND_Y opcodes rather than X Y, sequentially. Atomic opcodes in
> (IMHO) the way to go. Otherwise we'll end up with
> POP_AND_SWAP_AND_PUSH_AND_MAPEXP. However, I do not want to state that we
> nned not try to optimize performance later, but chaining opcodes adds
> great flexibility.

I agree that tryign to make single OPs that do mulitple things is a little
silly.  I guess you can just call me a RISC type of guy.

> > - POP_AND_LOOKUP
> 
> POP & DLV ?

POP and PEEK

> 
> 
> > - POP_AND_FORWARD
> 
> 
> > - NO_POP_AND_FORWARD
> FWD
> 
> 
> > - DISCARD
> DROP
>
> > TODO:
> > 1.  look into multi next hop for loadbalancing For LSRs.
> > Is this necessary? If yes, there has to be multiple FECids
> > in the ILM table.

> 
> I've been working on load balancing in MPLS networks. The "right" approach
> is as you state, to have several pointed NHLFEs in the ILM table for a
> given label(+labelspace+..). However, another nice approach is to setup
> tunnels ("mpls%d") and then use an equalizer algorithm to split the load.
> This decouples the algorithm from the implementation. To test this, played
> with having two mpls%d tunnels and use teql which is interface "agnostic"
> and worked well.

This aproach only works for end to end LSP load balancing.  Which should
not even be an issue if we expose each possible LSP as a nexthop.  Then
standard techniques for load balancing can be used.

> In academic research and IETF w.g. we have discussed Load Sharing several
> times, and the "current" consensus is that it is difficult to implement
> L.S. in split points other that the "per domain" Ingress LSRs. Since
> intermediate LSRs are not allowed to look at the IP header, no hash
> techinques (or not with enough granularity) can be used to make sure that
> packets beloging to the same microflow are forwarded over the same
> physical route.

This has been a pretty hot debate as of late on the MPLS-WG mailing list
(as part of the OAM framework draft).

In general the accepted technique is to do layer a layer vilolation and
look past the MPLS shim.  it the first octect is 04/06 then assume it's IP
and do a typical microflow load balancing, otherwise use the label stack
to create a hash.

I have some interesting ideas far areas of experimentation with reguards to
this.  For now I think we should make the statment that end-to-end and
mid-stream load balancing is needed, and should be configured by having
muliple out-going labels configured (perhaps the FWD instruction could
be an array of out-going labels?)

> In my opinion, what needs to be done
> 
> - Define a complete framework for FEC management at ingress LSRs, with
> policies to define:
>       - How to classify "L2/L3 data objects" into FECs, without limiting
> only to IPvX address prefixes as FECs. How do we encode "if the Ethernet
> dst addres is a:b:c:d:e:f and the ehtertype is IPX then this data object
> belongs to FEC F". DEfine related FEC encodings. Let the control plane
> apps distribute FEC information *as labels*. The non-ambigous incoming
> label conveys all FEC information, do not add the FECid indirection.

See my above comment on this.  I don't we want to do this.

>       - Define the protocol between MPLS subsystem and userspace

Agreed.

>       - Develop (or adapt) a new userspace app (mplsnl in my previous
> mail) that communicates with the MPLS kernel subsystem in order to
> Get/Update MPLS tables.
> 
>       - Rewrite (I agree with previous comments that this is the least
> elegant part of the existing implementation) dst/neigh/hh cache
> management, Once we have the whole outgoing MPLS PDU rebuilt.

I agree that the dst/neigh/hh stuff is not pretty.  I think this is
definitly a place where some kernel gurus could provide some help.

> 	- Multicast "barebones" support (or, more adequatelly, point to
> multipoint LSP support), conceptually similar to Load Sharing: the
> incoming label + labelspaces + interface + ...(this all means "a non
> ambiguous incoming label") should determine a set of MIIs to apply. The
> implementation dependant details are, for example: Are the MIIs to be
> applied iteratively (e.g. point to multipoint) or one among all (Load
> Sharing).
> 
> Things that I don't like from the existing implementation:
> 
> - The Radix Tree/Label space implementation. since mpls_recv_pckt (the
> callback when registering the packet handler) contains the incoming
> device, I'm still analyzing the drawbacks /advantages of having the "ILM
> global table" split in "per interface" ilm tables . That is, the incoming
> interface + the topmost incoming table are the "key" to find the MII
> object in a hash table.

The ramification here is whether or not you want to expose all of
the 'label programming' to the applications.  Just think of signaling
protocols that allocates a label in labelspace 0.  Or thing about how you
go about allocating an application label (think of the 2nd label used with
L2CC ala Martini)

I'm not saying the technique I used it the best, but it atleast handles
all of the possible uses of labels.

This is a good coversation we have going.  I hope others join in ;-)

-- 
James R. Leu
jl...@mi...