Re: [mpls-linux-devel] Jamal's MPLS design document

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

James/Hamal/All,

Thank you for submitting this document for review/comments.

Some thoughts right away (more to follow). For the moment, I'll be giving
my impressions w.r.t. the existing implementation and the patches posted
on this list.

Disclaimer: These are the opinions of my own, and in no way they are
written in stone. You'll see that I do not agree with some aspects of the
document, but I am open to discussion.

A general comment (maybe I'm wrong, let me dig a little more into this)
is that it is too rushy to discard the existing implementation without
taking the time to understand it (which is logical, since there was 0
documentation no even code comments, but it is something on which I'm
working on), but I would like to know in which particular parts the
existing implementation is flawed and cannot be corrected / extended /
reviewed.

For the moment, I have not seen a design flaw or major valid reason to
start from scratch.

> 2. Tables involved:
> We cant ignore these table names because tons of SNMP MIBs exist
> which at least talk about them; implementation is a different
> issue but at least we should be able to somehow semantically match
> them. The tables are the NHLFE, FTN and ILM.
> The code should use similar names when possible.

Agreed. One of the first things I noticed was the (apparent) lack of these
tables in Jim's implementation. Hopefully, the devel-guide will explain
this. What I have understood:

- the NHLFE is in fact split (which I consider a good design) into two
parts : MPLS Incoming Information (MII for short), a struct that has the
set of opcodes to execute upon arrival of the incoming packet, and MPLS
Outgoing Information (MOI for short) containing the set of opcodes to
execute when forwarding the packet. This also decouples w.r.t. labelled
packets that are delivered to the host.

- The ILM table exists (it is in fact the Incoming Radix Tree). The "ILM"
lookup is indeed mpls_get_mii_by_label. It looks the corresponding NHLFE
from a Label (which is the MII objet, see my previous paragraph). So, yes:

	- Some existing symbols should be renamed. For example the function
	mpls_opcode_peek has been renamed (in my tree anyway) to mpls_label_entry
	peek. In a similar way, we should define a high level function like
	"mpls_ilm_lookup" which takes the topmost label entry and gets the
	corresponding MII.These changes are trivial.

- the FTN table is indeed somehow missing, and is basically done in parts
using mplsadm: add an outgoing label (with an associated MOI (see below)
and then map some traffic to that label (this was mainly done hacking some
net/core parts).

So the big missing part is FEC management (at the per MPLS-domain ingress
node). We *should not* assume that a FEC is always an IPv4/IPv6 adress
prefix:

* We need a "generic" FEC encoding, so later we can also integrate L2 LSPs
EoMPLS, etc.

* A new part "classifier" that maps data objects (and not only address
prefixes) to FECs. These points are discussed later in this document.

>
> ILM and FTN derive a FECid from their respective lookups
> The result (FECid) is then used to lookup
> the NHLFE to determine how to forward the packet.
> 2.1 Next Hop Label Forwarding Entry (NHLFE) Table:
> This table is looked up using the FEC as the key (maybe
> + label space) although label spaces are still in the TOD below.
>
> A standard structure for NHLFE contains:
> - FEC id

See my comments (*) below.

> - neighbor information (IPV4/6 + egress interface)

Yes. It is/should be in the MOI part (that is, the "second half" of the
NHLFE)

> - MPLS operations to perform

The MII/MOI opcodes. With the benefit that if it is locally delivered,
theres no need to check the MOI.

(*) Comments:
I'm afraid that I don't agree here. IMHO, I thing we should not add the
"FECid" indirection here. it has several drawbacks:

- NHLFE are FEC agnostic. The same NHLFE could be reused for different
FECs. This is necessary for example, for LSP merging.

- The notion of FEC should only be defined at Ingress LSRs.

- W.r.t. the ILM table the FECid *is* the topmost label itself!
Explciitly the label represents the FEC w.r.t to a couple
upstream/downstream LSRs. The lookup should be label -> NHLFE (MII
object). No need to manage FECids (allocation/removal/etc)

- In some cases, it is necessary to establish cross-connects without
knowing the FEC that will be transported over the LSP (e.g. when working
at > 2 hierarchy evels): e.g. Incoming label (+labelspace+interface) ->
Outgoing label + outgoing interface. No need to know the FEC here.

With the notion of FECid, you have two issues: Label management and FECId
management. Let me explain myself a little more here: imagine we have a
simple FEC F, 'all packets with @IPdest = A.B.C.D/N', well defined, so it
should have a 'locally' unique FECid. This FECid cannot "non ambiguously"
be used to look up a NHLFE (e.g. when received over two different
interfaces for example). of course the same argument applies to labels,
but my point is "let the label only identify itself the FEC", do not add
another indirection.

> 2.1.1 NHLFE Configuration:
> The way i see it being setup is via netlink (this way we can take
> advantage of distributed architectures later).

Definitely :). For updates/requests. However, this does not preclude the
use of procfs/sysfs for exposing ReadOnly simple objects like
attributes/labelspaces/etc.

>
> tc l2conf <cmd> dev <devname>
> mpls nhlfe index <val> proto <ipv4|ipv6> nh <neighbor>

It is too soon to define a grammar for a userspace application. I would
work on the protocol between the kernel MPLS subsystem and the userspace,
defining which information objects are required. Netlink datagram format
format and then define a userspace app. Moreover, I would propose having a
new userspace app (something like mplsadm) rather than patching tc ip
route etc, given the fact that most users won't be using MPLS at all.

> 2.2 FEC to NHLFE mapping (FTN) Table
>
> I dont see this table existing by itself.

it doesn't :)

> Each MPLS interfacing component will derive a FECid which is used
> to search the NHLFE table.

See my previous comment. The topmost label + the labelspace (or incoming
device) + optionnally the upstream router in the case that the downstream
cannot know the upstream router *and* has allocated the same label for the
same FEC should suffice to derive the NHLFE (MII) to apply.

> 2.2.1 IPV4/6 route component FTN
> Typically, the FEC will be in the IPV4/6 FIB nexthop entry.
> This way we can have equal cost multi path entries
> with different FECids.

I'll discuss load sharing later in this mail.

>
> 2.2.2 ingress classification component:
> This has nothing to do with FTN rather it provides another mapping to
> the NHLFE table.
> (when i port tc extension code to 2.6 - we will need a new
> skb field called FECid);
> *ingress code matches a packet description and then sets the skb->FECid
> as an action. We could use the skb->FECid to overrule the FIB FEC
> when we are selecting the fast path.

Good point. But it should not manage FECids, it should manage NHLFE
entries.

> [The u32 classifier could be used to map based on any header bits and select
> the FECid.]

semantically, it does map the "data object" to a FEC, but it does not mean
that it needs to explicitely manage FECids. FECids are the labels. If you
add the FEC id notion you have two problems: Label management and FECid
management. At most, "FECids" (if you really really want to add them)
should be only managed at the "per domain I-LSR)

> 2.2.3 Tunneling and L2 technologies FTN
> Revist this later.
Yes! :) Ethernet over MPLS should now be a primary objective

>
> 2.2.4 NHLFE packet path:
>
> As in standard Linux, the fast path is first accessed. Two
> results:
> 1) On success a MPLS cache entry is found and attached to the skb->dst
> the skb->dst is used to forward.
> 2) On failure a slow path is exercised and a new dst cache is created
> from the NHLFE table.

Agreed. Nothing to add here.

> the FECid used to lookup the NHLFE for the cache entry creation.

:) nope!! the "label" should be used.

> 2.2.5 Configuration IPV4/6 routing:

> The ip tool should allow you specify route you want then
> specify the FECid for that route, i.e:

Again, too soon to focus on userspace / control plane

> [??? What would happen if the route nexthop entry and the NHLFE point
> to different egress devices?]

The NHLFE overrides the route nexthop. This is the basis of MPLS Traffic
Engineering. LSPs are not IGP constrained.

> 2.2.6 Configuration for others
>
> They need to be netlink enabled. At the moment only ipsec is.

What others? anyway. A well defined netlink based protocol is, as you
state, much needed.

> 2.3 ILM (incoming label mapping):
>
> Typical entries for this table are: label, ingress dev, FECid
> Lookup is based on label.

and this is the Radix Tree.

> ILM is used by both LSR or egress LER.
>
> 2.3.1 ILM packet processing:
>

> Incoming packets:
> - use label to lookup the dst cache via route_input()

The label is used to lookup a MII object from a Radix Tree that holds
information regarding what to do with the packet. Only in the case that
the packet needs to be forwarded, we obtain a MOI object. It is then that
we could:
     * check the dst_cache as you state.

right.. this fits well in mpls_output* family.

> 3.2 Label action opcodes

what's wrong with the existing opcodes? I see little performance gain in
having X_AND_Y opcodes rather than X Y, sequentially. Atomic opcodes in
(IMHO) the way to go. Otherwise we'll end up with
POP_AND_SWAP_AND_PUSH_AND_MAPEXP. However, I do not want to state that we
nned not try to optimize performance later, but chaining opcodes adds
great flexibility.

> - POP_AND_LOOKUP

POP & DLV ?

> - POP_AND_FORWARD

> - NO_POP_AND_FORWARD
FWD

> - DISCARD
DROP

> TODO:
> 1.  look into multi next hop for loadbalancing For LSRs.
> Is this necessary? If yes, there has to be multiple FECids
> in the ILM table.

I've been working on load balancing in MPLS networks. The "right" approach
is as you state, to have several pointed NHLFEs in the ILM table for a
given label(+labelspace+..). However, another nice approach is to setup
tunnels ("mpls%d") and then use an equalizer algorithm to split the load.
This decouples the algorithm from the implementation. To test this, played
with having two mpls%d tunnels and use teql which is interface "agnostic"
and worked well.

In academic research and IETF w.g. we have discussed Load Sharing several
times, and the "current" consensus is that it is difficult to implement
L.S. in split points other that the "per domain" Ingress LSRs. Since
intermediate LSRs are not allowed to look at the IP header, no hash
techinques (or not with enough granularity) can be used to make sure that
packets beloging to the same microflow are forwarded over the same
physical route.

In my opinion, what needs to be done

- Define a complete framework for FEC management at ingress LSRs, with
policies to define:
      - How to classify "L2/L3 data objects" into FECs, without limiting
only to IPvX address prefixes as FECs. How do we encode "if the Ethernet
dst addres is a:b:c:d:e:f and the ehtertype is IPX then this data object
belongs to FEC F". DEfine related FEC encodings. Let the control plane
apps distribute FEC information *as labels*. The non-ambigous incoming
label conveys all FEC information, do not add the FECid indirection.

      - Define the protocol between MPLS subsystem and userspace

      - Develop (or adapt) a new userspace app (mplsnl in my previous
mail) that communicates with the MPLS kernel subsystem in order to
Get/Update MPLS tables.

      - Rewrite (I agree with previous comments that this is the least
elegant part of the existing implementation) dst/neigh/hh cache
management, Once we have the whole outgoing MPLS PDU rebuilt.

	- Multicast "barebones" support (or, more adequatelly, point to
multipoint LSP support), conceptually similar to Load Sharing: the
incoming label + labelspaces + interface + ...(this all means "a non
ambiguous incoming label") should determine a set of MIIs to apply. The
implementation dependant details are, for example: Are the MIIs to be
applied iteratively (e.g. point to multipoint) or one among all (Load
Sharing).

Things that I don't like from the existing implementation:

- The Radix Tree/Label space implementation. since mpls_recv_pckt (the
callback when registering the packet handler) contains the incoming
device, I'm still analyzing the drawbacks /advantages of having the "ILM
global table" split in "per interface" ilm tables . That is, the incoming
interface + the topmost incoming table are the "key" to find the MII
object in a hash table.

Thank you for reading.

Best regards,

Ramon