Re: [mpls-linux-devel] Merging into the kernel?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

On Wed, Feb 15, 2006 at 09:40:57PM -0600, James R. Leu wrote:
> Hey there Steve,
>
[some things cut to avoid reposting too much] 
> 
> > One question occurs to me though... did you consider using the xfrm code
> > in order to interface with the higher layer protocols? I took a look at
> > this recently since it seems to be in the right place in the stack to
> > do this. Its fairly complicated and there seems to be at least a loose
> > fit but I can see that some changes would be required. The issue of selecting
> > a forwarding class being the biggest potential issue. Still it appears to me
> > to be a lot cleaner to modify xfrm a bit and I suspect more likely to meet
> > with more general approval.
> 
> Yes.  I recently spent a significant amount of time understanding the XFRM
> code just to realize that it cannot be tied to a specific route.  The
> selector mechanism is much like netfilter, ie it does not do longest prefix
> match.  Infact implementing a XFRM shim module would bring route based
> IPSEC VPNs to linux (without having to use a virtual interface).
> 
> I also looked at tc actions, but there too, tc is more like netfilter then
> a LPM.
>
Yes - I wonder though whether we could use a different selector mechanism
but keeping some of the general framework. When I looked at it, the main
thing which struck me was that the difficulty in changing the selector
mechanism was mostly down to the interface (via netlink) to userland.
Actually changing it on the kernel side is not impossible I think.

> 
> > This also brings me neatly on to the forwarding code. I see that this has
> > been implemented in three parts, which if you'll excuse my ascii artistry,
> > or lack of it, interact as follows:
> > 
> > 
> >                 /-----\   /----\   /-------\
> >  input from ----| ILM |---| XC |---| nhlfe |---- output to
> >  netdevice      \-----/   \----/   \-------/     netdevice
> >                    |                   |
> >                    |                   |
> >                  to higher          from higher
> >                  protocols          protocols
> >
> > Now both the ILM and nhlfe are composed of radix tree tables and I'm
> > curious as to why you chose this particular system over (say, for example)
> > a plain hash table. The cross connect interface which updates the final
> > instruction in the ILM entry to point to an nhlfe entry is also confusing
> > me slightly as I don't see why the ILM can't have a forwarding instruction
> > added to it directly when its created via the netlink message. Why the
> > extra interface?
> 
> The XC netlink interface is there to assist signaling protocols.  It is
> very common to create an ILM that terminates locally and then at a later
> time XCs to a NHLFE, and at even a later time, swing the XC to a different
> NHLFE.  With that being said, there is nothing that the XC netlink interface
> does that cannot be done by just modifying the instructions via the ILM
> netlink interface.
> 
> Why use a radix tree?  Originally it was just for ease of implementation.
> Now it is because the radix tree lookup for the ILM provides deterministic
> search times.  That being said, I have no problem with changing to a
> multi tier hash scheme as long as it can provide better performance (including
> corner cases).
>
And of it occured to me that in order to find this out we'd need some
tools to test against. Please find attached a patch for pktgen (as
current in davem's net-2.6.17 git tree at kernel.org) to generate
MPLS packets.

The extension allows you to add a stack of labels onto the packets its
sending out. There is one extra hack which I included: since we know
how many labels there are in the stack, I've used the bottom of stack
bit to indicate whether the label should be randomly generated or not.

You can thus push a stack of (up to 16 labels) where each label in the
stack is either a fixed value or random.

pgset "mpls 0001000a,0002000a,0000000a"

for example pushes labels 16, 32 and 0 (ipv4 null) each with a ttl of 10.
If you set the bottom of stack bit in one of the labels it will turn on
the MPLS_RND flag. You can also set and/or reset that flag in the 
normal way as well.

Patches to pktgen have become very popular of late it seems
so I'm going to wait until the latest set which are pending at the
moment have made it into Dave's tree before making a final diff to send
to Robert Olsson, the maintainer of pktgen.

Also if anyone has feedback about this feature, please let me know.

> > I have been giving some thought as to the efficiency of the forwarding
> > process itself recently, with the idea of "transcoding" the instructions
> > as provided via netlink into an efficient byte code to allow faster
> > execution. The would appear to be considerable scope for merging certain
> > instructions (e.g. a pull followed by a push) into one internal instruction
> > (i.e. the interface would be the same and the effect the same so it
> > wouldn't break the protocol at all).
> 
> I like the idea.  This is much like what I'm used to in the hardware
> forwarding world. What you're kind of hinting at it a packet translation
> engine, this would make it easier to map the forwarding of packets onto FPGA
> or ASIC based hardware (isn't there a couple of projects doing this
> for packet filtering? nf-HIPAC)
>
Its possible it might make it easier. I have to say that although I'm a
hardware engineer by training I've never really got into details of
network interfaces and what its possible to do on the cards. I wouldn't
be at all surprised if it was the case though and it would be nice to
do :-)

> 
> > The various instructions to set/get tcindex and nfmark seem like a
> > very good plan. I'm considering writing a patch to add setting nfmark
> > through the ipv4/6/decnet routing tables which I think would be a
> > generally useful plan. I wonder also if using one or the other or both
> > of nfmark and/or tcindex as a key in looking up the nhlfe and/or ilm
> > isn't a bad idea either.
> 
> That might be against the RFCs.  I know I'm already overstepping the
> RFCs by allowing the EXP bits to determine a NHLFE.
>
I wouldn't worry too much about overstepping what the RFCs say so long
as the result makes sense and the stack can still comply with them on
all the required points. The main worry with schemes like this is really
just a question of forwarding speed and whether it will slow things down
too much.

> > If nfmark could be 1:1 with mpls fec, then it might be possible to use
> > it together with xfrm as the interface for higher level protocols.
> 
> Not sure I follow you here.  Currently with the shim setup there is no
> NHLFE lookup in the forward path, the NHLFE is bound to the IPv4|6 route or
> the eb|iptables rule.
>
Ok, let me explain a bit more then.... I'm assuming a scenario where the
NHFLE is determined based upon nfmark and nfmark is set in the route
(of whatever protocol). If nfmark were also a key for xfrm then it
should be possible to "bundle" a set of dst_entry with the MPLS nhlfe
as the last entry in the stack.

I haven't got any further with the DECnet interface since I last posted
but I may well make that my next project,

Steve.