Re: [mpls-linux-devel] Merging into the kernel?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Sorry for taking so long to respond, here are a number of comments and
questions.... this is also an opportunity for me to express all the ideas
I've had over the last few weeks, so please excuse me if I ramble on too
much :-)

Firstly thanks for sending me the two patches. I've spoken to Jamal and
the genetlink solution is certainly the right way to go, so I'll talk
only about that particular patch from now on. You mentioned that RCU
needed looking at in a previous email and I've taken a quick look over
it and I'd agree that it looks like the current code is part way between
RCU and a fully locked solution.

Probably the simplest thing to do is just to change the list_add_rcu()
in shim.c:shim_proto_add() back to a standard list_add() until the list
can be converted completely (and likewise list_del_rcu() in 
shim.c:shim_proto_remove()) although I'd agree that an RCU implementation
would be much preferable, and possibly even a prerequisite for kernel
inclusion due to the already (mostly) lockless routing code.

One question occurs to me though... did you consider using the xfrm code
in order to interface with the higher layer protocols? I took a look at
this recently since it seems to be in the right place in the stack to
do this. Its fairly complicated and there seems to be at least a loose
fit but I can see that some changes would be required. The issue of selecting
a forwarding class being the biggest potential issue. Still it appears to me
to be a lot cleaner to modify xfrm a bit and I suspect more likely to meet
with more general approval.

One thing which struck me about the patch was although it does put
in place a lot of the groundwork, I think a more ambitious approach would
be more likely to be accepted. Going for small patches is good, but also
the initial patch should add some (standalone) functionality to the kernel,
so there is also "too small". I'd suggest going for the minimum sized patch
which can add a useful feature, say for example, adding just enough code to do
MPLS forwarding and leaving other bits (the tunnel device which makes a nice
patch on its own and the interface with the higher level protocols) until
later.

This also brings me neatly on to the forwarding code. I see that this has
been implemented in three parts, which if you'll excuse my ascii artistry,
or lack of it, interact as follows:

                /-----\   /----\   /-------\
 input from ----| ILM |---| XC |---| nhlfe |---- output to
 netdevice      \-----/   \----/   \-------/     netdevice
                   |                   |
                   |                   |
                 to higher          from higher
                 protocols          protocols

Now both the ILM and nhlfe are composed of radix tree tables and I'm
curious as to why you chose this particular system over (say, for example)
a plain hash table. The cross connect interface which updates the final
instruction in the ILM entry to point to an nhlfe entry is also confusing
me slightly as I don't see why the ILM can't have a forwarding instruction
added to it directly when its created via the netlink message. Why the
extra interface?

I have been giving some thought as to the efficiency of the forwarding
process itself recently, with the idea of "transcoding" the instructions
as provided via netlink into an efficient byte code to allow faster
execution. The would appear to be considerable scope for merging certain
instructions (e.g. a pull followed by a push) into one internal instruction
(i.e. the interface would be the same and the effect the same so it
wouldn't break the protocol at all).

It should be possible to calculate, for any given instruction sequence
the maximum headroom required (in the skb) to execute it in advance
such that it would be possible to guarantee that only a single call
to skb_realloc_headroom would be required for any particular packet.
A future development might even request that device drivers always send
packets with a certain amount of headroom to prevent even this call
being required. Currently I see that its assumed that 32 bytes will
cover all eventualities, which seems a reasonable bet for most uses.

Another feature of transcoding would be to remove useless instructions
(NOP) from the execution path. Is there actually any practical purpose
for this instruction? I'm afraid I can't see one.

The various instructions to set/get tcindex and nfmark seem like a
very good plan. I'm considering writing a patch to add setting nfmark
through the ipv4/6/decnet routing tables which I think would be a
generally useful plan. I wonder also if using one or the other or both
of nfmark and/or tcindex as a key in looking up the nhlfe and/or ilm
isn't a bad idea either.

If nfmark could be 1:1 with mpls fec, then it might be possible to use
it together with xfrm as the interface for higher level protocols.

I also have the basics of a DECnet interface to mpls (not tested yet,
I'm afraid) which I've roughed out. I'll send that to you if you are
interested. It doesn't look a lot different to the ipv4 one which is
no surprise since thats where DECnet's routing code comes from.

Steve.