Re: [mpls-linux-devel] Merging into the kernel?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hey there Steve,

On Wed, Feb 15, 2006 at 09:07:50PM +0000, root wrote:
>=20
> Hi,
>=20
> Sorry for taking so long to respond, here are a number of comments and
> questions.... this is also an opportunity for me to express all the ideas
> I've had over the last few weeks, so please excuse me if I ramble on too
> much :-)
>=20
> Firstly thanks for sending me the two patches. I've spoken to Jamal and
> the genetlink solution is certainly the right way to go, so I'll talk
> only about that particular patch from now on. You mentioned that RCU
> needed looking at in a previous email and I've taken a quick look over
> it and I'd agree that it looks like the current code is part way between
> RCU and a fully locked solution.

Agreed.  My understand of RCU is not complete.  I've been trying to
use the existing implementations throughout the kernel as examples,
but none seem to fit my scenario exactly.  I thought I had the shim_*
stuff correct (I used the dev_remove_pack/dev_add_pack as a guide).

> Probably the simplest thing to do is just to change the list_add_rcu()
> in shim.c:shim_proto_add() back to a standard list_add() until the list
> can be converted completely (and likewise list_del_rcu() in=20
> shim.c:shim_proto_remove()) although I'd agree that an RCU implementation
> would be much preferable, and possibly even a prerequisite for kernel
> inclusion due to the already (mostly) lockless routing code.

Understood.  I've converted back to list_del() and list_add().

> One question occurs to me though... did you consider using the xfrm code
> in order to interface with the higher layer protocols? I took a look at
> this recently since it seems to be in the right place in the stack to
> do this. Its fairly complicated and there seems to be at least a loose
> fit but I can see that some changes would be required. The issue of selec=
ting
> a forwarding class being the biggest potential issue. Still it appears to=
 me
> to be a lot cleaner to modify xfrm a bit and I suspect more likely to meet
> with more general approval.

Yes.  I recently spent a significant amount of time understanding the XFRM
code just to realize that it cannot be tied to a specific route.  The
selector mechanism is much like netfilter, ie it does not do longest prefix
match.  Infact implementing a XFRM shim module would bring route based
IPSEC VPNs to linux (without having to use a virtual interface).

I also looked at tc actions, but there too, tc is more like netfilter then
a LPM.

> One thing which struck me about the patch was although it does put
> in place a lot of the groundwork, I think a more ambitious approach would
> be more likely to be accepted. Going for small patches is good, but also
> the initial patch should add some (standalone) functionality to the kerne=
l,
> so there is also "too small". I'd suggest going for the minimum sized pat=
ch
> which can add a useful feature, say for example, adding just enough code =
to do
> MPLS forwarding and leaving other bits (the tunnel device which makes a n=
ice
> patch on its own and the interface with the higher level protocols) until
> later.

Agreed.  The first patch should contain the minimal MPLS implementation.

> This also brings me neatly on to the forwarding code. I see that this has
> been implemented in three parts, which if you'll excuse my ascii artistry,
> or lack of it, interact as follows:
>=20
>=20
>                 /-----\   /----\   /-------\
>  input from ----| ILM |---| XC |---| nhlfe |---- output to
>  netdevice      \-----/   \----/   \-------/     netdevice
>                    |                   |
>                    |                   |
>                  to higher          from higher
>                  protocols          protocols
>
> Now both the ILM and nhlfe are composed of radix tree tables and I'm
> curious as to why you chose this particular system over (say, for example)
> a plain hash table. The cross connect interface which updates the final
> instruction in the ILM entry to point to an nhlfe entry is also confusing
> me slightly as I don't see why the ILM can't have a forwarding instruction
> added to it directly when its created via the netlink message. Why the
> extra interface?

The XC netlink interface is there to assist signaling protocols.  It is
very common to create an ILM that terminates locally and then at a later
time XCs to a NHLFE, and at even a later time, swing the XC to a different
NHLFE.  With that being said, there is nothing that the XC netlink interface
does that cannot be done by just modifying the instructions via the ILM
netlink interface.

Why use a radix tree?  Originally it was just for ease of implementation.
Now it is because the radix tree lookup for the ILM provides deterministic
search times.  That being said, I have no problem with changing to a
multi tier hash scheme as long as it can provide better performance (includ=
ing
corner cases).

> I have been giving some thought as to the efficiency of the forwarding
> process itself recently, with the idea of "transcoding" the instructions
> as provided via netlink into an efficient byte code to allow faster
> execution. The would appear to be considerable scope for merging certain
> instructions (e.g. a pull followed by a push) into one internal instructi=
on
> (i.e. the interface would be the same and the effect the same so it
> wouldn't break the protocol at all).

I like the idea.  This is much like what I'm used to in the hardware
forwarding world. What you're kind of hinting at it a packet translation
engine, this would make it easier to map the forwarding of packets onto FPGA
or ASIC based hardware (isn't there a couple of projects doing this
for packet filtering? nf-HIPAC)

> It should be possible to calculate, for any given instruction sequence
> the maximum headroom required (in the skb) to execute it in advance
> such that it would be possible to guarantee that only a single call
> to skb_realloc_headroom would be required for any particular packet.
> A future development might even request that device drivers always send
> packets with a certain amount of headroom to prevent even this call
> being required. Currently I see that its assumed that 32 bytes will
> cover all eventualities, which seems a reasonable bet for most uses.

Currently the size of the stack is being tracked to handle MTU issues.
The really painful case for head allocation is ethernet over MPLS (22 bytes=
).

> Another feature of transcoding would be to remove useless instructions
> (NOP) from the execution path. Is there actually any practical purpose
> for this instruction? I'm afraid I can't see one.

There is no purpose to the NOOP code.  It is left over cruft that
can be removed.

> The various instructions to set/get tcindex and nfmark seem like a
> very good plan. I'm considering writing a patch to add setting nfmark
> through the ipv4/6/decnet routing tables which I think would be a
> generally useful plan. I wonder also if using one or the other or both
> of nfmark and/or tcindex as a key in looking up the nhlfe and/or ilm
> isn't a bad idea either.

That might be against the RFCs.  I know I'm already overstepping the
RFCs by allowing the EXP bits to determine a NHLFE.

> If nfmark could be 1:1 with mpls fec, then it might be possible to use
> it together with xfrm as the interface for higher level protocols.

Not sure I follow you here.  Currently with the shim setup there is no
NHLFE lookup in the forward path, the NHLFE is bound to the IPv4|6 route or
the eb|iptables rule.

> I also have the basics of a DECnet interface to mpls (not tested yet,
> I'm afraid) which I've roughed out. I'll send that to you if you are
> interested. It doesn't look a lot different to the ipv4 one which is
> no surprise since thats where DECnet's routing code comes from.

Cool.  Having another person implement to the shim interface is a good test
of the functionality.  Like I mentioned above, I have the beginning of a
XFRM shim implementation for the purpose of route based IPSEC VPNs.

> Steve.
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log fi=
les
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D103432&bid=3D230486&dat=
=3D121642
> _______________________________________________
> mpls-linux-devel mailing list
> mpl...@li...
> https://lists.sourceforge.net/lists/listinfo/mpls-linux-devel

--=20
James R. Leu
jl...@mi...