Re: [mpls-linux-devel] Re: 2.6 Spec: Random comments.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Fri, Feb 13, 2004 at 11:20:06AM -0500, Jamal Hadi Salim wrote:
> On Fri, 2004-02-13 at 09:39, James R. Leu wrote:
> 
> > > Given the above info, suggest a new name. Maybe NHid?
> > 
> > Much better then FECid ;-) (although it is just a name ...)
> 
> True, but has to map to the semantics;
> Ok NHid for now until something with a better ring shows up.
> 
> 
> > > 
> > > Ok, that is useful. I have not tested multipath but it should
> > > work with Linux routing ECMP at least. 
> > > I wouldnt call it NHLFE indices rather these identifiers so far
> > > called fecid;
> > > Also i would think most of these lists would contain a single entry.
> > > BTW, the ILM is not multihop ready. We should be able to add easily.
> > > Also there is no controil on how the multihop selection is done with
> > > the linux routing table - whatever Linux ECMP does goes.
> > > We should be able to fix the ILM with an algorith selector.
> > 
> > After looking at the code I would agreed that whatever linux multiple does
> > at the ingress LER, this code will follow.  The real question is how to
> > go about supporting multipath as an LSR? (one ILM needs to load balance over
> > multiple NHLFE).  Or dare I suggest p-mp LSPs?
> > 
> 
> Ive actually done some background compute on this in my head at least.
> Here are my thoughts on paper or electrons:
> ILM table entry (struct ltable in the code) should have a new structure,
> call it nh_choice, which has the following entries:
> 
> function selector();
> struct nh_info nh_list; 
> 
> nh_list would look like:
> struct gnet_stats	stats;		/* stats */
> u32			lt_fecid;    /* change that to ilm_nhid */
> 
> Note the above two entries currently reside in struct ltable.
> 
> A packet coming in will have the usual lookup; the entries
> nh_choice->selector() will be invoked. It will return the
> nhid.
> The idea behind the selector() is we can attach different algorithms
> via policy and make them take care of things like paths being down etc.
> I can think of two simple algorithms right away: random selection and
> RR. The idea is to open these algorithms to innovation.
> 
> >From user space this would look like:
> 
> l2c mpls ilm add dev eth0 label 22 nhalg roundrobin nhid 2 nhid 3 nhid 4

What about adding a new func ptr to the protocol driver.  Then we could
do protocol dependent stuff like hashing the IPv4|6 header or ethernet
header (ethernet over MPLS).

The task is trival if the stack only has one label, for more then one label
we would have to be creative.  Hashing the label stack, or use the PW ID
(suggestion in PWE3 WG which adds a word after the labelstack to indicate
what protocol lies below.)  The PW ID could be used to lookup the protocol
driver to generate the hash.

Or of course we could just add an options for which algo to use.

> 
> etc.
> 
> Thoughts?
>  
> > 
> > I know you mentioned it is "not an index" but to me it seems like it really
> > _is_ an index for the NHLFE. Can multiple NHids correspond to the same NHLFE?
> > If it is a 1 to 1 mapping for all intents an purposes it is an index :-)
> 
> Ok;->
> how about NHkey ? maybe a prefix of mpls_ would also be good.
> 
> > > dsts are still managed from the MPLS code. There is some generic stuff
> > > (create, destriy, gc etc) for which there is no point in recreating in
> > > the MPLS code
> > > The way it is right now works fine. What could probably have been a
> > > better approach is to stack dsts. It would require some surgery and i am
> > > not sure i have the patience for it. Mayeb we can ask Dave on his
> > > thoughts on this.
> > 
> > Currently we use dst stacking.  The 'child' dst is actually a static member
> > of the 'out going label info' (NHLFE).  So when the skb reaches the the exit
> > of IPv4|6 a check for the child is done.  The skb->dst is replaced with the
> > child dst and the child output funtion is called (which sends it into
> > MPLS land).  The entrace to MPLS land use "container_of" macro to get the
> > NHLFE to used to forward the packet.  How the stacked dst is created is
> > similar to your scheme.  I was wondering is XFRM is a better scheme to use
> > for all of this?
> 
> sorry i meant XFRM.
> I am indifferent whether we change it to your scheme or leave it as is.
> I will have to look at your code to make better judgement. My thinking
> would be the end goal should be NOT to touch the IPV4/6 code with ifdefs
> unless necessary. If theres not a huge difference in terms of eficiency
> or code beautifaction i would rather stick to the current code.
> BTW if you point me to the latest code i will print it and read offline
> over the weekend if possible.

Here are some snippits.  I think XFRM may remove the need for these,
but for now it works.

Setup the dst stacking
----------------------

    net/mpls/mpls_output.c

    int
    mpls_set_nexthop (struct dst_entry *dst, u32 nh_data, struct spec_nh *spec)
    {
            struct mpls_out_info *moi = NULL;

            MPLS_ENTER;
            moi = mpls_get_moi(nh_data);
            if (unlikely(!moi))
                    return -1;

            dst->metrics[RTAX_MTU-1] = moi->moi_mtu;
            dst->child = dst_clone(&moi->moi_dst);
            MPLS_DEBUG("moi: %p mtu: %d dst: %p\n", moi, moi->moi_mtu,
                    &moi->moi_dst);
            MPLS_EXIT;
            return 0;
    }

    mpls_set_nexthop is called from ipv4:rt_set_nexthop and from
    ipv6:ip6_route_add (I have a 'special nextop' system developed which
    would be replaced by XFRM).  It is very similar to your RTA_MPLS_FEC,
    but has 2 pieces of data a RTA_SPEC_PROTO and RTA_SPEC_DATA.  It is
    intended for multiple protocols to be able to register special nexthop.
    Right now only MPLS registers :-)  Again I have every intention of
    ripping it out in favor XFRM.

Using the dst stack
-------------------

    net/ipv4/ip_output.c

    static inline int ip_finish_output2(struct sk_buff *skb)
    {
            struct dst_entry *dst = skb->dst;
            struct hh_cache *hh = dst->hh;
            struct net_device *dev = dst->dev;
            int hh_len = LL_RESERVED_SPACE(dev);

            if (dst->child) {
                    skb->dst = dst_pop(skb->dst);
                    return skb->dst->output(skb);
            }
    ...

    Something very similar exists in net/ipv6/ip6_output.c ip6_output_finish() 

> 
> I may be a bit slow responding now since i am at work.
> 
> cheers,
> jamal

-- 
James R. Leu
jl...@mi...