Re: [mpls-linux-devel] Re: 2.6 Spec: Random comments.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Fri, Feb 13, 2004 at 05:58:03PM -0500, Jamal Hadi Salim wrote:
> On Fri, 2004-02-13 at 12:12, James R. Leu wrote:
> 
> > > >From user space this would look like:
> > > 
> > > l2c mpls ilm add dev eth0 label 22 nhalg roundrobin nhid 2 nhid 3 nhid 4
> > 
> > What about adding a new func ptr to the protocol driver.  Then we could
> > do protocol dependent stuff like hashing the IPv4|6 header or ethernet
> > header (ethernet over MPLS).
> 
> Ok, so you are looking at only IP packets at the edge of an MPLS
> network. Describe a little packet walk. Are you planning to 
> not use the ECMP features?

It could be any protocol we map onto an LSP (ie ethernet/atm/fr over MPLS),
you just have to add a protocol driver for it.

The ECMP feature only help you at the ingress LER.  You need something
to handle load balancing in the core of the MPLS domain.

ECMP example:

                       -------             -------                    
                      |       |           |       |                     
            .--1G-----| LSR 1 |---100M----| LSR 2 |----1G---.
           /          |       |           |       |          \          
 ---------/            -------             -------            \--------
| Ingress |                                                   | Ingress |
|   LER   |                                                   |   LER   |
 ---------\            -------             -------            /--------
           \          |       |           |       |          /          
            `--1G-----| LSR 3 |---100M----| LSR 4 |----1G---'
                      |       |           |       |                     
                       -------             -------

In the above case ECMP will allow a max traffic of 200M between
ingress and egress.

Load balancing example:

 ---------             -------             -------             --------
|         |           |       |---100M----|       |           |        |
| Ingress |----1G-----| LSR 1 |---100M----| LSR 2 |----1G-----| Egress |
|   LER   |           |       |---100M----|       |           |   LER  |
 ---------             -------             -------             --------

Without load balancing LDP would create 1 LSP for traffic going
from ingress to egress.  The max traffic you could sent from ingress
to egress is 100M.  With load balancing LDP still sets up 1 LSP from
igress to egress, but when LSR2 advertises a label to LSR1, LSR1 realizes
it has 3 adj to LSR2 and creates 3 NHLFEs, on on each of the links. It then
uses some mechanism to load balance traffic arriving on it's 1 ILM onto
the 3 NHLFEs.  In the single label case, looking at the protocol ID
associated with the ILM and doing a little layer violation ;-) and we
can do per flow hashing and map flows to the various NHLFEs.  Now the
max traffic between ingress and egress is 300M.

> > The task is trival if the stack only has one label, for more then one label
> > we would have to be creative.  Hashing the label stack, or use the PW ID
> > (suggestion in PWE3 WG which adds a word after the labelstack to indicate
> > what protocol lies below.)  The PW ID could be used to lookup the protocol
> > driver to generate the hash.
> 
> Point me to some doc if you dont mind. Is this for some of the VPN
> encapsulations?

http://www.ietf.org/internet-drafts/draft-allan-mpls-pid-00.txt

> > Or of course we could just add an options for which algo to use.
> 
> Note what i suggested is only for ILM level; And there you could add any
> algorithms you want. With the protocol driver are you suggesting to do
> something at the IPV4/6 FTN level only?

To be able to load balance and guarentee packet order, you need to know
what is underneath the label stack.  With just one label it is trivial to
figure out what is under the label stack.  With more then one, it isn't
so easy (the LSR that needs to do the load balancing was not involved in the
signaling of any of the labels past the first one).  Currently vendors do
some nasty hacking.  Look at the first nibble after the label stack, if it
is a 4, they assume IPv4.  They build the appropriate hash and use that
to select the outgoing NHLFE.

> > Here are some snippits.  I think XFRM may remove the need for these,
> > but for now it works.
> 
> > Setup the dst stacking
> > ----------------------
> > 
> >     net/mpls/mpls_output.c
> > 
> >     int
> >     mpls_set_nexthop (struct dst_entry *dst, u32 nh_data, struct spec_nh *spec)
> >     {
> >             struct mpls_out_info *moi = NULL;
> 
> I take it mpls_out_info is an nhlfe entry?
>                                                                                     
> >             MPLS_ENTER;
> >             moi = mpls_get_moi(nh_data);
> >             if (unlikely(!moi))
> >                     return -1;
> >                                                                                     
> >             dst->metrics[RTAX_MTU-1] = moi->moi_mtu;
> >             dst->child = dst_clone(&moi->moi_dst);
> >             MPLS_DEBUG("moi: %p mtu: %d dst: %p\n", moi, moi->moi_mtu,
> >                     &moi->moi_dst);
> >             MPLS_EXIT;
> >             return 0;
> >     }
> > 
> >     mpls_set_nexthop is called from ipv4:rt_set_nexthop and from
> >     ipv6:ip6_route_add (I have a 'special nextop' system developed which
> >     would be replaced by XFRM).  It is very similar to your RTA_MPLS_FEC,
> >     but has 2 pieces of data a RTA_SPEC_PROTO and RTA_SPEC_DATA.  It is
> >     intended for multiple protocols to be able to register special nexthop.
> >     Right now only MPLS registers :-)  Again I have every intention of
> >     ripping it out in favor XFRM.
> > 
> > Using the dst stack
> > -------------------
> > 
> >     net/ipv4/ip_output.c
> > 
> >     static inline int ip_finish_output2(struct sk_buff *skb)
> >     {
> >             struct dst_entry *dst = skb->dst;
> >             struct hh_cache *hh = dst->hh;
> >             struct net_device *dev = dst->dev;
> >             int hh_len = LL_RESERVED_SPACE(dev);
> > 
> >             if (dst->child) {
> >                     skb->dst = dst_pop(skb->dst);
> >                     return skb->dst->output(skb);
> >             }
> >     ...
> > 
> >     Something very similar exists in net/ipv6/ip6_output.c ip6_output_finish() 
> > 
> 
> On the outset this does look a bit cleaner but i would have to ping my
> brain on Daves approach. Take a look at his code.
> Q: Can you stack more than one of those dsts? If yes, then it may be
> even safer to have the nhlfe_route in the dst instead, no?
> i.e how sure can you be that child will be MPLS related; in other case
> it is guaranteed to (it does say dst->xxmplsxx).

Since we use the childs output pointer, IPv4|6 don't care if it is MPLS.
I suppose the same check for child could be made in MPLS output, then yes
you could have more the one child stacked.  I'm not sure if this would
be very optimal for create hierarchical LSPs (I think that is what
your eluding to).

> There are a few pieces for the current approach that i didnt like ;
> example the net_output_maybe_reroute() thing. Or having to mod dst.c
> to add ifdefs for MPLS. There could be a marriage of the two approaches
> maybe?

After getting the feedback from David, XFRM will have to wait and I
think the dst stacking is cleaner.

> cheers,
> jamal

-- 
James R. Leu
jl...@mi...