mpls-linux-devel Mailing List for MPLS for Linux (Page 29)
Status: Beta
Brought to you by:
jleu
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(7) |
Dec
(8) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(5) |
Feb
(73) |
Mar
(22) |
Apr
(21) |
May
|
Jun
|
Jul
(3) |
Aug
(5) |
Sep
(4) |
Oct
(4) |
Nov
(2) |
Dec
(6) |
2005 |
Jan
(5) |
Feb
|
Mar
(6) |
Apr
(11) |
May
(6) |
Jun
(5) |
Jul
(4) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(9) |
Dec
(15) |
2006 |
Jan
(11) |
Feb
(7) |
Mar
(4) |
Apr
(1) |
May
(2) |
Jun
(2) |
Jul
(7) |
Aug
|
Sep
(8) |
Oct
(9) |
Nov
(10) |
Dec
(14) |
2007 |
Jan
(11) |
Feb
(9) |
Mar
(39) |
Apr
(7) |
May
(4) |
Jun
(2) |
Jul
(5) |
Aug
(6) |
Sep
(6) |
Oct
(1) |
Nov
(1) |
Dec
(8) |
2008 |
Jan
|
Feb
(13) |
Mar
(19) |
Apr
(11) |
May
(16) |
Jun
(6) |
Jul
(2) |
Aug
(4) |
Sep
|
Oct
(5) |
Nov
|
Dec
(16) |
2009 |
Jan
(13) |
Feb
(5) |
Mar
|
Apr
|
May
(11) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(2) |
Oct
(8) |
Nov
(16) |
Dec
(15) |
2010 |
Jan
(6) |
Feb
(5) |
Mar
(1) |
Apr
(14) |
May
(42) |
Jun
(4) |
Jul
(1) |
Aug
(1) |
Sep
|
Oct
|
Nov
(4) |
Dec
(1) |
2011 |
Jan
(3) |
Feb
|
Mar
|
Apr
(7) |
May
(1) |
Jun
(2) |
Jul
(4) |
Aug
(19) |
Sep
(9) |
Oct
(13) |
Nov
(4) |
Dec
(3) |
2012 |
Jan
(2) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
(11) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
(3) |
Dec
(2) |
2013 |
Jan
(4) |
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(7) |
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2015 |
Jan
(1) |
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
(2) |
Jul
(2) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(2) |
2016 |
Jan
(6) |
Feb
(2) |
Mar
(1) |
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
(1) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
2017 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(4) |
Dec
|
2021 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Jamal H. S. <ha...@zn...> - 2004-02-16 21:45:33
|
BTW, I am fine with whatever you guys end up picking for the code repository; you will have to teach me about its usage. p4 sounds good. Also James i know you are a big fan of UML - i am trying to see if it valubale - getting tired of hanging my laptop (though i run ext3 these days ;->); so if you can share your setup on a test environment i would appreaciate it. I looked at Qemu it does look very interesting; any thoughts on that? On Mon, 2004-02-16 at 12:47, Ramon Casellas wrote: > On 16 Feb 2004, Jamal Hadi Salim wrote: > > > > Ok. So we may need some extra speacilized NHLFE entries. I am not a big > > fan of the two step process unless you guys really insist - then we can > > go and convince davem. > > > Well, the problem with CR-LDP and/or RSVP is that it is a 'ping-pong' set > up process, and you usually need to define a 'prestate'. Another > possibility is to consider RSVP as using the unsollicited downstream > label distribution and only process the RSVP-RESV message from control > space (when the message comes up from your downstream router), I am not > sure about this though. > Could something in user space be responsible for maintaining the prestate? When full state is available, it gets downloaded to the kernel. > > My opinion is lets have 3 new speacial NHLFEs: > > > - something that sends the packet to a blackhole which will work for > > such a scenarion as above. > > A 'disabled' NHLFE. I think that this can be useful, for example for > liberal retention mode. > Ok, so we could add something this: l2c mpls nhlfe add dev eth0 proto ipv4 nhlfeid 3 blackhole > > > - Another one will send the packet to user space via netlink. This may > > also be used for resolving what you have above. > > So we can conform to the RFC (although sometimes it is just IETF jargon) > But the question is 'which packet?' I assume that it is the first packet > that according to the FIB_RES should be mapped to a NHLFEid that just does > not exist. Don't we risk flooding userspace? Should it be only the first > packet? what a bout a single netlink event (in plain english: hey, I don't > know what to do with this FEC, can you do something about it?) Well, something along the same lines. Example: l2c mpls nhlfe add dev eth0 proto ipv4 nhlfeid 4 control-redirect The above could be a result of intentional policy such as preceeded by: l2c mpls ilm add dev eth0 label 9 nhlfeid 4 or as a result of it being the default NHLFE rule which gets consulted because bothing else was found, example: l2c mpls nhlfe add dev eth0 nhlfeid 4 default control-redirect Thoughts? > > - A third one is for locally destined packets. I was not sure whether > > this should just be a flag which says neighbor = local or not. > > IIRC, locally destined packets means that the LSR is egress (for all > hierarchical levels) and pops the last packet. As one possibility, the > default action should be just call IP module packet reception if we just > popped the last label, so the packet is locally delivered or forwarded per > dest address. If the last label has been popped then it would make sense to redirect to the stack. The one that i was worried about is it having a stack of labels hiding a local host IP packet. Can we assume that the user can shoot themselves in the feet and we wouldnt care? cheers, jamal |
From: Ramon C. <cas...@in...> - 2004-02-16 17:52:01
|
On 16 Feb 2004, Jamal Hadi Salim wrote: > On Sun, 2004-02-15 at 05:36, Ramon Casellas wrote: > Some comments: Ok. I'll do it asap. (nhlfeid ;) so when we prefix it there is just one underscore... agreed? :) > > What would be useful in this document as well is to describe the > interface between the kernel and user space. i.e describe the packets Agreed. I'll take care of this. > For how we document this typically look at: > http://www.faqs.org/rfcs/rfc3549.html ok. > Ok. So we may need some extra speacilized NHLFE entries. I am not a big > fan of the two step process unless you guys really insist - then we can > go and convince davem. Well, the problem with CR-LDP and/or RSVP is that it is a 'ping-pong' set up process, and you usually need to define a 'prestate'. Another possibility is to consider RSVP as using the unsollicited downstream label distribution and only process the RSVP-RESV message from control space (when the message comes up from your downstream router), I am not sure about this though. > My opinion is lets have 3 new speacial NHLFEs: > - something that sends the packet to a blackhole which will work for > such a scenarion as above. A 'disabled' NHLFE. I think that this can be useful, for example for liberal retention mode. > - Another one will send the packet to user space via netlink. This may > also be used for resolving what you have above. So we can conform to the RFC (although sometimes it is just IETF jargon) But the question is 'which packet?' I assume that it is the first packet that according to the FIB_RES should be mapped to a NHLFEid that just does not exist. Don't we risk flooding userspace? Should it be only the first packet? what a bout a single netlink event (in plain english: hey, I don't know what to do with this FEC, can you do something about it?) > - A third one is for locally destined packets. I was not sure whether > this should just be a flag which says neighbor = local or not. IIRC, locally destined packets means that the LSR is egress (for all hierarchical levels) and pops the last packet. As one possibility, the default action should be just call IP module packet reception if we just popped the last label, so the packet is locally delivered or forwarded per dest address. Thanks, R. |
From: Ramon C. <cas...@in...> - 2004-02-16 17:27:46
|
On Mon, 16 Feb 2004, James R. Leu wrote: > Just an question I'd like to pose to the group. I will setup > I already have a working p4 based system which Ramon and myself have been > I prefer P4, because of it's branching/integration and because I have > a bunch of scripts for helping me manage a P4 tree. I'm fine with that... Anyway my algorithm is quite simple actually: cd workplace p4 sync p4 edit file.c add nasty bug to file.c p4 submit file.c But seriously, I can use any other system that you may choose. R. |
From: James R. L. <jl...@mi...> - 2004-02-16 17:14:24
|
Just an question I'd like to pose to the group. I will setup a revision control system for us to work from, the question is which one? I already have a working p4 based system which Ramon and myself have been working from, or I could setup TLA (Arch). Both are very capable systems, but have pros/cons. Here is my take on them: P4 -- -fast (provided their is enough bandwidth, from the server :-) -very sclably -good permission system -I'm very familiar with managing a p4 depot -proprietary system :-( -they have granted me an open source users licence Arch ---- -Open source -poor permission system -support for distributed trees (if we can figure it out) -I use it in my day to day job, so I have some good milage with it I prefer P4, because of it's branching/integration and because I have a bunch of scripts for helping me manage a P4 tree. Let me know what you guys think. -- James R. Leu jl...@mi... |
From: Jamal H. S. <ha...@zn...> - 2004-02-16 15:54:15
|
On Sun, 2004-02-15 at 05:36, Ramon Casellas wrote: > re-hi, > > FYI: > http://www.enst.fr/~casellas/mpls-linux-2.6/spec/spec.pdf > http://www.enst.fr/~casellas/mpls-linux-2.6/spec/index.html > > Work in progress. Things to fix. That document looks pretty now ;-> Some comments: -Names: List by last name first (Casellas, Hadi Salim, Leu, Miller). This way noone inteprets the documentation to mean it is listed by contribution. - General comments: Originally the doc was written informally with "I" meaning myself. Its all over the doc. You may wanna fix that. - 1.2.1: The TODO is a separate document now. - 1.2.2.5: You still have that fecid in there. We may also need to provide an example on ECMP using "ip route nexhop .." - General: there should be consistency with the name for nhid - at times it reads nhlfe_id and others nhlfeid. - 1.2.3.1: Your comment on downstream on demand; i will respond below since you have that comment in this email as well. - figure 1.1: Logically you should draw an arrow from the route cache to the NHLFE entry without anything in between. Implementation wise at the moment there is a dst->mpls_route; but lets ignore that since we are not talking implementation here. What would be useful in this document as well is to describe the interface between the kernel and user space. i.e describe the packets used, events generated (at the moment any addition to NHLFE or ILM will generate an event); start by looking at: include/linux/l2cnetlink.h; to cutnpaste from there: ---------------------------------- /* ILM related */ struct ilmmsg { __u32 in_fecid; __u32 in_ifindex; __u32 in_space; __u32 in_label; __u8 in_owner; }; /* ILM attributes */ enum { ILM_UNSPEC, ILM_STATS, }; /* NHLFE related */ struct nhlfemsg { __u32 nh_fecid; __u32 nh_index; __u32 nh_ifindex; __u32 nh_space; __u32 nh_class; __u32 nh_flags; __u8 nh_owner; __u8 nh_proto; __u8 nh_dscp; __u8 nh_ttl; __u32 nh_ltype; }; /* owner - who installed the rule */ enum { L2C, /* the l2c tool */ }; /* nh_proto choices */ enum { MPLS_IPV4, MPLS_IPV6, }; /* nh_flags */ #define MPLS_FLAG_I_TC_INDEX 0x01 /* Input: Classify packet */ #define MPLS_FLAG_I_DIFFSERV 0x02 /* Input: Propagate diffserv bits */ #define MPLS_FLAG_O_TC_INDEX 0x04 /* Output: Classify packet */ #define MPLS_FLAG_O_DIFFSERV 0x08 /* Output: Propagate diffserv bits */ #define MPLS_FLAG_TTL_PROPAGATE 0x10 /* Input/Output: TTL propagation */ #define MIR_FLAG_TTL_PROPAGATE MPLS_FLAG_TTL_PROPAGATE /* NHLFE attributes */ enum { NH_UNSPEC, NH_OP_INS, NH_STATS, NH_NEIGH_IP, }; struct mpls_op_u { __u32 op; __u32 operand; }; ---------------------------------------- For how we document this typically look at: http://www.faqs.org/rfcs/rfc3549.html > > w.r.t Downstream on demand: > > I *do* think it's valuable and we *must* support it (e.g RSVP-TE). If I > cannot use RSVP-TE to setup LSPs in Linux I'm going back right now to > James implementation ;-) > > RCAS: I think we are confusing label distribution modes with > implementation details. RSVP-TE uses downstream on demand as a label > distribution mode, and it could be implemented as a two step process where > a dummy NHLFE is created during the RSVP_PATH message so the ILM may point > to it and then replaced by the right one upon reception of the RSVP_RESV. Ok. So we may need some extra speacilized NHLFE entries. I am not a big fan of the two step process unless you guys really insist - then we can go and convince davem. My opinion is lets have 3 new speacial NHLFEs: - something that sends the packet to a blackhole which will work for such a scenarion as above. - Another one will send the packet to user space via netlink. This may also be used for resolving what you have above. - A third one is for locally destined packets. I was not sure whether this should just be a flag which says neighbor = local or not. > What we do not support is sending orphan packets to userspace, and that > when an entry is added in the ILM or FTN there must be an exisiting > NHLFEid. I'm not saying that we need to match the exact words of the RFC, > but we can (and must) support downstream on demand. > sure. Let me know what you think of the above. > > him, if theres a > 20% improvement we have more strength);-> I wanna see > > MPLS in 2.6 soon and i think this is the fastest way to get there. > > Well, me too :) but I'd rather see it in 2.6 when it's ready. Most > probably you're right and it is.... > As you can see, we are fixing things; good it didnt go in right away. > > For example for something like the ILM, where lookup is based on a 12 > > bit label, then i would think making it anything more than a hash and > (...) > > make it 256 buckets and suddenly you are looking at 16 worst case. > > Where do you get this numbers ? :) > I thought the label was 20 bits, and > with 256 buckets (2**8) you have 2**12 = 4096 worst case. > even with 1024 buckets you get 1024 worst case. Never mid - too many things being computed in my brain. I was thinking of VLAN tags. > am I missing something? Are these values acceptable? In that case 4ill > shut up :) Well, have some student do a project ;-> Let them measure the perfomance differences under different scenarios with hash-and-walk vs radix tree or another funky lookup scheme for say many many entries.. With data we can challenge the current scheme. cheers, jamal |
From: Jamal H. S. <ha...@zn...> - 2004-02-16 14:25:38
|
On Sun, 2004-02-15 at 05:36, Ramon Casellas wrote: > re-hi, > > FYI: > http://www.enst.fr/~casellas/mpls-linux-2.6/spec/spec.pdf > http://www.enst.fr/~casellas/mpls-linux-2.6/spec/index.html > > Work in progress. Things to fix. I will print this and look at it then respond to the rest of your email. Off to the office so response may be a little slow. cheers, jamal |
From: Jamal H. S. <ha...@zn...> - 2004-02-16 14:23:34
|
On Sun, 2004-02-15 at 02:25, James R. Leu wrote: > It could be any protocol we map onto an LSP (ie ethernet/atm/fr over MPLS), > you just have to add a protocol driver for it. And the reason you want to do it at the protocol level is because you can classify better? > The ECMP feature only help you at the ingress LER. You need something > to handle load balancing in the core of the MPLS domain. Agreed, so in my earlier email i said we had no control over ECMP i.e at the mercy of linux V4/6 ECMP. At the ILM level on the other hand (for LSRs) we do have more control. > ECMP example: > ------- ------- | | | | .--1G-----| LSR 1 |---100M----| LSR 2 |----1G---. / | | | | \ ---------/ ------- ------- \-------- | Ingress | | Ingress | | LER | | LER | ---------\ ------- ------- /-------- \ | | | | / `--1G-----| LSR 3 |---100M----| LSR 4 |----1G---' | | | | ------- ------- > > In the above case ECMP will allow a max traffic of 200M between > ingress and egress. Ok > Load balancing example: > --------- ------- ------- -------- | | | |---100M----| | | | | Ingress |----1G-----| LSR 1 |---100M----| LSR 2 |----1G-----| Egress | | LER | | |---100M----| | | LER | --------- ------- ------- -------- > > > > Without load balancing LDP would create 1 LSP for traffic going > from ingress to egress. The max traffic you could sent from ingress > to egress is 100M. With load balancing LDP still sets up 1 LSP from > igress to egress, but when LSR2 advertises a label to LSR1, LSR1 realizes > it has 3 adj to LSR2 and creates 3 NHLFEs, on on each of the links. It then > uses some mechanism to load balance traffic arriving on it's 1 ILM onto > the 3 NHLFEs. In the single label case, looking at the protocol ID > associated with the ILM and doing a little layer violation ;-) and we > can do per flow hashing and map flows to the various NHLFEs. Now the > max traffic between ingress and egress is 300M. > Gotcha. so that balancing is done at the ILM level, correct? So that little violation or peeking is i take it the reason you want the protocol extension to be added? > > > The task is trival if the stack only has one label, for more then one label > > > we would have to be creative. Hashing the label stack, or use the PW ID > > > (suggestion in PWE3 WG which adds a word after the labelstack to indicate > > > what protocol lies below.) The PW ID could be used to lookup the protocol > > > driver to generate the hash. > > > > Point me to some doc if you dont mind. Is this for some of the VPN > > encapsulations? > http://www.ietf.org/internet-drafts/draft-allan-mpls-pid-00.txt I'll read the draft; i know the author from my nortel days. If i understood correctly, this is now introducing an extra piece of data in the packet? Note, as i described earlier, we should be able to just look at anything on the packet with the u32 classifier which can be activated before MPLS ILM is consulted. Also based on the top label we can do a classification again to peek into further packet data before making a decision the next hop. > > > Or of course we could just add an options for which algo to use. > > > > Note what i suggested is only for ILM level; And there you could add any > > algorithms you want. With the protocol driver are you suggesting to do > > something at the IPV4/6 FTN level only? > > To be able to load balance and guarentee packet order, you need to know > what is underneath the label stack. With just one label it is trivial to > figure out what is under the label stack. With more then one, it isn't > so easy (the LSR that needs to do the load balancing was not involved in the > signaling of any of the labels past the first one). Currently vendors do > some nasty hacking. Look at the first nibble after the label stack, if it > is a 4, they assume IPv4. They build the appropriate hash and use that > to select the outgoing NHLFE. Why cant you look? Is this because ASICS are already built? You know precisely where the label stack is going to end, no? Can you not then offset to that position and figure what the next data level is? > Since we use the childs output pointer, IPv4|6 don't care if it is MPLS. > I suppose the same check for child could be made in MPLS output, then yes > you could have more the one child stacked. I'm not sure if this would > be very optimal for create hierarchical LSPs (I think that is what > your eluding to). Ok, that sounds reasonable. For starters dont even talk about hierachical LSPs ;-> Out challenge is to get rid of dst->mpls .. then go to David with this one change - I think its above 5% value add;->. Are you going to make the change? cheers, jamal |
From: Ramon C. <cas...@in...> - 2004-02-15 10:39:44
|
re-hi, FYI: http://www.enst.fr/~casellas/mpls-linux-2.6/spec/spec.pdf http://www.enst.fr/~casellas/mpls-linux-2.6/spec/index.html Work in progress. Things to fix. Regards, Ramon w.r.t Downstream on demand: I *do* think it's valuable and we *must* support it (e.g RSVP-TE). If I cannot use RSVP-TE to setup LSPs in Linux I'm going back right now to James implementation ;-) RCAS: I think we are confusing label distribution modes with implementation details. RSVP-TE uses downstream on demand as a label distribution mode, and it could be implemented as a two step process where a dummy NHLFE is created during the RSVP_PATH message so the ILM may point to it and then replaced by the right one upon reception of the RSVP_RESV. What we do not support is sending orphan packets to userspace, and that when an entry is added in the ILM or FTN there must be an exisiting NHLFEid. I'm not saying that we need to match the exact words of the RFC, but we can (and must) support downstream on demand. On 14 Feb 2004, Jamal Hadi Salim wrote: > As i said before this is NOT my implementation. I tried to document and > sanitize what it does - mostly so we can have a useful discussion. My right, sorry. It's DaveM's implementation. > him, if theres a > 20% improvement we have more strength);-> I wanna see > MPLS in 2.6 soon and i think this is the fastest way to get there. Well, me too :) but I'd rather see it in 2.6 when it's ready. Most probably you're right and it is.... > For example for something like the ILM, where lookup is based on a 12 > bit label, then i would think making it anything more than a hash and (...) > make it 256 buckets and suddenly you are looking at 16 worst case. Where do you get this numbers ? :) I thought the label was 20 bits, and with 256 buckets (2**8) you have 2**12 = 4096 worst case. even with 1024 buckets you get 1024 worst case. am I missing something? Are these values acceptable? In that case 4ill shut up :) > So evaluate for each table what needs to be done then make a call. You are right. A choice should be made with performance numbers around. R. |
From: Ramon C. <cas...@in...> - 2004-02-15 07:45:32
|
Jamal, Glad to know your wife is pregnant. Best wishes :) On 14 Feb 2004, Jamal Hadi Salim wrote: > IIRC correclty from those old days, there exists some form of LDP > implementation. That could be part of the userspace tools or safer > separate. James, How do you see porting LDP portable to this version? > You have access to the patches. What more do you want? > OK, Set up a CVS repository. I am old fashioned and dont use it very > much. I still refuse to use bitkeeper. The easiest thing for me is > people send me patches and i merge them. Again i dont care if its CVS. > Maybe we can try something more exciting like that competition to > bitkeeper;-> James, What about adding a new kernel version to the p4 repository? something like mpls-kernel-dm with the docs and patches, giving write access to J.R.L., J.H.S, D.S.M, R.C ? I am working on the spec doc and other docs. Later I want to start documenting DaveM with kerneldoc. Thoughts? > Edit the doc and send an update ;-> Working on it R. |
From: James R. L. <jl...@mi...> - 2004-02-15 07:28:25
|
On Fri, Feb 13, 2004 at 05:58:03PM -0500, Jamal Hadi Salim wrote: > On Fri, 2004-02-13 at 12:12, James R. Leu wrote: > > > > >From user space this would look like: > > > > > > l2c mpls ilm add dev eth0 label 22 nhalg roundrobin nhid 2 nhid 3 nhid 4 > > > > What about adding a new func ptr to the protocol driver. Then we could > > do protocol dependent stuff like hashing the IPv4|6 header or ethernet > > header (ethernet over MPLS). > > Ok, so you are looking at only IP packets at the edge of an MPLS > network. Describe a little packet walk. Are you planning to > not use the ECMP features? It could be any protocol we map onto an LSP (ie ethernet/atm/fr over MPLS), you just have to add a protocol driver for it. The ECMP feature only help you at the ingress LER. You need something to handle load balancing in the core of the MPLS domain. ECMP example: ------- ------- | | | | .--1G-----| LSR 1 |---100M----| LSR 2 |----1G---. / | | | | \ ---------/ ------- ------- \-------- | Ingress | | Ingress | | LER | | LER | ---------\ ------- ------- /-------- \ | | | | / `--1G-----| LSR 3 |---100M----| LSR 4 |----1G---' | | | | ------- ------- In the above case ECMP will allow a max traffic of 200M between ingress and egress. Load balancing example: --------- ------- ------- -------- | | | |---100M----| | | | | Ingress |----1G-----| LSR 1 |---100M----| LSR 2 |----1G-----| Egress | | LER | | |---100M----| | | LER | --------- ------- ------- -------- Without load balancing LDP would create 1 LSP for traffic going from ingress to egress. The max traffic you could sent from ingress to egress is 100M. With load balancing LDP still sets up 1 LSP from igress to egress, but when LSR2 advertises a label to LSR1, LSR1 realizes it has 3 adj to LSR2 and creates 3 NHLFEs, on on each of the links. It then uses some mechanism to load balance traffic arriving on it's 1 ILM onto the 3 NHLFEs. In the single label case, looking at the protocol ID associated with the ILM and doing a little layer violation ;-) and we can do per flow hashing and map flows to the various NHLFEs. Now the max traffic between ingress and egress is 300M. > > The task is trival if the stack only has one label, for more then one label > > we would have to be creative. Hashing the label stack, or use the PW ID > > (suggestion in PWE3 WG which adds a word after the labelstack to indicate > > what protocol lies below.) The PW ID could be used to lookup the protocol > > driver to generate the hash. > > Point me to some doc if you dont mind. Is this for some of the VPN > encapsulations? http://www.ietf.org/internet-drafts/draft-allan-mpls-pid-00.txt > > Or of course we could just add an options for which algo to use. > > Note what i suggested is only for ILM level; And there you could add any > algorithms you want. With the protocol driver are you suggesting to do > something at the IPV4/6 FTN level only? To be able to load balance and guarentee packet order, you need to know what is underneath the label stack. With just one label it is trivial to figure out what is under the label stack. With more then one, it isn't so easy (the LSR that needs to do the load balancing was not involved in the signaling of any of the labels past the first one). Currently vendors do some nasty hacking. Look at the first nibble after the label stack, if it is a 4, they assume IPv4. They build the appropriate hash and use that to select the outgoing NHLFE. > > Here are some snippits. I think XFRM may remove the need for these, > > but for now it works. > > > Setup the dst stacking > > ---------------------- > > > > net/mpls/mpls_output.c > > > > int > > mpls_set_nexthop (struct dst_entry *dst, u32 nh_data, struct spec_nh *spec) > > { > > struct mpls_out_info *moi = NULL; > > I take it mpls_out_info is an nhlfe entry? > > > MPLS_ENTER; > > moi = mpls_get_moi(nh_data); > > if (unlikely(!moi)) > > return -1; > > > > dst->metrics[RTAX_MTU-1] = moi->moi_mtu; > > dst->child = dst_clone(&moi->moi_dst); > > MPLS_DEBUG("moi: %p mtu: %d dst: %p\n", moi, moi->moi_mtu, > > &moi->moi_dst); > > MPLS_EXIT; > > return 0; > > } > > > > mpls_set_nexthop is called from ipv4:rt_set_nexthop and from > > ipv6:ip6_route_add (I have a 'special nextop' system developed which > > would be replaced by XFRM). It is very similar to your RTA_MPLS_FEC, > > but has 2 pieces of data a RTA_SPEC_PROTO and RTA_SPEC_DATA. It is > > intended for multiple protocols to be able to register special nexthop. > > Right now only MPLS registers :-) Again I have every intention of > > ripping it out in favor XFRM. > > > > Using the dst stack > > ------------------- > > > > net/ipv4/ip_output.c > > > > static inline int ip_finish_output2(struct sk_buff *skb) > > { > > struct dst_entry *dst = skb->dst; > > struct hh_cache *hh = dst->hh; > > struct net_device *dev = dst->dev; > > int hh_len = LL_RESERVED_SPACE(dev); > > > > if (dst->child) { > > skb->dst = dst_pop(skb->dst); > > return skb->dst->output(skb); > > } > > ... > > > > Something very similar exists in net/ipv6/ip6_output.c ip6_output_finish() > > > > On the outset this does look a bit cleaner but i would have to ping my > brain on Daves approach. Take a look at his code. > Q: Can you stack more than one of those dsts? If yes, then it may be > even safer to have the nhlfe_route in the dst instead, no? > i.e how sure can you be that child will be MPLS related; in other case > it is guaranteed to (it does say dst->xxmplsxx). Since we use the childs output pointer, IPv4|6 don't care if it is MPLS. I suppose the same check for child could be made in MPLS output, then yes you could have more the one child stacked. I'm not sure if this would be very optimal for create hierarchical LSPs (I think that is what your eluding to). > There are a few pieces for the current approach that i didnt like ; > example the net_output_maybe_reroute() thing. Or having to mod dst.c > to add ifdefs for MPLS. There could be a marriage of the two approaches > maybe? After getting the feedback from David, XFRM will have to wait and I think the dst stacking is cleaner. > cheers, > jamal -- James R. Leu jl...@mi... |
From: Jamal H. S. <ha...@zn...> - 2004-02-15 02:38:17
|
On Sat, 2004-02-14 at 17:04, Ramon Casellas wrote: > > Although I think that we are indeed on the right path, IMHO, for the > moment, your implementation is lagging functionnality w.r.t. James' one > (available userspace apps, diffserv mapping, tunnels, procfs, sysfs,etc. User space app for static management is already there - thats what "l2c mpls" is. Maybe i didnt make myself clear before, diffserv and more is there they are just not locked into mpls; they can be associated but are independent apps. Tunels whatever James has can be merged. Procfs or sysfs i care less about and wouldnt loose sleep if they didnt exists - we use netlink which should cover most of what these things try to do. If someone wants to add that go ahead. I apologize i never got around to looking at James code (and weekend is mostly for a pregnant woman who occasionaly looks away and i sneak to check mail), but theres a lot of stuff that could be merged in (i noticed PPP, ATM, FR for example). IIRC correclty from those old days, there exists some form of LDP implementation. That could be part of the userspace tools or safer separate. > although some are not strictly required), although I admit that there are > still serious issues with James' impl (locking and SMP safeness are the > most notorious ones), and that it is just a matter of time and work. > > Given the fact that DaveM explicitely supports yours, it seems clear to me As i said before this is NOT my implementation. I tried to document and sanitize what it does - mostly so we can have a useful discussion. My piece is user space to kernel. If you look at the code you will see my name appearing in only about two files or so. My preference is to use this implemantation and to have as little fight with Dave as possible and only make changes to the base when we see it appropriate (if theres a 5% improvement, we could think about talking to him, if theres a > 20% improvement we have more strength);-> I wanna see MPLS in 2.6 soon and i think this is the fastest way to get there. > that we should focus on it (James? yours is the last word... I have > been working on yours for only several months, you have spent the last > five years), and avoid any other fork. So the question is "what can we do > now?". When I started working on James implementation, I appreciated being > almost immediately given write access, so I could do some documenting > tasks while I was understanding the inner works..., and I was trusted. I > understand that you may see things differently. What is your position on > this? Neither James nor I have (for the moment?) access to the patch/CVS > repository, l2c userspace application.... Somehow I feel hand tied :) You have access to the patches. What more do you want? > I > have spare time and I'm afraid that you may want a centralized approach, > which may have some inconvenients (although you have all right to). OK, Set up a CVS repository. I am old fashioned and dont use it very much. I still refuse to use bitkeeper. The easiest thing for me is people send me patches and i merge them. Again i dont care if its CVS. Maybe we can try something more exciting like that competition to bitkeeper;-> > In other words, if you were to be this project manager ;). How would you define > the tasks so everyone may contribute to the project, see the others recognize > his work, etc? Personally, I am interested and I would like to play a nice part > on this. What is missing? > I think we need to discuss then someone codes or merges. For example, we need to settle on the multihop; i think the idea i suggested is the way to go for ILM - i dont care who codes; i could. Also we need to settle the dst issue and that may result in coding. Like i pointed out i think the ATM, PPP and FR features are missing. Someone adventorous could get some LDP code ported over or document the API so LDP porters could run over it. Look at the tod list and maybe add more to it - and lest start there. > > QUESTIONS > ######################################################## > > question: Are there performance studies regarding radix trees w.r.t Hash > buckets and linked lists? If the number of labels is large, isn't the O(N) walk > op going to slow things down? How many labels are managed in average? for > example, if we assume 100000 BGP prefixes and (why not) a label per prefix, > with hash&walk (1024 hash buckets) it makes 100 entries (average) per bucket vs > approx log2(N) with binary trees? I am indifferent and frankly dont care how it is done. If someone needs to change code like that (which is Daves) just come with some justification. I will support it if it looks valuable. Like i said hit that 20% threshold. > what about other advanced ADT, like Hash buckets and radix trees or similar? > Thoughts? For example for something like the ILM, where lookup is based on a 12 bit label, then i would think making it anything more than a hash and walk is overkill. If you can put 64 hash buckets thats already taking off 6 bits; which means worst case you will walk is 64. make it 256 buckets and suddenly you are looking at 16 worst case. So evaluate for each table what needs to be done then make a call. > question: regarding dst management. Maybe Alexey could enlighten us. It may be > interesting to know his point of view about adding a specific mpls ptr to the > generic dst struct, or he may even propose alternate solutions... I think dst is the way to go; whether we end up using a child or a ptr is something we need to settle first. James hasnt repsonded to my last email. I am also confident Davem knows this space well. Alexey we can use at the end so he can spit at the code. > > COMMENTS ON USERSPACE APP > ######################################################## > > Jamal's proposal: > > l2c mpls nhlfe <cmd> dev <devname> > index <val> proto <ipv4|ipv6> nh <neighbor> > <operation set> fec <FECid> > operation set := (op <operation>)* > * cmd is one of: <add | del | replace | get> > > > 20040214-RCAS- We should work on both grammars. I understand that they are a > work in progress, but they are imprecise and inconvenient. for example the > "del" operation should not require the user to give the neighbour. OTOH, I > think we don't need replace (simple remove and add) You can leave out the other parts on del and it would work. That idea is consistent with tc structure. Replace is an atomic del/add. A lot of table management has it. Imagine many applications trying to manage the same table. > * index could be used to store the LSPid > 20040214-RCAS- (I don't understand this. ???) NHLFE_id aka fecid is local i.e not spreadable over LDP for example. A management application such as a dynamic daemon which has a bigger view of the world may wish to identify further by LSPid - hence the existence of "index". If it doesnt make sense we could remove it; right now i see no harm in it. If you dont specify it, it gets zeroed. > * FECid is the FEC identifier to be used as the key for searching. > > 20040214-RCAS- nhlfe_id :) yep ;-> > > Well, IMHO, I think the grammar can be improved. All the opcodes > need to be defined, with their arguments (get them from James > Implementation. They are quite complete and comprehensive) Do you see anything in the base set other than push and pop? Everything else is a combination of these. I am trying to remember what James did - i think he had opcodes like multi-push (in the above case you just specify as many pushes as you want). Actually i am guilty of influencing this piece in Daves code. I was influenced by what i saw from the ASICs i looked at and the two implementations. Lets discuss. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > SUGGESTION What about this ? > > l2c mpls nhlfe COMMAND <nhlfe_id> > > COMMAND := [add | del | get | SETCOMMAND] > SETCOMMAND := set proto <ipv4|ipv6> nh <neighbour> "OPERATIONSET" > OPERATIONSET := [swap SWAPARGS | pop | dlv | mapexp....],+ > SWAPARGS := labelvalue[:labelspace].. > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Examples > > # l2c mpls nhlfe add <nhlfe_id> > Add an empty entry (default, drop). Error if exists. > > # l2c mpls nhlfe del <nhlfe_id> > Remove the given entry. Ignore if not exists. > > # l2c mpls nhlfe get <nhlfe_id> > Dump entry. Ignore if not exists. > > # l2c mpls nhlfe set proto ipv4 nh 10.0.0.2 "swap 20,push 50" <nhlfe_id> > Is there a reason to make it two separate updates? Is the command too long maybe? > > > ############################################################## > The ip tool should allow you specify route you want then > specify the FECid for that route, i.e: > ip route ... FECid <FECid> > where FECid is the NHLFE keyid we want to use > Example: > ip route add 10.0.0.21/32 via 10.0.0.9 dev eth0 fecid 1 > > 20040214-RCAS all occurrences of FECId should be changed to nhlfe_id. > :) > Edit the doc and send an update ;-> > ############################################################################## > JAMAL: > l2c mpls ilm <cmd> dev <devname> > index <val> label fec <FECid> > > > RCAS : This should be <label> otherwise it looks like a keyword. thats a typo; should be: index <val> label <labelvalue> nhlfe_id <nhval> > * cmd is one of: <add | del | replace | get> > > RCAS: I think we don't need replace. Let the user del and add. Too > many commands are cumbersome. Like i said replace is there for atomicity of the two operations. All database operations typically have the above four commands. Look at this as a table that will be manipulated by many users concurently. > RCAS: Let's work on the grammar. The user should only need to give the > incoming label to remove, not the nhlfe_id that it points to. ?? The nhlfe_id must exist before the entry is allowed. Look at the architecture of the tables in the doc. All roads lead to the NHLFE table. > * devname is the input device to be used > RCAS: right, but we need more flexibility. > RCAS: one option would be to use wilcards, e.g. > RCAS: l2c mpls ilm add "ethO:15" > RCAS: l2c mpls ilm add "*:15" > RCAS: but, I do think that the labelspace approach in james impl. > RCAS: is better. Let the user set a labelspace as a netdevice > RCAS: attribute and let the user define ILM entries as > RCAS: labelspace+value. The labelsapce issue is still open. I can see its value in the L2VPN where an additional VPNid comes in with the labelspace. I am really struggling trying to see its value here. James and I had a small discussion we need to revive that. If you look at the code you will see, at the moment the label space is zero always. If there is something clear in the incoming packet that can be used to map to a device, then using labelspace becomes valuable. > > * Index is an additional identifier that could be used to > store LSP info. > > RCAS : What is val? > RCAS: I still don't understand this. Could you please give examples? Same idea as in the NHLFE. If it doesnt prove valuable we could remove it. > * FECid is the FECid to be used for searching the NHLFE. > > RCAS: nhlfe_id is the NHLFE id to use. > yep. > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > SUGGESTION What about this ? > > l2c mpls ilm COMMAND <ls:label> > > COMMAND := [add | del | get | BINDCOMMAND ] > BINDCOMMAND := bind <nhlfe_id> to > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > well, or something like that > I have issues with the labelspace as i described above. > > ###################################################################### > > > 3.0 Allowed OPCODEs > > 20040214-RCAS: We need more advanced (DiffServ, etc.) opcodes. We can leverage > James implementation for this. > > > > > 3.1 Modifiying opcodes > > - REDIRECT: redirect a packet to a different LSP > (useful for testing or redirecting to a control plane) > > 20040214-RCAS: this can be useful. Nice. > > > > - MIRROR: send a copy of a packet somewhere else for further > processing (useful for LSP pings, traceroute, debug etc) > > 20040214-RCAS: Idem. > > > 3.2 Label action opcodes > > > 20040214-RCAS: Why do we need to introduce two concepts "Label action opcodes" > and "Atomic operations"? aren't all "Atomic operations" "label actions" and > viceversa??? The way i saw it (or was influenced to think of it is as follows): - there are three basic operations (just like there basic types in C prgramming eg integer, char, short). Then you can build complex compositions from the rest of them (just like you can build data structures in C froim teh atomic data types) > > The atomic operations are:" > > - POP > - PUSH > - REPLACE > > 20040214-RCAS: Whats wrong with the standard name "SWAP"??? sure, swap it is. > > Note: > a stack of consisting atomic operations can be implemented; example: > a pop followed by several pushes. > > 20040214-RCAS: Ein??? Be more specific. > The analogy of atomic data types and structures i described above applies. > > > Well, I have some urgent boring things to do, more comments to follow, And i have someone who is looking for me right now - i took too long to go to the washroom ;-> cheers, jamal |
From: Ramon C. <cas...@in...> - 2004-02-14 22:08:51
|
Jamal, All, Please, find my comments inline below. They concern userspace app grammar and syntax, as well as discussing opcodes. GENERIC COMMENTS (no flaming intented). ######################################################## Although I think that we are indeed on the right path, IMHO, for the moment, your implementation is lagging functionnality w.r.t. James' one (available userspace apps, diffserv mapping, tunnels, procfs, sysfs,etc. although some are not strictly required), although I admit that there are still serious issues with James' impl (locking and SMP safeness are the most notorious ones), and that it is just a matter of time and work. Given the fact that DaveM explicitely supports yours, it seems clear to me that we should focus on it (James? yours is the last word... I have been working on yours for only several months, you have spent the last five years), and avoid any other fork. So the question is "what can we do now?". When I started working on James implementation, I appreciated being almost immediately given write access, so I could do some documenting tasks while I was understanding the inner works..., and I was trusted. I understand that you may see things differently. What is your position on this? Neither James nor I have (for the moment?) access to the patch/CVS repository, l2c userspace application.... Somehow I feel hand tied :) I have spare time and I'm afraid that you may want a centralized approach, which may have some inconvenients (although you have all right to). In other words, if you were to be this project manager ;). How would you define the tasks so everyone may contribute to the project, see the others recognize his work, etc? Personally, I am interested and I would like to play a nice part on this. What is missing? QUESTIONS ######################################################## question: Are there performance studies regarding radix trees w.r.t Hash buckets and linked lists? If the number of labels is large, isn't the O(N) walk op going to slow things down? How many labels are managed in average? for example, if we assume 100000 BGP prefixes and (why not) a label per prefix, with hash&walk (1024 hash buckets) it makes 100 entries (average) per bucket vs approx log2(N) with binary trees? what about other advanced ADT, like Hash buckets and radix trees or similar? Thoughts? question: regarding dst management. Maybe Alexey could enlighten us. It may be interesting to know his point of view about adding a specific mpls ptr to the generic dst struct, or he may even propose alternate solutions... COMMENTS ON USERSPACE APP ######################################################## Jamal's proposal: l2c mpls nhlfe <cmd> dev <devname> index <val> proto <ipv4|ipv6> nh <neighbor> <operation set> fec <FECid> operation set := (op <operation>)* * cmd is one of: <add | del | replace | get> 20040214-RCAS- We should work on both grammars. I understand that they are a work in progress, but they are imprecise and inconvenient. for example the "del" operation should not require the user to give the neighbour. OTOH, I think we don't need replace (simple remove and add) * index could be used to store the LSPid 20040214-RCAS- (I don't understand this. ???) * FECid is the FEC identifier to be used as the key for searching. 20040214-RCAS- nhlfe_id :) Well, IMHO, I think the grammar can be improved. All the opcodes need to be defined, with their arguments (get them from James Implementation. They are quite complete and comprehensive) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SUGGESTION What about this ? l2c mpls nhlfe COMMAND <nhlfe_id> COMMAND := [add | del | get | SETCOMMAND] SETCOMMAND := set proto <ipv4|ipv6> nh <neighbour> "OPERATIONSET" OPERATIONSET := [swap SWAPARGS | pop | dlv | mapexp....],+ SWAPARGS := labelvalue[:labelspace].. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Examples # l2c mpls nhlfe add <nhlfe_id> Add an empty entry (default, drop). Error if exists. # l2c mpls nhlfe del <nhlfe_id> Remove the given entry. Ignore if not exists. # l2c mpls nhlfe get <nhlfe_id> Dump entry. Ignore if not exists. # l2c mpls nhlfe set proto ipv4 nh 10.0.0.2 "swap 20,push 50" <nhlfe_id> ############################################################## The ip tool should allow you specify route you want then specify the FECid for that route, i.e: ip route ... FECid <FECid> where FECid is the NHLFE keyid we want to use Example: ip route add 10.0.0.21/32 via 10.0.0.9 dev eth0 fecid 1 20040214-RCAS all occurrences of FECId should be changed to nhlfe_id. :) ############################################################################## JAMAL: l2c mpls ilm <cmd> dev <devname> index <val> label fec <FECid> RCAS : This should be <label> otherwise it looks like a keyword. * cmd is one of: <add | del | replace | get> RCAS: I think we don't need replace. Let the user del and add. Too many commands are cumbersome. RCAS: Let's work on the grammar. The user should only need to give the incoming label to remove, not the nhlfe_id that it points to. * devname is the input device to be used RCAS: right, but we need more flexibility. RCAS: one option would be to use wilcards, e.g. RCAS: l2c mpls ilm add "ethO:15" RCAS: l2c mpls ilm add "*:15" RCAS: but, I do think that the labelspace approach in james impl. RCAS: is better. Let the user set a labelspace as a netdevice RCAS: attribute and let the user define ILM entries as RCAS: labelspace+value. * Index is an additional identifier that could be used to store LSP info. RCAS : What is val? RCAS: I still don't understand this. Could you please give examples? * FECid is the FECid to be used for searching the NHLFE. RCAS: nhlfe_id is the NHLFE id to use. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SUGGESTION What about this ? l2c mpls ilm COMMAND <ls:label> COMMAND := [add | del | get | BINDCOMMAND ] BINDCOMMAND := bind <nhlfe_id> to ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ well, or something like that ###################################################################### 3.0 Allowed OPCODEs 20040214-RCAS: We need more advanced (DiffServ, etc.) opcodes. We can leverage James implementation for this. 3.1 Modifiying opcodes - REDIRECT: redirect a packet to a different LSP (useful for testing or redirecting to a control plane) 20040214-RCAS: this can be useful. Nice. - MIRROR: send a copy of a packet somewhere else for further processing (useful for LSP pings, traceroute, debug etc) 20040214-RCAS: Idem. 3.2 Label action opcodes 20040214-RCAS: Why do we need to introduce two concepts "Label action opcodes" and "Atomic operations"? aren't all "Atomic operations" "label actions" and viceversa??? The atomic operations are:" - POP - PUSH - REPLACE 20040214-RCAS: Whats wrong with the standard name "SWAP"??? Note: a stack of consisting atomic operations can be implemented; example: a pop followed by several pushes. 20040214-RCAS: Ein??? Be more specific. Well, I have some urgent boring things to do, more comments to follow, Thanks, Ramon // ------------------------------------------------------------------- // Ramon Casellas - GET/ENST/INFRES/RHD/A508 - cas...@in... |
From: Jamal H. S. <ha...@zn...> - 2004-02-14 00:32:53
|
On Fri, 2004-02-13 at 17:22, Ramon Casellas wrote: > Jamal, > > I am still in the middle of understanding your patch. One of the things > that worries me (most probably due to my lack of understanding) is that it > seems quite intrusive w.r.t other parts of the stack. IMVHO, I often > consider strong coupling not_a_so_good_thing, and I defend duplicating > some parts of code in the sake of clarity and modularity. So some > ideas/questions: There are certain things that you cant avoid. Example you will have ifdefs in the v6 and v4 for FTN support. The less ifdefs the better. I think once you start attaching IPSEC to MPLS, same thing will happen. There are things which are v4 and v6 specific that are totaly abstracted out but dependent on those protocols - example neighbor binding. But this is really clean right now. look at the mpls_prot_driver code. I think this code is as decoupled as you can go but i may have missed your point. > * I appreciated your effort with the design document. I am a paranoid guy > regarding documentation (that's why I wrote down the devel guide on James' > implementation). A design document stating the required changes of core > parts for MPLS support and the reasons would be much welcome, and it would > allow further discussion (you stated in a previous mail, that this time, > as a premiere in Linux, you wanted to do things right :)) . Do you plan to > write something about that? I know it is the most ungrateful part.. I am capable of writing good doc with proper motivation. I dont have it right now but you could do that ;-> If you want you can take over the spec doc. I will try to clarify things when i can. > * In this sense, to truly modularize the MPLS implementation, I think it > would be appropriate to make things in such a way that the user could be > able to select "Core MPLS support" and "Full MPLS Support" (or something > like that) when configuring the kernel. Core LSRs would only be able to > forward mpls labelled packets without knowledge of L3 protocols (think of > a BGP/MPLS VPN 'P' router that is used to forward L3 and L2 frames) and > only a minimal set modifications to IPv4/IPv6 would be compiled in (in > other words, the FIB Table need only be extended in the second case). Is > this level of granularity common practice in the Linux kernel? Are you refering to being able to compile out FTN support? I think this is doable; you just need to introduce a config probably one for each of v4 or 6. > * It's just a simple question, take no offense :) but do you consider the > patch you sent quite "feature freeze" and "written in stone" or are you > willing to open development and allow changes *iff* common consensus > justifies it? I think this is an important point for us. > Consensus is key between us at least. caveat: What i would like though is to avoid having to stress Dave when theres no clear win in some change to be made. I would like to make it easy for him to accept things - so lets discuss changes first like the dst changes then have some good reasons before we talk to him. cheers, jamal |
From: Ramon C. <cas...@in...> - 2004-02-13 23:28:23
|
On 13 Feb 2004, Jamal Hadi Salim wrote: > On Fri, 2004-02-13 at 12:32, James R. Leu wrote: > > The MPLS tunnel interface fits well into the 'cisco' model of TE LSPS, which > > represents them as a ptp Tunnel interface with a peer address of the > > end-point of the LSP. The 'juniper' model represents TE LSPs and just > > another route in the MPLS 'routing' table (/32 route for the end-point of > > the TE LSP). I personally perfer the 'cisco' model, it provides more > > flexibility (anything that can work with a netdevice can use it). > > > > Ok. So i may be getting a better idea. Essentially by being a netdevice > it gets the advantage of being routable etc. > Just because CISCO has it is good reason to add it. Jamal, Thanks for being open to ideas and thoughts. May I suggest you setting up (when you find some time) a CVS so it is easiear for us to sync to the latest tree? Not right now of course. REgards, r. |
From: Jamal H. S. <ha...@zn...> - 2004-02-13 23:17:34
|
On Fri, 2004-02-13 at 12:32, James R. Leu wrote: > The MPLS tunnel interface fits well into the 'cisco' model of TE LSPS, which > represents them as a ptp Tunnel interface with a peer address of the > end-point of the LSP. The 'juniper' model represents TE LSPs and just > another route in the MPLS 'routing' table (/32 route for the end-point of > the TE LSP). I personally perfer the 'cisco' model, it provides more > flexibility (anything that can work with a netdevice can use it). > Ok. So i may be getting a better idea. Essentially by being a netdevice it gets the advantage of being routable etc. Just because CISCO has it is good reason to add it. We should also support the Juniper approach. We are Linux after all ;-> One piece i said earlier was missing that may enable this is the tc-action code[1]. With this i can do at pre-IP level do something along the lines of: tc filter add dev eth0 parent ffff: protocol ip prio 1 \ u32 match ip src 10.0.0.21/32 flowid 1:15 \ action set nhlfe_id 10 \ action mpls_tunnel \ action mirred egress redirect dev eth2 and then use the skb->nhlfe_id in the mpls_tunnel before redirecting the packet out eth2. Of course i could let routing take care of redirecting to dev eth2. cheers, jamal [1]This code is going in; just lazy to scrub it at this point http://www.cyberus.ca/~hadi/patches/action/README |
From: Jamal H. S. <ha...@zn...> - 2004-02-13 23:00:11
|
On Fri, 2004-02-13 at 12:12, James R. Leu wrote: > > >From user space this would look like: > > > > l2c mpls ilm add dev eth0 label 22 nhalg roundrobin nhid 2 nhid 3 nhid 4 > > What about adding a new func ptr to the protocol driver. Then we could > do protocol dependent stuff like hashing the IPv4|6 header or ethernet > header (ethernet over MPLS). Ok, so you are looking at only IP packets at the edge of an MPLS network. Describe a little packet walk. Are you planning to not use the ECMP features? > The task is trival if the stack only has one label, for more then one label > we would have to be creative. Hashing the label stack, or use the PW ID > (suggestion in PWE3 WG which adds a word after the labelstack to indicate > what protocol lies below.) The PW ID could be used to lookup the protocol > driver to generate the hash. Point me to some doc if you dont mind. Is this for some of the VPN encapsulations? > Or of course we could just add an options for which algo to use. Note what i suggested is only for ILM level; And there you could add any algorithms you want. With the protocol driver are you suggesting to do something at the IPV4/6 FTN level only? > Here are some snippits. I think XFRM may remove the need for these, > but for now it works. > Setup the dst stacking > ---------------------- > > net/mpls/mpls_output.c > > int > mpls_set_nexthop (struct dst_entry *dst, u32 nh_data, struct spec_nh *spec) > { > struct mpls_out_info *moi = NULL; I take it mpls_out_info is an nhlfe entry? > MPLS_ENTER; > moi = mpls_get_moi(nh_data); > if (unlikely(!moi)) > return -1; > > dst->metrics[RTAX_MTU-1] = moi->moi_mtu; > dst->child = dst_clone(&moi->moi_dst); > MPLS_DEBUG("moi: %p mtu: %d dst: %p\n", moi, moi->moi_mtu, > &moi->moi_dst); > MPLS_EXIT; > return 0; > } > > mpls_set_nexthop is called from ipv4:rt_set_nexthop and from > ipv6:ip6_route_add (I have a 'special nextop' system developed which > would be replaced by XFRM). It is very similar to your RTA_MPLS_FEC, > but has 2 pieces of data a RTA_SPEC_PROTO and RTA_SPEC_DATA. It is > intended for multiple protocols to be able to register special nexthop. > Right now only MPLS registers :-) Again I have every intention of > ripping it out in favor XFRM. > > Using the dst stack > ------------------- > > net/ipv4/ip_output.c > > static inline int ip_finish_output2(struct sk_buff *skb) > { > struct dst_entry *dst = skb->dst; > struct hh_cache *hh = dst->hh; > struct net_device *dev = dst->dev; > int hh_len = LL_RESERVED_SPACE(dev); > > if (dst->child) { > skb->dst = dst_pop(skb->dst); > return skb->dst->output(skb); > } > ... > > Something very similar exists in net/ipv6/ip6_output.c ip6_output_finish() > On the outset this does look a bit cleaner but i would have to ping my brain on Daves approach. Take a look at his code. Q: Can you stack more than one of those dsts? If yes, then it may be even safer to have the nhlfe_route in the dst instead, no? i.e how sure can you be that child will be MPLS related; in other case it is guaranteed to (it does say dst->xxmplsxx). There are a few pieces for the current approach that i didnt like ; example the net_output_maybe_reroute() thing. Or having to mod dst.c to add ifdefs for MPLS. There could be a marriage of the two approaches maybe? cheers, jamal |
From: Ramon C. <cas...@in...> - 2004-02-13 22:53:41
|
On 13 Feb 2004, Jamal Hadi Salim wrote: > On Fri, 2004-02-13 at 12:10, Ramon Casellas wrote: > > > What do you mean with "by request" - is it created by policy or packet > arrival? By policy, tyically from userspace. This may clarify it a little http://perso.enst.fr/~casellas/mpls-linux/ch02s04.html http://perso.enst.fr/~casellas/mpls-linux/ch02s07.html http://perso.enst.fr/~casellas/mpls-linux/ch10.html > I think i may be able to visualize this, if ia m right - what is > happening is a packet gets redirected to this device which then > does some MPLS work on it before sending out some device with proper > encapsulation? Is this typically an IP packet? Yes and Yes. not bad :). And you can use it to stack. > Nobody has pointed a URL to me yet of whenre the code is. sorry about that http://sourceforge.net/project/showfiles.php?group_id=15443 Best regards, R. |
From: Jamal H. S. <ha...@zn...> - 2004-02-13 22:39:57
|
On Fri, 2004-02-13 at 12:10, Ramon Casellas wrote: > > > > What is the mpls_tunnel.c for? Is it a netdevice? What is it used for? > > Yes. It is a virtual netdevice that is allocated upon request and > basically holds a MOI (the equivalent of a nhlfe_id). User sees it as a > unidirectional netdevice (ifconfig, etc), What do you mean with "by request" - is it created by policy or packet arrival? I think i may be able to visualize this, if ia m right - what is happening is a packet gets redirected to this device which then does some MPLS work on it before sending out some device with proper encapsulation? Is this typically an IP packet? > Take a look at the file if you happen to find some spare time (Indeed, it > can be improved and the sysfs integration was a little hairy) but I think > it is very convenient and extensively used when RSVP-TE sets up LSPs. Nobody has pointed a URL to me yet of whenre the code is. cheers, jamal |
From: Ramon C. <cas...@in...> - 2004-02-13 22:24:40
|
Jamal, I am still in the middle of understanding your patch. One of the things that worries me (most probably due to my lack of understanding) is that it seems quite intrusive w.r.t other parts of the stack. IMVHO, I often consider strong coupling not_a_so_good_thing, and I defend duplicating some parts of code in the sake of clarity and modularity. So some ideas/questions: * I appreciated your effort with the design document. I am a paranoid guy regarding documentation (that's why I wrote down the devel guide on James' implementation). A design document stating the required changes of core parts for MPLS support and the reasons would be much welcome, and it would allow further discussion (you stated in a previous mail, that this time, as a premiere in Linux, you wanted to do things right :)) . Do you plan to write something about that? I know it is the most ungrateful part.. * In this sense, to truly modularize the MPLS implementation, I think it would be appropriate to make things in such a way that the user could be able to select "Core MPLS support" and "Full MPLS Support" (or something like that) when configuring the kernel. Core LSRs would only be able to forward mpls labelled packets without knowledge of L3 protocols (think of a BGP/MPLS VPN 'P' router that is used to forward L3 and L2 frames) and only a minimal set modifications to IPv4/IPv6 would be compiled in (in other words, the FIB Table need only be extended in the second case). Is this level of granularity common practice in the Linux kernel? * It's just a simple question, take no offense :) but do you consider the patch you sent quite "feature freeze" and "written in stone" or are you willing to open development and allow changes *iff* common consensus justifies it? I think this is an important point for us. Thoughts? R. |
From: James R. L. <jl...@mi...> - 2004-02-13 17:34:18
|
The MPLS tunnel interface fits well into the 'cisco' model of TE LSPS, which represents them as a ptp Tunnel interface with a peer address of the end-point of the LSP. The 'juniper' model represents TE LSPs and just another route in the MPLS 'routing' table (/32 route for the end-point of the TE LSP). I personally perfer the 'cisco' model, it provides more flexibility (anything that can work with a netdevice can use it). On Fri, Feb 13, 2004 at 06:10:13PM +0100, Ramon Casellas wrote: > On 13 Feb 2004, Jamal Hadi Salim wrote: > > > you are sure you dont want nh somewhere in there? > > since this is a reference to the NHlfe; > > heck why dont we just call it nhlfe_id ? ;-> > > > nhlfe_id is fine for me. > > > > Refer to my earlier email for the suggestions i made. > > Agreed. > > > > > > > comment: I *do* think that mpls_tunnel.c from James impl can directly be > > > used and it's very convenient. Just %s/moi/fwd_id/g > > > > What is the mpls_tunnel.c for? Is it a netdevice? What is it used for? > > > Yes. It is a virtual netdevice that is allocated upon request and > basically holds a MOI (the equivalent of a nhlfe_id). User sees it as a > unidirectional netdevice (ifconfig, etc), > > Take a look at the file if you happen to find some spare time (Indeed, it > can be improved and the sysfs integration was a little hairy) but I think > it is very convenient and extensively used when RSVP-TE sets up LSPs. > > regards, > Ramon > > > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click > _______________________________________________ > mpls-linux-devel mailing list > mpl...@li... > https://lists.sourceforge.net/lists/listinfo/mpls-linux-devel -- James R. Leu jl...@mi... |
From: James R. L. <jl...@mi...> - 2004-02-13 17:21:13
|
Thanks for the feed back. We'll leave you alone now :-) On Fri, Feb 13, 2004 at 09:07:53AM -0800, David S. Miller wrote: > On Fri, 13 Feb 2004 08:46:53 -0600 > "James R. Leu" <jl...@mi...> wrote: > > > I just wanted to get David's take on using XFRM for the Layer 3 to MPLS > > mapping which would utilize dst stacking? > > > > It XFRM capable of doing this, any pointers as to where to start? > > XFRM wants to work with protocol stacking at the protocol level (ie. things > within ipv4, or ipv6). > > We could tweak it to do this, but I advise against this initially because > this way we can stick the MPLS stack more simply into 2.4.x if we wanted > to (and I certainly might want to do that). > > After we're done, and did a 2.4.x backport if desired, we can look into > using XFRM. But I don't advise this now. > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click > _______________________________________________ > mpls-linux-devel mailing list > mpl...@li... > https://lists.sourceforge.net/lists/listinfo/mpls-linux-devel -- James R. Leu jl...@mi... |
From: James R. L. <jl...@mi...> - 2004-02-13 17:14:39
|
On Fri, Feb 13, 2004 at 11:20:06AM -0500, Jamal Hadi Salim wrote: > On Fri, 2004-02-13 at 09:39, James R. Leu wrote: > > > > Given the above info, suggest a new name. Maybe NHid? > > > > Much better then FECid ;-) (although it is just a name ...) > > True, but has to map to the semantics; > Ok NHid for now until something with a better ring shows up. > > > > > > > > Ok, that is useful. I have not tested multipath but it should > > > work with Linux routing ECMP at least. > > > I wouldnt call it NHLFE indices rather these identifiers so far > > > called fecid; > > > Also i would think most of these lists would contain a single entry. > > > BTW, the ILM is not multihop ready. We should be able to add easily. > > > Also there is no controil on how the multihop selection is done with > > > the linux routing table - whatever Linux ECMP does goes. > > > We should be able to fix the ILM with an algorith selector. > > > > After looking at the code I would agreed that whatever linux multiple does > > at the ingress LER, this code will follow. The real question is how to > > go about supporting multipath as an LSR? (one ILM needs to load balance over > > multiple NHLFE). Or dare I suggest p-mp LSPs? > > > > Ive actually done some background compute on this in my head at least. > Here are my thoughts on paper or electrons: > ILM table entry (struct ltable in the code) should have a new structure, > call it nh_choice, which has the following entries: > > function selector(); > struct nh_info nh_list; > > nh_list would look like: > struct gnet_stats stats; /* stats */ > u32 lt_fecid; /* change that to ilm_nhid */ > > Note the above two entries currently reside in struct ltable. > > A packet coming in will have the usual lookup; the entries > nh_choice->selector() will be invoked. It will return the > nhid. > The idea behind the selector() is we can attach different algorithms > via policy and make them take care of things like paths being down etc. > I can think of two simple algorithms right away: random selection and > RR. The idea is to open these algorithms to innovation. > > >From user space this would look like: > > l2c mpls ilm add dev eth0 label 22 nhalg roundrobin nhid 2 nhid 3 nhid 4 What about adding a new func ptr to the protocol driver. Then we could do protocol dependent stuff like hashing the IPv4|6 header or ethernet header (ethernet over MPLS). The task is trival if the stack only has one label, for more then one label we would have to be creative. Hashing the label stack, or use the PW ID (suggestion in PWE3 WG which adds a word after the labelstack to indicate what protocol lies below.) The PW ID could be used to lookup the protocol driver to generate the hash. Or of course we could just add an options for which algo to use. > > etc. > > Thoughts? > > > > > I know you mentioned it is "not an index" but to me it seems like it really > > _is_ an index for the NHLFE. Can multiple NHids correspond to the same NHLFE? > > If it is a 1 to 1 mapping for all intents an purposes it is an index :-) > > Ok;-> > how about NHkey ? maybe a prefix of mpls_ would also be good. > > > > dsts are still managed from the MPLS code. There is some generic stuff > > > (create, destriy, gc etc) for which there is no point in recreating in > > > the MPLS code > > > The way it is right now works fine. What could probably have been a > > > better approach is to stack dsts. It would require some surgery and i am > > > not sure i have the patience for it. Mayeb we can ask Dave on his > > > thoughts on this. > > > > Currently we use dst stacking. The 'child' dst is actually a static member > > of the 'out going label info' (NHLFE). So when the skb reaches the the exit > > of IPv4|6 a check for the child is done. The skb->dst is replaced with the > > child dst and the child output funtion is called (which sends it into > > MPLS land). The entrace to MPLS land use "container_of" macro to get the > > NHLFE to used to forward the packet. How the stacked dst is created is > > similar to your scheme. I was wondering is XFRM is a better scheme to use > > for all of this? > > sorry i meant XFRM. > I am indifferent whether we change it to your scheme or leave it as is. > I will have to look at your code to make better judgement. My thinking > would be the end goal should be NOT to touch the IPV4/6 code with ifdefs > unless necessary. If theres not a huge difference in terms of eficiency > or code beautifaction i would rather stick to the current code. > BTW if you point me to the latest code i will print it and read offline > over the weekend if possible. Here are some snippits. I think XFRM may remove the need for these, but for now it works. Setup the dst stacking ---------------------- net/mpls/mpls_output.c int mpls_set_nexthop (struct dst_entry *dst, u32 nh_data, struct spec_nh *spec) { struct mpls_out_info *moi = NULL; MPLS_ENTER; moi = mpls_get_moi(nh_data); if (unlikely(!moi)) return -1; dst->metrics[RTAX_MTU-1] = moi->moi_mtu; dst->child = dst_clone(&moi->moi_dst); MPLS_DEBUG("moi: %p mtu: %d dst: %p\n", moi, moi->moi_mtu, &moi->moi_dst); MPLS_EXIT; return 0; } mpls_set_nexthop is called from ipv4:rt_set_nexthop and from ipv6:ip6_route_add (I have a 'special nextop' system developed which would be replaced by XFRM). It is very similar to your RTA_MPLS_FEC, but has 2 pieces of data a RTA_SPEC_PROTO and RTA_SPEC_DATA. It is intended for multiple protocols to be able to register special nexthop. Right now only MPLS registers :-) Again I have every intention of ripping it out in favor XFRM. Using the dst stack ------------------- net/ipv4/ip_output.c static inline int ip_finish_output2(struct sk_buff *skb) { struct dst_entry *dst = skb->dst; struct hh_cache *hh = dst->hh; struct net_device *dev = dst->dev; int hh_len = LL_RESERVED_SPACE(dev); if (dst->child) { skb->dst = dst_pop(skb->dst); return skb->dst->output(skb); } ... Something very similar exists in net/ipv6/ip6_output.c ip6_output_finish() > > I may be a bit slow responding now since i am at work. > > cheers, > jamal -- James R. Leu jl...@mi... |
From: Ramon C. <cas...@in...> - 2004-02-13 17:12:13
|
On 13 Feb 2004, Jamal Hadi Salim wrote: > you are sure you dont want nh somewhere in there? > since this is a reference to the NHlfe; > heck why dont we just call it nhlfe_id ? ;-> nhlfe_id is fine for me. > Refer to my earlier email for the suggestions i made. Agreed. > > > comment: I *do* think that mpls_tunnel.c from James impl can directly be > > used and it's very convenient. Just %s/moi/fwd_id/g > > What is the mpls_tunnel.c for? Is it a netdevice? What is it used for? Yes. It is a virtual netdevice that is allocated upon request and basically holds a MOI (the equivalent of a nhlfe_id). User sees it as a unidirectional netdevice (ifconfig, etc), Take a look at the file if you happen to find some spare time (Indeed, it can be improved and the sysfs integration was a little hairy) but I think it is very convenient and extensively used when RSVP-TE sets up LSPs. regards, Ramon |
From: David S. M. <da...@re...> - 2004-02-13 17:10:00
|
On Fri, 13 Feb 2004 08:46:53 -0600 "James R. Leu" <jl...@mi...> wrote: > I just wanted to get David's take on using XFRM for the Layer 3 to MPLS > mapping which would utilize dst stacking? > > It XFRM capable of doing this, any pointers as to where to start? XFRM wants to work with protocol stacking at the protocol level (ie. things within ipv4, or ipv6). We could tweak it to do this, but I advise against this initially because this way we can stick the MPLS stack more simply into 2.4.x if we wanted to (and I certainly might want to do that). After we're done, and did a 2.4.x backport if desired, we can look into using XFRM. But I don't advise this now. |
From: Jamal H. S. <ha...@zn...> - 2004-02-13 16:50:24
|
On Fri, 2004-02-13 at 10:17, Ramon Casellas wrote: > On Fri, 13 Feb 2004, Jamal Hadi Salim wrote: > > disclaimer: > > from now on, all mails will be sent to mpl...@li... ( > that is what I wa going to do, but then i received your email about the > mailing list failing) > Following on your statement - removed Dave. I like ccing original sender in case mailing list goes down .. > > > Essentially it is an identifier of a NHLFE entry. > > So you are right naming it a FEC identifier may not be the best. > > So the relationship is: > > 1 FEC F to N available objects ^ is that F a typo? > 1 Incoming Label/Labelspace to N available objects Essentially yes if the F is a typo. > > I dont wanna call the so-far-called fecid lspid but it is close. > > I see what you mean. Let us check what the RFC says: > The "Incoming Label Map" (ILM) maps each incoming label to a set of > NHLFEs. (...) > [..] > (RCAS: N.B. you don't need equal cost paths)... We can do it , so lets just add it. > > Given the above info, suggest a new name. Maybe NHid? > > I would say something like a "fwd_id" from "Forward Id", or "out_id" it > should not prelude one or other Next Hop. you are sure you dont want nh somewhere in there? since this is a reference to the NHlfe; heck why dont we just call it nhlfe_id ? ;-> > > Also i would think most of these lists would contain a single entry. > > Yes, unicast MPLS with no load sharing enabled. However, you may need them > when doing multicast and/or load sharing. > You mentioning multicast has given me some interesting thoughts. Essentially multicast would be just another algorithm in the thought that i previously posted (response to James). > > > BTW, the ILM is not multihop ready. We should be able to add easily. > Do you really need it to? Just hold a set of fwd_ids. The policy to select > one should be configurable (discipline) & pluggable. A common impl. Is a > hash table. Almost like you read my mind. Refer to my earlier email for the suggestions i made. > The RFC also defines the interaction with routing in this case. (although > vaguely) any routing/IP details in my opinion are NHLFE related. Example a neighbor needs to have an IP address. > > > The ability to select this value by policy allows us to be able to > > select the NHLFE entries from other subsystems; eg a u32 classifier > > on ingress could select all IP addresses from 10.1.1.1/24 to have a > > fecid of 10. The skb->fecid is then set to 10. When the packet gets to > > I see, but as long as it is not called fec_id, it's fine :) call it > fwd_id. check my earlier view above. Toss a coin and pick something and lets stick with it. > > > > > dsts are still managed from the MPLS code. There is some generic stuff > > (create, destriy, gc etc) for which there is no point in recreating in > > the MPLS code > > I am not sure that you need to. This is what was done in James' impl. > mpls_dst. The only thing you need is a means to allocate mpls_dsts and > hang the reference into the skb's dst. The advantage is that you don't add > another member to dst (I still don't like adding a mpls ptr to a generic > dst, but I assume you are far more knowledgeable than I am), but of > course, you still have to modify the skb dst. (e.g. release it and hold a > new reference). > Ok i will need to look at the code. > > > The way it is right now works fine. What could probably have been a > > better approach is to stack dsts. It would require some surgery and i am > > not sure i have the patience for it. > > Well, I though we agreed on doing it the right way :) I am not stating > which one it is though. Absolutely, but that also means not sticking unnecessary ifdefs in 20 files just so that you can supports some funky xfrm approach. > In mpls_unicast_forward > > lt = (struct ltable *)skb->dst; > skb->dst = <->u.dst; > > would not it be possible here to allocate a mpls_dst with a new dst_ops > with the right size? Yes, this is the dirtiest scene in the usage of the skb->dst in that code. It is not too too bad as far as obscenity level is concerned and. if there are better ways to do this, lets move on to those approaches. The big challenge would be the other issues associated with it such as hh, neighbors etc. > comment: I *do* think that mpls_tunnel.c from James impl can directly be > used and it's very convenient. Just %s/moi/fwd_id/g What is the mpls_tunnel.c for? Is it a netdevice? What is it used for? cheers, jamal |