Re: [mpls-linux-general] Better to use nfmark vs tc_index?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Jim,

James R. Leu wrote:

> Comments within ...
> 
> On Thu, Nov 29, 2001 at 05:32:17PM +0100, Olivier Dugeon wrote:
> 
>>Hi Jim,
>>
>>James R. Leu wrote:
>>
>>
>>>After looking at iptable a bit more I see that it can set a nfmark via
>>>the MARK rule.  Should I use this as oppsed to tc_index to influence the
>>>LSP and EXP? (note that DSCP will still be an options)
>>>
>>>
>>
>>Look at our patch. We have post a full and a small one. The small one 
>>use nfmark and is very close to the actual kernel. The only reason that 
>>we have developpe a similar approach with mpls_index is the ability to 
>>use both nfmark route and mpls_index iptable classification :
>>
>>Our small patch which use nfmark as mpls key override all time the 
>>nfmark route selection. With iptable, you can mark packet ie. nfmark 
>>field in the skbuff and used this nfmark to enhance routing stuff. In 
>>the ip_route_input(net/ipv4/route.c) routine only @ip dst, @ip src, 
>>interface number (input or output) and tos field are used to compute the 
>>hash route table key. Look at the CONFIG_IP_ROUTE_FWMARK flag (line 1686 
>>to 1688) and you can saw that nfmark can be used to enhance this key 
>>calculation. We have mimic this for mpls_index mark. So with the full 
>>patch you can used both mpls_index and/or nfmark as enhancement for the 
>>hash key route table computation.
>>
> 
> So you're saying that by using nfmark for MPLS we'd be overloading nfmark
> and wouldn't be able to do specialized route lookups (with nfmark) and
> have nfmark choose the LSP.  I guess I understand that.  So we need something
> like MARK (MPLS) and stores the value on the skb (mpls_index).  I might
> agrees with that, but the details of what it stores in the skb are still
> unclear to me.  (as in, I need to think about it more)
> 

mplx_index in the original patch (until to v0.3) store the RADIX_TREE 
index. After v0.4 (include) we store the label. So, it's more user 
friendly, we haven't the nedd of retrieve the RADIX_TREE key from 
/proc/net/mpls_xxx. From the label we recompute the RADIX_TREE key in 
the (net/ipv4/route.c)rt_set_next_hop routine. So, we are abble to 
retrieve the moi from the RADIX_TREE and setup the route key ops_data field.

This is execute only once per flow. After executing the 
(net/ipv4/route.c)rt_set_next_hop routine, the 
(net/ipv4/route.c)ip_route_input_slow routine finish to compute the 
route hash key. So, the moi is store in this structure, and the next 
packet are directly process. The mpls_index is used like the nfmark to 
setup different route hash key and distinguish different packet 
labelling comming from a same or to a same IP address. To convince 
Steven, mpls long stuff is made only once per flow. Activate the debug 
and look at the trace. You can see that rt_set_next_hop mpls stuff is 
call only once per flow. I made some test. A ping without MPLS between 2 
node take around 80 micro-second. With MPLS + iptables + TC, the first 
ping packet take around 150-180 micro-second and the subsequent one 
around 120-130 micro-second.

As you not in your previous mail, i doesn't want to bypass the ipv4 
stack. I think its a bad think and we can't do this because in MPLS, the 
box is first of all a router. So, it must process the packet as a normal 
router. It's just at the end that we decide to labelled the packet. You 
and me respect this both with the FIB and the iptable stuff.

> 
>>>I think I now understand what Olivier did, he created something similar
>>>to MARK but for MPLS.  If we are going to continue to use that I would
>>>like to change it alittle.  Instead of storing the mpls_index, I think it
>>>should build a dst and store it with the rule.  This dst will direct the
>>>skb to mpls_output() and will have the outgoung label info attached.
>>>When it gets to mpls_output() MPLS processing will occur like normal.
>>>The dst will be slapped on to any packet that matches the rule.
>>>
>>>Do you think that by using nfmark we can accomplish the same thing?
>>>We would have to relay upon another mean of getting data to mpls_output()
>>>like a MPLS tunnel interface or a entry in the FIB that has been marked
>>>for MPLS.  Once it gets to mpls_output() the nfmark could be used to influence
>>>the LSP or EXP.
>>>
>>>Ofcourse maybe it's just safer to have both options availble :-)
>>>
>>>Now to the matter of tc_index.  It seems that nfmark can be used by a
>>>scheduling classifier, but it looks like the classifier for tc_index is
>>>better.  So it might be that nfmark (or a MPLS mark) is used to influence
>>>LSP and EXP descisions (note that DSCP will still be option) and that
>>>tc_index is used to influence scheduling.
>>>
>>
>>Actually for the TC part, we use mpls_index. My latest patch (not 
>>publish yet) use mpls_index when it has been configured in the 
>>kernel_config and directly the label in the other case. We can recopy 
>>this index into the tc_index. The TC mpls classifier has been written 
>>for this purpose. The original way we want code is to use u32 
>>classifier. But there is two pb.
>>
>>1/ the label is not accessible by u32 classifier. They only start at the 
>>ip header not the shim header.
>>
>>2/ Why classify again the packet (CPU power ....) if it has already been 
>>classified with iptable or another process ? The shim header (formely 
>>the label and/or the EXP fields of the shim header) can be used as a 
>>filter mark.
>>
> 
> Would using the tc_index sched classifier solve #1?  As for number 2 you
> need both.  iptables simply marks it, no scheduling is actually done.
> The sched classifier is mearly trying to sort the marked packets into the
> appropriate "queues".
> 
> Jim
> 
> 

-- 
  FTR&D/DAC/CPN
  Technopole Anticipa     | mailto:Oli...@fr...
  2, Avenue Pierre Marzin | Phone:  +(33) 2 96 05 28 80
  F-22307 LANNION         | Fax:    +(33) 2 96 05 18 52