Thread: [Lse-tech] [OLS] RCU overhead measurements

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Here is a more detailed description of the rcu overhead measurement
data presented at OLS 2002. We used kernel profile ticks to roughly 
compare the overheads of different implementations of RCU in order 
to see which features in those implementations help and which don't.

For our experiments, we used the rt_rcu-2.5.3-1.patch (lockfree route cache)
along with 3 different RCU infrastructure implementations. We used
a test suggested by Dave Miller that sends a large number of
packets to random destination addresses changing the address after
every 5 packets. The test was run on an 8-cpu PIII Xeon system
with 6GB RAM. It was run under two environments -

1. Increased neighbor table garbage collection threshold thereby
   reducing the number of updates
2. Default GC threshold that results in very large number of
   garbage collection and route cache updates due to random
   destination addresses.

The key implementation features of the 3 RCU infrastructure patches are -

1. rcu_ltimer - Per-CPU queue of RCUs, periodic (10ms) checking for
                CPU going through a quiescent state.
2. rcu_sched  - Per-CPU queue of RCUs, all checking done from scheduler 
                context, global atomic counter of RCUs.
3. rcu_poll   - Global queue of RCUs, single polling tasklet, force
                reschedule on CPUs for completion of RCU grace period.

The patches can be found at http://lse.sf.net/locking/ols2002/rcu/patches.
A description of these as well as some other experimental patches
can be found in http://lse.sf.net/locking/ols2002/rcu/patches/rcu_impl.html.

The raw kernprof outputs are at 
http://lse.sf.net/locking/ols2002/rcu/results/rttest/rttest_davem/.

The numbers represent the profile ticks in the corresponding routines.

1. rt_rcu with neighbor table garbage collection threshold increased to
   prevent frequent overflow (due to random dest addresses).

function        	base    rcu_ltimer      rcu_sched       rcu_poll
--------        	----    ----------      ---------       --------
ip_route_output_key     4486    2026		2162		2135
call_rcu                	11		68		71
rcu_process_callbacks   	4		   		128
rcu_invoke_callbacks            4		   		24
rcu_batch_done		           		1		   
schedule		423	453		459		500
rcu_prepare_polling	   	   		   		13
rcu_polling		   	   		   		134

2. rt_rcu with frequent neighbor table overflow (due to random dest addresses)

function        	base    rcu_ltimer      rcu_sched       rcu_poll
--------        	----    ----------      ---------       --------
ip_route_output_key     2358    1646		1619		1641
call_rcu                	262		1147		2251
rcu_process_callbacks   	49		   		2778
rcu_invoke_callbacks            57		   		488
rcu_check_quiescent_state	27
rcu_check_callbacks		24
rcu_reg_batch			3
rcu_batch_done		           		3		   
schedule		294	   		849		808
rcu_prepare_polling	   	   		   		21
rcu_polling		   	   		   		2756
rcu_completion							4
force_cpu_reschedule						553
__tasklet_hi_schedule						1557

Based on these measurements, we can draw the following conclusions -

1. RCU, implemented the right way, can be beneficial even when
updates are more than rare. Atleast one patch (rcu_ltimer) shows
a net improvement (gain in ip_route_output_key - rcu overhead) under
heavy update load.

2. When updates are rare most implementations can work well and show
real benefits.

3. Global queues are out, it is important to have per-cpu queues as
seen from overheads in rcu_poll code.

4. Global atomic RCU counter in rcu_sched probably hurts as seen
in call_rcu() overhead. So we should try to avoid such global
counters.

Comments/suggestions ?

Thanks
-- 
Dipankar Sarma  <dip...@in...> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

Thread: [Lse-tech] [OLS] RCU overhead measurements

lse-tech