From: "Shailabh Nagar" <nagar@...>
> The higher CPU numbers look interesting - any broad pointers on
> major bottleneck for performance degradation beyond 8 cpus ?
> Is it cacheblock contention, scheduler problems or something else ?
I'm in the middle of analyzing that now.
Earlier testing, when reflex v1.1 was using CLONE_FILES, showed that
when the w=<high> values caused context-switching by write() to a pipe,
there is significant cacheblock contention in fget() doing the
read_lock() of the files_struct rwlock. Removing CLONE_FILES makes that
My first look using profiling shows moderately large peaks in
scheduler() and sys_sched_yield(), as you'd expect. I need to determine
if there are any particular hot spots within those routines. And also
in nr_running(). I believe the nr_running() cycles are largely due to
cacheblock contention -- each peek at another cpu's rq_data.nt_running
count causes a cacheblock writeback by the owning CPU, which has dirtied
the cacheblock when it acquired/released rq_data.runqueue_lock.
Similarly, sys_sched_yield() peaks at every other cpu's
aligned_data.schedule_data.curr, which will often produce a cacheblock
writeback of a dirty block.
My sys_sched_yield() peak is probably also worsened by a flawed mips64
compiler, which doesn't obey the __cacheline_aligned directive in:
aligned_data_t aligned_data [NR_CPUS] __cacheline_aligned =
As it turns out, in my mips64 kernel the aligned_data element shares
the same L2 cacheblock as tasklist_lock. However, the other
aligned_data elements are in their own L2 cacheblocks, so the penalty
probably isn't significant. I will hack aligned_data into L2
alignment and rerun some tests to measure the effect.