[Lse-tech] AIM7 scaling, pagecache_lock, multiqueue scheduler

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I ran some "modified AIM7" tests on a 1-to-32-cpu mips64 Origin2000 NUMA
system using three kernels:
1) baseline 2.4.2
2) 2.4.2 + "pagecache_lock patch"
3) 2.4.2 + "pagecache_lock patch" + multiqueue scheduler

"Modified AIM7" means the AIM7 "workfile.shared" workload, minus the
three sync_* subtests.  The intention is to produce an AIM7 workload
that is computebound on my test platform, even when the various
filesystem tests are exercising a single directory.  All AIM7 test runs
were performed with 70 tasks (threads).

Lockmeter discovered that the baseline 2.4.2 kernel is severely
bottlenecked above 4p by contention on the global pagecache_lock,
indicating that at 32p about half the available CPU cycles were consumed
by waits on this one spinlock.  Ananth (an...@en...) extracted a
patch (I'll call it the "pagecache_lock patch") from a larger Tux2
patch.  The "pagecache_lock patch" replaces the global lock with a
finer-grained locking strategy.

I attach an image of the scaling graph for the results for the three
kernels.  The x-axis is the number of CPUs, and the y-axis is the
benchmark throughput, relative to the 1-CPU performance.  The graph
includes a reference line, labelled "perfect scaling", denoting the
ideal scaling of Nx performance for x-CPUs (e.g., seeing 16x performance
with 16p, 32x performance with 32p).

Observe that the baseline 2.4.2 kernel gets 3.4x performance at 4p, but
only 4.6x at 8p, and the improvements improve only gradually from there
to 6.1x at 32p.  As I said, scaling is greatly constrained by the global
pagecache_lock.

I have not included results from a 2.4.2+multiqueue kernel, as those
numbers are basically the same as for the baseline 2.4.2 kernel.

The "pagecache_lock patch" greatly improves scaling, peaking at 17x-18x
at about 28p.  Interestingly, adding the multiqueue scheduler drops peak
performance by about 5%.

Further lockmeter and profiling indicates that the next bottlenecks are
the kernel_flag (waits are consuming 22% of the CPU cycles, and
ext2_get_block's waits consume 12%) and the pagemap_lru_lock (waits
consume 10% of the CPU cycles).  The routine nr_free_pages() is also a
heavy consumer of CPU cycles on this big-memory machine.

John Hawkes
ha...@en...