From: John H. <ha...@en...> - 2001-03-14 17:00:24
|
I ran some "modified AIM7" tests on a 1-to-32-cpu mips64 Origin2000 NUMA system using three kernels: 1) baseline 2.4.2 2) 2.4.2 + "pagecache_lock patch" 3) 2.4.2 + "pagecache_lock patch" + multiqueue scheduler "Modified AIM7" means the AIM7 "workfile.shared" workload, minus the three sync_* subtests. The intention is to produce an AIM7 workload that is computebound on my test platform, even when the various filesystem tests are exercising a single directory. All AIM7 test runs were performed with 70 tasks (threads). Lockmeter discovered that the baseline 2.4.2 kernel is severely bottlenecked above 4p by contention on the global pagecache_lock, indicating that at 32p about half the available CPU cycles were consumed by waits on this one spinlock. Ananth (an...@en...) extracted a patch (I'll call it the "pagecache_lock patch") from a larger Tux2 patch. The "pagecache_lock patch" replaces the global lock with a finer-grained locking strategy. I attach an image of the scaling graph for the results for the three kernels. The x-axis is the number of CPUs, and the y-axis is the benchmark throughput, relative to the 1-CPU performance. The graph includes a reference line, labelled "perfect scaling", denoting the ideal scaling of Nx performance for x-CPUs (e.g., seeing 16x performance with 16p, 32x performance with 32p). Observe that the baseline 2.4.2 kernel gets 3.4x performance at 4p, but only 4.6x at 8p, and the improvements improve only gradually from there to 6.1x at 32p. As I said, scaling is greatly constrained by the global pagecache_lock. I have not included results from a 2.4.2+multiqueue kernel, as those numbers are basically the same as for the baseline 2.4.2 kernel. The "pagecache_lock patch" greatly improves scaling, peaking at 17x-18x at about 28p. Interestingly, adding the multiqueue scheduler drops peak performance by about 5%. Further lockmeter and profiling indicates that the next bottlenecks are the kernel_flag (waits are consuming 22% of the CPU cycles, and ext2_get_block's waits consume 12%) and the pagemap_lru_lock (waits consume 10% of the CPU cycles). The routine nr_free_pages() is also a heavy consumer of CPU cycles on this big-memory machine. John Hawkes ha...@en... |