From: <JD...@de...> - 2001-06-15 10:50:19
|
We analyzed Dbench scalability behaviour on ext2 file system and kernel 2.4.0. Dbench (ftp://samba.org/pub/tridge/dbench/) is an emulation of the Netbench benchmark. It produces only the filesystem load. It does all the same IO calls that the smbd server in Samba would produce when confronted with a netbench run. It does no networking calls. The system used was: 8 way 700 MHz Intel Xeon, 8x1MB L2 cache, with 37 GB disk IBM Ultrastar 36LZX on Ultra2 SCSI Controller Adaptec AIC-7896. Main memory was limited to 1680 MB via "mem=" kernel boot parameter. The throughput values are showing that the test does not exceed the buffer cache (max. disk-buffer transfer speed theoretical 43 MB/sec). We tested dbench by varying the number of clients from 1 to 30 and the number of CPUs from Uniprocessor to 8 CPUs. Each Dbench run was repeated 11 times and the first run was ignored as warmup. We found for maximum throughput: CPU | throughput scalability | [MB/s] ----+----------------------- U | 101,16 1,00 1 | 94,46 0,93 2 | 144,93 1,43 4 | 195,01 1,93 8 | 197,17 1,95 Looking at an excerpt of the data at SMP kernel with 8 CPUs: #clients 1 2 4 6 8 10 20 30 Throughput [MB/s] 92.26 143.00 188.29 197.17 173.88 179.68 173.18 175.34 The throughput reaches the maximum with 6 clients and decreases with 8 clients. The throughput for the other sets of CPUs reaching the maximum when the number of clients is equal the number of CPUs and do not decrease in such order of magnitude between two adjacent measurement points. Running kernel profiler in pc mode [ticks] (5 loops with 30 clients), we found | CPU |increase from 2 CPUs entry name | 1 2 4 8 |to 8 CPUs by factor -------------------------+-----------------------------+-------------------- USER | 8,507 8,605 8,834 9,002 | 1.1 __generic_copy_from_user | 5,572 5,804 9,111 14,787 | 2.6 file_read_actor | 2,724 3,040 4,218 5,905 | 1.9 default_idle | 1,651 1,245 2,572 8,281 | 6.7 stext_lock | 0 574 5,212 27,993 | 48.8 misc | 7,501 11,140 16,930 30,028 | 4.0 -------------------------+-----------------------------+-------------------- total number of ticks | 25,955 30,408 46,877 95,996 | 3.2 The portion of entry stext_lock indicates that the kernel is spent more and more time with spinning for spinlocks. Calculating the percentage of these entries from the total number of ticks: | CPU | entry name | 1 2 4 8 | -------------------------+---------------------------------+ USER | 32.78% 28.30% 18.85% 9.38% | __generic_copy_from_user | 21.47% 19.09% 19.44% 15.40% | file_read_actor | 10.50% 10.00% 9.00% 6.15% | default_idle | 6.36% 4.09% 5.49% 8.63% | stext_lock | 0.00% 1.89% 11.12% 29.16% | misc | 28.90% 36.64% 36.12% 31.28% | -------------------------+---------------------------------+ total number of ticks | 100.00% 100.00% 100.00% 100.00% | On 8 CPUs the entry stext_lock consumes 29% of the total CPU power, without contributing to the workload. Running lockmeter (5 loops with 30 clients), we found: 1. CPU utilization [%] spent for spinning (looping to get a spin lock) | #CPU |increase from 2 CPUs lock name | 2 4 8 |to 8 CPUs by factor -----------------+-------------------+-------------------- kmap_lock | 0.64 2.90 18.60 | 29.1 pagecache_lock | 0.35 1.90 8.90 | 25.4 lru_list_lock | 0.28 4.80 6.80 | 24.3 dcache_lock | 0.13 0.37 0.64 | 4.9 pagemap_lru_lock | 0.15 0.46 0.90 | 6.0 kernel_flag | 0.73 2.20 6.40 | 8.8 misc | 0.02 0.27 0.26 | 13.0 -----------------+-------------------+-------------- total | 2,30 12,90 42,50 | 18.5 2. average time [us] the lock is held | #CPU |increase from 2 CPUs lock name | 2 4 8 |to 8 CPUs by factor -----------------+------------------+-------------------- kmap_lock | 0.60 1.00 2.90 | 4.8 pagecache_lock | 0.90 1.50 3.10 | 3.4 lru_list_lock | 0.80 1.80 2.80 | 3.5 dcache_lock | 0.40 0.50 0.60 | 1.5 pagemap_lru_lock | 0.90 1.40 2.60 | 2.9 kernel_flag | 1.90 3.20 6.00 | 3.2 The measurements are showing that when increasing the number of CPUs and clients for dbench the additional CPU power is mostly invested into lock handling, 3+ CPUs out of eight CPUs are spinning for six locks. Curiosuly the lock hold times are increasing up to 4.8 times, this contributes to increasing spin times. For comparison we tried the same workload on a Linux/390 (on IBM 9672-XZ7 G6) For maximum throughput, we found: CPU | throughput scalability | [MB/s] ----+----------------------- 1 | 70.64 1.00 2 | 134.62 1.91 4 | 249.19 3.53 8 | 422.10 5.98 Juergen Doelle jd...@de... IBM Lab Boeblingen - Linux Architecture & Performance |