[Lse-tech] Dbench scalability results on 2.4.0

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

We analyzed Dbench scalability behaviour on ext2 file system and kernel
2.4.0.

Dbench (ftp://samba.org/pub/tridge/dbench/) is an emulation of the Netbench
benchmark.
It produces only the filesystem load. It does all the same IO calls that
the
smbd server in Samba would produce when confronted with a netbench run. It
does
no networking calls.

The system used was:
8 way 700 MHz Intel Xeon, 8x1MB L2 cache, with 37 GB disk IBM Ultrastar
36LZX
on Ultra2 SCSI Controller Adaptec AIC-7896. Main memory was limited to 1680
MB
via "mem=" kernel boot parameter.

The throughput values are showing that the test does not exceed the buffer
cache (max. disk-buffer transfer speed theoretical 43 MB/sec).

We tested dbench by varying the number of clients from 1 to 30 and the
number
of CPUs from Uniprocessor to 8 CPUs. Each Dbench run was repeated 11 times
and the first run was ignored as warmup.

We found for maximum throughput:

   CPU | throughput scalability
       |  [MB/s]
   ----+-----------------------
    U  |  101,16      1,00
    1  |   94,46      0,93
    2  |  144,93      1,43
    4  |  195,01      1,93
    8  |  197,17      1,95

Looking at an excerpt of the data at SMP kernel with 8 CPUs:

   #clients          1     2      4      6      8      10     20     30
   Throughput [MB/s] 92.26 143.00 188.29 197.17 173.88 179.68 173.18 175.34

The throughput reaches the maximum with 6 clients and decreases with 8
clients.
The throughput for the other sets of CPUs reaching the maximum when the
number
of clients is equal the number of CPUs and do not decrease in such order of
magnitude between two adjacent measurement points.

Running kernel profiler in pc mode [ticks] (5 loops with 30 clients), we
found

                            | CPU                         |increase from 2
CPUs
   entry name               |  1      2      4      8     |to 8 CPUs by
factor

-------------------------+-----------------------------+--------------------
   USER                     |  8,507  8,605  8,834  9,002 |     1.1
   __generic_copy_from_user |  5,572  5,804  9,111 14,787 |     2.6
   file_read_actor          |  2,724  3,040  4,218  5,905 |     1.9
   default_idle             |  1,651  1,245  2,572  8,281 |     6.7
   stext_lock               |      0    574  5,212 27,993 |    48.8
   misc                     |  7,501 11,140 16,930 30,028 |     4.0

-------------------------+-----------------------------+--------------------
   total number of ticks    | 25,955 30,408 46,877 95,996 |     3.2

The portion of entry stext_lock indicates that the kernel is spent more and
more
time with spinning for spinlocks.

Calculating the percentage of these entries from the total number of ticks:

                            |  CPU                            |
   entry name               |    1       2       4         8  |
   -------------------------+---------------------------------+
   USER                     |  32.78%  28.30%  18.85%   9.38% |
   __generic_copy_from_user |  21.47%  19.09%  19.44%  15.40% |
   file_read_actor          |  10.50%  10.00%   9.00%   6.15% |
   default_idle             |   6.36%   4.09%   5.49%   8.63% |
   stext_lock               |   0.00%   1.89%  11.12%  29.16% |
   misc                     |  28.90%  36.64%  36.12%  31.28% |
   -------------------------+---------------------------------+
   total number of ticks    | 100.00% 100.00% 100.00% 100.00% |

On 8 CPUs the entry stext_lock consumes 29% of the total CPU power, without
contributing to the workload.

Running lockmeter (5 loops with 30 clients), we found:

1. CPU utilization [%] spent for spinning (looping to get a spin lock)

                    | #CPU              |increase from 2 CPUs
   lock name        | 2     4     8     |to 8 CPUs by factor
   -----------------+-------------------+--------------------
   kmap_lock        | 0.64  2.90  18.60 |       29.1
   pagecache_lock   | 0.35  1.90   8.90 |       25.4
   lru_list_lock    | 0.28  4.80   6.80 |       24.3
   dcache_lock      | 0.13  0.37   0.64 |        4.9
   pagemap_lru_lock | 0.15  0.46   0.90 |        6.0
   kernel_flag      | 0.73  2.20   6.40 |        8.8
   misc             | 0.02  0.27   0.26 |       13.0
   -----------------+-------------------+--------------
   total            | 2,30 12,90  42,50 |       18.5

2. average time [us] the lock is held

                    | #CPU             |increase from 2 CPUs
   lock name        | 2     4     8    |to 8 CPUs by factor
   -----------------+------------------+--------------------
   kmap_lock        | 0.60  1.00  2.90 |        4.8
   pagecache_lock   | 0.90  1.50  3.10 |        3.4
   lru_list_lock    | 0.80  1.80  2.80 |        3.5
   dcache_lock      | 0.40  0.50  0.60 |        1.5
   pagemap_lru_lock | 0.90  1.40  2.60 |        2.9
   kernel_flag      | 1.90  3.20  6.00 |        3.2

The measurements are showing that when increasing the number of CPUs and
clients
for dbench the additional CPU power is mostly invested into lock handling,
3+ CPUs out of eight CPUs are spinning for six locks.
Curiosuly the lock hold times are increasing up to 4.8 times, this
contributes
to increasing spin times.

For comparison we tried the same workload on a Linux/390 (on IBM 9672-XZ7
G6)
For maximum throughput, we found:

   CPU | throughput scalability
       |  [MB/s]
   ----+-----------------------
    1  |   70.64      1.00
    2  |  134.62      1.91
    4  |  249.19      3.53
    8  |  422.10      5.98

Juergen Doelle
jd...@de...
IBM Lab Boeblingen - Linux  Architecture & Performance