[Valgrind-users] Need help in interpreting cachegrind's output

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi everybody,

I'm new in profiling and got stuck in analyzing cachegrind's output.
Any help and hints for further profiling strategies are appreciated!
It is not so important for me to get the problem at hand optimized, but
to get some insights into profiling.

Background:
I tried to implement a pool allocator for fixed size obects in C++.

Since I know, that the objects allocated last will most likely get
freed first, I attempted to implement some shrink-functionality: if
some threshold of free slots are available at the end, shrink the
pool.

However, I failed miserably: My allocator is even slower than
malloc/free.

In my first attempt, I implemented some bitset as an array of ints to
keep track of the occupied slots.

I have two functions: find_first_unset_bit() (needed for allocation of
blocks) and find_unset_tail() (needed for the shrink-feature).

A gprof-run told me, that these two are hotspots and that the second
one takes about 30% more time than the first one (for the most optimal
allocation/deallocation pattern).
I thought that it would be good to examine the reason for this as
a starting point.

A line-by-line gprof-analysis showed up to be pretty useless:
disassembling my objects via `objdump -ld ...' told me that part of
the instructions in my `-O3'-optimized binary are assigned to the
"wrong" source lines.

So I went on to try cachegrind out. Unfortunately, the results look
pretty much the same for both functions (see below) and I'm stuck now.

To give you some introduction on how the two functions are implemented
(if you like, I could also paste the original C++ code, but I think
this is quite pointless):
`find_first_unset_bit()' and `find_unset_tail()'
make use of some member-variables: `_i_word_first_unset' and
`_i_word_unset_tail' respecively. These are indices into the array of
integers and updated by the bit setting and clearing functions. For
the first, it is guaranteed that all words before this index are
completely set (0xffffffff), for the second, it is guaranteed, that
all words from that index on are zero.
Integers are searched from these indices either forward or backwards for values
!= 0xffffffff, != 0 resp.
Then a bsf/bsr insn is utilized to get the first nonzero bit/the last
nonzero bit.

Here is cachegrind's output for the two (with accumulated counts added
for your convenience):

--8<---------------cut here---------------start------------->8---
        Ir I1mr ILmr        Dr D1mr DLmr      Dw D1mw DLmw      Bc    Bcm Bi Bim
       .    .    .         .    .    .       .    .    .       .      .  .   .  _ZNK5xmltr6memory15input_page_impl14dynamic_bitset20find_first_unset_bitEv:
        .    .    .         .    .    .       .    .    .       .      .  .   .  .LFB866:
        .    .    .         .    .    .       .    .    .       .      .  .   .         .cfi_startproc
  640,000    1    1   640,000    0    0       0    0    0       0      0  0   0         movq    8(%rdi), %rcx
  640,000    0    0   640,000    0    0       0    0    0       0      0  0   0         movq    16(%rdi), %rax
  640,000    0    0   640,000    0    0       0    0    0       0      0  0   0         movq    32(%rdi), %rdx
  640,000    0    0         0    0    0       0    0    0       0      0  0   0         subq    %rcx, %rax
  640,000    0    0         0    0    0       0    0    0       0      0  0   0         sarq    $2, %rax
  640,000    0    0         0    0    0       0    0    0       0      0  0   0         cmpq    %rdx, %rax
  640,000    0    0         0    0    0       0    0    0 640,000      6  0   0         jne     .L24
      100    0    0         0    0    0       0    0    0       0      0  0   0         jmp     .L21
        .    .    .         .    .    .       .    .    .       .      .  .   .         .p2align 4,,10
        .    .    .         .    .    .       .    .    .       .      .  .   .         .p2align 3
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L27:
   19,900    0    0         0    0    0       0    0    0       0      0  0   0         addq    $1, %rdx
   19,900    0    0         0    0    0       0    0    0       0      0  0   0         cmpq    %rdx, %rax
   19,900    0    0         0    0    0       0    0    0  19,900      0  0   0         je      .L21
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L24:
  659,800    0    0   659,800    0    0       0    0    0       0      0  0   0         movl    (%rcx,%rdx,4), %esi
  659,800    0    0         0    0    0       0    0    0       0      0  0   0         xorl    $-1, %esi
  659,800    0    0         0    0    0       0    0    0 659,800 19,900  0   0         je      .L27
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         bsfl    %esi, %esi
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         movl    $-1, %eax
  639,900    0    0         0    0    0 639,900    0    0       0      0  0   0         movq    %rdx, 32(%rdi)
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         cmovne  %esi, %eax
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         salq    $5, %rdx
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         movl    %eax, %eax
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         addq    %rdx, %rax
  639,900    0    0   639,900    0    0       0    0    0       0      0  0   0         ret
        .    .    .         .    .    .       .    .    .       .      .  .   .         .p2align 4,,10
        .    .    .         .    .    .       .    .    .       .      .  .   .         .p2align 3
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L21:
      100    1    1         0    0    0     100    0    0       0      0  0   0         movq    %rdx, 32(%rdi)
      100    0    0         0    0    0       0    0    0       0      0  0   0         xorl    %eax, %eax
      100    0    0         0    0    0       0    0    0       0      0  0   0         salq    $5, %rdx
      100    0    0         0    0    0       0    0    0       0      0  0   0         addq    %rdx, %rax
      100    0    0       100    0    0       0    0    0       0      0  0   0         ret
        .    .    .         .    .    .       .    .    .       .      .  .   .         .cfi_endproc
-------------------------------------------------------------------------------
 11638900    2    2   3219800    0    0  640000    0    0 1319700  19906  0   0

        Ir I1mr ILmr        Dr D1mr DLmr      Dw D1mw DLmw      Bc    Bcm Bi Bim
       .    .    .         .    .    .       .    .    .       .      .  .   .  _ZNK5xmltr6memory15input_page_impl14dynamic_bitset15find_unset_tailEv:
        .    .    .         .    .    .       .    .    .       .      .  .   .  .LFB867:
        .    .    .         .    .    .       .    .    .       .      .  .   .         .cfi_startproc
  640,000    0    0   640,000    0    0       0    0    0       0      0  0   0         movq    40(%rdi), %rdx
  640,000    0    0         0    0    0       0    0    0       0      0  0   0         testq   %rdx, %rdx
  640,000    0    0         0    0    0       0    0    0 640,000      2  0   0         je      .L35
  640,000    0    0   640,000    0    0       0    0    0       0      0  0   0         movq    8(%rdi), %rsi
  640,000    1    1         0    0    0       0    0    0       0      0  0   0         subq    $1, %rdx
  640,000    0    0   640,000    0    0       0    0    0       0      0  0   0         movl    (%rsi,%rdx,4), %eax
  640,000    0    0         0    0    0       0    0    0       0      0  0   0         testl   %eax, %eax
  640,000    0    0         0    0    0       0    0    0 640,000 20,000  0   0         je      .L32
  620,000    0    0         0    0    0       0    0    0       0      0  0   0         jmp     .L41
        .    .    .         .    .    .       .    .    .       .      .  .   .         .p2align 4,,10
        .    .    .         .    .    .       .    .    .       .      .  .   .         .p2align 3
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L34:
   19,900    0    0         0    0    0       0    0    0       0      0  0   0         leaq    -1(%rdx), %rcx
   19,900    0    0    19,900    0    0       0    0    0       0      0  0   0         movl    (%rsi,%rcx,4), %eax
   19,900    0    0         0    0    0       0    0    0       0      0  0   0         testl   %eax, %eax
   19,900    0    0         0    0    0       0    0    0  19,900      2  0   0         jne     .L40
        .    .    .         .    .    .       .    .    .       .      .  .   .         movq    %rcx, %rdx
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L32:
   20,000    0    0         0    0    0       0    0    0       0      0  0   0         testq   %rdx, %rdx
   20,000    0    0         0    0    0       0    0    0  20,000    102  0   0         jne     .L34
      100    0    0         0    0    0     100    0    0       0      0  0   0         movq    $0, 40(%rdi)
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L35:
      100    0    0         0    0    0       0    0    0       0      0  0   0         xorl    %eax, %eax
      100    0    0       100    0    0       0    0    0       0      0  0   0         ret
        .    .    .         .    .    .       .    .    .       .      .  .   .         .p2align 4,,10
        .    .    .         .    .    .       .    .    .       .      .  .   .         .p2align 3
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L40:
   19,900    0    0         0    0    0  19,900    0    0       0      0  0   0         movq    %rdx, 40(%rdi)
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L31:
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         bsrl    %eax, %edx
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         movl    $32, %eax
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         xorl    $31, %edx
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         subl    %edx, %eax
  639,900    0    0         0    0    0       0    0    0 639,900      0  0   0         je      .L35
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         salq    $5, %rcx
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         movl    %eax, %eax
  639,900    0    0         0    0    0       0    0    0       0      0  0   0         addq    %rcx, %rax
  639,900    0    0   639,900    0    0       0    0    0       0      0  0   0         ret
        .    .    .         .    .    .       .    .    .       .      .  .   .  .L41:
  620,000    1    1         0    0    0       0    0    0       0      0  0   0         movq    %rdx, %rcx
  620,000    0    0         0    0    0       0    0    0       0      0  0   0         jmp     .L31
        .    .    .         .    .    .       .    .    .       .      .  .   .         .cfi_endproc
-------------------------------------------------------------------------------
 12878900    2    2   2579900    0    0   20000    0    0 1959800  20106  0   0
--8<---------------cut here---------------end--------------->8---

The isns (Ir) are comparable, so are the cache misses. The faster
`find_first_unset_bit()' has even far more memory writes (Dw) than the
slower `find_unset_tail()' and slightly more memory reads.

The only noticeable difference I can see are the branch counts:
`find_unset_tail()' has got about 48% more of them than
`find_first_unset_bit()'.
However, the instuction read cache misses and the branch mispredicts are
negliglble in both case and according to AMD's "Software Optimization
Guide for AMD Family 10h and 12h Processors" (which applies to my box),
the Jcc as well as the JMP insns have got a latency of 1 cycle.
So, this can't be the reason for the runtime difference or am I getting
something wrong?

Thanks alot for your time and efforts!
Best,

Nicolai