|
From: Nicolai S. <nic...@gm...> - 2013-11-23 21:47:02
|
Hi everybody,
I'm new in profiling and got stuck in analyzing cachegrind's output.
Any help and hints for further profiling strategies are appreciated!
It is not so important for me to get the problem at hand optimized, but
to get some insights into profiling.
Background:
I tried to implement a pool allocator for fixed size obects in C++.
Since I know, that the objects allocated last will most likely get
freed first, I attempted to implement some shrink-functionality: if
some threshold of free slots are available at the end, shrink the
pool.
However, I failed miserably: My allocator is even slower than
malloc/free.
In my first attempt, I implemented some bitset as an array of ints to
keep track of the occupied slots.
I have two functions: find_first_unset_bit() (needed for allocation of
blocks) and find_unset_tail() (needed for the shrink-feature).
A gprof-run told me, that these two are hotspots and that the second
one takes about 30% more time than the first one (for the most optimal
allocation/deallocation pattern).
I thought that it would be good to examine the reason for this as
a starting point.
A line-by-line gprof-analysis showed up to be pretty useless:
disassembling my objects via `objdump -ld ...' told me that part of
the instructions in my `-O3'-optimized binary are assigned to the
"wrong" source lines.
So I went on to try cachegrind out. Unfortunately, the results look
pretty much the same for both functions (see below) and I'm stuck now.
To give you some introduction on how the two functions are implemented
(if you like, I could also paste the original C++ code, but I think
this is quite pointless):
`find_first_unset_bit()' and `find_unset_tail()'
make use of some member-variables: `_i_word_first_unset' and
`_i_word_unset_tail' respecively. These are indices into the array of
integers and updated by the bit setting and clearing functions. For
the first, it is guaranteed that all words before this index are
completely set (0xffffffff), for the second, it is guaranteed, that
all words from that index on are zero.
Integers are searched from these indices either forward or backwards for values
!= 0xffffffff, != 0 resp.
Then a bsf/bsr insn is utilized to get the first nonzero bit/the last
nonzero bit.
Here is cachegrind's output for the two (with accumulated counts added
for your convenience):
--8<---------------cut here---------------start------------->8---
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim
. . . . . . . . . . . . . _ZNK5xmltr6memory15input_page_impl14dynamic_bitset20find_first_unset_bitEv:
. . . . . . . . . . . . . .LFB866:
. . . . . . . . . . . . . .cfi_startproc
640,000 1 1 640,000 0 0 0 0 0 0 0 0 0 movq 8(%rdi), %rcx
640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movq 16(%rdi), %rax
640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movq 32(%rdi), %rdx
640,000 0 0 0 0 0 0 0 0 0 0 0 0 subq %rcx, %rax
640,000 0 0 0 0 0 0 0 0 0 0 0 0 sarq $2, %rax
640,000 0 0 0 0 0 0 0 0 0 0 0 0 cmpq %rdx, %rax
640,000 0 0 0 0 0 0 0 0 640,000 6 0 0 jne .L24
100 0 0 0 0 0 0 0 0 0 0 0 0 jmp .L21
. . . . . . . . . . . . . .p2align 4,,10
. . . . . . . . . . . . . .p2align 3
. . . . . . . . . . . . . .L27:
19,900 0 0 0 0 0 0 0 0 0 0 0 0 addq $1, %rdx
19,900 0 0 0 0 0 0 0 0 0 0 0 0 cmpq %rdx, %rax
19,900 0 0 0 0 0 0 0 0 19,900 0 0 0 je .L21
. . . . . . . . . . . . . .L24:
659,800 0 0 659,800 0 0 0 0 0 0 0 0 0 movl (%rcx,%rdx,4), %esi
659,800 0 0 0 0 0 0 0 0 0 0 0 0 xorl $-1, %esi
659,800 0 0 0 0 0 0 0 0 659,800 19,900 0 0 je .L27
639,900 0 0 0 0 0 0 0 0 0 0 0 0 bsfl %esi, %esi
639,900 0 0 0 0 0 0 0 0 0 0 0 0 movl $-1, %eax
639,900 0 0 0 0 0 639,900 0 0 0 0 0 0 movq %rdx, 32(%rdi)
639,900 0 0 0 0 0 0 0 0 0 0 0 0 cmovne %esi, %eax
639,900 0 0 0 0 0 0 0 0 0 0 0 0 salq $5, %rdx
639,900 0 0 0 0 0 0 0 0 0 0 0 0 movl %eax, %eax
639,900 0 0 0 0 0 0 0 0 0 0 0 0 addq %rdx, %rax
639,900 0 0 639,900 0 0 0 0 0 0 0 0 0 ret
. . . . . . . . . . . . . .p2align 4,,10
. . . . . . . . . . . . . .p2align 3
. . . . . . . . . . . . . .L21:
100 1 1 0 0 0 100 0 0 0 0 0 0 movq %rdx, 32(%rdi)
100 0 0 0 0 0 0 0 0 0 0 0 0 xorl %eax, %eax
100 0 0 0 0 0 0 0 0 0 0 0 0 salq $5, %rdx
100 0 0 0 0 0 0 0 0 0 0 0 0 addq %rdx, %rax
100 0 0 100 0 0 0 0 0 0 0 0 0 ret
. . . . . . . . . . . . . .cfi_endproc
-------------------------------------------------------------------------------
11638900 2 2 3219800 0 0 640000 0 0 1319700 19906 0 0
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim
. . . . . . . . . . . . . _ZNK5xmltr6memory15input_page_impl14dynamic_bitset15find_unset_tailEv:
. . . . . . . . . . . . . .LFB867:
. . . . . . . . . . . . . .cfi_startproc
640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movq 40(%rdi), %rdx
640,000 0 0 0 0 0 0 0 0 0 0 0 0 testq %rdx, %rdx
640,000 0 0 0 0 0 0 0 0 640,000 2 0 0 je .L35
640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movq 8(%rdi), %rsi
640,000 1 1 0 0 0 0 0 0 0 0 0 0 subq $1, %rdx
640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movl (%rsi,%rdx,4), %eax
640,000 0 0 0 0 0 0 0 0 0 0 0 0 testl %eax, %eax
640,000 0 0 0 0 0 0 0 0 640,000 20,000 0 0 je .L32
620,000 0 0 0 0 0 0 0 0 0 0 0 0 jmp .L41
. . . . . . . . . . . . . .p2align 4,,10
. . . . . . . . . . . . . .p2align 3
. . . . . . . . . . . . . .L34:
19,900 0 0 0 0 0 0 0 0 0 0 0 0 leaq -1(%rdx), %rcx
19,900 0 0 19,900 0 0 0 0 0 0 0 0 0 movl (%rsi,%rcx,4), %eax
19,900 0 0 0 0 0 0 0 0 0 0 0 0 testl %eax, %eax
19,900 0 0 0 0 0 0 0 0 19,900 2 0 0 jne .L40
. . . . . . . . . . . . . movq %rcx, %rdx
. . . . . . . . . . . . . .L32:
20,000 0 0 0 0 0 0 0 0 0 0 0 0 testq %rdx, %rdx
20,000 0 0 0 0 0 0 0 0 20,000 102 0 0 jne .L34
100 0 0 0 0 0 100 0 0 0 0 0 0 movq $0, 40(%rdi)
. . . . . . . . . . . . . .L35:
100 0 0 0 0 0 0 0 0 0 0 0 0 xorl %eax, %eax
100 0 0 100 0 0 0 0 0 0 0 0 0 ret
. . . . . . . . . . . . . .p2align 4,,10
. . . . . . . . . . . . . .p2align 3
. . . . . . . . . . . . . .L40:
19,900 0 0 0 0 0 19,900 0 0 0 0 0 0 movq %rdx, 40(%rdi)
. . . . . . . . . . . . . .L31:
639,900 0 0 0 0 0 0 0 0 0 0 0 0 bsrl %eax, %edx
639,900 0 0 0 0 0 0 0 0 0 0 0 0 movl $32, %eax
639,900 0 0 0 0 0 0 0 0 0 0 0 0 xorl $31, %edx
639,900 0 0 0 0 0 0 0 0 0 0 0 0 subl %edx, %eax
639,900 0 0 0 0 0 0 0 0 639,900 0 0 0 je .L35
639,900 0 0 0 0 0 0 0 0 0 0 0 0 salq $5, %rcx
639,900 0 0 0 0 0 0 0 0 0 0 0 0 movl %eax, %eax
639,900 0 0 0 0 0 0 0 0 0 0 0 0 addq %rcx, %rax
639,900 0 0 639,900 0 0 0 0 0 0 0 0 0 ret
. . . . . . . . . . . . . .L41:
620,000 1 1 0 0 0 0 0 0 0 0 0 0 movq %rdx, %rcx
620,000 0 0 0 0 0 0 0 0 0 0 0 0 jmp .L31
. . . . . . . . . . . . . .cfi_endproc
-------------------------------------------------------------------------------
12878900 2 2 2579900 0 0 20000 0 0 1959800 20106 0 0
--8<---------------cut here---------------end--------------->8---
The isns (Ir) are comparable, so are the cache misses. The faster
`find_first_unset_bit()' has even far more memory writes (Dw) than the
slower `find_unset_tail()' and slightly more memory reads.
The only noticeable difference I can see are the branch counts:
`find_unset_tail()' has got about 48% more of them than
`find_first_unset_bit()'.
However, the instuction read cache misses and the branch mispredicts are
negliglble in both case and according to AMD's "Software Optimization
Guide for AMD Family 10h and 12h Processors" (which applies to my box),
the Jcc as well as the JMP insns have got a latency of 1 cycle.
So, this can't be the reason for the runtime difference or am I getting
something wrong?
Thanks alot for your time and efforts!
Best,
Nicolai
|