|
From: Christoph B. <bar...@or...> - 2006-10-22 11:12:29
|
Hi, I often use valgrind's memcheck and expect a slowdown of about 100x. But my current programm shows the following behaviour: Normal runtime: 3 minutes Current runtime: 2408 minutes and still running Is there an explanation for this huge slowdown? Thread usage? Huge memory consumption? Greetings Chrsitoph Bartoschek |
|
From: Nicholas N. <nj...@cs...> - 2006-10-22 14:08:09
|
On Sun, 22 Oct 2006, Christoph Bartoschek wrote: > I often use valgrind's memcheck and expect a slowdown of about 100x. But my > current programm shows the following behaviour: > > Normal runtime: 3 minutes > Current runtime: 2408 minutes and still running > > Is there an explanation for this huge slowdown? Thread usage? Huge memory > consumption? Hard to say without more information. It shouldn't be that slow. Nick |
|
From: Christoph B. <bar...@or...> - 2006-10-22 20:25:51
|
Am Sonntag, 22. Oktober 2006 16:07 schrieb Nicholas Nethercote: > > Hard to say without more information. It shouldn't be that slow. > The operation has finished in the meantime and the slowdown is 1365x 144765 seconds vs. 105 seconds. My problem is that the operation has to be repeated 70 times till the error I seek occurs. You need more information? How can I provide it? What information do you need? Christoph Bartoschek |
|
From: Nicholas N. <nj...@cs...> - 2006-10-23 02:10:20
|
On Sun, 22 Oct 2006, Christoph Bartoschek wrote: >> Hard to say without more information. It shouldn't be that slow. > > The operation has finished in the meantime and the slowdown is 1365x > > 144765 seconds vs. 105 seconds. > > My problem is that the operation has to be repeated 70 times till the error I > seek occurs. > > You need more information? How can I provide it? What information do you need? I don't know. That's what makes it tricky :) Nick |
|
From: Christoph B. <bar...@or...> - 2006-10-24 13:28:11
|
Am Montag, 23. Oktober 2006 04:10 schrieb Nicholas Nethercote: > On Sun, 22 Oct 2006, Christoph Bartoschek wrote: > >> Hard to say without more information. It shouldn't be that slow. > > > > The operation has finished in the meantime and the slowdown is 1365x > > > > 144765 seconds vs. 105 seconds. > > > > My problem is that the operation has to be repeated 70 times till the > > error I seek occurs. > > > > You need more information? How can I provide it? What information do you > > need? > > I don't know. That's what makes it tricky :) > > Nick There are no timers inside of valgrind to measure where most of the time is spent? Would profiling with OProfile help? Christoph |
|
From: Christoph B. <bar...@or...> - 2006-10-24 13:38:51
|
Am Montag, 23. Oktober 2006 04:10 schrieb Nicholas Nethercote: > I don't know. That's what makes it tricky :) > Here is the top of the valgrind run in Oprofile after some minutes. I did not use specific compile flags. Maybe one has to add -g to get more information. CPU: AMD64 processors, speed 2605.94 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 11007731 91.2732 vgPlain_record_ExeContext 239043 1.9821 vgPlain_use_CF_info 223989 1.8573 (no symbols) 150552 1.2483 (no symbols) 83379 0.6914 vgModuleLocal_search_one_cfitab 52679 0.4368 vgPlain_arena_free 32243 0.2674 vgMemCheck_helperc_LOADV64le 30434 0.2524 vgMemCheck_helperc_STOREV64le 28161 0.2335 vgMemCheck_helperc_MAKE_STACK_UNINIT 19305 0.1601 mc_new_mem_stack_8 16982 0.1408 set_address_range_perms 16447 0.1364 vgPlain_run_innerloop__dispatch_unprofiled 15854 0.1315 mc_die_mem_stack_8 9690 0.0803 vgPlain_unknown_SP_update 9271 0.0769 vgPlain_search_transtab 9142 0.0758 unlinkBlock 7635 0.0633 vgMemCheck_helperc_LOADV32le 7159 0.0594 vgMemCheck_helperc_STOREV32le 5380 0.0446 vgPlain_scheduler 4323 0.0358 die_and_free_mem 4306 0.0357 vgPlain_get_StackTrace2 3544 0.0294 fast_lookup_failed 3153 0.0261 vgPlain_arena_malloc 2801 0.0232 mc_expensive_sanity_check 2791 0.0231 amd64g_calculate_rflags_all_WRK 2654 0.0220 vgPlain_run_innerloop 2620 0.0217 mc_new_mem_stack_16 2424 0.0201 vgMemCheck_check_mem_is_noaccess 2157 0.0179 mkFreeBlock 2031 0.0168 vgPlain_set_shadow_regs_area 1860 0.0154 vgMemCheck_helperc_LOADV8 1576 0.0131 vgPlain_is_exiting 1522 0.0126 mc_new_mem_stack_32 1491 0.0124 vgPlain_HT_add_node 1414 0.0117 mc_die_mem_stack 1257 0.0104 vgPlain_HT_remove 1157 0.0096 ensure_mm_init 1105 0.0092 vgPlain_memcpy |
|
From: Julian S. <js...@ac...> - 2006-10-24 13:43:06
|
> There are no timers inside of valgrind to measure where most of the time is > spent? No. When we want to know that stuff we profile valgrind with cachegrind. > Would profiling with OProfile help? Probably yes. Tell us a bit about your application. - Does it use a lot of threads? - Are you running on a multi-CPU machine? - Does it do a lot of floating point? - Are you running on a P4-variant? - Does it do a lot of mallocs/frees? J |
|
From: Julian S. <js...@ac...> - 2006-10-24 13:56:18
|
> CPU: AMD64 processors, speed 2605.94 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit > mask of 0x00 (No unit mask) count 100000 > samples % symbol name > 11007731 91.2732 vgPlain_record_ExeContext Gaaah! If that's true it is really bad. What is the inner loop of your application doing? J |
|
From: Christoph B. <bar...@or...> - 2006-10-24 14:27:22
|
Am Dienstag, 24. Oktober 2006 15:55 schrieb Julian Seward:
> > CPU: AMD64 processors, speed 2605.94 MHz (estimated)
> > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
> > unit mask of 0x00 (No unit mask) count 100000
> > samples % symbol name
> > 11007731 91.2732 vgPlain_record_ExeContext
>
> Gaaah! If that's true it is really bad. What is the inner loop
> of your application doing?
I would not say that it does something weird. Here is a basic outline of the
inner loop that gets a binary tree as input:
while (tree.has_leaf()) {
for each (buffer/inverter)
check_merge_timing(leaf, leaf->parent)
}
for 2401 possibilites
check_root_connect
check_merge_timing and check_root_connect are functions that spend 95% of
their time in extracting function approximations from a 2-dim lookup table.
Here is the exact count of malloc/free after 19hours in valgrind:
==00:19:53:01.364 9133== malloc/free: 98,138,916 allocs, 97,052,020 frees,
18,848,731,867 bytes allocated.
The run has been killed with STRG+C
And finally an oprofile output during the initialization phase of the
application:
Counted CPU_CLK_UNHALTED events (Cycles outside
warning: /boot/vmlinux-2.6.13-15.12-smp.gz is n
samples % symbol name
7836011 41.8869 (no symbols)
1049427 5.6096 vgPlain_arena_free
758054 4.0521 (no symbols)
722585 3.8625 set_address_range_perms
687323 3.6740 vgMemCheck_helperc_LOADV64le
665990 3.5600 vgMemCheck_helperc_STOREV64le
663955 3.5491 vgModuleLocal_search_one_cfit
584375 3.1237 vgPlain_run_innerloop__dispat
519607 2.7775 vgMemCheck_helperc_MAKE_STACK
476431 2.5467 doRegisterAllocation
297077 1.5880 mc_new_mem_stack_8
274416 1.4669 mc_die_mem_stack_8
241299 1.2898 vgMemCheck_helperc_STOREV8
238227 1.2734 vgPlain_unknown_SP_update
229308 1.2258 (no symbols)
206303 1.1028 unlinkBlock
197175 1.0540 vgPlain_use_CF_info
188233 1.0062 vgPlain_arena_malloc
160315 0.8570 mc_expensive_sanity_check
150404 0.8040 vgMemCheck_helperc_LOADV8
140847 0.7529 vgMemCheck_check_mem_is_noacc
132725 0.7095 vgPlain_scheduler
113339 0.6058 vgPlain_search_transtab
97279 0.5200 (no symbols)
94171 0.5034 get_secmap_ptr
93965 0.5023 vgPlain_record_ExeContext
88662 0.4739 vgMemCheck_helperc_LOADV32le
81197 0.4340 amd64g_calculate_rflags_all_W
69063 0.3692 vgPlain_run_innerloop
64332 0.3439 mkInuseBlock
57323 0.3064 vgPlain_free
53945 0.2884 blockSane
53554 0.2863 arenaId_to_ArenaP
49555 0.2649 vgMemCheck_helperc_STOREV32le
The image name for the first symbol is
anon (tgid:15483 range:0x4029a5000-0x40f2f2000)
and for the second
anon (tgid:15483 range:0x4029a5000-0x41243c000)
One can see that during initialization the application seems to behave
normally.
Christoph Bartoschek
|
|
From: Christoph B. <bar...@or...> - 2006-10-24 13:57:50
|
Finally the result of opannotate on the most timeconsuming function:
:ExeContext* VG_(record_ExeContext) ( ThreadId tid )
7296101 23.2144 :{ /* vgPlain_record_ExeContext total: 28374282 90.2800 */
: Int i;
: Addr ips[VG_DEEPEST_BACKTRACE];
: Bool same;
: UWord hash;
: ExeContext* new_ec;
: ExeContext* list;
: UInt n_ips;
:
7 2.2e-05 : init_ExeContext_storage();
276 8.8e-04 : vg_assert(VG_(clo_backtrace_size) >= 1 &&
: VG_(clo_backtrace_size) <= VG_DEEPEST_BACKTRACE);
:
61 1.9e-04 : n_ips = VG_(get_StackTrace) (tid, ips
VG_(clo_backtrace_size) );
104 3.3e-04 : tl_assert(n_ips >= 1);
:
: /* Now figure out if we've seen this one before. First
hash it so
: as to determine the list number. */
:
: hash = 0;
333 0.0011 : for (i = 0; i < n_ips; i++) {
541 0.0017 : hash ^= ips[i];
225 7.2e-04 : hash = (hash << 29) | (hash >> 3);
: }
171 5.4e-04 : hash = hash % N_EC_LISTS;
:
: /* And (the expensive bit) look a matching entry in the
list. */
:
: ec_searchreqs++;
:
2 6.4e-06 : list = ec_list[hash];
:
: while (True) {
688333 2.1901 : if (list == NULL) break;
: ec_searchcmps++;
: same = True;
6002376 19.0981 : for (i = 0; i < n_ips; i++) {
11124624 35.3958 : if (list->ips[i] != ips[i]) {
: same = False;
: break;
: }
: }
: if (same) break;
3260252 10.3733 : list = list->next;
: }
:
15 4.8e-05 : if (list != NULL) {
: /* Yay! We found it. */
: return list;
: }
:
: /* Bummer. We have to allocate a new context record. */
: ec_totstored++;
:
: new_ec = VG_(arena_malloc)( VG_AR_EXECTXT,
: sizeof(struct _ExeContext)
: + n_ips * sizeof(Addr) );
:
: for (i = 0; i < n_ips; i++)
: new_ec->ips[i] = ips[i];
:
: new_ec->n_ips = n_ips;
: new_ec->next = ec_list[hash];
818 0.0026 : ec_list[hash] = new_ec;
:
: return new_ec;
43 1.4e-04 :}
|
|
From: Nicholas N. <nj...@cs...> - 2006-10-24 15:51:26
|
On Tue, 24 Oct 2006, Christoph Bartoschek wrote:
> Finally the result of opannotate on the most timeconsuming function:
>
> :ExeContext* VG_(record_ExeContext) ( ThreadId tid )
> 7296101 23.2144 :{ /* vgPlain_record_ExeContext total: 28374282 90.2800 */
This is the function that records stack traces, which are associated with
allocations and errors. Each unique stack trace is only recorded once, each
stack trace (aka ExeContext), upon creation, is compared to the existing
ones in the table, and only added if it is different to the existing ones.
If it's the same, a pointer to the one already in the table is used instead.
It looks like your program is causing a lot of very similar and/or identical
stack traces to be created, and the comparisons against the existing ones
are taking up a lot of time. Was running with --num-callers=8 (or even
--num-callers=1) much faster? Increasing N_EC_LISTS might help. But it
seems like there's a flaw in the approach if it's working so badly.
Another possible cause of slowdown is that you're running on an 8-way
machine, and Valgrind serialises execution of threads. Although you said
the work is mostly done by one thread so maybe this isn't important.
> : Int i;
> : Addr ips[VG_DEEPEST_BACKTRACE];
> : Bool same;
> : UWord hash;
> : ExeContext* new_ec;
> : ExeContext* list;
> : UInt n_ips;
> :
> 7 2.2e-05 : init_ExeContext_storage();
> 276 8.8e-04 : vg_assert(VG_(clo_backtrace_size) >= 1 &&
> : VG_(clo_backtrace_size) <= VG_DEEPEST_BACKTRACE);
Do you know what these two columns mean?
Nick
> :
> 61 1.9e-04 : n_ips = VG_(get_StackTrace) (tid, ips
> VG_(clo_backtrace_size) );
> 104 3.3e-04 : tl_assert(n_ips >= 1);
> :
> : /* Now figure out if we've seen this one before. First
> hash it so
> : as to determine the list number. */
> :
> : hash = 0;
> 333 0.0011 : for (i = 0; i < n_ips; i++) {
> 541 0.0017 : hash ^= ips[i];
> 225 7.2e-04 : hash = (hash << 29) | (hash >> 3);
> : }
> 171 5.4e-04 : hash = hash % N_EC_LISTS;
> :
> : /* And (the expensive bit) look a matching entry in the
> list. */
> :
> : ec_searchreqs++;
> :
> 2 6.4e-06 : list = ec_list[hash];
> :
> : while (True) {
> 688333 2.1901 : if (list == NULL) break;
> : ec_searchcmps++;
> : same = True;
> 6002376 19.0981 : for (i = 0; i < n_ips; i++) {
> 11124624 35.3958 : if (list->ips[i] != ips[i]) {
> : same = False;
> : break;
> : }
> : }
> : if (same) break;
> 3260252 10.3733 : list = list->next;
> : }
> :
> 15 4.8e-05 : if (list != NULL) {
> : /* Yay! We found it. */
> : return list;
> : }
> :
> : /* Bummer. We have to allocate a new context record. */
> : ec_totstored++;
> :
> : new_ec = VG_(arena_malloc)( VG_AR_EXECTXT,
> : sizeof(struct _ExeContext)
> : + n_ips * sizeof(Addr) );
> :
> : for (i = 0; i < n_ips; i++)
> : new_ec->ips[i] = ips[i];
> :
> : new_ec->n_ips = n_ips;
> : new_ec->next = ec_list[hash];
> 818 0.0026 : ec_list[hash] = new_ec;
> :
> : return new_ec;
> 43 1.4e-04 :}
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Valgrind-users mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-users
>
|
|
From: Christoph B. <bar...@or...> - 2006-10-24 16:05:03
|
Am Dienstag, 24. Oktober 2006 17:51 schrieb Nicholas Nethercote:
>
> > : Int i;
> > : Addr ips[VG_DEEPEST_BACKTRACE];
> > : Bool same;
> > : UWord hash;
> > : ExeContext* new_ec;
> > : ExeContext* list;
> > : UInt n_ips;
> >
> > 7 2.2e-05 : init_ExeContext_storage();
> > 276 8.8e-04 : vg_assert(VG_(clo_backtrace_size) >= 1 &&
> >
> > : VG_(clo_backtrace_size) <=
> > : VG_DEEPEST_BACKTRACE);
>
> Do you know what these two columns mean?
>
OProfile is a stochastic profiler. Each 100000 events of type CPU_CLK_UNHALTED
it checks where the execution currently is. Then the counter for this
position is incremented. Here we see the counter.
For example: During the whole profiling run oprofile stopped 7 times in the
line with
init_ExeContext_storage();
It did not stop inside of the called function but in the code of the current
function preparing or evaluating the call to init_ExeContext_storage();
The second number is the percantage that this line has compared to all events.
Here the number is neglible. Another example is:
11124624 35.3958 : if (list->ips[i] != ips[i]) {
Here we see that 35.4% of all taken samples have been in this line of code. If
the profiling session is long enough (and I think mine was long enough) one
can conclude that 35.4% of the whole runtime during the profiling session has
been spent in this line of code.
Christoph Bartoschek
|
|
From: Christoph B. <bar...@or...> - 2006-10-24 14:06:39
|
Am Dienstag, 24. Oktober 2006 15:42 schrieb Julian Seward: > Tell us a bit about your application. > > - Does it use a lot of threads? There are two threads that are most of the time waiting in pthread_cond_wait. Then there is a master thread that is also waiting in a pthread_join and finally there is one working thread that has all cpu time. > - Are you running on a multi-CPU machine? Yes it is an AMD Opteron machine with 8 CPUs. Each 2600MHz and 1MB L2 Cache. The machine has 64GB of memory. > - Does it do a lot of floating point? I would say 50% floating point 50% integer. > - Are you running on a P4-variant? I guess Opteron is not a P4 variant because of 64bits. > - Does it do a lot of mallocs/frees? There are a lot of mallocs/frees but I would not say that the number is very high. I could measure the exact number of allocs/frees if this is more important. Christoph Bartoschek |
|
From: Julian S. <js...@ac...> - 2006-10-24 14:10:37
|
Fine so far. > > - Does it do a lot of mallocs/frees? > > There are a lot of mallocs/frees but I would not say that the number is > very high. I could measure the exact number of allocs/frees if this is more > important. More info would be useful. Roughly how many simultaneously live blocks do you have allocated? eg, 100k blocks? 1 million blocks? Roughly how many total allocations/frees does the program do? (running memcheck with -v will tell you). The first question is more important. J |
|
From: Christoph B. <bar...@or...> - 2006-10-24 14:37:17
|
Am Dienstag, 24. Oktober 2006 16:10 schrieb Julian Seward: > Fine so far. > > > > - Does it do a lot of mallocs/frees? > > > > There are a lot of mallocs/frees but I would not say that the number is > > very high. I could measure the exact number of allocs/frees if this is > > more important. > > More info would be useful. > > Roughly how many simultaneously live blocks do you have allocated? > eg, 100k blocks? 1 million blocks? > > Roughly how many total allocations/frees does the program do? > (running memcheck with -v will tell you). > > The first question is more important. It is not easy to answer the first question. The programm is running inside of a nutshell and I cannot say how many live blocks the shell has when my application is started. The run I terminated with CTRL+C and mentioned in an earlier mail had 98,138,916 allocations and 97,052,020 frees. That indicates a number of 1,086,896 live blocks. My core application operates only on about 100-10,000 live blocks. The rest belongs to the nutshell. I have now started the programm with -v. What is the interessting output? Christoph |
|
From: Julian S. <js...@ac...> - 2006-10-24 14:44:34
|
> It is not easy to answer the first question. The programm is running inside > of a nutshell and I cannot say how many live blocks the shell has when my > application is started. The run I terminated with CTRL+C and mentioned in > an earlier mail had 98,138,916 allocations and 97,052,020 frees. That > indicates a number of 1,086,896 live blocks. > > My core application operates only on about 100-10,000 live blocks. The rest > belongs to the nutshell. > > I have now started the programm with -v. What is the interessting output? It is in fact this; I mistakenly thought it appears only at -v. I guess the new run with -v is pointless. == malloc/free: 98,138,916 allocs, 97,052,020 frees, 18,848,731,867 bytes allocated. It all sounds like stupid questions but we're slowly accumulating useful info. Now I look at the annotated source you sent, and also the stuff about the tree traversal. And I'm thinking, are you doing very deep recursion (I mean, function-call recursion) to process the tree(s), and at the same time doing some mallocs/frees during that recursion? J |
|
From: Julian S. <js...@ac...> - 2006-10-24 14:49:38
|
> Now I look at the annotated source you sent, and also the stuff about > the tree traversal. And I'm thinking, are you doing very deep recursion > (I mean, function-call recursion) to process the tree(s), and at the > same time doing some mallocs/frees during that recursion? Reason I ask is that maybe the hash function used in VG_(record_ExeContext) (which takes snapshots of the stack whenever there is a malloc/free) is having a bad case here and is causing the snapshots for a very large number of blocks to all be on the same hash table chain. That would explain the profiles you sent, although I still am not sure if it is possible to write a program that would cause this to happen. Oh yes, one other question. Are you using --num-callers=N ? If yes what is N and how does it relate to the recursion depth(s) in your tree traversal? J |
|
From: Christoph B. <bar...@or...> - 2006-10-24 15:01:13
|
Am Dienstag, 24. Oktober 2006 16:44 schrieben Sie: > == malloc/free: 98,138,916 allocs, 97,052,020 frees, > 18,848,731,867 bytes allocated. > > It all sounds like stupid questions but we're slowly accumulating > useful info. > > Now I look at the annotated source you sent, and also the stuff about > the tree traversal. And I'm thinking, are you doing very deep recursion > (I mean, function-call recursion) to process the tree(s), and at the > same time doing some mallocs/frees during that recursion? > The algorithm does not use function-call recursion. I forgot to mention that the current leaf is removed from the tree after each step and I have an iterator that gives me the current leaf without recursion. However there might be some recursion in subroutines I cannot control. All mallocs/frees are at the beginning and end of an instance of the algorithm. Not during the inner loop. I used --num-callers=32 in the run I profiled. The number is not related to the core algorithm but to the fact that one runs inside of the shell. A typical backtrace looks like this in our application and I like to see the whole backtrace: 3-5 frames memory allocation 5-9 frames core application 8-15 frames TCL interpreter ~10 frames shell I've just started a run with --num-callers=8 maybe it is faster. I also have a suppression file for the shell with 16 entries. Maybe this is important. Greetings Christoph |
|
From: Julian S. <js...@ac...> - 2006-10-24 15:58:48
|
> It looks like your program is causing a lot of very similar and/or > identical stack traces to be created, and the comparisons against the > existing ones are taking up a lot of time. That was my guess too. But later it occurred to me, Christoph said the program first has a nutshell (container, I guess?) which allocates a lot of memory - maybe 1 M blocks, and then the main algorithm runs. Suppose the nutshell fills the table execontext table up a lot. Then when the algorithm runs and does malloc/free, these will be expensive because each one requires searching through long hash chains. If that's true one simple fix would be the usual hack of moving an ExeContext in a chain one step forwards for each successful search. Various other tables already do that. I suspect this is easy to fix if we have a way to reproduce the problem. J |
|
From: Christoph B. <bar...@or...> - 2006-10-24 17:03:20
|
Am Dienstag, 24. Oktober 2006 17:58 schrieb Julian Seward:
> > It looks like your program is causing a lot of very similar and/or
> > identical stack traces to be created, and the comparisons against the
> > existing ones are taking up a lot of time.
>
> That was my guess too. But later it occurred to me, Christoph
> said the program first has a nutshell (container, I guess?)
> which allocates a lot of memory - maybe 1 M blocks, and then the
> main algorithm runs. Suppose the nutshell fills the table
> execontext table up a lot. Then when the algorithm runs and
> does malloc/free, these will be expensive because each one
> requires searching through long hash chains.
>
> If that's true one simple fix would be the usual hack of moving
> an ExeContext in a chain one step forwards for each successful
> search. Various other tables already do that.
>
> I suspect this is easy to fix if we have a way to reproduce
> the problem.
I have written a small testprogram that seems to reproduce the problem. First
I allocate for a given x at 10^x different positions some memory and do not
free it. Then I allocate and delallocate in an endless loop at the same
position.
This programm should simulate the situation you pointed out. The shell
allocates 1,000,000 blocks of memory and then the core algorithm runs
allocating only small amounts of memory.
Here is the time that valgrind uses to allocate the memory:
x=4 0.8 seconds
x=5 92 seconds
I would expect that for x = 5 allocation takes below 10 seconds.
The programm prints the time between 10,000 allocations.
In the run phase the programm is about 10x slower for x=5 than x=4. For x=5
10,000 allocations take 0.03 seconds and for x=4 they need 0.002 seconds.
Using oprofile the programm shows for x=5 or x=6 the profile I see for my main
application.
#include <iostream>
#include <sstream>
#include <iomanip>
extern "C" {
#include <sys/time.h>
}
void * alloc() {
static int count = 0;
++count;
if (count % 10000 == 0) {
struct timeval tv;
gettimeofday(&tv, NULL);
std::cout << "Allocating: " << count << " Time: " << tv.tv_sec << "."
<< std::setw(6) << tv.tv_usec << std::endl;
}
return new int[random() % 100];
}
void allocate(int count) {
if (count == 0) {
alloc();
return;
}
allocate(count-1);
allocate(count-1);
allocate(count-1);
allocate(count-1);
allocate(count-1);
allocate(count-1);
allocate(count-1);
allocate(count-1);
allocate(count-1);
allocate(count-1);
}
void run() {
while (true) {
alloc();
}
}
int main(int argc, char ** argv) {
if (argc == 1) return -1;
std::stringstream ss(argv[1]);
int depth;
ss >> depth;
if (not ss) return -1;
allocate(depth);
std::cout << "Allocation ready" << std::endl;
run();
}
Greetings
Christoph
|
|
From: Bryan M. <bra...@go...> - 2006-10-24 17:02:37
|
Julian Seward wrote: >> It looks like your program is causing a lot of very similar and/or >> identical stack traces to be created, and the comparisons against the >> existing ones are taking up a lot of time. > > That was my guess too. But later it occurred to me, Christoph > said the program first has a nutshell (container, I guess?) > which allocates a lot of memory - maybe 1 M blocks, and then the > main algorithm runs. Suppose the nutshell fills the table > execontext table up a lot. Then when the algorithm runs and > does malloc/free, these will be expensive because each one > requires searching through long hash chains. > > If that's true one simple fix would be the usual hack of moving > an ExeContext in a chain one step forwards for each successful > search. Various other tables already do that. You could also store the full hash value along with the list entry. This adds another simple equality test (at the cost of some memory) before you start to iterate up the stack trace doing the expensive full comparison and should also be simple to add in. Bryan |
|
From: Julian S. <js...@ac...> - 2006-10-24 18:29:34
|
> > If that's true one simple fix would be the usual hack of moving
> > an ExeContext in a chain one step forwards for each successful
> > search. Various other tables already do that.
>
> You could also store the full hash value along with the list entry. This
> adds another simple equality test (at the cost of some memory) before
> you start to iterate up the stack trace doing the expensive full
> comparison and should also be simple to add in.
Er .. no. Each list contains elements with the same hash value -
the only point of the initial hash value is to decide which list
to use. I did spot that the hash computation isn't 64-bit clean,
though, which can't help.
hash = 0;
for (i = 0; i < n_ips; i++) {
hash ^= ips[i];
hash = (hash << 29) | (hash >> 3);
}
hash = hash % N_EC_LISTS;
I'll look at it more later this evening. Christoph's program doesn't
reproduce the problem on x86 so it must be something amd64 specific.
J
|
|
From: Bryan M. <bra...@go...> - 2006-10-24 18:51:08
|
Julian Seward wrote:
>>> If that's true one simple fix would be the usual hack of moving
>>> an ExeContext in a chain one step forwards for each successful
>>> search. Various other tables already do that.
>> You could also store the full hash value along with the list entry. This
>> adds another simple equality test (at the cost of some memory) before
>> you start to iterate up the stack trace doing the expensive full
>> comparison and should also be simple to add in.
>
> Er .. no. Each list contains elements with the same hash value -
> the only point of the initial hash value is to decide which list
> to use. I did spot that the hash computation isn't 64-bit clean,
> though, which can't help.
>
> hash = 0;
> for (i = 0; i < n_ips; i++) {
> hash ^= ips[i];
> hash = (hash << 29) | (hash >> 3);
> }
> hash = hash % N_EC_LISTS;
>
> I'll look at it more later this evening. Christoph's program doesn't
> reproduce the problem on x86 so it must be something amd64 specific.
>
I was thinking more along the lines of
bucket = hash % N_EC_LISTS;
Use the bucket in place of the current hash value to get the right chain
and store the full hash for later comparison or is the full hash likely
to produce identical values as well?
|
|
From: Christoph B. <bar...@or...> - 2006-10-24 19:40:10
|
Am Dienstag, 24. Oktober 2006 20:28 schrieb Julian Seward:
> > > If that's true one simple fix would be the usual hack of moving
> > > an ExeContext in a chain one step forwards for each successful
> > > search. Various other tables already do that.
> >
> > You could also store the full hash value along with the list entry. This
> > adds another simple equality test (at the cost of some memory) before
> > you start to iterate up the stack trace doing the expensive full
> > comparison and should also be simple to add in.
>
> Er .. no. Each list contains elements with the same hash value -
> the only point of the initial hash value is to decide which list
> to use. I did spot that the hash computation isn't 64-bit clean,
> though, which can't help.
>
> hash = 0;
> for (i = 0; i < n_ips; i++) {
> hash ^= ips[i];
> hash = (hash << 29) | (hash >> 3);
> }
> hash = hash % N_EC_LISTS;
You are right. For the first 30,009 allocations in my testcase the buckets
are:
Bucket: 3770 Count: 1
Bucket: 4023 Count: 1
Bucket: 4131 Count: 1
Bucket: 4143 Count: 1
Bucket: 4497 Count: 1
Bucket: 6757 Count: 2400
Bucket: 13338 Count: 1
Bucket: 14909 Count: 1
Bucket: 19312 Count: 1
Bucket: 21097 Count: 25799
Bucket: 23305 Count: 1
Bucket: 23854 Count: 1800
Bucket: 25883 Count: 1
Changing the line to
hash = (hash << 61) | (hash >> 3);
leads to a better distribution of the values and better runtime for the
testcase. However I do not know whether rotation by 3 really leads to a good
hash function for 64 bit.
Additionally, why is no dynamic hashmap used? One that expands the number of
buckets each time the density gets to high.
Christoph
|
|
From: Julian S. <js...@ac...> - 2006-10-24 20:06:14
|
> hash = (hash << 61) | (hash >> 3);
>
> leads to a better distribution of the values and better runtime for the
> testcase. However I do not know whether rotation by 3 really leads to a
> good hash function for 64 bit.
Yeh, rotating by 3 is stupid and doesn't mix the bits well for short
stack traces. I just tried this, which mixes much better:
static inline UWord ROLW ( UWord w, Int n )
{
Int bpw = 8 * sizeof(void*);
w = (w << n) | (w >> (bpw-n));
return w;
}
...
hash = 0;
for (i = 0; i < n_ips; i++) {
hash ^= ips[i];
hash = ROLW(hash, 19);
}
and that improves performance on your test case by about a factor
of 10. Can you try it?
> Additionally, why is no dynamic hashmap used? One that expands the number
> of buckets each time the density gets to high.
Partly lazyness, and partly maybe we need the hashmap to be available
before the dynamic memory manager is running.
J
|