|
From: Siddharth N. <sn...@dr...> - 2014-06-13 01:22:23
|
Hi All, I did a modification of the Callgrind tool to measure the amount of bytes communicated between functions. To do this, I used a shadow memory similar to what was described in the "How to shadow every byte" paper. As my secondary maps are quite huge (sometimes up to 2.5MB), the memory usage of the tool becomes sky high, especially for memory-intensive applications. I am keeping track of all my VG_mallocs and I periodically calculate the heap usage. However, I noticed that for some memory-intensive benchmarks, that the VIRT reported by Linux is much larger than my estimate of memory. For example, in one scenario, I report 18Gb of allocated memory, while the VIRT report by "top" shows 32Gb. The tool stops running on the next memory allocation request and Valgrind says that it cannot allocate anymore memory. I checked my calculation of heap usage several times and I believe its correct. I tried running the tool under Memcheck itself to see if there are any leaks, but Memcheck does not seem to report anything wrong. In fact, judging by the numbers I see, I'm not sure Memcheck is seeing a lot of allocations requested by my tool. Any theories on why this may be happening? Is there a way I can periodically check if the number of new virtual pages requested during a malloc is greater than the actual malloc'ed memory? Thanks, Sid |
|
From: Philippe W. <phi...@sk...> - 2014-06-13 17:30:37
|
On Thu, 2014-06-12 at 21:21 -0400, Siddharth Nilakantan wrote: > Hi All, > > > I did a modification of the Callgrind tool to measure the amount of > bytes communicated between functions. To do this, I used a shadow > memory similar to what was described in the "How to shadow every byte" > paper. As my secondary maps are quite huge (sometimes up to 2.5MB), > the memory usage of the tool becomes sky high, especially for > memory-intensive applications. I am keeping track of all my VG_mallocs > and I periodically calculate the heap usage. > > > However, I noticed that for some memory-intensive benchmarks, that the > VIRT reported by Linux is much larger than my estimate of memory. For > example, in one scenario, I report 18Gb of allocated memory, while the > VIRT report by "top" shows 32Gb. The tool stops running on the next > memory allocation request and Valgrind says that it cannot allocate > anymore memory. I checked my calculation of heap usage several times > and I believe its correct. > > > I tried running the tool under Memcheck itself to see if there are any > leaks, but Memcheck does not seem to report anything wrong. In fact, > judging by the numbers I see, I'm not sure Memcheck is seeing a lot of > allocations requested by my tool. > > > Any theories on why this may be happening? Is there a way I can > periodically check if the number of new virtual pages requested during > a malloc is greater than the actual malloc'ed memory? While the tool is running, you could from the shell launch at regular interval : vgdb v.info memory aspacemgr This will report the size of each valgrind arena, and (in the valgrind process output file) gives the detail of the mmap-ed blocks by the valgrind adress space manager. You might also try to run with --profile-heap=yes : this gives a lot more details about the content of the valgrind arenas, but that costs some overhead per allocated blocks. Philippe |
|
From: Siddharth N. <sn...@dr...> - 2014-06-13 23:43:15
|
Philippe, Thanks very much for that. I ran with what you said on my single-threaded benchmark and got a line like the following just before the tool died. -------- Arena "tool": 31444697088/31444697088 max/curr mmap'd, 0/0 unsplit/split sb unmmap'd, 19661485344/19661485344 max/curr on_loan -------- <Here there is a breakdown of the various mallocs in the code> This is a single-threaded benchmark, which creates just one "tool" arena. This indicates that the arena's size is close to 29.3GB while the actual useable bytes (on_loan) are 18.3G. My calculation of malloc'ed bytes also yields 18.3G so that part is correct. Why is the arena size much larger? I currently do not free any memory, so is the data segment growing in discrete and huge chunks on certain malloc calls? On 13 June 2014 13:30, Philippe Waroquiers <phi...@sk...> wrote: > On Thu, 2014-06-12 at 21:21 -0400, Siddharth Nilakantan wrote: > > Hi All, > > > > > > I did a modification of the Callgrind tool to measure the amount of > > bytes communicated between functions. To do this, I used a shadow > > memory similar to what was described in the "How to shadow every byte" > > paper. As my secondary maps are quite huge (sometimes up to 2.5MB), > > the memory usage of the tool becomes sky high, especially for > > memory-intensive applications. I am keeping track of all my VG_mallocs > > and I periodically calculate the heap usage. > > > > > > However, I noticed that for some memory-intensive benchmarks, that the > > VIRT reported by Linux is much larger than my estimate of memory. For > > example, in one scenario, I report 18Gb of allocated memory, while the > > VIRT report by "top" shows 32Gb. The tool stops running on the next > > memory allocation request and Valgrind says that it cannot allocate > > anymore memory. I checked my calculation of heap usage several times > > and I believe its correct. > > > > > > I tried running the tool under Memcheck itself to see if there are any > > leaks, but Memcheck does not seem to report anything wrong. In fact, > > judging by the numbers I see, I'm not sure Memcheck is seeing a lot of > > allocations requested by my tool. > > > > > > Any theories on why this may be happening? Is there a way I can > > periodically check if the number of new virtual pages requested during > > a malloc is greater than the actual malloc'ed memory? > While the tool is running, you could from the shell launch at regular > interval : > vgdb v.info memory aspacemgr > > This will report the size of each valgrind arena, > and (in the valgrind process output file) gives the detail of > the mmap-ed blocks by the valgrind adress space manager. > > You might also try to run with --profile-heap=yes : this gives a lot > more details about the content of the valgrind arenas, but that costs > some overhead per allocated blocks. > > Philippe > > > > |
|
From: Philippe W. <phi...@sk...> - 2014-06-14 09:10:09
|
On Fri, 2014-06-13 at 19:42 -0400, Siddharth Nilakantan wrote: > Philippe, > > > Thanks very much for that. I ran with what you said on my > single-threaded benchmark and got a line like the following just > before the tool died. > > > -------- Arena "tool": 31444697088/31444697088 max/curr mmap'd, 0/0 > unsplit/split sb unmmap'd, 19661485344/19661485344 max/curr on_loan > -------- > > <Here there is a breakdown of the various mallocs in the code> > > > This is a single-threaded benchmark, which creates just one "tool" > arena. > This indicates that the arena's size is close to 29.3GB while the > actual useable bytes (on_loan) are 18.3G. My calculation of malloc'ed > bytes also yields 18.3G so that part is correct. > Why is the arena size much larger? I currently do not free any memory, > so is the data segment growing in discrete and huge chunks on certain > malloc calls? Effectively, an arena is managed by mmap-ing big blocks of memory (e.g. 4Mb) which are split in smaller blocks according to what is VG_(malloc)-ed. Such a big difference between the mmap-ed and on_loan can be caused by 2 things: * allocating a lot of very small blocks (each malloc-ed block has an overhead. So, very small blocks proportionally causes a lot of overhead. * or by a 'unfriendly' pattern of malloc sizes. E.g. if the arena big block size is 4Mb, and you allocate a lot of 2Mb+1 "small blocks", that will cause a huge overhead. The details given by --profile-heap=yes should give some details about what is happening. Philippe |
|
From: Siddharth N. <sn...@dr...> - 2014-06-17 02:27:54
|
Thanks Philippe. After some investigation, I believe what you mentioned is
happening exactly. My secondary maps in the shadow memory are sized at
2.5MB in the particular example we were discussing. This would result in
4MB block sizes being allocated by the VG_arena_malloc function. No matter
what the payload size, is it always going to alloc at 4MB chunks? Does the
pszB_to_bszB always return 4MB? I assume not, because the numbers from
profile-heap don't add up otherwise. Below is an example just before the
tool died:
--26681-- core : 1048576/ 1048576 max/curr mmap'd, 0/0 unsplit/split
sb unmmap'd, 50688/ 33904 max/curr, 6132/ 49980720
totalloc-blocks/bytes, 6131 searches
*--26681-- tool : 31516000256/31516000256 max/curr mmap'd, 0/0
unsplit/split sb unmmap'd, 19706049824/19706049824 max/curr,
20929/19706065264 totalloc-blocks/bytes, 28217271 searches*
--26681-- dinfo : 15945728/ 9846784 max/curr mmap'd, 3/2 unsplit/split
sb unmmap'd, 15245712/ 9109584 max/curr, 23407/ 30832000
totalloc-blocks/bytes, 23419 searches
--26681-- (null) : 0/ 0 max/curr mmap'd, 0/0 unsplit/split
sb unmmap'd, 0/ 0 max/curr, 0/ 0
totalloc-blocks/bytes, 0 searches
--26681-- demangle: 65536/ 0 max/curr mmap'd, 0/2 unsplit/split
sb unmmap'd, 176/ 0 max/curr, 168/ 4928
totalloc-blocks/bytes, 166 searches
--26681-- exectxt : 0/ 0 max/curr mmap'd, 0/0 unsplit/split
sb unmmap'd, 0/ 0 max/curr, 0/ 0
totalloc-blocks/bytes, 0 searches
--26681-- errors : 0/ 0 max/curr mmap'd, 0/0 unsplit/split
sb unmmap'd, 0/ 0 max/curr, 0/ 0
totalloc-blocks/bytes, 0 searches
--26681-- ttaux : 65536/ 65536 max/curr mmap'd, 0/0 unsplit/split
sb unmmap'd, 6160/ 6032 max/curr, 217/ 11488
totalloc-blocks/bytes, 216 searches
-------- Arena "core": 1048576/1048576 max/curr mmap'd, 0/0 unsplit/split
sb unmmap'd, 50688/33904 max/curr on_loan --------
32 in 1: stacks.rs.1
33,872 in 11: gdbsrv
*-------- Arena "tool": 31516000256/31516000256 max/curr mmap'd, 0/0
unsplit/split sb unmmap'd, 19706049824/19706049824 max/curr on_loan
--------*
16 in 1: sched_lock
64 in 1: cl.threads.nes.1
80 in 1: cl.init_funcarray.sys.3
80 in 1: cl.clo.nf.1
96 in 1: cl.dump.init_dumps.1
96 in 2: hashtable.Hc.1
112 in 1: options.efn.1
128 in 1: cl.dump.init_dumps.2
144 in 5: cl.events.group.1
160 in 2: commandline.sua.3
160 in 1: cl.events.geMapping.1
208 in 1: cl.init_funcarray.sys.4
208 in 2: commandline.sua.2
208 in 1: cl.threads.nt.1
256 in 1: initimg-linux.sce.5
400 in 7: cl.fn.non.2
448 in 7: cl.events.eventset.1
592 in 2: cl.clo.nc.1
640 in 13: cl.fn.nfn.2
3,840 in 8: cl.fn.non.1
4,000 in 1: cl.context.ifs.1
5,728 in 223: cl.fn.nfnnd.2
9,568 in 13: cl.fn.nfn.1
10,368 in 162: cl.funcinfo.gc.1
12,320 in 2: hashtable.Hc.2
17,840 in 223: cl.fn.nfnnd.1
20,304 in 1: cl.context.ict.1
28,000 in 1: cl.callstack.ics.1
29,824 in 466: cl.jumps.nj.1
35,504 in 1: cl.jumps.ijh.1
40,288 in 1: cl.fn.gfe.1
58,704 in 3,667: cl.bbcc.nr.1
67,504 in 1: cl.bb.ibh.1
80,080 in 1: cl.drwinit_thread.gc.1
80,080 in 1: cl.init_funcarray.sys.2
83,504 in 1: cl.bbcc.ibh.1
240,000 in 1: cl.init_funcarray.sys.1
259,200 in 162: cl.funccontext.gc.1
467,376 in 2,053: cl.bb.nb.1
475,072 in 571: cl.context.nc.1
521,712 in 3,752: cl.bbcc.nb.1
1,581,056 in 3: cl.sim.cs_ic.1
1,600,064 in 2: cl.costs.gc.1
2,621,440 in 1: cl.init_funcarray.sm.1
5,435,072 in 2,016: cl.funcinst.gc.1
*19,692,257,280 in 7,512: cl.copy_for_writing.sm.1 -----> Secondary Map
allocations*
As we can see, for the tool the "totalloc-blocks" are 20929, with its bytes
at about 18.3G.
20929 * 4MB is a very large number and much greater than even 29G which is
the reported size of the entire "tool" arena.
Looking at the details for the "tool" arena and counting the total number
of allocations, I see 20897 allocations, almost the same as 20929 shown at
the top. The allocations are dominated by my Secondary Map. I'm guessing
that fixing the size of the Secondary Maps to something that 'aligns' well
with 4MB (2.5MB does pretty terribly) will reduce the mismatch between
VMSpace and counted bytes allocated. Do you concur?
On 14 June 2014 05:10, Philippe Waroquiers <phi...@sk...>
wrote:
> On Fri, 2014-06-13 at 19:42 -0400, Siddharth Nilakantan wrote:
> > Philippe,
> >
> >
> > Thanks very much for that. I ran with what you said on my
> > single-threaded benchmark and got a line like the following just
> > before the tool died.
> >
> >
> > -------- Arena "tool": 31444697088/31444697088 max/curr mmap'd, 0/0
> > unsplit/split sb unmmap'd, 19661485344/19661485344 max/curr on_loan
> > --------
> >
> > <Here there is a breakdown of the various mallocs in the code>
> >
> >
> > This is a single-threaded benchmark, which creates just one "tool"
> > arena.
> > This indicates that the arena's size is close to 29.3GB while the
> > actual useable bytes (on_loan) are 18.3G. My calculation of malloc'ed
> > bytes also yields 18.3G so that part is correct.
> > Why is the arena size much larger? I currently do not free any memory,
> > so is the data segment growing in discrete and huge chunks on certain
> > malloc calls?
> Effectively, an arena is managed by mmap-ing big blocks of memory
> (e.g. 4Mb) which are split in smaller blocks according to what is
> VG_(malloc)-ed.
>
> Such a big difference between the mmap-ed and on_loan can be caused
> by 2 things:
> * allocating a lot of very small blocks (each malloc-ed block has
> an overhead. So, very small blocks proportionally causes a lot of
> overhead.
> * or by a 'unfriendly' pattern of malloc sizes.
> E.g. if the arena big block size is 4Mb, and you allocate a lot
> of 2Mb+1 "small blocks", that will cause a huge overhead.
>
> The details given by --profile-heap=yes should give some details
> about what is happening.
>
> Philippe
>
>
>
>
|
|
From: Philippe W. <phi...@sk...> - 2014-06-17 04:51:14
|
On Mon, 2014-06-16 at 22:27 -0400, Siddharth Nilakantan wrote: > Thanks Philippe. After some investigation, I believe what you > mentioned is happening exactly. My secondary maps in the shadow memory > are sized at 2.5MB in the particular example we were discussing. This > would result in 4MB block sizes being allocated by the VG_arena_malloc > function. > No matter what the payload size, is it always going to alloc at 4MB > chunks? It will always mmap superblocks of 4MB or more (in case a malloc > 4Mb is requested). > Does the pszB_to_bszB always return 4MB? IIRC, pszB_to_bszB is for calculating the usable size of a malloc-ed block, not used for a mmap-ed superblock. > I assume not, because the numbers from profile-heap don't add up > otherwise. Below is an example just before the tool died: I think you are confusing malloc-ed blocks with mmap-ed superblocks. > > --26681-- tool : 31516000256/31516000256 max/curr mmap'd, 0/0 > unsplit/split sb unmmap'd, 19706049824/19706049824 max/curr, > 20929/19706065264 totalloc-blocks/bytes, 28217271 searches totalloc is the sum of all blocks that have been allocated, some of these blocks have been freed (very few in your case). > > > -------- Arena "tool": 31516000256/31516000256 max/curr mmap'd, 0/0 > unsplit/split sb unmmap'd, 19706049824/19706049824 max/curr on_loan > -------- > 16 in 1: sched_lock > 64 in 1: cl.threads.nes.1 > 80 in 1: cl.init_funcarray.sys.3 > 80 in 1: cl.clo.nf.1 > 96 in 1: cl.dump.init_dumps.1 > 96 in 2: hashtable.Hc.1 > 112 in 1: options.efn.1 > 128 in 1: cl.dump.init_dumps.2 > 144 in 5: cl.events.group.1 > 160 in 2: commandline.sua.3 > 160 in 1: cl.events.geMapping.1 > 208 in 1: cl.init_funcarray.sys.4 > 208 in 2: commandline.sua.2 > 208 in 1: cl.threads.nt.1 > 256 in 1: initimg-linux.sce.5 > 400 in 7: cl.fn.non.2 > 448 in 7: cl.events.eventset.1 > 592 in 2: cl.clo.nc.1 > 640 in 13: cl.fn.nfn.2 > 3,840 in 8: cl.fn.non.1 > 4,000 in 1: cl.context.ifs.1 > 5,728 in 223: cl.fn.nfnnd.2 > 9,568 in 13: cl.fn.nfn.1 > 10,368 in 162: cl.funcinfo.gc.1 > 12,320 in 2: hashtable.Hc.2 > 17,840 in 223: cl.fn.nfnnd.1 > 20,304 in 1: cl.context.ict.1 > 28,000 in 1: cl.callstack.ics.1 > 29,824 in 466: cl.jumps.nj.1 > 35,504 in 1: cl.jumps.ijh.1 > 40,288 in 1: cl.fn.gfe.1 > 58,704 in 3,667: cl.bbcc.nr.1 > 67,504 in 1: cl.bb.ibh.1 > 80,080 in 1: cl.drwinit_thread.gc.1 > 80,080 in 1: cl.init_funcarray.sys.2 > 83,504 in 1: cl.bbcc.ibh.1 > 240,000 in 1: cl.init_funcarray.sys.1 > 259,200 in 162: cl.funccontext.gc.1 > 467,376 in 2,053: cl.bb.nb.1 > 475,072 in 571: cl.context.nc.1 > 521,712 in 3,752: cl.bbcc.nb.1 > 1,581,056 in 3: cl.sim.cs_ic.1 > 1,600,064 in 2: cl.costs.gc.1 > 2,621,440 in 1: cl.init_funcarray.sm.1 > 5,435,072 in 2,016: cl.funcinst.gc.1 > 19,692,257,280 in 7,512: cl.copy_for_writing.sm.1 -----> Secondary > Map allocations > > > As we can see, for the tool the "totalloc-blocks" are 20929, with its > bytes at about 18.3G. totalloc-blocks are malloc-ed blocks, not mmap-ed superblocks. > 20929 * 4MB is a very large number and much greater than even 29G > which is the reported size of the entire "tool" arena. > Looking at the details for the "tool" arena and counting the total > number of allocations, I see 20897 allocations, almost the same as > 20929 shown at the top. The allocations are dominated by my Secondary > Map. I'm guessing that fixing the size of the Secondary Maps to > something that 'aligns' well with 4MB (2.5MB does pretty terribly) > will reduce the mismatch between VMSpace and counted bytes allocated. > Do you concur? Yes these 2.5 MB blocks are very probably the problem. So, either you choose a better size, or alternatively rather than using VG_(malloc), you use mmap. See pub_tool_aspacemgr.h about how to get memory with mmap, for example you could call VG_(am_shadow_alloc) Philippe |
|
From: Siddharth N. <sn...@dr...> - 2014-06-17 05:57:26
|
On 17 June 2014 00:51, Philippe Waroquiers <phi...@sk...> wrote: > On Mon, 2014-06-16 at 22:27 -0400, Siddharth Nilakantan wrote: > > Thanks Philippe. After some investigation, I believe what you > > mentioned is happening exactly. My secondary maps in the shadow memory > > are sized at 2.5MB in the particular example we were discussing. This > > would result in 4MB block sizes being allocated by the VG_arena_malloc > > function. > > > No matter what the payload size, is it always going to alloc at 4MB > > chunks? > It will always mmap superblocks of 4MB or more (in case a > malloc > 4Mb is requested). > > > Does the pszB_to_bszB always return 4MB? > IIRC, pszB_to_bszB is for calculating the usable size of a malloc-ed > block, not used for a mmap-ed superblock. > > > I assume not, because the numbers from profile-heap don't add up > > otherwise. Below is an example just before the tool died: > I think you are confusing malloc-ed blocks with mmap-ed superblocks. > > Ah yes, sorry, I got confused when reading the code. Every malloc does not necessarily lead to a new superblock. In my case, just the requests for a Secondary map entry necessarily resulted in a new superblock. A quick calculation, 7512 sec map entries * (4MB - 2.5MB) = 11GB is very close to the difference I see between VMSpace usage and allocated bytes (29G - 18G). > > > --26681-- tool : 31516000256/31516000256 max/curr mmap'd, 0/0 > > unsplit/split sb unmmap'd, 19706049824/19706049824 max/curr, > > 20929/19706065264 totalloc-blocks/bytes, 28217271 searches > totalloc is the sum of all blocks that have been allocated, some of > these blocks have been freed (very few in your case). > > > > > > -------- Arena "tool": 31516000256/31516000256 max/curr mmap'd, 0/0 > > unsplit/split sb unmmap'd, 19706049824/19706049824 max/curr on_loan > > -------- > > 16 in 1: sched_lock > > 64 in 1: cl.threads.nes.1 > > 80 in 1: cl.init_funcarray.sys.3 > > 80 in 1: cl.clo.nf.1 > > 96 in 1: cl.dump.init_dumps.1 > > 96 in 2: hashtable.Hc.1 > > 112 in 1: options.efn.1 > > 128 in 1: cl.dump.init_dumps.2 > > 144 in 5: cl.events.group.1 > > 160 in 2: commandline.sua.3 > > 160 in 1: cl.events.geMapping.1 > > 208 in 1: cl.init_funcarray.sys.4 > > 208 in 2: commandline.sua.2 > > 208 in 1: cl.threads.nt.1 > > 256 in 1: initimg-linux.sce.5 > > 400 in 7: cl.fn.non.2 > > 448 in 7: cl.events.eventset.1 > > 592 in 2: cl.clo.nc.1 > > 640 in 13: cl.fn.nfn.2 > > 3,840 in 8: cl.fn.non.1 > > 4,000 in 1: cl.context.ifs.1 > > 5,728 in 223: cl.fn.nfnnd.2 > > 9,568 in 13: cl.fn.nfn.1 > > 10,368 in 162: cl.funcinfo.gc.1 > > 12,320 in 2: hashtable.Hc.2 > > 17,840 in 223: cl.fn.nfnnd.1 > > 20,304 in 1: cl.context.ict.1 > > 28,000 in 1: cl.callstack.ics.1 > > 29,824 in 466: cl.jumps.nj.1 > > 35,504 in 1: cl.jumps.ijh.1 > > 40,288 in 1: cl.fn.gfe.1 > > 58,704 in 3,667: cl.bbcc.nr.1 > > 67,504 in 1: cl.bb.ibh.1 > > 80,080 in 1: cl.drwinit_thread.gc.1 > > 80,080 in 1: cl.init_funcarray.sys.2 > > 83,504 in 1: cl.bbcc.ibh.1 > > 240,000 in 1: cl.init_funcarray.sys.1 > > 259,200 in 162: cl.funccontext.gc.1 > > 467,376 in 2,053: cl.bb.nb.1 > > 475,072 in 571: cl.context.nc.1 > > 521,712 in 3,752: cl.bbcc.nb.1 > > 1,581,056 in 3: cl.sim.cs_ic.1 > > 1,600,064 in 2: cl.costs.gc.1 > > 2,621,440 in 1: cl.init_funcarray.sm.1 > > 5,435,072 in 2,016: cl.funcinst.gc.1 > > 19,692,257,280 in 7,512: cl.copy_for_writing.sm.1 -----> Secondary > > Map allocations > > > > > > As we can see, for the tool the "totalloc-blocks" are 20929, with its > > bytes at about 18.3G. > totalloc-blocks are malloc-ed blocks, not mmap-ed superblocks. > > > 20929 * 4MB is a very large number and much greater than even 29G > > which is the reported size of the entire "tool" arena. > > Looking at the details for the "tool" arena and counting the total > > number of allocations, I see 20897 allocations, almost the same as > > 20929 shown at the top. The allocations are dominated by my Secondary > > Map. I'm guessing that fixing the size of the Secondary Maps to > > something that 'aligns' well with 4MB (2.5MB does pretty terribly) > > will reduce the mismatch between VMSpace and counted bytes allocated. > > Do you concur? > Yes these 2.5 MB blocks are very probably the problem. > So, either you choose a better size, or alternatively rather than using > VG_(malloc), you use mmap. See pub_tool_aspacemgr.h about how to get > memory with mmap, for example you could call VG_(am_shadow_alloc) > > Thanks very much. Any special considerations when using Valgrind's mmap? I'll try a few things and report if this solves the problem. > Philippe > > > > |