|
From: Patrick A. <al...@co...> - 2011-06-24 11:59:41
|
I'm having a very difficult problem getting valgrind to execute a
program which allocates large amounts of memory (> 20 Gb)
When the program runs by itself, all the malloc calls are successful.
However when I run it with valgrind's memcheck or massif tools (v
3.6.0), a malloc call fails (which is trying to allocate around 6.4 Gb).
I am running the program with:
> valgrind --main-stacksize=100000000 --max-stackframe=100000000 ./eef_main
Here is the relevant output:
==808== Warning: set address range perms: large range [0x5b5c5040,
0x3564cd040) (undefined)
==808== Warning: set address range perms: large range [0x40d985040,
0x70888d040) (undefined)
pde_alloc: sparse matrix allocs failed: Success: nzval = 0x40d985040,
nzcol = (nil), rowptr = 0x58e9a6e0
==808== Warning: set address range perms: large range [0x5b5c5030,
0x3564cd050) (noaccess)
==808== Warning: set address range perms: large range [0x40d985030,
0x70888d050) (noaccess)
The corresponding code snippet is:
w->rowptr = malloc((PDE_MAT_SIZE2 + 1) * sizeof(int));
w->nzval = malloc(PDE_MAT_SIZE1 * PDE_MAT_SIZE2 * sizeof(double));
w->nzcol = malloc(PDE_MAT_SIZE1 * PDE_MAT_SIZE2 * sizeof(int));
if (w->nzval == 0 || w->nzcol == 0 || w->rowptr == 0)
{
fprintf(stderr, "pde_alloc: sparse matrix allocs failed: %s:
nzval = %p, nzcol = %p, rowptr = %p\n", strerror(errno), w->nzval,
w->nzcol, w->rowptr);
pde_free(w);
return 0;
}
In this code, PDE_MAT_SIZE1 = PDE_MAT_SIZE2 = 40000. Therefore the
'nzval' call is allocating 12.8Gb and the 'nzcol' is allocating 6.4 Gb.
(sizeof(double) = 8, sizeof(int) = 4)
The machine has 50Gb of ram and both of these calls are successful
without using valgrind.
Furthermore, it appears that the valgrind malloc function does not set
errno to the appropriate value, since the output says "Success" even
when the pointer is null.
Any assistance is appreciated!
Patrick
|
|
From: John R. <jr...@bi...> - 2011-06-24 14:24:29
|
> When the program runs by itself, all the malloc calls are successful.
> However when I run it with valgrind's memcheck or massif tools (v
> 3.6.0), a malloc call fails (which is trying to allocate around 6.4 Gb).
Which Linux distribution, which Linux kernel ("uname -a"), and which
C runtime library ("ls -l /lib*/libc.so*") are you running?
There are various policies (such as automatic huge pages in some kernels)
and various algorithms (such as malloc implementations in glibc)
which might matter.
> ==808== Warning: set address range perms: large range [0x5b5c5040,
> 0x3564cd040) (undefined)
> ==808== Warning: set address range perms: large range [0x40d985040,
> 0x70888d040) (undefined)
Such warnings from valgrind are expected: the sizes are rather large,
and sometimes such large sizes are clues to errors. Note that there
are two of them, corresponding to the two malloc() which did succeed.
The address intervals are the ranges returned by malloc(). Immediately
after successful malloc() then the contents of the region are uninitialized
("undefined").
> pde_alloc: sparse matrix allocs failed: Success: nzval = 0x40d985040,
> nzcol = (nil), rowptr = 0x58e9a6e0
That line above is your error message. It would be MUCH better
if you listed the values in the same order as the calls to malloc().
> ==808== Warning: set address range perms: large range [0x5b5c5030,
> 0x3564cd050) (noaccess)
> ==808== Warning: set address range perms: large range [0x40d985030,
> 0x70888d050) (noaccess)
Those two large ranges must have happened _after_ the snippet below
(your error message precedes them.)
The difference between "undefined" and "noaccess" is one clue.
However, notice that each "noaccess" range overlaps the corresponding
"undefined" range by 16 bytes on both ends. Hmmm....
>
> The corresponding code snippet is:
>
> w->rowptr = malloc((PDE_MAT_SIZE2 + 1) * sizeof(int));
> w->nzval = malloc(PDE_MAT_SIZE1 * PDE_MAT_SIZE2 * sizeof(double));
> w->nzcol = malloc(PDE_MAT_SIZE1 * PDE_MAT_SIZE2 * sizeof(int));
> if (w->nzval == 0 || w->nzcol == 0 || w->rowptr == 0)
> {
> fprintf(stderr, "pde_alloc: sparse matrix allocs failed: %s:
> nzval = %p, nzcol = %p, rowptr = %p\n", strerror(errno), w->nzval,
> w->nzcol, w->rowptr);
> pde_free(w);
> return 0;
> }
>
> In this code, PDE_MAT_SIZE1 = PDE_MAT_SIZE2 = 40000. Therefore the
> 'nzval' call is allocating 12.8Gb and the 'nzcol' is allocating 6.4 Gb.
> (sizeof(double) = 8, sizeof(int) = 4)
>
> The machine has 50Gb of ram and both of these calls are successful
> without using valgrind.
How much paging space ("swapon -s") does the machine have?
> Furthermore, it appears that the valgrind malloc function does not set
> errno to the appropriate value, since the output says "Success" even
> when the pointer is null.
You didn't check immediately after each malloc(). This makes it harder
to figure out whether errno should be valid. If the last call to malloc()
[or anything else which _might_ set errno] did succeed, then the value of
errno might not correspond to the last failure of malloc().
--
|
|
From: Patrick A. <al...@co...> - 2011-06-24 16:02:48
|
On 06/24/2011 04:25 PM, John Reiser wrote:
>> When the program runs by itself, all the malloc calls are successful.
>> However when I run it with valgrind's memcheck or massif tools (v
>> 3.6.0), a malloc call fails (which is trying to allocate around 6.4 Gb).
> Which Linux distribution, which Linux kernel ("uname -a"), and which
> C runtime library ("ls -l /lib*/libc.so*") are you running?
> There are various policies (such as automatic huge pages in some kernels)
> and various algorithms (such as malloc implementations in glibc)
> which might matter.
>
Ubuntu Lucid:
Linux maya 2.6.32-31-generic #61-Ubuntu SMP Fri Apr 8 18:25:51 UTC 2011
x86_64 GNU/Linux
libc:
lrwxrwxrwx 1 root root 14 2011-02-24 18:32 /lib/libc.so.6 -> libc-2.11.1.so
swapon -s:
Filename Type Size Used Priority
/dev/sda3 partition 95999992 0 -1
I tried a similar test case like yours and it works ok. Unfortunately my
program is extremely large and complicated. In fact I get other errors
like "conditional jump depends on uninitialized value" but these errors
exist in another part of the code entirely.
I modified the code to check after each malloc call. The valgrind output
looks like this now:
==15101== Warning: set address range perms: large range [0x5b806040,
0x35670e040) (undefined)
==15101== Warning: set address range perms: large range [0x40d709040,
0x708611040) (undefined)
nzval = 0x40d709040
pde_alloc: sparse matrix nzcol alloc failed: Success
==15101== Warning: set address range perms: large range [0x5b806030,
0x35670e050) (noaccess)
==15101== Warning: set address range perms: large range [0x40d709030,
0x708611050) (noaccess)
with code snippet:
w->nzval = malloc(PDE_MAT_SIZE1 * PDE_MAT_SIZE2 * sizeof(double));
if (!w->nzval)
{
fprintf(stderr, "pde_alloc: sparse matrix nzval alloc failed:
%s\n", strerror(errno));
pde_free(w);
return 0;
}
fprintf(stderr, "nzval = %p\n", w->nzval);
w->nzcol = malloc(PDE_MAT_SIZE1 * PDE_MAT_SIZE2 * sizeof(int));
if (!w->nzcol)
{
fprintf(stderr, "pde_alloc: sparse matrix nzcol alloc failed:
%s\n", strerror(errno));
pde_free(w);
return 0;
}
|
|
From: John R. <jr...@bi...> - 2011-06-24 15:25:06
|
It's always a good idea to try a small test case that has "the same" behavior.
-----
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#define PDE_MAT_SIZE1 40000
#define PDE_MAT_SIZE2 40000
int main()
{
int *rowptr = malloc((PDE_MAT_SIZE2 + 1) * sizeof(int));
printf("rowptr=%p errno=%d\n", rowptr, errno);
double *nzval = malloc(PDE_MAT_SIZE1 * PDE_MAT_SIZE2 * sizeof(double));
printf("nzval=%p errno=%d\n", nzval, errno);
int *nzcol = malloc(PDE_MAT_SIZE1 * PDE_MAT_SIZE2 * sizeof(int));
printf("nzcol=%p errno=%d\n", nzcol, errno);
if (nzval == 0 || nzcol == 0 || rowptr == 0) {
fprintf(stderr, "pde_alloc: sparse matrix allocs failed: %s: "
"nzval = %p, nzcol = %p, rowptr = %p\n", strerror(errno), nzval,
nzcol, rowptr);
}
return 0;
}
-----
It works for me using each of valgrind-3.5.0, valgrind-3.6.0, and today's
valgrind-3.7.0-SVN running under Fedora 14 on x86_64 with 4GB RAM.
$ uname -a
Linux myhost 2.6.35.13-92.fc14.x86_64 #1 SMP Sat May 21 17:26:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
$ ls -l /lib*/libc.so*
lrwxrwxrwx. 1 root root 12 Feb 8 19:43 /lib64/libc.so.6 -> libc-2.13.so
$ swapon -s
Filename Type Size Used Priority
/dev/sda5 partition 8056560 0 -1
/dev/sdb2 partition 104416 0 -2
/dev/sda2 partition 70655996 0 -3
$ /usr/local/valgrind-3.6.0/bin/valgrind ./a.out
==10622== Memcheck, a memory error detector
==10622== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==10622== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==10622== Command: ./a.out
==10622==
<< snip two complaints about 'index' from 'expand_dynamic_string_token' >>
rowptr=0x4c27040 errno=0 ### This allocation was small (160KB).
==10622== Warning: set address range perms: large range [0x393d1040, 0x3342d9040) (undefined)
nzval=0x393d1040 errno=0
==10622== Warning: set address range perms: large range [0x405a4f040, 0x5831d3040) (undefined)
nzcol=0x405a4f040 errno=0
==10622==
==10622== HEAP SUMMARY:
==10622== in use at exit: 19,200,160,004 bytes in 3 blocks
==10622== total heap usage: 3 allocs, 0 frees, 19,200,160,004 bytes allocated
==10622==
==10622== LEAK SUMMARY:
==10622== definitely lost: 160,004 bytes in 1 blocks
==10622== indirectly lost: 0 bytes in 0 blocks
==10622== possibly lost: 19,200,000,000 bytes in 2 blocks
==10622== still reachable: 0 bytes in 0 blocks
==10622== suppressed: 0 bytes in 0 blocks
==10622== Rerun with --leak-check=full to see details of leaked memory
==10622==
==10622== For counts of detected and suppressed errors, rerun with: -v
==10622== Use --track-origins=yes to see where uninitialised values come from
==10622== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)
$
-----
--
|
|
From: John R. <jr...@bi...> - 2011-06-24 16:22:36
|
> I tried a similar test case like yours and it works ok. Unfortunately my > program is extremely large and complicated. In fact I get other errors > like "conditional jump depends on uninitialized value" but these errors > exist in another part of the code entirely. If the other errors include any illegal write to memory (outside the bounds of an allocated block) then such an error could scramble the data on which malloc depends. Now it is time to try strace: strace -e trace=mmap,mmap2,munmap,brk,mremap valgrind \ valgrind_args... ./my_app app_args... which gives an independent report of all the system calls which perform address-space manipulation, and includes the decoded system errno. -- |
|
From: WAROQUIERS P. <phi...@eu...> - 2011-06-27 10:31:20
|
>> I tried a similar test case like yours and it works ok. >Unfortunately my >> program is extremely large and complicated. In fact I get >other errors >> like "conditional jump depends on uninitialized value" but >these errors >> exist in another part of the code entirely. > >If the other errors include any illegal write to memory (outside the >bounds of an allocated block) then such an error could >scramble the data >on which malloc depends. > >Now it is time to try strace: > strace -e trace=mmap,mmap2,munmap,brk,mremap valgrind \ > valgrind_args... ./my_app app_args... >which gives an independent report of all the system calls which perform >address-space manipulation, and includes the decoded system errno. In case malloc calls are failing under Valgrind, it is also often useful to run with --profile-heap=yes, to see where the memory is allocated, and if there is some fragmentation. If that shows fragmentation, then the patch in bug 250101 might help. Philippe ____ This message and any files transmitted with it are legally privileged and intended for the sole use of the individual(s) or entity to whom they are addressed. If you are not the intended recipient, please notify the sender by reply and delete the message and any attachments from your system. Any unauthorised use or disclosure of the content of this message is strictly prohibited and may be unlawful. Nothing in this e-mail message amounts to a contractual or legal commitment on the part of EUROCONTROL, unless it is confirmed by appropriately signed hard copy. Any views expressed in this message are those of the sender. |