|
From: janjust <tja...@un...> - 2014-03-31 14:23:30
|
Hi, I'm trying to run a cray-mpich mpi application under valgrind; however, I'm getting an error message from mpich that it cannot unmap/remap huge pages. Did anyone encounter this problem before? Any workarounds? Thanks! -- View this message in context: http://valgrind.10908.n7.nabble.com/mpich-unable-to-munmap-hugepages-tp49150.html Sent from the Valgrind - Users mailing list archive at Nabble.com. |
|
From: Philippe W. <phi...@sk...> - 2014-03-31 18:24:32
|
On Mon, 2014-03-31 at 07:23 -0700, janjust wrote: > Hi, Hello, > I'm trying to run a cray-mpich mpi application under valgrind; however, > I'm getting an error message from mpich that it cannot unmap/remap huge > pages. What error msg do you get ? > Did anyone encounter this problem before? Any workarounds? Valgrind significantly increases the memory usage, and manages the address space itself (and differently from a native run). Which version of Valgrind are you using on which OS ? 32 or 64 bits ? I suggest to start valgrind with various tracing options to see what happens e.g. --trace-syscalls=yes -v -v -v -d -d -d and observe the traces around the mpich error msg Philippe |
|
From: janjust <tja...@un...> - 2014-03-31 20:01:18
|
Thanks for replying! The entire program output is at the bottom. The error is: Unable to mmap hugepage 4194304 bytes Unable to mmap hugepage 4194304 bytes This is surely mpich specific , as ompi works just fine. The application is very simple, just a simple hello-world example with MPI_Init, send/receive, gather, barrier. My valgrind version is from trunk; however, this happens with the 3.9 release too which is the latest stable I'm guessing. OS is Cray's compute node linux - 64bit uname -a gives: Linux xxxx-ext1 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200 x86_64 x86_64 x86_64 GNU/Linux Btw, I ran pretty large scientific codes with valgrind without a problem, the major issues are typically non-handled instructions which can be omitted from time to time with compiler flags. Memory was never an issue. ============= janjust@login8:~/janjust_proj/tmp$ aprun -n 2 -N 1 ../valgrind-trunk-build/bin/valgrind --tool=none ./a.out ==16936== Nulgrind, the minimal Valgrind tool ==20916== Nulgrind, the minimal Valgrind tool ==16936== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote. ==20916== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote. ==16936== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info ==16936== Command: ./a.out ==20916== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info ==20916== Command: ./a.out ==16936== ==20916== Unable to mmap hugepage 4194304 bytes Unable to mmap hugepage 4194304 bytes For file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.16937.kvs_4760754 err Invalid argument For file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.20917.kvs_4760754 err Invalid argument Rank 1 [Mon Mar 31 15:45:07 2014] [c0-0c0s1n0] Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(449).............: MPID_Init(234)....................: channel initialization failed MPIDI_CH3_Init(83)................: MPID_nem_init(325)................: MPID_nem_gni_init(1695)...........: MPID_nem_gni_dma_buffers_init(769): Out of memory Rank 0 [Mon Mar 31 15:45:07 2014] [c0-0c0s1n3] Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(449).............: MPID_Init(234)....................: channel initialization failed MPIDI_CH3_Init(83)................: MPID_nem_init(325)................: MPID_nem_gni_init(1695)...........: MPID_nem_gni_dma_buffers_init(769): Out of memory ==16937== ==20917== _pmiu_daemon(SIGCHLD): [NID 00093] [c0-0c0s1n3] [Mon Mar 31 15:45:07 2014] PE RANK 0 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 00002] [c0-0c0s1n0] [Mon Mar 31 15:45:07 2014] PE RANK 1 exit signal Killed ==16936== ==20916== [NID 00093] 2014-03-31 15:45:07 Apid 4760754: initiated application termination Application 4760754 exit codes: 137 Application 4760754 resources: utime ~0s, stime ~0s, Rss ~28836, inblocks ~10526, outblocks ~54958 ============== -- View this message in context: http://valgrind.10908.n7.nabble.com/mpich-unable-to-munmap-hugepages-tp49150p49153.html Sent from the Valgrind - Users mailing list archive at Nabble.com. |
|
From: <fd0...@sk...> - 2014-03-31 21:28:16
|
> Le 31 mars 2014 à 22:01, janjust <tja...@un...> a écrit : > > > Thanks for replying! > The entire program output is at the bottom. > > The error is: > Unable to mmap hugepage 4194304 bytes > Unable to mmap hugepage 4194304 bytes Can you run with --trace-syscalls=yes -v -v -v -d -d- d to have more info about what is happening ? Philippe |
|
From: janjust <tja...@un...> - 2014-03-31 21:43:22
|
(hm my direct reply seems to be getting rejected) Yes, The output is rather large so I attached 3 files that were the result of running it with 2 procs. 1 for stdout and the other two are from --log-file=valgrind.%p -Tommy val.out <http://valgrind.10908.n7.nabble.com/file/n49155/val.out> valgrind.26200 <http://valgrind.10908.n7.nabble.com/file/n49155/valgrind.26200> valgrind.26269 <http://valgrind.10908.n7.nabble.com/file/n49155/valgrind.26269> -- View this message in context: http://valgrind.10908.n7.nabble.com/mpich-unable-to-munmap-hugepages-tp49150p49155.html Sent from the Valgrind - Users mailing list archive at Nabble.com. |
|
From: Philippe W. <phi...@sk...> - 2014-04-01 18:42:00
|
On Mon, 2014-03-31 at 14:43 -0700, janjust wrote: > (hm my direct reply seems to be getting rejected) > > Yes, > The output is rather large so I attached 3 files that were the result of > running it with 2 procs. 1 for stdout and the other two are from > --log-file=valgrind.%p > -Tommy > > val.out <http://valgrind.10908.n7.nabble.com/file/n49155/val.out> > > valgrind.26200 > <http://valgrind.10908.n7.nabble.com/file/n49155/valgrind.26200> > > valgrind.26269 > <http://valgrind.10908.n7.nabble.com/file/n49155/valgrind.26269> Looking at the output, this seems to be the relevant trace: SYSCALL[26201,1]( 2) sys_open ( 0x5ec9cc(/proc/mounts), 0 ) --> [async] ... SYSCALL[26201,1]( 2) ... [async] --> Success(0x0:0x11) SYSCALL[26201,1]( 5) sys_newfstat ( 17, 0xffebf76e0 )[sync] --> Success(0x0:0x0) SYSCALL[26201,1]( 9) sys_mmap ( 0x0, 4096, 3, 34, -1, 0 ) --> [pre-success] Success(0x0:0x4e71000) SYSCALL[26201,1]( 0) sys_read ( 17, 0x4e71000, 1024 ) --> [async] ... SYSCALL[26201,1]( 0) ... [async] --> Success(0x0:0x400) SYSCALL[26201,1](137) sys_statfs ( 0xffebfaa85(/var/lib/hugetlbfs/global/pagesize-2097152), 0xffebf9a80 )[sync] --> Success(0x0:0x0) SYSCALL[26201,1]( 3) sys_close ( 17 )[sync] --> Success(0x0:0x0) SYSCALL[26201,1]( 11) sys_munmap ( 0x4e71000, 4096 )[sync] --> Success(0x0:0x0) SYSCALL[26201,1]( 2) sys_open ( 0xffebf8a80(/var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.26201.kvs_4761352), 66, 493 ) --> [async] ... SYSCALL[26201,1]( 2) ... [async] --> Success(0x0:0x11) SYSCALL[26201,1]( 87) sys_unlink ( 0xffebf8a80(/var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.26201.kvs_4761352) ) --> [async] ... SYSCALL[26201,1]( 87) ... [async] --> Success(0x0:0x0) SYSCALL[26201,1]( 9) sys_mmap ( 0x0, 4194304, 3, 1, 17, 0 ) --> [pre-fail] Failure(0x16) I then tried to reproduce the problem above with the small below code, doing exactly the same syscalls with same parameter, except the fd arg to mmap, which must be the result of the open. The below works on my system. You could try the below (natively and under valgrind) and see if that fails or not. If that does not fail, then you should replace 4m.txt with a path name similar to the above (assuming the path /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.26201.kvs_4761352 is on a "strangely mounted" filesystem cfr the open /proc/mounts just above). If the below (and the modified below) succeeds, but the mpich run fails, then I guess you will be obliged to debug the valgrind code itself, to see what exactly makes the syscall fail with EINVAL: is it the valgrind checks ? or is it the real syscall failing ? Rather than debugging Valgrind, you might first try to find if the syscall itself fails by running valgrind under strace e.g. strace -f valgrind mmap_huge Philippe #include <stdio.h> #include <sys/mman.h> main() { int fd; char *m; fd = open("4m.txt", 66, 493 ); printf ("open result : %d\n", fd); unlink ("4m.txt"); m = (char*) mmap ( 0x0, 4194304, 3, 1, fd, 0 ); printf ("mmap result %p\n", m); } |
|
From: janjust <tja...@un...> - 2014-04-02 14:17:34
|
ah ok so this is a virtual filesystem or something, I'm still unsure what is going on. the example you provided succeeds for me as well, but...this file: > /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.26201.kvs_4761352 Doesn't exist or something else is going on… $ls /var/lib doesn't show hugetlbfs from my launch node, but it shows up if I run the $ls /var/lib by launching the job with $aprun it evaluates to: janjust@titan-batch7:~/janjust_proj/tmp$ aprun -n 1 -N 1 ls -lah /var/lib/hugetlbfs/global/ Couldn't parse executable total 0 drwxr-xr-x 8 root root 160 Apr 1 13:45 . drwxr-xr-x 3 root root 60 Apr 1 13:45 .. drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-131072 drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-16777216 drwxrwxrwt 2 root root 0 Apr 2 09:52 pagesize-2097152 drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-524288 drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-67108864 drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-8388608 but then "pagesize-2097152" is empty, which could be why the call is failing… -- View this message in context: http://valgrind.10908.n7.nabble.com/mpich-unable-to-munmap-hugepages-tp49150p49168.html Sent from the Valgrind - Users mailing list archive at Nabble.com. |
|
From: Philippe W. <phi...@sk...> - 2014-04-02 19:33:59
|
On Wed, 2014-04-02 at 07:17 -0700, janjust wrote: > ah ok so this is a virtual filesystem or something, I'm still unsure what is > going on. > the example you provided succeeds for me as well, but...this file: > > > /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.0.26201.kvs_4761352 > Doesn't exist or something else is going on… Yes, this file is created when needed, and then unlinked. See the trace produced by valgrind --trace-syscalls=yes ... > > $ls /var/lib doesn't show hugetlbfs from my launch node, but it shows up if > I run the $ls /var/lib by launching the job with $aprun it evaluates to: Not knowing what is aprun and what it does, I have no idea why it helps to "see" the mounted hugetlbfs file system. > > janjust@titan-batch7:~/janjust_proj/tmp$ aprun -n 1 -N 1 ls -lah > /var/lib/hugetlbfs/global/ > Couldn't parse executable > total 0 > drwxr-xr-x 8 root root 160 Apr 1 13:45 . > drwxr-xr-x 3 root root 60 Apr 1 13:45 .. > drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-131072 > drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-16777216 > drwxrwxrwt 2 root root 0 Apr 2 09:52 pagesize-2097152 > drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-524288 > drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-67108864 > drwxrwxrwt 2 root root 0 Apr 1 13:45 pagesize-8388608 > > but then "pagesize-2097152" is empty, which could be why the call is > failing… I do not think the failure is linked to the 0 size. The small executable works with a 0 size 4m.txt (just remove the unlink from the small test program, and you will see that 4m.txt is created if needed, and then mapped). Relaunching works even if the 4m.txt has a 0 size. What might be the problem is bad/wrong support of huge pages by Valgrind. I know very little about huge pages, but it looks like the below pagesize-xxxxx indicates to map a huge page of the size. It looks like you ask for a 4M huge page but on the pagesize-2097152. Maybe you could update the small program to do exactly the same open and same mmap, but with the absolute patch name of the file in the hugetbls stuff ? Then run it natively (I am assuming this should work, including with the hugetbls) Then run it under strace Then run it under valgrind Then run it under strace -f valgrind We might see the difference in the way the underlying mmap calls are done. (you might have to do all that under aprun (maybe you can do aprun bash or something like that) Philippe > > > > > -- > View this message in context: http://valgrind.10908.n7.nabble.com/mpich-unable-to-munmap-hugepages-tp49150p49168.html > Sent from the Valgrind - Users mailing list archive at Nabble.com. > > ------------------------------------------------------------------------------ > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users |
|
From: janjust <tja...@un...> - 2014-04-03 14:21:25
|
Philippe, Thanks a lot for helping with this. I ran the code as you suggested. aprun is a job submission system for our cluster machines. Attached is a file with all the output in order: code, native, strace native, valgrind, strace -f valgrind If you look at the strace -f valgrind output, at the bottom you'll see a mmap fail with EINVAL return code. hugepage_test.txt <http://valgrind.10908.n7.nabble.com/file/n49175/hugepage_test.txt> -- View this message in context: http://valgrind.10908.n7.nabble.com/mpich-unable-to-munmap-hugepages-tp49150p49175.html Sent from the Valgrind - Users mailing list archive at Nabble.com. |
|
From: Philippe W. <phi...@sk...> - 2014-04-03 17:15:29
|
On Thu, 2014-04-03 at 07:21 -0700, janjust wrote: > Philippe, > Thanks a lot for helping with this. > > I ran the code as you suggested. > aprun is a job submission system for our cluster machines. > > Attached is a file with all the output in order: > code, native, strace native, valgrind, strace -f valgrind > > If you look at the strace -f valgrind output, at the bottom you'll see a > mmap fail with EINVAL return code. Ok, I think I have an hypothesis (the below is pure guess work): I guess that the file on the hugetlbfs is special (as it is on this special mounted "huge page" file system). This file provides huge pages, which must (probably) respect some constraints such as: it must be mapped at a multiple of a huge page (1M ? 4M ? or whatever) and/or it must be in a specific part of the adress space and/or ... Valgrind adress space manager does not understand the notion of huge page. What valgrind does is: it maintains a list of unused "address space zone". To do an mmap, valgrind decides at which adress the mmap will be done. and then asks a fixed mapping at this address to the kernel. If this fixed mapping address is done in a way which is incompatible with the constraints to have a huge page, the kernel makes the mmap call fail. In the below strace extracts, you see that under valgrind, the mmap call is using a first argument different of NULL, and has added a MAP_FIXED argument : strace native: mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x2aaaaac00000 strace valgrind: mmap(0x4801000, 4194304, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 3, 0) = -1 EINVAL (Invalid argument) The logic for all this is in syswrap-generic.c around line 2146 (SVN trunk). What we might maybe do is to have yet another refinement, which is: when the mmap fails, try again, but without any MAP_FIXED arg, and afterwards, verify that the address decided by the kernel is ok. In other words, in the syscall handling for the client, the idea is to introduce a kludge similar to what is done in aspacemgr-linux.c around line 2460 and following. This looks not too difficult to do (and I could even test this on the gcc110 compile farm system, that has a huge fs file system :). How to confirm the hypothesis ? I suggest you run several times the small test program natively. If the map is always at the same address, modify the small program to pass as first argument this address (and maybe add MAP_FIXED). If afterwards, the small program succeeds both natively and under valgrind, it looks like the hypothesis is somewhat confirmed. If the above hypothesis looks correct and/or you can confirm it, then I suggest you file a bug in bugzilla (and do not hesitate to try to prepare the patch described above :). Philippe |
|
From: janjust <tja...@un...> - 2014-04-03 22:33:09
|
Philippe, This worked! Thank you so much for your help. So your hypothesis is correct. The huge_pages (at least on my system) have an alignment issue if MAP_FIXED is used, and no alignment issue if it's not used. The syswrap-generic.c seems to always use MAP_FIXED, is that a valgrind requirement? At any rate, I filed a bug report, and attached a potential patch (which worked for me), but I'm not sure if I did this correctly. All I did is added another "refinement" fallback to do a 3rd mmap() ignoring the MAP_FIXED. This could be done better though but maybe giving aspacemanager a hint to give me a hugepage_size aligned address if it fails the second time. Here is the bug report and the patch should be attached (again the patch is brutally simple). https://bugs.kde.org/show_bug.cgi?id=333051 Also I have another issue but it seems to work, for now… I get a warning with: ==21390== Warning: noted but unhandled ioctl 0x7801 with no size/direction hints Any ideas what that is? -- View this message in context: http://valgrind.10908.n7.nabble.com/mpich-unable-to-munmap-hugepages-tp49150p49181.html Sent from the Valgrind - Users mailing list archive at Nabble.com. |
|
From: Philippe W. <phi...@sk...> - 2014-04-03 22:43:02
|
On Thu, 2014-04-03 at 15:33 -0700, janjust wrote: > Philippe, > This worked! Thank you so much for your help. > > So your hypothesis is correct. The huge_pages (at least on my system) have > an alignment issue if MAP_FIXED is used, and no alignment issue if it's not > used. > > The syswrap-generic.c seems to always use MAP_FIXED, is that a valgrind > requirement? That is the way today Valgrind manages memory. I think it could work without (cfr your patch) or at least use MAP_FIXED only as a first preferred way to do an mmap. > > At any rate, I filed a bug report, and attached a potential patch (which > worked for me), but I'm not sure if I did this correctly. All I did is added > another "refinement" fallback to do a 3rd mmap() ignoring the MAP_FIXED. > > This could be done better though but maybe giving aspacemanager a hint to > give me a hugepage_size aligned address if it fails the second time. > > Here is the bug report and the patch should be attached (again the patch is > brutally simple). > > https://bugs.kde.org/show_bug.cgi?id=333051 Ok, thanks for the feedback, the bug and the patch. Touching mmap and aspacemgr area is a touchy area, so for sure this patch will have to be carefully looked at. But a bug with a patch (and even better, with a test case) is more likely to attract attention :). > > Also I have another issue but it seems to work, for now… > > I get a warning with: > ==21390== Warning: noted but unhandled ioctl 0x7801 with no size/direction > hints > > Any ideas what that is? Valgrind must have some little code so that it "understands" what a syscall is doing (e.g. what memory it is reading and writing). There are a bunch of ioctl variety, and Valgrind does not understand them all. Such not understood ioctl causes such msgs (and then could cause false positive or false negative e.g. in memcheck). I think you will find more info in README_MISSING_SYSCALL_OR_IOCTL Philippe |