You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Tomasz P. <bp...@o2...> - 2005-07-25 13:56:16
|
Hello, Two days ago I changed booting my 4 nodes (CM5) from CD to PXE, everything works fine but some messages appears (2 sec later) on nodes console after 'Node setup completed successfully.': ---8<---- bpslave-1: connect(127.0.0.1:2223): Connection refused bpslave-1: Slave setup failed bpslave : Slave 1 exited ---8<---- this appear even I boot from CD now, but nodes seems to work corectly, is there any reason to cary about it? is there any solution to fix it? maybe my PXE's configs are wrong? dhcpd.conf: ---- ddns-update-style ad-hoc; DHCPD_INTERFACE = "eth0"; option space PXE; option PXE.mtftp-ip code 1 = ip-address; option PXE.mtftp-cport code 2 = unsigned integer 16; option PXE.mtftp-sport code 3 = unsigned integer 16; option PXE.mtftp-tmout code 4 = unsigned integer 8; option PXE.mtftp-delay code 5 = unsigned integer 8; option PXE.discovery-control code 6 = unsigned integer 8; option PXE.discovery-mcast-addr code 7 = ip-address; subnet 192.168.0.0 netmask 255.255.255.0 { class "pxeclients" { match if substring (option vendor-class-identifier, 0, 9) = "PXEClient"; option vendor-class-identifier "PXEClient"; vendor-option-space PXE; option PXE.mtftp-ip 0.0.0.0; filename "pxelinux.0"; next-server 192.168.0.10; } class "etherboot" { if substring (option vendor-class-identifier, 0, 9) = "Etherboot" { filename "vmlinuz"; } } pool { range 192.168.0.1 192.168.0.9; max-lease-time 86400; default-lease-time 86400; deny unknown clients; } host w1 { hardware ethernet 00:11:D8:C5:9E:C2; fixed-address 192.168.0.1; server-name "grande"; option routers 192.168.0.10; option domain-name-servers 192.168.0.10; option host-name "w1"; # option root-path "/diskless/192.168.1.21"; if substring (option vendor-class-identifier, 0, 9) = "Etherboot" { filename "vmlinuz"; } else if substring (option vendor-class-identifier, 0,9) ="PXEClient" { filename "pxelinux.0"; } } host w2 { hardware ethernet 00:11:D8:C5:A0:BD; fixed-address 192.168.0.2; server-name "grande"; option routers 192.168.0.10; option domain-name-servers 192.168.0.10; option host-name "w1"; # option root-path "/diskless/192.168.1.21"; if substring (option vendor-class-identifier, 0, 9) = "Etherboot" { filename "vmlinuz"; } else if substring (option vendor-class-identifier, 0,9) ="PXEClient" { filename "pxelinux.0"; } } # >>>> etc. } ---- default: ---- DEFAULT Clustermatic 5 LABEL Clustermatic 5 KERNEL amd64_2.kernel APPEND initrd=amd64_2.initrd ---- Any suggestion is appreciated. |
From: Rene S. <rs...@tu...> - 2005-07-22 18:34:28
|
Hi, > I've never seen a problem like that. Is it reproducable? > Not sure. I think it might have to do with the queuing system we are using from platform LSF. We enabled preemption on the queues and it seemed that right before the crash some of the preempted/suspended jobs did not get the nodes reasigned to them properly. For example user joe whos job was suspended on node 0 did not get node 0 reasigned to joe. The job or process for user joe was running on node 0 but the node still belonged to root and not user joe. Then the cluster crashed and now things are working fine. I am watching the logs and so far no sign of trouble. Preemption seems to be working correctly now and nodes get assigned/reasigned properly. Thanks Rene |
From: Erik H. <eah...@gm...> - 2005-07-22 17:22:36
|
I've never seen a problem like that. Is it reproducable? - Erik On 7/20/05, Rene Salmon <rs...@tu...> wrote: > Hi List, >=20 > Our cluster crashed a bit ago and rebooted and now things are back and > running again. >=20 > I am looking through the syslogs and trying to figure out what happend > or what caused the crash and I found these messages: >=20 >=20 >=20 > Jul 20 20:35:48 kernel: bproc: ghost: signal: signr > =3D=3D 0 > Jul 20 20:35:48 last message repeated 3639 times > Jul 20 20:35:48 kernel: proc: ghost: signal: signr =3D > =3D 0 > Jul 20 20:35:48 kernel: bproc: ghost: signal: signr > =3D=3D 0 > Jul 20 20:35:48 last message repeated 3639 times > Jul 20 20:35:48 kernel: proc: ghost: signal: signr =3D > =3D 0 > Jul 20 20:35:48 kernel: bproc: ghost: signal: signr > =3D=3D 0 > Jul 20 20:35:48 last message repeated 3639 times > Jul 20 20:35:48 kernel: proc: ghosproc: ghostproc: g > host: signal: signr =3D=3D 0 > Jul 20 20:35:48 kernel: bproc: ghost: signal: signr > =3D=3D 0 >=20 >=20 > Any one have any clues? >=20 > Thanks > Rene >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dclic= k > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Rene S. <rs...@tu...> - 2005-07-21 03:49:29
|
Hi List, Our cluster crashed a bit ago and rebooted and now things are back and running again. I am looking through the syslogs and trying to figure out what happend or what caused the crash and I found these messages: Jul 20 20:35:48 kernel: bproc: ghost: signal: signr == 0 Jul 20 20:35:48 last message repeated 3639 times Jul 20 20:35:48 kernel: proc: ghost: signal: signr = = 0 Jul 20 20:35:48 kernel: bproc: ghost: signal: signr == 0 Jul 20 20:35:48 last message repeated 3639 times Jul 20 20:35:48 kernel: proc: ghost: signal: signr = = 0 Jul 20 20:35:48 kernel: bproc: ghost: signal: signr == 0 Jul 20 20:35:48 last message repeated 3639 times Jul 20 20:35:48 kernel: proc: ghosproc: ghostproc: g host: signal: signr == 0 Jul 20 20:35:48 kernel: bproc: ghost: signal: signr == 0 Any one have any clues? Thanks Rene |
From: Sean <se...@la...> - 2005-07-18 16:33:08
|
HI Try "bpsh # cat /proc/PID/statm". The mem info isn't forwarded to the frontend. Sean Jeff Brown wrote: > folks, > > I need to track memory allocation on slave (compute) nodes in a bproc > environment. I'm currently pulling data from /proc/<pid>/statm - rss. > The data looks suspect. All pages are allocated essentially at startup > and the number never grows. This is not consistent with the behavior of > the program I'm tracking. > > Is there a better way to fetch current page allocation in a bproc > environment? We are running bproc version 3.2.6 (returned from > bproc_version). > > thanks! > > Jeff Brown > Los Alamos National Laboratory > > > Jeffrey S. Brown > X-8, MS F645 > (505) 665-4655 (office) > (505) 231-4755 (cell) > je...@la... > > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Jeff B. <je...@la...> - 2005-07-18 16:16:53
|
folks, I need to track memory allocation on slave (compute) nodes in a bproc environment. I'm currently pulling data from /proc/<pid>/statm - rss. The data looks suspect. All pages are allocated essentially at startup and the number never grows. This is not consistent with the behavior of the program I'm tracking. Is there a better way to fetch current page allocation in a bproc environment? We are running bproc version 3.2.6 (returned from bproc_version). thanks! Jeff Brown Los Alamos National Laboratory Jeffrey S. Brown X-8, MS F645 (505) 665-4655 (office) (505) 231-4755 (cell) je...@la... |
From: Joshua A. <lu...@ln...> - 2005-07-06 19:48:49
|
On Wed, 2005-07-06 at 09:05 -0700, Erik Hendriks wrote: > > Is there a way to use myrinet's mapper to boot the nodes over gm? > > That should be possible (it used to be) but it will take a bit of > work. I did it by making the mapper into a beoboot plugin. Basically > you rename 'main', make it into a library and link it into the phase 1 > + phase 2 boot images. Then there's a tiny bit of glue logic to fork > and call the modified main after the drivers are loaded and then kill > it when phase 1 wants to shut down. Last time I did this I didn't > have to change any of the mapper source. Renaming, etc. can be done > with objcopy. I just put together a new makefile for it. Done. If anyone wants a copy send me an email. Pulled the mapper from GM 2.0.21. Josh |
From: Erik H. <eah...@gm...> - 2005-07-06 16:06:03
|
On 7/5/05, Jeff Rasmussen <jra...@ln...> wrote: > I'm experiencing issues running gm_route with the Myrinet Clos256 > switches. The mapper runs but does not return any hosts other then the > mapper itself. I can run myrinet's mapper and gm_board_info will display > all my nodes. I have also taken an older 16 port switch and plugged > nodes into it to verify that these boot failures are not related to a > bad software config and everything worked fine. >=20 > Is there any updated version of gm_route that will work with the new > switches? No. I doubt there will be either since the new switches are supposed to fix the sort comings that made all the monkey business in gm_route necessary in the first place. > Is there a way to use myrinet's mapper to boot the nodes over gm? That should be possible (it used to be) but it will take a bit of work. I did it by making the mapper into a beoboot plugin. Basically you rename 'main', make it into a library and link it into the phase 1 + phase 2 boot images. Then there's a tiny bit of glue logic to fork and call the modified main after the drivers are loaded and then kill it when phase 1 wants to shut down. Last time I did this I didn't have to change any of the mapper source. Renaming, etc. can be done with objcopy. I just put together a new makefile for it. - Erik |
From: Jeff R. <jra...@ln...> - 2005-07-05 21:13:56
|
I'm experiencing issues running gm_route with the Myrinet Clos256 switches. The mapper runs but does not return any hosts other then the mapper itself. I can run myrinet's mapper and gm_board_info will display all my nodes. I have also taken an older 16 port switch and plugged nodes into it to verify that these boot failures are not related to a bad software config and everything worked fine. Is there any updated version of gm_route that will work with the new switches? or Is there a way to use myrinet's mapper to boot the nodes over gm? Thanks, -- Jeff Rasmussen |
From: Erik H. <eah...@gm...> - 2005-06-27 22:53:51
|
On 6/23/05, Julian Seward <ju...@va...> wrote: >=20 > > Although I agree that linking valgrind against bproc would be nasty, I > > still like the idea of stopping valgrind at a convenient moment > > though. [...] >=20 > I've been playing with doing bproc_move at the "right" (least-worst) > time from within Valgrind. With a bit of fd-restoring-uglyness and > preloading various .so's onto the slaves, I can get V to migrate off > the master and stay alive. At least -- it works when V is running > in no-instrumentation mode (--tool=3Dnone). >=20 > When you try to migrate running a useful tool (memcheck) the > migration call instantly fails. I am doing >=20 > ret =3D VG_(do_syscall3)(__NR_bproc, BPROC_SYS_MOVE, node, (UWord)&req)= ; Why not just try dumping to a file on the front end for starters (bproc_dump)? You can treat the dump like an executable and just run it on the front end. That'd probably be a good test. Also, you could use bpsh to run it remotely and avoid the pain of dealing with I/O forwarding too. > (having copied relevant stuff from clients/bproc.c and {sys,kernel}/bproc= .h) >=20 > So my question. For historical reasons Memcheck allocates all required > "shadow" address space at the start, about 1.5G, with a huge mmap > of /dev/zero. It then mprotects bits of this incrementally to bring > it into use as needed. Most of the mapping is never used. >=20 > This only works because by default the kernel does VM overcommitting. > On some systems (Red Hat 8) this scheme fails if there isn't 1.5G of > swap to back it. >=20 > So I was wondering how your cm46 slave kernels will behave. At the > point vmadump gets its hands on the process image this huge mapping > will have been established. My test slave has 128M of memory and > obviously no swap. Should migration succeed under these circumstances? I'm guessing no... because I doubt the kernel will be willing to overcommit that much memory. I'm not familiar with the overcommit policies so I can't say conclusively. That said, vmadump (the migrator piece) is smart enough not to send zeroed pages. Since it *is* a file mapping, it will probably try to read all those pages on the front end to make sure that it is indeed all zeros. It's going to try and make a 1.5G anonymous mapping on the other end and patch in whatever pages aren't zero. > The syscall fails with EINVAL. This I thought strange in that if > the slave has insufficient memory surely you would return ENOMEM ? Based on the fact that you're trying to allocate 1.5G, I'm gonna guess that came from this snippet in vmadump_common.c: /* Load the data from the dump file */ down_write(¤t->mm->mmap_sem); addr =3D do_mmap(0, head->start, head->end - head->start, PROT_READ|PROT_WRITE|PROT_EXEC, mmap_flags, 0); up_write(¤t->mm->mmap_sem); if (addr !=3D head->start) { printk("do_mmap(0, %08lx, %08lx, ...) =3D 0x%08lx (failed)\n", head->start, head->end - head->start, addr); return -EINVAL; } That really should just pass through whatever the mmap error is. I think I was trying to keep the possible set of errors returned by vmadump smaller than the intersection of all the syscalls it uses.=20 E.g. connection reset by peer is a TCP-ism and maybe vmadump should just say EIO. > I spent some time reading kernel/move.c -- process2move() and > send_process(), but couldn't deduce whether or not ENOMEM would > return in case where the slave had insufficient memory. >=20 > Really this is 2 questions: >=20 > * How does vmadump and/or 2.6.9-cm46 behave when migrating overcommitted > space? (the space is a map of /dev/zero with PROT_NONE). It should be the same as vanilla 2.6.9 (I don't know exactly what that is). vmadump with the default set of argument bproc_move will construct the following: For the type of regions you're talking about: - Only non-zero pages are sent: - For anonymous mappings - zeroness can often be determined via page table walk. - For file mappings - it's going to page in the page and check it. - Regions are recreated as anonymous mappings on the remote machine. - Only the non-zero pages are paged in (and written to). For things that get sent as file references (bplib -l): - Only modified pages get sent. - Regions are recreated as file mappings. - Modified pages are patched in. I think it might be a win on to add /dev/zero the library list on the front= end. > * If a migration should fail due to lack of memory, what does sys_bproc > return? Looks like EINVAL although it probably shouldn't. > [[Note: I'm just trying to understand what's happening. Not saying > there's any problem with BProc. We know that our big-bang allocation > scheme is braindead and needs fixing.]] Nod. In theory, except for the humongous over commit on the slave node, it seems like it *should* work fine with the migrator. - Erik |
From: Julian S. <ju...@va...> - 2005-06-23 11:35:30
|
> Although I agree that linking valgrind against bproc would be nasty, I > still like the idea of stopping valgrind at a convenient moment > though. [...] I've been playing with doing bproc_move at the "right" (least-worst) time from within Valgrind. With a bit of fd-restoring-uglyness and preloading various .so's onto the slaves, I can get V to migrate off the master and stay alive. At least -- it works when V is running in no-instrumentation mode (--tool=none). When you try to migrate running a useful tool (memcheck) the migration call instantly fails. I am doing ret = VG_(do_syscall3)(__NR_bproc, BPROC_SYS_MOVE, node, (UWord)&req); (having copied relevant stuff from clients/bproc.c and {sys,kernel}/bproc.h) So my question. For historical reasons Memcheck allocates all required "shadow" address space at the start, about 1.5G, with a huge mmap of /dev/zero. It then mprotects bits of this incrementally to bring it into use as needed. Most of the mapping is never used. This only works because by default the kernel does VM overcommitting. On some systems (Red Hat 8) this scheme fails if there isn't 1.5G of swap to back it. So I was wondering how your cm46 slave kernels will behave. At the point vmadump gets its hands on the process image this huge mapping will have been established. My test slave has 128M of memory and obviously no swap. Should migration succeed under these circumstances? The syscall fails with EINVAL. This I thought strange in that if the slave has insufficient memory surely you would return ENOMEM ? I spent some time reading kernel/move.c -- process2move() and send_process(), but couldn't deduce whether or not ENOMEM would return in case where the slave had insufficient memory. Really this is 2 questions: * How does vmadump and/or 2.6.9-cm46 behave when migrating overcommitted space? (the space is a map of /dev/zero with PROT_NONE). * If a migration should fail due to lack of memory, what does sys_bproc return? [[Note: I'm just trying to understand what's happening. Not saying there's any problem with BProc. We know that our big-bang allocation scheme is braindead and needs fixing.]] Thanks, J |
From: Erik H. <eah...@gm...> - 2005-06-22 06:01:37
|
On 6/21/05, Julian Seward <ju...@va...> wrote: >=20 > I wrote a simple test program which simply consists of a > spin-wait loop, then a bproc_move from front end to a slave > node, and a second spin-wait loop which prints a progress > message every second or so. >=20 > The process is migrated correctly. However, after running > on the slave for a few (20 ish?) seconds, it dies, with > "Killed" printed. The amount of progress it makes before > this happens varies from attempt to attempt, although it > does not vary by much. >=20 > Another ten or twenty seconds after "Killed" appears, the > slave invariably reboots itself. When a slave holding a process dies, the process looks like it got a SIGKILL on the front end. There's no normal UNIX way to say something like the machine that process was on isn't with us anymore so that seemed like the next best thing to do. > strace isn't helpful; it merely tells me the process is killed > by SIGKILL, which is apparent anyway. >=20 > Why does this happen? How can I avoid it? Given that the > migration takes place OK, it feels like the master has asked > the slave to reset itself as a result of some kind of timeout > happening. When the slaves are idle they stay alive indefinitely > with no such reboots. I'm not sure why that's happening. Is there anything on the slave's console? The 20 seconds interval sounds like the normal ping timeout scenerio between the master and a slave. Is it possible that the slave daemon is getting starved somehow? Can you bpsh other things to the node while your program is running? =20 > [I also don't understand why I can see printfs from the > program after migration, given that "man 2 bproc_move" says > "All open files are closed during migration."] That's an inaccuracy, I suppose. If you don't specify any other I/O setup, it takes stdout + stderr and feeds them to what ever the process's original stdout was. It uses the socket that was used to pass the process data to do this. It's a crutch for little test programs like the one you wrote. The I/O forwarding done by bpsh, mpirun, etc. doesn't work this way. - Erik |
From: Erik H. <eah...@gm...> - 2005-06-22 05:54:31
|
On 6/20/05, Greg Watson <gw...@la...> wrote: > Interestingly, I was just talking to one of the OpenMPI developers > who have a very similar problem. On most systems, mpirun will launch > a daemon on the remote nodes, then the daemon forks and execs the > program to be run. It does this to maintain control of I/O forwarding > amongst other things. Parallel debuggers also need to do a similar > thing. Yep. BProc is definitely not "most systems." :-) > Unfortunately, bproc 4 does not support exec'ing an executable on a > remote node (although older versions did), so they have to use some > other mechanism like copying the executable to the node. > Unfortunately, both the exec and direct copying (and NFS for that > matter) bypass the tree spawn mechanism that makes bproc so efficient > and scalable. The execve hook became difficult to support because of some of changes in the process movement code (specifically atomic conversion of process to/from a bproc message). It was relatively easy in the context of BProc 3 because it didn't handle the real process/process in a message distinction very well at all. It was a pretty big misfeature anyway. The manner in which people generally wanted to use it was exactly the wrong thing to do with it - have mpirun/parallel debugger/etc. exec on the node, etc. We used to watch master nodes crumble trying to migrate only a 100 or so processes off at once. > The basic problem is the need to get two (or more) > executables onto a node in such a way as they know about each other, > but as far as I know this functionality is not available. I don't > know how difficult it would be to modify vexecmove to handle multiple > executables - perhaps Erik could answer that? Multiple executables? I don't understand how that makes sense in the context of an 'execve' type system call. You cannot load multiple binaries into a single process in any kind of meaningful way.=20 Otherwise you're back to populating the file system which should definitely not be part of execmove. BTW, I think the general movement should be in the other direction.=20 vexecmove() is a hugely complicated system call. There are all kinds of problems with it because it effectively combines fork() and execve(). Both these calls have special semantics wrt ptrace.=20 There's a bunch of kernel code in there to try and emulate these weird cases without really returning to user space between these calls.=20 This is nasty and gross and gdb doesn't work right without it. As a possible fix, I observed that execve can be implemented as execdump into my own memory space followed by a move and then undump from my own memory space. I have a branch in the CVS which does this and removes vexecmove as a primitive (it's emulated in the user space bproc library). I think I axed 1000 lines of kernel code as a result. It's still half baked because dump to/from my own memory isn't implemented. There are a number of other advantages to handling execmove as a dump/move/undump sequence in user space as well. The bottom line is I think the kernel goop needs to get simpler, not more complicated. - Erik P.S. The execve hook wouldn't work for something like valgrind anyway since it doesn't use execve to load the program to be debugged. |
From: Erik H. <eah...@gm...> - 2005-06-22 05:52:52
|
On 6/20/05, Julian Seward <ju...@va...> wrote: >=20 > > > The intention is that the carried-around tree is small, as is indeed > > > V's install tree is (5.7 M). >=20 > That's -O -g; if we knock off the debug info it's about 2M. FWIW, that really doesn't strike me as being too big to migrate along with a process. When debugging with valgrind, you should be expecting higher memory requirements, slower run times, etc. 2-5MB extra on the process seems totally reasonable to me. I wouldn't worry about migrating that at all. > > What about 'program'? > > Are you imagining this to be in valgrind's install tree? > > Else this would still have to be separately migrated, or on nfs, no? >=20 > Good point. Uh, this is more complex than I thought. >=20 > * 'program' does indeed need to be migrated too >=20 > * How will .../bin/valgrind know to start 'program' ? >=20 > * Very often, users want to supply their own suppressions > files for Valgrind (--supp=3Dfilename on the command line) > and that needs to be shunted across too >=20 > * How does all this work when starting stuff with mpirun > rather than bpsh ? >=20 > * What if the valgrinded program on the nodes decides to > start a child process which it wants valgrinded? There's also the issue of permissions and resource limitations. bpsh isn't a privileged program so populating /usr/lib/* on a slave node will be problem. Also, on diskless systems places like /usr/lib are usually populated minimally to keep memory usage down. Once it's populated you basically have a cache. Something is going to have to take care of purging it, etc. It's probably less of a headache for the administrator to drop the valgrind stuff out there. These days on machines with multiple gigs of ram, throwing another 5mb out there isn't going to be a big deal.=20 If that turns out to be too much maybe the scheduler can be made to put something like valgrind on the node if the user requests it.=20 Bottom line is I think the /usr/lib stuff isn't that big a deal. The user binary is the real issue. Populating a node with files usually turns into one of those slippery slopes and things get out of hand quickly. It's come up with many times. Shared libraries is common topic of conversation - migrating those with an executable would be nice. It starts to look less reasonable to do it explicitly when there are 100s of them. Then there's the permission and resource issues I mentioned. Then people are going to want to do configuration files... and then input data and they're going to want BProc to be their network file system - a problem I specifically did NOT set out to solve. Ok, I'm ranting. Bottom line is I think it's perfectly reasonable to use something like NFS with BProc. I just shouldn't ever be a hard requirement. Although I agree that linking valgrind against bproc would be nasty, I still like the idea of stopping valgrind at a convenient moment though. What if BProc could restore a few open file descriptors?=20 Don't worry about stdin/out/err since bpsh et al. are supposed to do something reasonable with those. I think mmap is a great way of implicitly telling the system what you need to run. I am probably biased though. What about some other way of catching the valgrind process at the right moment? Just thinking out-loud here but.... What about doing something along the lines of an LD_PRELOAD hack on valgrind? As long as it's dynamically linked thing maybe we could manually load in a stub (which isn't part of valgrind) that would take care of the details of saving/restoring any open files and doing the dump. I think there's probably lots of nasty details in there but that might make it possible to get the stop w/o getting too much nasty BProc details in it. It might be a good way to isolate valgrind from BProc and vice versa. Basically, it seems like some special logic is going to be required since the program and valgrind aren't being loaded in the usual fashion. I had a convenient little hack to stop an ELF binary after linking completed but before it got to things like calling constructors or main() but that's not going to work here. Oh well, too bad. - Erik |
From: Julian S. <ju...@va...> - 2005-06-22 02:15:24
|
I wrote a simple test program which simply consists of a spin-wait loop, then a bproc_move from front end to a slave node, and a second spin-wait loop which prints a progress message every second or so. The process is migrated correctly. However, after running on the slave for a few (20 ish?) seconds, it dies, with "Killed" printed. The amount of progress it makes before this happens varies from attempt to attempt, although it does not vary by much. Another ten or twenty seconds after "Killed" appears, the slave invariably reboots itself. strace isn't helpful; it merely tells me the process is killed by SIGKILL, which is apparent anyway. Why does this happen? How can I avoid it? Given that the migration takes place OK, it feels like the master has asked the slave to reset itself as a result of some kind of timeout happening. When the slaves are idle they stay alive indefinitely with no such reboots. [I also don't understand why I can see printfs from the program after migration, given that "man 2 bproc_move" says "All open files are closed during migration."] -------- I'm using x86 BProc from Clustermatic 5. The master is vanilla SuSE 9.2. Both the master and slave(s) run on VMWare WS5 running on WinXP on a P4 with 1G memory. All /etc/clustermatic/config settings are defaults, except "pingtimeout", which I reduced from 30 to 10. The program prints a message just before/after it does bproc_move. I can see that the migration really does happen watching the CPU usage of the various VMware threads involved. J |
From: Julian S. <ju...@va...> - 2005-06-21 18:22:29
|
> - valgrind starts up and gets through loading the program to be debugged. > - valgrind stops and dumps itself w/ vmadump (bproc_dump()). > - bpsh/mpirun migrates THAT process image instead of some fresh executable. > - half started process w/ valgrind + other executable wakes up and > runs on the slave node. That might work. It sounds expensive though. Instead of doing bproc_dump, why not do bproc_move. At that point, V should have loaded the executable to be debugged, and read all the associated junk (debug info, suppressions files, extra .so's, whatever) and basically have th thing ready to go. One problem is that the migration will trash the fd on which V writes error messages/etc, but that's not a big deal since V can just reopen the log after the move is done. I'll give it a go. J |
From: Greg W. <gw...@la...> - 2005-06-20 16:01:16
|
On Jun 20, 2005, at 9:50 AM, Julian Seward wrote: > > >> OpenMPI developers who have a very similar problem. >> [...] >> The basic problem is the need to get two (or more) >> executables onto a node in such a way as they know about each other, >> > > Yes. Just to be clear, what we need to get onto the nodes is: > > * executable to be debugged > * valgrind stage 1 executable, which is really a loader for > * valgrind stage 2 executable > > * a bunch of .so's. These we dlopen, so am not sure if the standard > BProc library mechanism takes care of it or not. > > * a bunch of pre-supplied text files ("suppressions") > which tell V to hide errors from standard libraries. Identity of > these files are known at the time V is built/installed. > > * possibly extra such files as supplied by the application's > developers. Their identity is not known until V starts up. > > I wasn't aware of the no-exec-on-nodes restriction, although that > isn't a big deal for us -- V uses its own user-space exec > implementation. To clarify, you can exec an executable that is local to the node. In bproc 3 and earlier you could give exec a full path name and it would automatically load it from the front end if it wasn't found on the node. This no longer works in bproc 4. Greg |
From: Julian S. <ju...@va...> - 2005-06-20 15:50:30
|
> OpenMPI developers who have a very similar problem. > [...] > The basic problem is the need to get two (or more) > executables onto a node in such a way as they know about each other, Yes. Just to be clear, what we need to get onto the nodes is: * executable to be debugged * valgrind stage 1 executable, which is really a loader for * valgrind stage 2 executable * a bunch of .so's. These we dlopen, so am not sure if the standard BProc library mechanism takes care of it or not. * a bunch of pre-supplied text files ("suppressions") which tell V to hide errors from standard libraries. Identity of these files are known at the time V is built/installed. * possibly extra such files as supplied by the application's developers. Their identity is not known until V starts up. I wasn't aware of the no-exec-on-nodes restriction, although that isn't a big deal for us -- V uses its own user-space exec implementation. J |
From: Greg W. <gw...@la...> - 2005-06-20 14:20:57
|
Interestingly, I was just talking to one of the OpenMPI developers who have a very similar problem. On most systems, mpirun will launch a daemon on the remote nodes, then the daemon forks and execs the program to be run. It does this to maintain control of I/O forwarding amongst other things. Parallel debuggers also need to do a similar thing. Unfortunately, bproc 4 does not support exec'ing an executable on a remote node (although older versions did), so they have to use some other mechanism like copying the executable to the node. Unfortunately, both the exec and direct copying (and NFS for that matter) bypass the tree spawn mechanism that makes bproc so efficient and scalable. The basic problem is the need to get two (or more) executables onto a node in such a way as they know about each other, but as far as I know this functionality is not available. I don't know how difficult it would be to modify vexecmove to handle multiple executables - perhaps Erik could answer that? Greg On Jun 20, 2005, at 5:58 AM, Julian Seward wrote: > > >>> The intention is that the carried-around tree is small, as is indeed >>> V's install tree is (5.7 M). >>> > > That's -O -g; if we knock off the debug info it's about 2M. > > > >> What about 'program'? >> Are you imagining this to be in valgrind's install tree? >> Else this would still have to be separately migrated, or on nfs, no? >> > > Good point. Uh, this is more complex than I thought. > > * 'program' does indeed need to be migrated too > > * How will .../bin/valgrind know to start 'program' ? > > * Very often, users want to supply their own suppressions > files for Valgrind (--supp=filename on the command line) > and that needs to be shunted across too > > * How does all this work when starting stuff with mpirun > rather than bpsh ? > > * What if the valgrinded program on the nodes decides to > start a child process which it wants valgrinded? > > Hum. > > J > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |
From: Julian S. <ju...@va...> - 2005-06-20 11:58:54
|
> > The intention is that the carried-around tree is small, as is indeed > > V's install tree is (5.7 M). That's -O -g; if we knock off the debug info it's about 2M. > What about 'program'? > Are you imagining this to be in valgrind's install tree? > Else this would still have to be separately migrated, or on nfs, no? Good point. Uh, this is more complex than I thought. * 'program' does indeed need to be migrated too * How will .../bin/valgrind know to start 'program' ? * Very often, users want to supply their own suppressions files for Valgrind (--supp=filename on the command line) and that needs to be shunted across too * How does all this work when starting stuff with mpirun rather than bpsh ? * What if the valgrinded program on the nodes decides to start a child process which it wants valgrinded? Hum. J |
From: Cerion Armour-B. <ce...@op...> - 2005-06-20 10:35:42
|
On Monday 20 June 2005 12:27, Julian Seward wrote: > I've been pondering a more generalised solution .. tell me if this > sounds crazy. > > It's a modified version of bpsh (or a replacement). Instead of > doing > > bpsh <node_specifiers> program args > > do > > modified_bpsh <node_specifiers> path program args > > modified_bpsh reads the entire tree rooted at path into itself > (mmap games, perhaps), migrates to the nodes, dumps the tree back > into the node-local filesystem, and execs program w/args as usual. > > Running V on a slave node is then > > modified_bpsh <node_specifiers> \ > /where/V/is/installed/on/master \ # the path > /where/V/is/installed/on/master/bin/valgrind \ # stage1 > program \ > args > > This strikes me as having several advantages: > > * doesn't require slaves to use an NFS-mounted filesystem > * is useful for any kind of tool requiring a readonly filesystem > * doesn't require linking V against BProc > > The intention is that the carried-around tree is small, as is indeed > V's install tree is (5.7 M). If bandwidth is an issue (fair enough, if > sending copies to N hundred nodes) then it might be possible to > compress the tree as it is read using a real-time compression package > and decompress on the slaves. I'm thinking of LZO > (http://www.oberhumer.com/opensource/lzo) which is GPLd and very fast. > > Comments? > What about 'program'? Are you imagining this to be in valgrind's install tree? Else this would still have to be separately migrated, or on nfs, no? C. |
From: Julian S. <ju...@va...> - 2005-06-20 10:27:42
|
> [...] > > Interesting. If I read this right valgrind acting as ELF loader. Correct. There's no way to avoid this [that we know of.] > Does it do linking stuff too? No. We merely load the executable and its direct dependencies, then start up the stated ELF interpreter (ld.so) on our virtual CPU. So we don't have to get into the dynamic linking swamp, fortunately. > Does the target program effectively > have it's own dynamic linker or is it shared with valgrind? > Does it > share instances of the libraries? - it appears that stage2 is > dynamically linked as well. Our design goal is that V is completely independent of any other libraries. We haven't quite got there yet, but it nearly is. The primary motivation is that V has to maintain complete control over the process' address space and signal state, and that's essentially impossible if we defer to glibc to do low level stuff like malloc, free, etc. Also, imagine the potential chaos if V and the program shared glibc.so, and the simulated program was part way through doing malloc (on the simulated CPU) when V decided to call malloc on the real CPU. Even if this didn't turn out to be a problem, the difficulty in convincing ourselves that it's safe and always going to work is huge. So our policy is to make V (viz, stage2) as completely self-contained as we can. Since we're not quite there yet .. V does use glibc.so and ld.so, but has its own instances of them. > It's true that it's only one executable but that could be something > pretty weird. I'm kinda just thinking out loud here but what about > the following: > > - valgrind starts up and gets through loading the program to be debugged. > - valgrind stops and dumps itself w/ vmadump (bproc_dump()). > - bpsh/mpirun migrates THAT process image instead of some fresh executable. > - half started process w/ valgrind + other executable wakes up and > runs on the slave node. > > The nasty bit here is that valgrind would have to be linked w/ bproc. > I did some weird stuff w/ editing freshly loaded elf binaries to add a > preinit section that called bproc. That basically allowed the kernel > to take over again after dynamic linking was done but before the > program ran. I don't know if some similar hack could work here. I > don't know - just a thought. > > This would be pretty easy to test, I think. If you added the > bproc_dump call and just dumped to a plain file, you can execve that > file directly to reload the dump. That would allow bpsh to do its > thing. I real solution would probably look more like dump into a pipe > or something. > > That still leaves the problem of valgrind getting at files when it > pleases. Would it be possible/reasonable for valgrind to pre-load > everything it *might* need down the line? That could be optional. That kind of thing might be a possibility, although I have to be honest and say I'd prefer not to have to put BProc specifics into V if I don't have to, especially as at this time we're working hard to make V less target-specific. I've been pondering a more generalised solution .. tell me if this sounds crazy. It's a modified version of bpsh (or a replacement). Instead of doing bpsh <node_specifiers> program args do modified_bpsh <node_specifiers> path program args modified_bpsh reads the entire tree rooted at path into itself (mmap games, perhaps), migrates to the nodes, dumps the tree back into the node-local filesystem, and execs program w/args as usual. Running V on a slave node is then modified_bpsh <node_specifiers> \ /where/V/is/installed/on/master \ # the path /where/V/is/installed/on/master/bin/valgrind \ # stage1 program \ args This strikes me as having several advantages: * doesn't require slaves to use an NFS-mounted filesystem * is useful for any kind of tool requiring a readonly filesystem * doesn't require linking V against BProc The intention is that the carried-around tree is small, as is indeed V's install tree is (5.7 M). If bandwidth is an issue (fair enough, if sending copies to N hundred nodes) then it might be possible to compress the tree as it is read using a real-time compression package and decompress on the slaves. I'm thinking of LZO (http://www.oberhumer.com/opensource/lzo) which is GPLd and very fast. Comments? J |
From: Erik H. <eah...@gm...> - 2005-06-17 23:58:57
|
On 6/17/05, Julian Seward <ju...@va...> wrote: =20 [snip] > * User runs /usr/bin/valgrind prog args-for-prog >=20 > * /usr/bin/valgrind is not the "real" valgrind executable. > That is /usr/lib/valgrind/stage2. /usr/bin/valgrind > loads stage2 high in the address space and hands control to it. >=20 > * stage2 unmaps /usr/bin/valgrind. It is now alone in the > address space, and in particular there is a hole at the > standard load address (0x8040000, or wherever). >=20 > * stage2 has its own implementation of exec() (sort of). > It uses this to load prog (+ dependent .so's) and start it. >=20 > * From prog's point of view it is started just as it would be > normally. >=20 > In reality it is running on a virtual CPU provided by stage2. > stage2 intercepts and messes with all mmap() etc done by prog > to ensure it doesn't screw up valgrind. Interesting. If I read this right valgrind acting as ELF loader.=20 Does it do linking stuff too? Does the target program effectively have it's own dynamic linker or is it shared with valgrind? Does it share instances of the libraries? - it appears that stage2 is dynamically linked as well. [snip] > This is pretty ugly. As I understand it, bpsh takes > /opt/valgrind/bin/valgrind from the master, migrates it to the slave(s)= , > starts it there, and it just happens to work because the valgrind > install trees on the master and slaves are identical. Yup. That's right. =20 > I don't have any better ideas. Fundamentally it seems difficult because > bpsh is only prepared to migrate one executable and that has to be > /opt/valgrind/bin/valgrind, so you have to have a different way to get > the executable-to-be-debugged to the slaves. It's true that it's only one executable but that could be something pretty weird. I'm kinda just thinking out loud here but what about the following: - valgrind starts up and gets through loading the program to be debugged. - valgrind stops and dumps itself w/ vmadump (bproc_dump()). - bpsh/mpirun migrates THAT process image instead of some fresh executable. - half started process w/ valgrind + other executable wakes up and runs on the slave node. The nasty bit here is that valgrind would have to be linked w/ bproc.=20 I did some weird stuff w/ editing freshly loaded elf binaries to add a preinit section that called bproc. That basically allowed the kernel to take over again after dynamic linking was done but before the program ran. I don't know if some similar hack could work here. I don't know - just a thought. This would be pretty easy to test, I think. If you added the bproc_dump call and just dumped to a plain file, you can execve that file directly to reload the dump. That would allow bpsh to do its thing. I real solution would probably look more like dump into a pipe or something. That still leaves the problem of valgrind getting at files when it pleases. Would it be possible/reasonable for valgrind to pre-load everything it *might* need down the line? That could be optional. - Erik |
From: Julian S. <ju...@va...> - 2005-06-17 22:38:40
|
> I've never tried to use it in this manner. I did take a quick peek at > this once a while back. At the time it looked to me like starting a > process with valgrind was (essentially) setting LD_PRELOAD to load the > valgrind .so files and maybe a few other environment variables. Is > this still more or less what valgrind is doing? /usr/bin/valgrind I > found on my system here is just a binary. Erik The LD_PRELOAD mechanism went away some time back as it relies too heavily on glibc and libpthread specifics. Startup is tricky because we need to load both Valgrind and the application to be debugged into the same address space (same process); but the application has no idea this is happening. How it works now is: * User runs /usr/bin/valgrind prog args-for-prog * /usr/bin/valgrind is not the "real" valgrind executable. That is /usr/lib/valgrind/stage2. /usr/bin/valgrind loads stage2 high in the address space and hands control to it. * stage2 unmaps /usr/bin/valgrind. It is now alone in the address space, and in particular there is a hole at the standard load address (0x8040000, or wherever). * stage2 has its own implementation of exec() (sort of). It uses this to load prog (+ dependent .so's) and start it. * From prog's point of view it is started just as it would be normally. In reality it is running on a virtual CPU provided by stage2. stage2 intercepts and messes with all mmap() etc done by prog to ensure it doesn't screw up valgrind. So .. > 1. Valgrind has to have some reasonably nice mechanism to tell us what > exactly needs to be set. I'm not sure exactly what that should look > like but I figure there's lots of possibilities No env vars are needed for startup, I think. There's some env var trickery if a valgrinded process wants to start a valgrinded child process, but we can ignore that for now. > 3. The valgrind libraries need to be available on the slave nodes. > This is just a system configuration issue. I did some work > experimenting with migration after linking so this requirement could > potentially go away. Well, not just the .so's. stage2 assumes it can grab any of the stuff in PREFIX/lib/valgrind as/when it likes. There are various .so's forced into the address space at startup, but there are also a bunch of text files (*.supp) which are important. So far I managed to get it to work as follows: * on the master node, install into /opt/valgrind * bpcp -r /opt/valgrind to the slave nodes * bpcp 'prog' to somewhere on the slave nodes, say /opt/prog * on the master do bpsh <nodeid> /opt/valgrind/bin/valgrind /opt/prog args-for-prog This is pretty ugly. As I understand it, bpsh takes /opt/valgrind/bin/valgrind from the master, migrates it to the slave(s), starts it there, and it just happens to work because the valgrind install trees on the master and slaves are identical. I don't have any better ideas. Fundamentally it seems difficult because bpsh is only prepared to migrate one executable and that has to be /opt/valgrind/bin/valgrind, so you have to have a different way to get the executable-to-be-debugged to the slaves. If the entire V install tree could be pre-installed on the slaves that would help. I guess one option is for all slaves to refer to a global NFS mount. But there would still be a problem of moving the executable. Ideas? J |
From: Erik H. <eah...@gm...> - 2005-06-17 15:36:05
|
On 6/16/05, Julian Seward <ju...@va...> wrote: >=20 > Valgrind is a GPL'd tool suite for doing memory debugging and profiling > on x86-linux and amd64-linux. We are looking into the issue of making > Valgrind work well on BProc and hence on MPI. >=20 > I'd like to re-ask Ceri's question: has anyone used or tried to use > Valgrind over BProc? If so, what did you have to do to make it work? I've never tried to use it in this manner. I did take a quick peek at this once a while back. At the time it looked to me like starting a process with valgrind was (essentially) setting LD_PRELOAD to load the valgrind .so files and maybe a few other environment variables. Is this still more or less what valgrind is doing? /usr/bin/valgrind I found on my system here is just a binary. If it is.... then it seems to me the best way to get it started would be to get the same variables set for the processes running out on the nodes. The following would be required to get that done, I think: 1. Valgrind has to have some reasonably nice mechanism to tell us what exactly needs to be set. I'm not sure exactly what that should look like but I figure there's lots of possibilities. 2. mpirun, bpsh, et al. need a mechanism to have different environment variables set for the child processes. That's easy but it's not there at this point. 3. The valgrind libraries need to be available on the slave nodes.=20 This is just a system configuration issue. I did some work experimenting with migration after linking so this requirement could potentially go away. What do you think? - Erik > > From: Cerion Armour-Brown <cerion@op...> > > Subject: valgrind & bproc > > Date: 2005-05-30 01:40 > > Hi, > > I"m a developer working on Valgrind, and I"m trying to work out the bes= t way > > to use valgrind with bproc. > > > > Does anyone already do this? (Directly with bproc - not via mpi). > > If so, I"d really appreciate some details on how you"ve set this up. > > > > I understand that using valgrind with mpirun is fairly straightforward > >(though > > I haven"t set up mpi yet to try it out). From what I"ve read, it seems > > valgrind must be accessible from the nodes (nfs, or whatever), but the > > program to run is migrated from the master, yes? > > > > Using bpsh, I don"t see how I can avoid needing both valgrind, and the > >program > > to run, accessible from the nodes, since running > > $ bpsh -a valgrind foo > > will migrate valgrind, then, on the nodes, valgrind will look for foo. > > Valgrind is unlike gdb in that you cannot "attach" it once the program = has > > started. So the trick used for gdb (bpsh -a foo, find foo"s pid, attac= h > >gdb) > > won"t work. > > > > Any pointers much appreciated, > > Cerion >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dclic= k > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users > |