|
From: Erik H. <eah...@gm...> - 2005-06-22 05:52:52
|
On 6/20/05, Julian Seward <ju...@va...> wrote: >=20 > > > The intention is that the carried-around tree is small, as is indeed > > > V's install tree is (5.7 M). >=20 > That's -O -g; if we knock off the debug info it's about 2M. FWIW, that really doesn't strike me as being too big to migrate along with a process. When debugging with valgrind, you should be expecting higher memory requirements, slower run times, etc. 2-5MB extra on the process seems totally reasonable to me. I wouldn't worry about migrating that at all. > > What about 'program'? > > Are you imagining this to be in valgrind's install tree? > > Else this would still have to be separately migrated, or on nfs, no? >=20 > Good point. Uh, this is more complex than I thought. >=20 > * 'program' does indeed need to be migrated too >=20 > * How will .../bin/valgrind know to start 'program' ? >=20 > * Very often, users want to supply their own suppressions > files for Valgrind (--supp=3Dfilename on the command line) > and that needs to be shunted across too >=20 > * How does all this work when starting stuff with mpirun > rather than bpsh ? >=20 > * What if the valgrinded program on the nodes decides to > start a child process which it wants valgrinded? There's also the issue of permissions and resource limitations. bpsh isn't a privileged program so populating /usr/lib/* on a slave node will be problem. Also, on diskless systems places like /usr/lib are usually populated minimally to keep memory usage down. Once it's populated you basically have a cache. Something is going to have to take care of purging it, etc. It's probably less of a headache for the administrator to drop the valgrind stuff out there. These days on machines with multiple gigs of ram, throwing another 5mb out there isn't going to be a big deal.=20 If that turns out to be too much maybe the scheduler can be made to put something like valgrind on the node if the user requests it.=20 Bottom line is I think the /usr/lib stuff isn't that big a deal. The user binary is the real issue. Populating a node with files usually turns into one of those slippery slopes and things get out of hand quickly. It's come up with many times. Shared libraries is common topic of conversation - migrating those with an executable would be nice. It starts to look less reasonable to do it explicitly when there are 100s of them. Then there's the permission and resource issues I mentioned. Then people are going to want to do configuration files... and then input data and they're going to want BProc to be their network file system - a problem I specifically did NOT set out to solve. Ok, I'm ranting. Bottom line is I think it's perfectly reasonable to use something like NFS with BProc. I just shouldn't ever be a hard requirement. Although I agree that linking valgrind against bproc would be nasty, I still like the idea of stopping valgrind at a convenient moment though. What if BProc could restore a few open file descriptors?=20 Don't worry about stdin/out/err since bpsh et al. are supposed to do something reasonable with those. I think mmap is a great way of implicitly telling the system what you need to run. I am probably biased though. What about some other way of catching the valgrind process at the right moment? Just thinking out-loud here but.... What about doing something along the lines of an LD_PRELOAD hack on valgrind? As long as it's dynamically linked thing maybe we could manually load in a stub (which isn't part of valgrind) that would take care of the details of saving/restoring any open files and doing the dump. I think there's probably lots of nasty details in there but that might make it possible to get the stop w/o getting too much nasty BProc details in it. It might be a good way to isolate valgrind from BProc and vice versa. Basically, it seems like some special logic is going to be required since the program and valgrind aren't being loaded in the usual fashion. I had a convenient little hack to stop an ELF binary after linking completed but before it got to things like calling constructors or main() but that's not going to work here. Oh well, too bad. - Erik |