From: Erik H. <eah...@gm...> - 2005-06-27 22:53:51
|
On 6/23/05, Julian Seward <ju...@va...> wrote: >=20 > > Although I agree that linking valgrind against bproc would be nasty, I > > still like the idea of stopping valgrind at a convenient moment > > though. [...] >=20 > I've been playing with doing bproc_move at the "right" (least-worst) > time from within Valgrind. With a bit of fd-restoring-uglyness and > preloading various .so's onto the slaves, I can get V to migrate off > the master and stay alive. At least -- it works when V is running > in no-instrumentation mode (--tool=3Dnone). >=20 > When you try to migrate running a useful tool (memcheck) the > migration call instantly fails. I am doing >=20 > ret =3D VG_(do_syscall3)(__NR_bproc, BPROC_SYS_MOVE, node, (UWord)&req)= ; Why not just try dumping to a file on the front end for starters (bproc_dump)? You can treat the dump like an executable and just run it on the front end. That'd probably be a good test. Also, you could use bpsh to run it remotely and avoid the pain of dealing with I/O forwarding too. > (having copied relevant stuff from clients/bproc.c and {sys,kernel}/bproc= .h) >=20 > So my question. For historical reasons Memcheck allocates all required > "shadow" address space at the start, about 1.5G, with a huge mmap > of /dev/zero. It then mprotects bits of this incrementally to bring > it into use as needed. Most of the mapping is never used. >=20 > This only works because by default the kernel does VM overcommitting. > On some systems (Red Hat 8) this scheme fails if there isn't 1.5G of > swap to back it. >=20 > So I was wondering how your cm46 slave kernels will behave. At the > point vmadump gets its hands on the process image this huge mapping > will have been established. My test slave has 128M of memory and > obviously no swap. Should migration succeed under these circumstances? I'm guessing no... because I doubt the kernel will be willing to overcommit that much memory. I'm not familiar with the overcommit policies so I can't say conclusively. That said, vmadump (the migrator piece) is smart enough not to send zeroed pages. Since it *is* a file mapping, it will probably try to read all those pages on the front end to make sure that it is indeed all zeros. It's going to try and make a 1.5G anonymous mapping on the other end and patch in whatever pages aren't zero. > The syscall fails with EINVAL. This I thought strange in that if > the slave has insufficient memory surely you would return ENOMEM ? Based on the fact that you're trying to allocate 1.5G, I'm gonna guess that came from this snippet in vmadump_common.c: /* Load the data from the dump file */ down_write(¤t->mm->mmap_sem); addr =3D do_mmap(0, head->start, head->end - head->start, PROT_READ|PROT_WRITE|PROT_EXEC, mmap_flags, 0); up_write(¤t->mm->mmap_sem); if (addr !=3D head->start) { printk("do_mmap(0, %08lx, %08lx, ...) =3D 0x%08lx (failed)\n", head->start, head->end - head->start, addr); return -EINVAL; } That really should just pass through whatever the mmap error is. I think I was trying to keep the possible set of errors returned by vmadump smaller than the intersection of all the syscalls it uses.=20 E.g. connection reset by peer is a TCP-ism and maybe vmadump should just say EIO. > I spent some time reading kernel/move.c -- process2move() and > send_process(), but couldn't deduce whether or not ENOMEM would > return in case where the slave had insufficient memory. >=20 > Really this is 2 questions: >=20 > * How does vmadump and/or 2.6.9-cm46 behave when migrating overcommitted > space? (the space is a map of /dev/zero with PROT_NONE). It should be the same as vanilla 2.6.9 (I don't know exactly what that is). vmadump with the default set of argument bproc_move will construct the following: For the type of regions you're talking about: - Only non-zero pages are sent: - For anonymous mappings - zeroness can often be determined via page table walk. - For file mappings - it's going to page in the page and check it. - Regions are recreated as anonymous mappings on the remote machine. - Only the non-zero pages are paged in (and written to). For things that get sent as file references (bplib -l): - Only modified pages get sent. - Regions are recreated as file mappings. - Modified pages are patched in. I think it might be a win on to add /dev/zero the library list on the front= end. > * If a migration should fail due to lack of memory, what does sys_bproc > return? Looks like EINVAL although it probably shouldn't. > [[Note: I'm just trying to understand what's happening. Not saying > there's any problem with BProc. We know that our big-bang allocation > scheme is braindead and needs fixing.]] Nod. In theory, except for the humongous over commit on the slave node, it seems like it *should* work fine with the migrator. - Erik |