Re: VM overcommit in 2.6.9-cm46 (was Re: [BProc] Valgrind and BProc (again))

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 6/23/05, Julian Seward <ju...@va...> wrote:
>=20
> > Although I agree that linking valgrind against bproc would be nasty, I
> > still like the idea of stopping valgrind at a convenient moment
> > though.  [...]
>=20
> I've been playing with doing bproc_move at the "right" (least-worst)
> time from within Valgrind.  With a bit of fd-restoring-uglyness and
> preloading various .so's onto the slaves, I can get V to migrate off
> the master and stay alive.  At least -- it works when V is running
> in no-instrumentation mode (--tool=3Dnone).
>=20
> When you try to migrate running a useful tool (memcheck) the
> migration call instantly fails.  I am doing
>=20
>   ret =3D VG_(do_syscall3)(__NR_bproc, BPROC_SYS_MOVE, node, (UWord)&req)=
;

Why not just try dumping to a file on the front end for starters
(bproc_dump)?  You can treat the dump like an executable and just run
it on the front end.  That'd probably be a good test.  Also, you could
use bpsh to run it remotely and avoid the pain of dealing with I/O
forwarding too.

> (having copied relevant stuff from clients/bproc.c and {sys,kernel}/bproc=
.h)
>=20
> So my question.  For historical reasons Memcheck allocates all required
> "shadow" address space at the start, about 1.5G, with a huge mmap
> of /dev/zero.  It then mprotects bits of this incrementally to bring
> it into use as needed.  Most of the mapping is never used.
>=20
> This only works because by default the kernel does VM overcommitting.
> On some systems (Red Hat 8) this scheme fails if there isn't 1.5G of
> swap to back it.
>=20
> So I was wondering how your cm46 slave kernels will behave.  At the
> point vmadump gets its hands on the process image this huge mapping
> will have been established.  My test slave has 128M of memory and
> obviously no swap.  Should migration succeed under these circumstances?

I'm guessing no... because I doubt the kernel will be willing to
overcommit that much memory.  I'm not familiar with the overcommit
policies so I can't say conclusively.

That said, vmadump (the migrator piece) is smart enough not to send
zeroed pages.  Since it *is* a file mapping, it will probably try to
read all those pages on the front end to make sure that it is indeed
all zeros.  It's going to try and make a 1.5G anonymous mapping on the
other end and patch in whatever pages aren't zero.

> The syscall fails with EINVAL.  This I thought strange in that if
> the slave has insufficient memory surely you would return ENOMEM ?

Based on the fact that you're trying to allocate 1.5G, I'm gonna guess
that came from this snippet in vmadump_common.c:

        /* Load the data from the dump file */
        down_write(&current->mm->mmap_sem);
        addr =3D do_mmap(0, head->start, head->end - head->start,
                       PROT_READ|PROT_WRITE|PROT_EXEC, mmap_flags, 0);
        up_write(&current->mm->mmap_sem);
        if (addr !=3D head->start) {
            printk("do_mmap(0, %08lx, %08lx, ...) =3D 0x%08lx (failed)\n",
                   head->start, head->end - head->start, addr);
            return -EINVAL;
        }

That really should just pass through whatever the mmap error is.  I
think I was trying to keep the possible set of errors returned by
vmadump smaller than the intersection of all the syscalls it uses.=20
E.g.  connection reset by peer is a TCP-ism and maybe vmadump should
just say EIO.

> I spent some time reading kernel/move.c -- process2move() and
> send_process(), but couldn't deduce whether or not ENOMEM would
> return in case where the slave had insufficient memory.
>=20
> Really this is 2 questions:
>=20
> * How does vmadump and/or 2.6.9-cm46 behave when migrating overcommitted
>   space?  (the space is a map of /dev/zero with PROT_NONE).

It should be the same as vanilla 2.6.9 (I don't know exactly what that
is).  vmadump with the default set of argument bproc_move will
construct the following:

For the type of regions you're talking about:
- Only non-zero pages are sent:
  - For anonymous mappings - zeroness can often be determined via page
table walk.
  - For file mappings - it's going to page in the page and check it.
- Regions are recreated as anonymous mappings on the remote machine.
  - Only the non-zero pages are paged in (and written to).

For things that get sent as file references (bplib -l):
- Only modified pages get sent.
- Regions are recreated as file mappings.
- Modified pages are patched in.

I think it might be a win on to add /dev/zero the library list on the front=
 end.

> * If a migration should fail due to lack of memory, what does sys_bproc
>   return?

Looks like EINVAL although it probably shouldn't.

> [[Note: I'm just trying to understand what's happening.  Not saying
>   there's any problem with BProc.  We know that our big-bang allocation
>   scheme is braindead and needs fixing.]]

Nod.  In theory, except for the humongous over commit on the slave
node, it seems like it *should* work fine with the migrator.

- Erik