Re: [Valgrind-developers] Revising memory management

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Julian Seward wrote:

>* There's a fundamental circularity which has caused segfaults
>  at least twice in the past.  The segment list manager needs 
>  the malloc/free manager to be operating, but the malloc/free
>  manager may cause segment list entries to be allocated.
>
>  In effect we have two competing low level memory managers, 
>  a situation which is nonsensical and should be fixed.  The
>  segment-list-manager (which we should really call the 
>  Address Space Manager, ASpaceMgr) is fundamental and should be 
>  self-contained.  The malloc/free manager should be built on 
>  top of ASpaceMgr.  The point at which debug info reading is 
>  done should be moved upwards in the services hierarchy
>  to enable this split to be made.
>
>* Abstraction boundaries in vg_mylibc have been muddied.  Once
>  upon a time, VG_(mmap) and VG_(mprotect) simply passed requests
>  through to the kernel.  Now they are part of the segment-mapping
>  game and make enquiries against the segment list.  That functionality 
>  needs to exist somewhere, but it's confusing that it happens
>  at that low a level.
>
>* I found the code hard to understand (== maintain) and there is
>  no comprehensive statement of what it is and is not trying to
>  achieve.
>  
>
Yeah.  One conceptual difference between what you're describing and what
exists is that the Segment list is intended to document what exists, but
it isn't actually responsible for managing it.  Segments can be backed
by many different kinds of VM object (mmap, shared memory, etc), and the
Segment code doesn't really care about what backs each virtual address
range, and it certainly doesn't do anything to cause those ranges to
appear/disappear.  It expects to be told about what does exist and when
it changes, and it can be queried by code which makes changes to
discover what exists.

You're describing something a bit more active, which knows how to create
and destroy VM objects.  That's a wider mandate, and more complex
because there are so many ways that can happen.

Certainly we need to fix the Segment manager's cyclic dependency.

>I've also been considering how to rework address space management to
>support a 64-bit world.  The following proposal builds on my proposal
>of a couple of weeks ago, which proposed chopping the address space
>up into 64M chunks and doing permissions checks at that granularity.
>
>* The 64M superblocks have 4 possible ownership states:
>    Unallocated
>    Valgrind's      -- V's text, stack, static data, dynamic data
>    Shadow
>    Client
>  
>
Is there any difference between Shadow and Valgrind?  Do they behave
differently in any respect, or is it just that their contents mean
different things?

>* As mentioned above, the Address Space Manager is fundamental
>  and self-contained.  It is decoupled from the malloc/free manager.
>  It no longer deals with debug info loading/unloading.  It does
>  nothing that requires dynamic memory allocation.  The segment list
>  is to be held in statically allocated storage to make that possible.
>  That's not wonderful, but even a 1 Mb static area should hold
>  enough info to track several thousand segments.
>  
>
Well, you could still allocate them dynamically, so long as you do it
early enough (ie, before you're so short of space that there's no room
to describe the new space you're allocating).

>* At least for Linux, stage2 is loaded into a 64M superblock
>  just below 0x4000'0000 (1 G).  ASpaceMgr allocates superblocks
>  on demand, above 1G if it can for shadow memory and for Valgrind's 
>  own use, and below 1G for the client, if possible.  In particular 
>  it tries hard not to put Valgrind or Shadow data in the area 
>  below 1G.
>
>  Why 0x4000'0000 ?
>
>  - On a 32-bit machine, this gives the client nearly 1G of 
>    contiguous space, should it want to do large mmaps.  If 
>    clients want to mmap more than 1G at a time, that's tough
>    -- use 64-bit Valgrind instead.
>
>  - On a 32-bit machine, even if the top 3/4 of the address space
>    is given over to the kernel, we don't have to deal with 
>    different load addresses -- it will work as-is.  Under those
>    conditions ASpaceMgr will have to make inroads into the top
>    of the 1G area, but that's unavoidable.
>  
>
I'm not sure I follow you here.

>  - In general, on a 32-bit machine, because memory is allocated
>    in 64M superblocks to either shadow, client or V-internal, we get
>    rid of all problems associated with the current hard partitioning
>    scheme between client and shadow memory.  Big-bang allocation is
>    done away with.  We know we can still protect V from wild writes
>    by the client at fairly minimal expense.
>  
>
What problems are there, and how are they avoided by this scheme?

>  - On a 64-bit machine, all code is to be mapped in below 1 G, but
>    apart from that ASpaceMgr can be fairly relaxed about fragmentation 
>    in the area above 1 G.
>  
>
Er, why?  We have terabytes of address space to play with.  Why make
holes in it?  There's no technical reason we need to put code down that low.

More generally, I don't see why these parameters need to be fixed
between different targets/platforms.  Is there any reason the superblock
must be 64M, and the stage2 address is at 1G for everyone?  It seems
pretty easy to make these target-specific parameters and still have all
the generic code cope.

Also, choosing a fixed address prevents Valgrind from running under
itself. I think we should keep that.

>Startup then looks like this:
>
>* Load stage2 at 0x4000'0000 - 64M
>
>* Copy command-line/env data into this area somewhere
>
>* Switch stacks and start stage2
>
>* stage2: initialise ASpaceMgr, read initial segments from
>  /proc/self/maps
>
>* Initialise logging, so we can print debugging info
>  early on.  Note this means the logging mechanism cannot
>  do dynamic memory allocation.
>
>* Nuke all segments except this one -- this gets the address
>  space in a known starting state
>
>* Initialise the malloc/free manager
>
>* Initialise scheduler
>
>* Get signals in a known state; initialise signals subsystem
>
>* Initialise any other subsystems (Vex?)
>
>* Make the ume mechanism load the client
>
>* Run the client
>
>
>The only part of any difficulty is to get stage2 to a specific
>address.  Three possibilities:
>
>(1) Link it to load at that address at build-time.
>
>(2) Build it as a PIE.
>
>(3) Use a standalone ELF .o loader/linker to load all the .o's,
>    link and start them.
>
>At first (3) sounds insane, but it has a couple of advantages:
>
>- we don't need to screw around padding the space with mmap in
>  stage1 to ensure stage2 and all its bits & pieces end up in
>  the designated 64 M superblock
>
>- it gives us 100% control over V's linking and makes it easy to
>  ensure we don't inadvertantly depend on anything from glibc.
>  I like that.
>
>The main disadvantage is that gdb would not have a clue what it
>was looking at unless we found a way to convey debug info to it.
>  
>
No, the main disadvantages are that this is fantastically complex, makes
us very dependent on the toolchain (particularly all the little
side-channels between gcc and binutils), and is architecture and OS
dependent (since not everyone uses ELF).  By comparison, padding the
address space and using libc carefully are very simple and portable.

I'm still in favour of 1) use PIE where available (otherwise choose a
static loading address), 2) using a flexible address space configuration
which allows Valgrind to run under itself, 3) try not to get too
involved with the object file formats.

    J