|
From: Julian S. <js...@ac...> - 2005-02-23 18:12:36
|
Recently I had to futz with low-level memory management to get
programs started on AMD64. As it turned out my problems were
self-inflicted, but as a side-effect I trogged around the
segment-tracking/address-space-management stuff in great detail.
Since we have chosen not to virtualise the client's address space,
we have to share it. And since the client doesn't know we're there,
we have to take complete control of the address space. That much
is implemented, and it's a good idea.
There are, however, aspects of the implementation that concern me:
* There's a fundamental circularity which has caused segfaults
at least twice in the past. The segment list manager needs
the malloc/free manager to be operating, but the malloc/free
manager may cause segment list entries to be allocated.
In effect we have two competing low level memory managers,
a situation which is nonsensical and should be fixed. The
segment-list-manager (which we should really call the
Address Space Manager, ASpaceMgr) is fundamental and should be
self-contained. The malloc/free manager should be built on
top of ASpaceMgr. The point at which debug info reading is
done should be moved upwards in the services hierarchy
to enable this split to be made.
* Abstraction boundaries in vg_mylibc have been muddied. Once
upon a time, VG_(mmap) and VG_(mprotect) simply passed requests
through to the kernel. Now they are part of the segment-mapping
game and make enquiries against the segment list. That functionality
needs to exist somewhere, but it's confusing that it happens
at that low a level.
* I found the code hard to understand (== maintain) and there is
no comprehensive statement of what it is and is not trying to
achieve.
I've also been considering how to rework address space management to
support a 64-bit world. The following proposal builds on my proposal
of a couple of weeks ago, which proposed chopping the address space
up into 64M chunks and doing permissions checks at that granularity.
* The 64M superblocks have 4 possible ownership states:
Unallocated
Valgrind's -- V's text, stack, static data, dynamic data
Shadow
Client
* As mentioned above, the Address Space Manager is fundamental
and self-contained. It is decoupled from the malloc/free manager.
It no longer deals with debug info loading/unloading. It does
nothing that requires dynamic memory allocation. The segment list
is to be held in statically allocated storage to make that possible.
That's not wonderful, but even a 1 Mb static area should hold
enough info to track several thousand segments.
* At least for Linux, stage2 is loaded into a 64M superblock
just below 0x4000'0000 (1 G). ASpaceMgr allocates superblocks
on demand, above 1G if it can for shadow memory and for Valgrind's
own use, and below 1G for the client, if possible. In particular
it tries hard not to put Valgrind or Shadow data in the area
below 1G.
Why 0x4000'0000 ?
- On a 32-bit machine, this gives the client nearly 1G of
contiguous space, should it want to do large mmaps. If
clients want to mmap more than 1G at a time, that's tough
-- use 64-bit Valgrind instead.
- On a 32-bit machine, even if the top 3/4 of the address space
is given over to the kernel, we don't have to deal with
different load addresses -- it will work as-is. Under those
conditions ASpaceMgr will have to make inroads into the top
of the 1G area, but that's unavoidable.
- In general, on a 32-bit machine, because memory is allocated
in 64M superblocks to either shadow, client or V-internal, we get
rid of all problems associated with the current hard partitioning
scheme between client and shadow memory. Big-bang allocation is
done away with. We know we can still protect V from wild writes
by the client at fairly minimal expense.
- On a 64-bit machine, all code is to be mapped in below 1 G, but
apart from that ASpaceMgr can be fairly relaxed about fragmentation
in the area above 1 G.
Startup then looks like this:
* Load stage2 at 0x4000'0000 - 64M
* Copy command-line/env data into this area somewhere
* Switch stacks and start stage2
* stage2: initialise ASpaceMgr, read initial segments from
/proc/self/maps
* Initialise logging, so we can print debugging info
early on. Note this means the logging mechanism cannot
do dynamic memory allocation.
* Nuke all segments except this one -- this gets the address
space in a known starting state
* Initialise the malloc/free manager
* Initialise scheduler
* Get signals in a known state; initialise signals subsystem
* Initialise any other subsystems (Vex?)
* Make the ume mechanism load the client
* Run the client
The only part of any difficulty is to get stage2 to a specific
address. Three possibilities:
(1) Link it to load at that address at build-time.
(2) Build it as a PIE.
(3) Use a standalone ELF .o loader/linker to load all the .o's,
link and start them.
At first (3) sounds insane, but it has a couple of advantages:
- we don't need to screw around padding the space with mmap in
stage1 to ensure stage2 and all its bits & pieces end up in
the designated 64 M superblock
- it gives us 100% control over V's linking and makes it easy to
ensure we don't inadvertantly depend on anything from glibc.
I like that.
The main disadvantage is that gdb would not have a clue what it
was looking at unless we found a way to convey debug info to it.
Congratulations to anybody who made it this far. Comments?
J
|