|
From: Nicholas N. <nj...@ca...> - 2004-08-31 22:50:43
|
Hi,
We have various problems with memory layouts at the moment, as per bug
#82301. In what follows I describe the current approach, some problems,
and suggest some solutions.
I think we need to address this, and quickly -- it's clearly one of the
biggest problems at the moment. I would estimate we're averaging about
one email a day from people having problems. Now that 2.2.0 is out, I
think it will become even more important.
Comments welcome. Apologies for yet another long email from me. Thanks.
N
Current memory layout:
Nulgrind Addrcheck
Memcheck
CLIENT_BASE +-------------------------+ 0
| client address space | (937MB) (833MB) (440MB)
: client heap vvvvvvvvvv:
client_mapbase - - 3a965000 3412a000 1b8e4000
| client stack ^^^^^^^^^^| (1877MB) (1668MB) (883MB)
client_end +-------------------------+ aff00000 9c600000 52c00000
| redzone (1MB) |
shadow_base +-------------------------+ b0000000 9c700000 52d00000
| shadow mem (may be 0) | (0) (312MB) (1489MB)
shadow_end +-------------------------+ b0000000 affc0000 afe80000
: gap (may be 0 sized) : (0) (0) (1MB)
valgrind_base +-------------------------+ b0000000
| stage2 + V's heap | (16MB)
| (barely used)|
(vg_mapbase) - - b1000000
| valgrind .so's/maps vvvv| (240MB)
- -
| valgrind stack ^^^^|
valgrind_end +-------------------------+ c0000000
: kernel : (1GB)
+-------------------------+
Notes:
- stage1 loads stage2 at 0xb0000000; stage1 is then overwritten with client
- shadow memory is allocated with a single "big-bang" mmap() at startup.
-----------------------------------------------------------------------------
Problems + solutions
-----------------------------------------------------------------------------
P1. It assumes 3G:1G user/kernel split.
- For 4G kernels, Valgrind gets the whole extra 1GB for its own use (I
think). This works, but is sub-optimal.
- For other layouts (eg. 2G:2G, or even 2.9G:1.1G) it just doesn't work.
Changing KICKSTART_BASE is a workaround, if you know that. (But 2G:2G
still cannot run Memcheck, see below.)
S1. This can be solved easily, by using position-independent executables
(PIE). We can do a configure-time test for PIE, and if supported, make
stage2 a PIE. Then stage1 can decide where stage2 should go, by doing
some kind of run-time test (which would look at where the stack is, or
use shmat(), or something, to determine where the user/kernel division
lies).
This change is pretty uncontroversial, and Paul already has a patch for
it (which I don't think should be committed as-is, but is a good start).
For non-PIE-supporting systems, we could build 3 or 4 versions of
stage2, and choose the most appropriate one (I have a patch for this).
Or just a single fixed-location back-up stage2 might be enough.
-----------------------------------------------------------------------------
P2. For kernels with "overcommit" mmapping off -- which prevents a process
from allocating more address space than the available swap space -- you
need at least 1.5GB of swap for Memcheck to run, because swap must be at
least as large as any individual segment. (And I think users with ulimit -v
set suffer the same problem.)
S2. Avoiding this requires not using the big-bang shadow allocation
method, and that shadow memory instead be done incrementally. (More about
that below.)
-----------------------------------------------------------------------------
P3. Machines with small user spaces (eg. 2G:2G machines) cannot run
Memcheck, because the big shadow memory region covers 0x40000000, which
is where normal programs want to put their shadow maps.
S3. Fixing this requires that the boundary between client and Valgrind
is not fixed, which requires incremental shadow memory allocation.
-----------------------------------------------------------------------------
P4. Tools sometimes run out of address space when there is still address
space in other regions free.
S4. The rigidity of the client/shadow-mem/valgrind division must be
reduced to fix this.
-----------------------------------------------------------------------------
P5. Large executables (eg. 200MB+) cannot be loaded in memory by
Valgrind in order to read their debug info.
S5. Two possible fixes:
- make client/shadow-mem/valgrind divisions less rigid
- incremental debug info reading (but that's impossible for stabs)
-----------------------------------------------------------------------------
Discussion
-----------------------------------------------------------------------------
P1 can be solved independently of the others, and uncontroversially.
P2--P5 are all related. For 32-bit machines, big-bang shadow memory
allocation does not seem appropriate -- the size of the map required
causes P2. The resulting rigidity of the address space causes P3--P5.
The downside of switching to incremental shadow memory is that it makes
direct-offset shadow addressing impossible, at least on 32-bit.
Direct-offset seems much more plausible on 64-bit, where we have more
address to play with. But the benefits of direct-offset still are not
clear (Jeremy's experiments didn't show a speed improvement), and we don't
have any 64-bit ports working yet.
-----------------------------------------------------------------------------
Solution
-----------------------------------------------------------------------------
A couple of steps:
- Don't use big-bang allocation for shadow memory. Make shadow memory
maps allocated just like any other Valgrind/tool memory. Thus
Valgrind would have a single region for itself, instead of two
separate ones. This would solve P2, P4 and P5.
- Make the client/valgrind division movable. Client memory would grow
up, Valgrind memory would grow down. This would largely solve P3.
Only question here is: where does the stack go? If the stack size is
ulimited (eg. to 8MB), there's little problem. Otherwise, perhaps
below client_mapbase, so it grows down towards the upward-growing
heap. [nb: what happens if they collide? undefined?]
Another possibility:
- for --pointercheck=no, erase the client/Valgrind division totally.
stage2 would still be put up high, but then all other mmapping (client
and Valgrind) would just go in a single area. The client stack would
go just below stage2. Of course, the client could then clobber
Valgrind's memory. This probably makes more sense with incremental
shadow memory.
Advantages of this:
- allows a bigger client stack (eg. 1GB+)
- for 2G:2G systems, would run out of memory later than otherwise
(because with the division, the gap between the client executable
and 0x40000000 -- the client mapbase -- cannot be utilised by
Valgrind).
----
A more radical solution: truly virtualise the address space (rather
than just partitioning it) -- ie. Valgrind implements it's own virtual
MMU and page table. The exact details of how it would work are not yet
clear. If even feasible, this is a long-term solution; something else
should be done in the meantime.
-----------------------------------------------------------------------------
My suggestion
-----------------------------------------------------------------------------
In this order, make the following changes.
1. Use PIE where possible, solving P1.
2. Switch big-bang shadow memory allocation to incremental, solving P2,
P4, P5.
3. Make the client/Valgrind division movable, largely solving P3.
4. Maybe make the --pointer-check=no change, if it seems useful.
|
|
From: Eric E. <eri...@fr...> - 2004-09-01 01:03:40
|
Few ideas as they come, if they can help:
1. We can finely take control of all the memory space
by mapping everything available using MAP_NORESERVE.
If the client requires some space, we decide where we want
it to be mapped, unmap the reserved zone, and do the syscall.
The kernel won't have any other choice than mapping to the
zone we have freed, since it'll be the only zone available.
2. Run-time test is the only solution if we want to be sure
it works on all situations. Furthermore it is mandatory
for binary distributions. We have the means to do it,
with either the shmat trick or the previous reserving solution.
3. Not sure we can do that, but maybe PIE is not mandatory.
Can we have stage2 compiled only as PIC, mmap it where we
need, and perform the relocations ? I've tried a bit PIE,
and I feel it is not always working; testing for platforms
on which it works could become a nightmare.... But I'm not
an ld expert...
> P2. For kernels with "overcommit" mmapping off -- which prevents a
> process from allocating more address space than the available swap space
What are these kernels doing exactly ? Can't we reserve the space with
MAP_NORESERVE, and when we really need to use it, remap normally ?
If not possible, we can probably reserve the ranges using shm mappings.
> P5. Large executables (eg. 200MB+) cannot be loaded in memory by
> Valgrind in order to read their debug info.
>
> S5. Two possible fixes:
> - make client/shadow-mem/valgrind divisions less rigid
> - incremental debug info reading (but that's impossible for stabs)
Indeed there are other solutions. We are likely to always find
someone which will have an exe bigger than the space we have available...
Either:
- delegate dbginf reading to another process which has more free space;
communicate through ipc/whatever
- do not read the dbginfs. Have it done by another process
doing the post-processing (and filtering) of the vg results after
- recommend using dwarf2 instead of stabs, and make the reader incremental
- say concerned users to buy 64bit boxes (hummmm.....)
> The downside of switching to incremental shadow memory is that it makes
> direct-offset shadow addressing impossible, at least on 32-bit.
I wouldn't be so categoric. We can certainly do a big-bang _reservation_
(not allocation), using either MAP_NORESERVE or shm mappings, and
incrementally remap parts as we need them.
> Only question here is: where does the stack go? If the stack size is
> ulimited (eg. to 8MB), there's little problem. Otherwise, perhaps
> below client_mapbase, so it grows down towards the upward-growing
> heap. [nb: what happens if they collide? undefined?]
Wherever they are, if whatever collides, valgrind should issue
a precise error message, and provide a command-line argument so the
users can ask to reserve e.g. 256Mb of stack. I think it is safe
to have a default, relatively limited, maximum stack size, so
stack overflows could be detected quite quickly.
Cheers
--
Eric
|
|
From: Nicholas N. <nj...@ca...> - 2004-09-01 21:21:33
|
On Wed, 1 Sep 2004, Eric Estievenart wrote: >> The downside of switching to incremental shadow memory is that it makes >> direct-offset shadow addressing impossible, at least on 32-bit. > > I wouldn't be so categoric. We can certainly do a big-bang _reservation_ > (not allocation), using either MAP_NORESERVE or shm mappings, and > incrementally remap parts as we need them. Sure, but then you still have taken a great big chunk out of the address space for shadow memory (irrespective whether you've allocated it or not) and that still causes all the problems whereby the layout is too rigid. N |
|
From: Jeremy F. <je...@go...> - 2004-09-01 08:36:08
|
On Tue, 2004-08-31 at 23:50 +0100, Nicholas Nethercote wrote: > P2. For kernels with "overcommit" mmapping off -- which prevents a process > from allocating more address space than the available swap space -- you > need at least 1.5GB of swap for Memcheck to run, because swap must be at > least as large as any individual segment. (And I think users with ulimit -v > set suffer the same problem.) > > S2. Avoiding this requires not using the big-bang shadow allocation > method, and that shadow memory instead be done incrementally. (More about > that below.) Does MAP_NORESERVE work? I'm not sure that incremental mmaping is enough in all circumstances, because you're trying to work around heuristics the kernel applies to allocations. MAP_NORESERVE simply tells the kernel that it shouldn't apply any heuristics to this allocation. If you have strict overcommit enabled, then there's no choice but to have enough swap - it assumes you're going to use every byte you map. > ----------------------------------------------------------------------------- > P3. Machines with small user spaces (eg. 2G:2G machines) cannot run > Memcheck, because the big shadow memory region covers 0x40000000, which > is where normal programs want to put their shadow maps. Shadow maps? Do you mean something else? Client mapbase? We can put that somewhere else (put it high, and allocate mappings growing down, for example). > S3. Fixing this requires that the boundary between client and Valgrind > is not fixed, which requires incremental shadow memory allocation. > > ----------------------------------------------------------------------------- > P4. Tools sometimes run out of address space when there is still address > space in other regions free. > > S4. The rigidity of the client/shadow-mem/valgrind division must be > reduced to fix this. > > ----------------------------------------------------------------------------- > P5. Large executables (eg. 200MB+) cannot be loaded in memory by > Valgrind in order to read their debug info. > > S5. Two possible fixes: > - make client/shadow-mem/valgrind divisions less rigid > - incremental debug info reading (but that's impossible for stabs) Well, even for stabs there's no need to mmap the whole thing in; if we just read (or mmaped) chunks at a time and processed them serially, we can extract everything needed. > ----------------------------------------------------------------------------- > Discussion > ----------------------------------------------------------------------------- > P1 can be solved independently of the others, and uncontroversially. > > P2--P5 are all related. For 32-bit machines, big-bang shadow memory > allocation does not seem appropriate -- the size of the map required > causes P2. The resulting rigidity of the address space causes P3--P5. > > The downside of switching to incremental shadow memory is that it makes > direct-offset shadow addressing impossible, at least on 32-bit. > Direct-offset seems much more plausible on 64-bit, where we have more > address to play with. But the benefits of direct-offset still are not > clear (Jeremy's experiments didn't show a speed improvement), and we don't > have any 64-bit ports working yet. The trouble is that memcheck&co do have a fixed ratio of shadow memory to real memory used. If the client uses its address space sparsely then it causes sparse (wasteful) use of the shadow memory, but since we get to place all the mmaps, we needn't make it have sparse memory use. The exception is if the client explicitly places its mappings, but I don't think that's common. So I know that people are running into memory problems, but it isn't clear to me that we can't solve them by using the address space more densely. Tools which don't have a fixed ratio (cachegrind) are another issue. They're not, technically, using shadow memory (since there isn't the 1:1 relationship between client addresses and shadow addresses), but Valgrind heap. > > ----------------------------------------------------------------------------- > Solution > ----------------------------------------------------------------------------- > A couple of steps: > > - Don't use big-bang allocation for shadow memory. Make shadow memory > maps allocated just like any other Valgrind/tool memory. Thus > Valgrind would have a single region for itself, instead of two > separate ones. This would solve P2, P4 and P5. > > - Make the client/valgrind division movable. Client memory would grow > up, Valgrind memory would grow down. This would largely solve P3. > > Only question here is: where does the stack go? If the stack size is > ulimited (eg. to 8MB), there's little problem. Otherwise, perhaps > below client_mapbase, so it grows down towards the upward-growing > heap. [nb: what happens if they collide? undefined?] We could put the stack below the executable. The x86 ABI allows this (and Solaris x86 does this). It could break some programs which assume the stack is high, but most won't care. J |
|
From: Tom H. <th...@cy...> - 2004-09-01 08:49:24
|
In message <1094008711.3129.27.camel@localhost>
Jeremy Fitzhardinge <je...@go...> wrote:
> On Tue, 2004-08-31 at 23:50 +0100, Nicholas Nethercote wrote:
>> P2. For kernels with "overcommit" mmapping off -- which prevents a process
>> from allocating more address space than the available swap space -- you
>> need at least 1.5GB of swap for Memcheck to run, because swap must be at
>> least as large as any individual segment. (And I think users with ulimit -v
>> set suffer the same problem.)
>>
>> S2. Avoiding this requires not using the big-bang shadow allocation
>> method, and that shadow memory instead be done incrementally. (More about
>> that below.)
>
> Does MAP_NORESERVE work? I'm not sure that incremental mmaping is
> enough in all circumstances, because you're trying to work around
> heuristics the kernel applies to allocations. MAP_NORESERVE simply
> tells the kernel that it shouldn't apply any heuristics to this
> allocation.
>
> If you have strict overcommit enabled, then there's no choice but to
> have enough swap - it assumes you're going to use every byte you map.
I haven't looked to see what the kernel actually does, but the manual
page claims (and it may well be wrong) that MAP_NORESERVE explicitly
prevents pre-allocation of swap space for the mapping:
MAP_NORESERVE
(Used together with MAP_PRIVATE.) Do not reserve swap space
pages for this mapping. When swap space is reserved, one has the
guarantee that it is possible to modify this private copy-on-
write region. When it is not reserved one might get SIGSEGV
upon a write when no memory is available.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Ashley P. <as...@qu...> - 2004-09-01 09:55:09
|
On Tue, 2004-08-31 at 23:50, Nicholas Nethercote wrote: > The downside of switching to incremental shadow memory is that it makes > direct-offset shadow addressing impossible, at least on 32-bit. > Direct-offset seems much more plausible on 64-bit, where we have more > address to play with. But the benefits of direct-offset still are not > clear (Jeremy's experiments didn't show a speed improvement), and we don't > have any 64-bit ports working yet. Somewhat of an aside as these problems all need to be fixed for 32bit systems anyway but 64bit doesn't give you the address space freedom you might expect, at least one arch is limited to three virtual address space regions of one terabyte each. Ashley, |
|
From: Eric E. <eri...@fr...> - 2004-09-01 18:00:28
|
Jeremy Fitzhardinge wrote: > Does MAP_NORESERVE work? I'm not sure that incremental mmaping is > enough in all circumstances, because you're trying to work around > heuristics the kernel applies to allocations. MAP_NORESERVE simply > tells the kernel that it shouldn't apply any heuristics to this > allocation. FYI, I've tried reserving all the memory space with MAP_NORESERVE, and no swap (swapoff'd before testing). This on a normal 3G:1G 2.6 kernel. With that, I'm able to reserve 2G in memory, in the range 0x40000000 to 0xC0000000. Strangely mmap( NULL, ... MAP_NORESERVE ) will never return a zone in the 0-0x40000000 range, even if there is no other free ranges... To reserve the free chunks of that area, many shmat work, but you are not sure you will have reserved everything except if you scan /proc/self/maps or if you use a dichotomic heuristic to try to grab everything. The documentation (linux/Documentation/vm/overcommit-accounting) says (I summarize here): 3 modes: 0 - Heuristic handling; refuses obvious overcommits (default) 1 - No overcommit handling 2 - New strict overcommit. Commit space must be < swap + percentage of ram set through vm.overcommit_memory sysctl or /proc/sys/vm/overcommit_(memory|ratio). > In mode 2 the MAP_NORESERVE flag is ignored. Ooops we (may) have a problem.... but: > The overcommit is based on the following rules: > For an anonymous or /dev/zero map > SHARED - size of mapping > PRIVATE READ-only - 0 cost (but of little use) Maybe of little use, but it is exactly what we need :-) So I performed the tests, with the 3 types of overcommit, and PROT_NONE and MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE .... > If you have strict overcommit enabled, then there's no choice but to > have enough swap - it assumes you're going to use every byte you map. [trumpets] It worked !!!! I reserved the 3G of usable addressing space (2G with the mmap, the remaining 1G with a repeated shm mapping), all that with less than 500M of memory available... > Well, even for stabs there's no need to mmap the whole thing in; if we > just read (or mmaped) chunks at a time and processed them serially, we > can extract everything needed. Indeed, there is always a solution. But we have to store indices to the .stabstr section and try to read it sequentially (or map it chunks by chunks, or whatever) because it is a random-access section. But when Nicholas said "incremental loading", he meant loading debug infos for sections/compile units as we need them. That's another problem, not completely decorrelated with the "I can't map the whole file to read dbg infos". Not sure it's worth doing it for now. How many people still use stabs (on the platform we support...) ? > The trouble is that memcheck&co do have a fixed ratio of shadow memory > to real memory used. If the client uses its address space sparsely then > it causes sparse (wasteful) use of the shadow memory, but since we get > to place all the mmaps, we needn't make it have sparse memory use. The > exception is if the client explicitly places its mappings, but I don't > think that's common. Client seldom places its mappings, but we must _reserve_ areas, otherwise the kernel may decide to place a client mapping in an area which will bother us after... BTW, now we are speaking of mem layout, I think it is very important to keep in mind that it would be great to bootstrap V... Cheers ;-) -- Eric |
|
From: Jeremy F. <je...@go...> - 2004-09-01 20:46:51
|
On Wed, 2004-09-01 at 20:00 +0200, Eric Estievenart wrote: > > Well, even for stabs there's no need to mmap the whole thing in; if we > > just read (or mmaped) chunks at a time and processed them serially, we > > can extract everything needed. > > Indeed, there is always a solution. But we have to store indices > to the .stabstr section and try to read it sequentially (or map it > chunks by chunks, or whatever) because it is a random-access section. That should be reasonably easy to manage (depends on how big the strings are compared to everything else). > But when Nicholas said "incremental loading", he meant loading > debug infos for sections/compile units as we need them. > That's another problem, not completely decorrelated with > the "I can't map the whole file to read dbg infos". Yes, I know, exactly. > Not sure it's worth doing it for now. How many people still use > stabs (on the platform we support...) ? Valgrind's dwarf support is pretty weak compared to the stabs support - there's nothing there to extract type information. > Client seldom places its mappings, but we must _reserve_ areas, > otherwise the kernel may decide to place a client mapping in an > area which will bother us after... Only occasionally. We explicitly place all the mappings the client makes with mmap, for example. We only need to resort to padding things out either before we gain full control (constraining ld.so to put things in the right place), or for stupid syscall which create mappings without asking where (the async IO thing). > BTW, now we are speaking of mem layout, I think it is very important > to keep in mind that it would be great to bootstrap V... You mean self-virtualization? That's a goal, but its tricky. J |
|
From: Nicholas N. <nj...@ca...> - 2004-09-01 21:44:19
|
On Tue, 31 Aug 2004, Jeremy Fitzhardinge wrote: >> P2. For kernels with "overcommit" mmapping off -- which prevents a process >> from allocating more address space than the available swap space -- you >> need at least 1.5GB of swap for Memcheck to run, because swap must be at >> least as large as any individual segment. (And I think users with ulimit -v >> set suffer the same problem.) >> >> S2. Avoiding this requires not using the big-bang shadow allocation >> method, and that shadow memory instead be done incrementally. (More about >> that below.) > > Does MAP_NORESERVE work? I'm not sure that incremental mmaping is > enough in all circumstances, because you're trying to work around > heuristics the kernel applies to allocations. MAP_NORESERVE simply > tells the kernel that it shouldn't apply any heuristics to this > allocation. > > If you have strict overcommit enabled, then there's no choice but to > have enough swap - it assumes you're going to use every byte you map. I'm not sure if it was clear what I meant by "incremental shadow memory" here. I meant don't do a big-bang mmap (eg. 1.5GB for Memcheck) at the start, rather, only allocate shadow memory chunks as needed (like in 2.0.0). The good thing about this is that shadow memory chunks can be interleaved with other Valgrind memory, giving much more flexibility, so that address space is much less likely to be exhausted. >> ----------------------------------------------------------------------------- >> P3. Machines with small user spaces (eg. 2G:2G machines) cannot run >> Memcheck, because the big shadow memory region covers 0x40000000, which >> is where normal programs want to put their shadow maps. > > Shadow maps? Do you mean something else? Client mapbase? Yes, sorry for the confusion. >> S5. Two possible fixes: >> - make client/shadow-mem/valgrind divisions less rigid >> - incremental debug info reading (but that's impossible for stabs) > > Well, even for stabs there's no need to mmap the whole thing in; if we > just read (or mmaped) chunks at a time and processed them serially, we > can extract everything needed. Hmm, yes, good idea. Anyone want to volunteer? >> The downside of switching to incremental shadow memory is that it makes >> direct-offset shadow addressing impossible, at least on 32-bit. > > The trouble is that memcheck&co do have a fixed ratio of shadow memory > to real memory used. If the client uses its address space sparsely then > it causes sparse (wasteful) use of the shadow memory, Exactly, that's why I'm arguing against direct-offset shadow addressing. > but since we get > to place all the mmaps, we needn't make it have sparse memory use. The > exception is if the client explicitly places its mappings, but I don't > think that's common. > > So I know that people are running into memory problems, but it isn't > clear to me that we can't solve them by using the address space more > densely. (I assume you mean the client portion of the address space?) How do propose to use the client address space more densely? I can't see how this would work. I'm not sure if we're all on the same wavelength with this stuff. > Tools which don't have a fixed ratio (cachegrind) are another issue. > They're not, technically, using shadow memory (since there isn't the 1:1 > relationship between client addresses and shadow addresses), but > Valgrind heap. I'm not sure why you say "technically"; Cachegrind (and Calltree, Nulgrind, and Massif) don't use shadow memory at all. Much of the discussion doesn't apply to them. However, they are still affected by some of the rigidity problems eg. Calltree suffers from Problem P4. (And "valgrind heap" is misleading because Valgrind no longer has a heap as such; I took it out when I rejigged the memory layout stuff. It now only allocates via maps.) >> Only question here is: where does the stack go? If the stack size is >> ulimited (eg. to 8MB), there's little problem. Otherwise, perhaps >> below client_mapbase, so it grows down towards the upward-growing >> heap. [nb: what happens if they collide? undefined?] > > We could put the stack below the executable. The x86 ABI allows this > (and Solaris x86 does this). It could break some programs which assume > the stack is high, but most won't care. A problem with that is that on x86-64 executables are mapped very low, 0x400000 I think (4MB), which doesn't leave enough room for even an 8MB stack. I like the idea of putting it just below client_mapbase better. N |
|
From: Eric E. <eri...@fr...> - 2004-09-02 00:07:53
|
Nicholas Nethercote wrote: > I'm not sure if it was clear what I meant by "incremental shadow memory" > here. I meant don't do a big-bang mmap (eg. 1.5GB for Memcheck) at the > start, rather, only allocate shadow memory chunks as needed (like in > 2.0.0). The good thing about this is that shadow memory chunks can be > interleaved with other Valgrind memory, giving much more flexibility, so > that address space is much less likely to be exhausted. I better understand. I tried to focus first on a solution where we kept the current memory layout, but ensured that the big-bang mmap would work on strict overcommit systems. But this would not be a proper definitive solution. Undoubtly, doing an incremental shadow alloc will help a lot, but do we know the performance penalty it implies ? For a shadow A access, the current code does something like shifting by 8 and adding a constant. For the V access, it is a simple constant add. When using ISM (Inc Shadow Mem), we must locate the block (through the skiplist, I guess), and add the (map-shadow) shift to the address. Small caching will help, but I fear the penalty may not be neglectible... > Hmm, yes, good idea. Anyone want to volunteer? (incremental stabs) Hmmm. My pipe is full. I'd rathed start with incremental+lazy dwarf reading ;-) >> The trouble is that memcheck&co do have a fixed ratio of shadow memory >> to real memory used. If the client uses its address space sparsely then >> it causes sparse (wasteful) use of the shadow memory, > > Exactly, that's why I'm arguing against direct-offset shadow addressing. Indeed, but has this ever happened ? The issues we had, AFAIR, are that we were not able to map a whole file to read debug infos. If it is the only (real) problem, and indirect offset shadow addressing is slower, we should better fix the dbg inf reading... > I'm not sure if we're all on the same wavelength with this stuff. Maybe the only unarguable point in this discussion ;-) > I'm not sure why you say "technically"; Cachegrind (and Calltree, > Nulgrind, and Massif) don't use shadow memory at all. Much of the > discussion doesn't apply to them. However, they are still affected by > some of the rigidity problems eg. Calltree suffers from Problem P4. Could you explain in which cases CT suffers from P4 (tools out of address space) ? Is it only when reading debug infos, or for other reasons ? > stupid syscalls which create mappings without > asking where (the async IO thing). Not sure the apache team find them stupid ;-) I don't know exactly how they work, but if we can't control where these mappings are setup by the kernel (and which size), we must either: - Never use a fixed mmap, because we may collide - Or scan every zone we intent to setup a fixed mmap on using the shmat trick before to be sure it won't collide with an aio mapping - Or reserve everything... >> We only need to resort to padding things out either before we gain full control >> (constraining ld.so to put things in the right place) > I'm surprised that people with non-overcommitting kernels are not having a problem > with this step, but are having a problem with the big-bang shadow memory allocation. > I would have thought the padding done by stage1 would involve more than 1.5GB worth of maps. Because we are not padding things out currently (AFAIK), and the big-bang shadow alloc does not use PROT_NONE + MAP_NORESERVE ? > You mean self-virtualization? That's a goal, but its tricky. Indeed, but we must not close the door ! The day it will work will be a beautiful day. And it will necessarily happen, so the sooner the better. For now the only way to report errors would be to extract the "suspicious" code, link it into a separate exe, and vgify it. Would be helpful though, e.g. to debug the debuginfo readers and do some unit testing. Cheers -- Eric |
|
From: Jeremy F. <je...@go...> - 2004-09-08 20:51:31
|
On Wed, 2004-09-01 at 22:44 +0100, Nicholas Nethercote wrote: > >> The downside of switching to incremental shadow memory is that it makes > >> direct-offset shadow addressing impossible, at least on 32-bit. > > > > The trouble is that memcheck&co do have a fixed ratio of shadow memory > > to real memory used. If the client uses its address space sparsely then > > it causes sparse (wasteful) use of the shadow memory, > > Exactly, that's why I'm arguing against direct-offset shadow addressing. > > > but since we get > > to place all the mmaps, we needn't make it have sparse memory use. The > > exception is if the client explicitly places its mappings, but I don't > > think that's common. > > > > So I know that people are running into memory problems, but it isn't > > clear to me that we can't solve them by using the address space more > > densely. > > (I assume you mean the client portion of the address space?) > > How do propose to use the client address space more densely? I can't see > how this would work. > > I'm not sure if we're all on the same wavelength with this stuff. OK, here's how I'm looking at it. The worse case is that the client uses two pages: one at 0x0, and one at client_end. This has an extremely sparse address space use, and while we only need two pages of shadow, the shadow_mapping occupies a lot of address space. The best case is that the client uses every byte of its mapped address space. In this case, the incremental shadow allocation will use the same amount of memory as the shadow mapping scheme, since every byte needs N bits of shadow. Now, since clients almost never use MAP_FIXED, it means that the address of every memory mapping is under our control. This means that if we place them in memory in a dense fashion, we can approach the best case memory usage density. This means that our original estimate of the amount of shadow memory needed (client size * N bits/byte) will be accurate, and we're making best use of the overall address space. So even if we allow the client address space to grow up as new things are mapped, and the shadow allocations to grow down, they'll always end up meeting at the same place anyway. > > Tools which don't have a fixed ratio (cachegrind) are another issue. > > They're not, technically, using shadow memory (since there isn't the 1:1 > > relationship between client addresses and shadow addresses), but > > Valgrind heap. > > I'm not sure why you say "technically"; Cachegrind (and Calltree, > Nulgrind, and Massif) don't use shadow memory at all. Much of the > discussion doesn't apply to them. However, they are still affected by > some of the rigidity problems eg. Calltree suffers from Problem P4. Wasn't cachegrind changed to allocate its stuff out of the shadow memory anyway? > (And "valgrind heap" is misleading because Valgrind no longer has a heap > as such; I took it out when I rejigged the memory layout stuff. It now > only allocates via maps.) Well, I'd still call that a heap, since its the thing under VG_(malloc). Doesn't really matter how the actual memory is requested from the kernel. > A problem with that is that on x86-64 executables are mapped very low, > 0x400000 I think (4MB), which doesn't leave enough room for even an 8MB > stack. I like the idea of putting it just below client_mapbase better. It would be interesting to know where Solaris-x86-64 puts things. J |
|
From: Nicholas N. <nj...@ca...> - 2004-09-01 21:48:18
|
On Wed, 1 Sep 2004, Jeremy Fitzhardinge wrote: >> But when Nicholas said "incremental loading", he meant loading >> debug infos for sections/compile units as we need them. >> That's another problem, not completely decorrelated with >> the "I can't map the whole file to read dbg infos". > > Yes, I know, exactly. Er, no, by "incremental loading" I meant loading the debug info for a single section in pieces, rather than requiring one big mmap(). Loading debug info as we need them I would call "lazy" or "on-demand" loading. Sorry if that was unclear. >> Not sure it's worth doing it for now. How many people still use >> stabs (on the platform we support...) ? > > Valgrind's dwarf support is pretty weak compared to the stabs support - > there's nothing there to extract type information. But Helgrind is the only tool that uses the type information, and not many people use Helgrind. > We only need to resort to padding things out either before we gain full > control (constraining ld.so to put things in the right place) I'm surprised that people with non-overcommitting kernels are not having a problem with this step, but are having a problem with the big-bang shadow memory allocation. I would have thought the padding done by stage1 would involve more than 1.5GB worth of maps. >> BTW, now we are speaking of mem layout, I think it is very important >> to keep in mind that it would be great to bootstrap V... > > You mean self-virtualization? That's a goal, but its tricky. PIE should help, though. N |
|
From: Jeremy F. <je...@go...> - 2004-09-08 21:01:14
|
On Wed, 2004-09-01 at 22:48 +0100, Nicholas Nethercote wrote: > >> Not sure it's worth doing it for now. How many people still use > >> stabs (on the platform we support...) ? > > > > Valgrind's dwarf support is pretty weak compared to the stabs support - > > there's nothing there to extract type information. > > But Helgrind is the only tool that uses the type information, and not many > people use Helgrind. Well, any tool which wants to describe an address symbolically could use that stuff. Its just that the other tools haven't been modified to do so. Actually, I thought memcheck/addrcheck do use it, but since they're mainly talking about values, it doesn't come up terribly often. > > We only need to resort to padding things out either before we gain full > > control (constraining ld.so to put things in the right place) > > I'm surprised that people with non-overcommitting kernels are not having a > problem with this step, but are having a problem with the big-bang shadow > memory allocation. I would have thought the padding done by stage1 would > involve more than 1.5GB worth of maps. They're mapped out of a file (mainly so we can identify them again later, but it has the nice effect of not triggering the overcommit logic). > >> BTW, now we are speaking of mem layout, I think it is very important > >> to keep in mind that it would be great to bootstrap V... > > > > You mean self-virtualization? That's a goal, but its tricky. > > PIE should help, though. That solves one problem, but Valgrind is its own worst enemy in this respect (it does a pile of things we'd hope to never see in a client). J |
|
From: Nicholas N. <nj...@ca...> - 2004-09-02 07:58:59
|
On Thu, 2 Sep 2004, Eric Estievenart wrote: > Undoubtly, doing an incremental shadow alloc will help a lot, but > do we know the performance penalty it implies ? > > For a shadow A access, the current code does something like shifting by > 8 and adding a constant. For the V access, it is a simple constant add. > When using ISM (Inc Shadow Mem), we must locate the block (through the > skiplist, I guess), and add the (map-shadow) shift to the address. Small > caching will help, but I fear the penalty may not be neglectible... No, that's wrong. Shadow addressing is done with a two-level table. The first 16 bits of an address are used to index into the first level of the table, the second 16 index into the 2nd level. Each chunk in the table holds the shadow values for a 64KB region of memory. Chunks are created lazily; if a load/store touches a 64KB region that has not been touched before, Memcheck allocates and initialises the chunk and inserts it in the table. It allocates the chunk out of the big-bang map. The C function then gets/sets the shadow value. So even though we have the conditions now for direct-offset shadow addressing (ie. we have a big contiguous block of shadow memory) we do *not* do direct-offset shadow addressing. We do the multi-level lookup. So if we change to incremental shadow memory allocation (ie. not big bang, where shadow chunks aren't contiguous and in order) we can stick with the current approach no problems. (Valgrind 2.0.0 was like that, no big-bang map was used.) Look at get_abit() and set_abit() in addrcheck/ac_main.c for examples. >>> The trouble is that memcheck&co do have a fixed ratio of shadow memory >>> to real memory used. If the client uses its address space sparsely then >>> it causes sparse (wasteful) use of the shadow memory, >> >> Exactly, that's why I'm arguing against direct-offset shadow addressing. > > Indeed, but has this ever happened ? The issues we had, AFAIR, are > that we were not able to map a whole file to read debug infos. > If it is the only (real) problem, and indirect offset shadow addressing > is slower, we should better fix the dbg inf reading... Debug info reading is not the only problem. I listed 5 problems in my first email, debug info reading is only relevant to P5. > Could you explain in which cases CT suffers from P4 (tools out of address > space) ? > Is it only when reading debug infos, or for other reasons ? Other reasons. Calltree just does a lot of calls to VG_(malloc)() for its own purposes -- it just tracks a lot of metadata. For big programs, this can easily exceed 256MB. This is the difference between P4 and P5 in my first email. Go back and read them again. >> I'm surprised that people with non-overcommitting kernels are not >> having a problem with this step, but are having a problem with the >> big-bang shadow memory allocation. I would have thought the padding >> done by stage1 would involve more than 1.5GB worth of maps. > > Because we are not padding things out currently (AFAIK), and the > big-bang shadow alloc does not use PROT_NONE + MAP_NORESERVE ? The big-bang shadow alloc uses PROT_NONE + MAP_PRIVATE|MAP_ANON|MAP_FIXED. For Memcheck the alloc-size is 1.5GB. The stage1 padding uses PROT_NONE + MAP_FIXED|MAP_PRIVATE, with a real file. Typically, on my machine it does two mmaps, with a combined size of 2.8GB. I wonder if the use of a file causes the difference. >> You mean self-virtualization? That's a goal, but its tricky. > > Indeed, but we must not close the door ! The day it will work > will be a beautiful day. And it will necessarily happen, so the > sooner the better. For now the only way to report errors would be > to extract the "suspicious" code, link it into a separate exe, > and vgify it. Would be helpful though, e.g. to debug the debuginfo > readers and do some unit testing. That's something I definitely want to do -- many modules (eg. vg_malloc2.c, vg_skiplist.c) can be hooked into a unit test and then run under Memcheck. Some won't be able to be (eg. vg_scheduler.c) because they're too tied up in the system. This will require some work to improve the modularisation a bit (ie. decreasing coupling and making modules as self-contained as possible) but that's a good thing. (I tried doing it with vg_malloc2.c but it has enough dependencies on other code that it was a pain.) Unit testing is good, and currently we have none, we only have system testing. But self-hosting would definitely be a good thing too. N |
|
From: Eric E. <eri...@fr...> - 2004-09-02 19:31:36
|
Nicholas Nethercote wrote:
>
> Shadow addressing is done with a two-level table. [...]
>
> So even though we have the conditions now for direct-offset shadow
> addressing (ie. we have a big contiguous block of shadow memory) we do
> *not* do direct-offset shadow addressing.
Well, I assumed that the big-bang alloc was made for the purpose
of direct offset shadow adressing, and so that it was done...
The question is then: What, at that time, decided you
to do the big-bang mmap ? I feel we may fall in a classic
software engineering syndrom, where you do one thing, change it
to another because it does not work well, change back to the first
because it is not better, then back to the second because you are
not satisfied, etc.
I have had a look at the ML archives, but couldn't find any
explication other than for direct offset shadow, and need for
a barrier between client and vg. But the reasons are not clear.
Any other hints on that ?
> Look at get_abit() and set_abit() in addrcheck/ac_main.c for examples.
Indeed, I had a look a few months ago, but did not remember the
particular details. It just means that we won't reduce performance
by... keeping the current implementation :-)
>>> Exactly, that's why I'm arguing against direct-offset shadow addressing.
Agreed, it's only trouble. The 32bit address space is too limited
to waste it. And the performance is correct with current scheme.
> Debug info reading is not the only problem. I listed 5 problems in my
> first email, debug info reading is only relevant to P5.
> this can easily exceed 256MB. This is the difference between P4 and P5
> in my first email. Go back and read them again.
Don't worry, I had them well in mind and in front of my eyes ;-)
I just need a few more tokens on P4, because we already
had possible supposedly working solutions for:
- P1: PIE + layout detection
- P2: PROT_NONE + MAP_NORESERVE and/or shm reservation
- P3: layout detection + (big bang or incremental shadow alloc)
Summarizing P4 and P5 gives the following constraints:
For P4, tools need to have a heap. They don't care where the big
areas are allocated, neither how they get chunked in small
blocks by VG_(malloc). Arenas improve perf, and that's good.
Flexible barrier (or no barrier) (would) help sharing the address
space between the tools and the client, because some times
the client requires a lot of mem, and some times the tools do.
P5: At any moment, we may need to have, say 500Mb of consecutive
address space available to map a big file. This is very temporary,
and while we have this mapping:
- The client does not execute
- We increase VG heap usage with the dbg infos we are parsing.
With the client needs being the following:
C1. text, data, bss fixed size segments
C2. a (down-growing) stack. May need to be very big in some cases
C3. an (up-growing) heap (brk) region. May need to be very big in some cases
C4. additional addressing space for file/anonymous mappings
(malloc will fallback to that if brk() fails
C5. reserved fixed address space ranges (very rarely)
And VG + tools needs being:
V1. text, data, bss - don't care where they are
V2. a small stack - idem
V3. chunks for shadow memory - not contiguous, don't care where they are
V4. mapping ranges for VG_(malloc) - no need to be contiguous
V5. Temporary BIG area for file mappings (for now)
I found a solution for P1..P5 which seems credible.
Based on the assumption that we don't need to have
a barrier (even flexible - BTW, what did you
mean by flexible ? command-line arg, or adjusting during
the execution ?), we can have C4, V3 and V4 interleaved,
and V5 taking the maximum space available after C3 and/or before
C2 since C2 and C3 won't grow while V5 (file mapping) is needed.
C2 (stack) should have a reasonable default, but be parametered
through a cmd-line arg (e.g. --min-stack-size) for
specific purposes.
Idem for C3 (brk). Note that having a limited heap region
should't bother too much apps, if their malloc uses
mmap() when brk() fails.
Optionally we could add --max-XXX-size if people want
to test under memory constraint.
So the following layout could work:
0 ===========================
Shared area 1 (DOWN)
...
VV--------------VV
available to vg & client
cli_data ===========================
client heap (DOWN)
...
cli_data_end VV----------------VV
[Temp vg mappings]
^^----------------^^
....
Shared area 2 (UP)
cli_stk_max ============================
Unused / Unusable
cli_stk_cur ^^----------------^^
.....
client stack, (UP)
space_end ============================
Kernel space
0xFFFFFFFF ============================
=== for limits fixed at once
--- for limits moving during the run
Shared areas (SAs) can contain C1, C4, V1, V3, V4
blocks.
Of course this is just an example, we may decide
to setup things differently, e.g.
- I didn't put the VG stack anywhere; could take
8 Mb in SA1, or be put at SA1 end, etc.
- Client stack is constrained, could be moved at
end of SA1 (needs arbitration between allocations
in SA1 and SA2 : less max client stack or less
maximum file mapping and less max client data.)
- There could be less or more shared areas
- Zones can be ordered in other manner
The important points in this idea are:
- No more address address space separation between
Valgrind and client
- At startup, depending on the view we have
of the address space (scanned), command-line
options, whatever, we decide a layout
and fix (compute) the hard limits.
- The big file mapping areas is between the client heap
and a shared area which is used in last resort.
- At runtime, we arbitrate the allocations in the shared
areas. For previous model, we would first try
to use everything in SA1 before using SA2 because
it impacts client heap and max mappable file size.
- If something goes wrong, we issue a clear message
telling the user to increase e.g. the min avail stack
size or min avail heap size on the command-line
I re-read carefully P1 through P5 and found nothing
problematic remaining. The only question I have no answer
is why was there a barrier between vg and client. I feel
it is not needed.
> [Modularization, Unit testing, ...]
> This will require some work to improve the modularisation a bit (ie.
> decreasing coupling and making modules as self-contained as possible)
> but that's a good thing.
Clearly a big task which could really improve the quality...
Volunteers ? ;-)
> But self-hosting would definitely be a good thing too.
With the shared areas, it shouldn't be too difficult
to have the 3 pseudo-processes (VG1 + VG2 + Client)
sharing the same address space. Probably with a bit of
cooperation between VG1 and VG2...
--
Eric
|
|
From: Jeremy F. <je...@go...> - 2004-09-08 20:41:54
|
On Thu, 2004-09-02 at 08:58 +0100, Nicholas Nethercote wrote: > The big-bang shadow alloc uses PROT_NONE + MAP_PRIVATE|MAP_ANON|MAP_FIXED. > For Memcheck the alloc-size is 1.5GB. > > The stage1 padding uses PROT_NONE + MAP_FIXED|MAP_PRIVATE, with a real > file. Typically, on my machine it does two mmaps, with a combined size of > 2.8GB. > > I wonder if the use of a file causes the difference. Yes, it does. The kernel is worried about whether there's somewhere to put the dirty pages as memory fills up. If there's no file backing, then the limit is phys.mem+swap. But if the mapping is backed by a file, then that mapping doesn't need to use swap if it is paged out. J |
|
From: Julian S. <js...@ac...> - 2004-09-02 10:24:06
|
One point to bear in mind is that the conversation that followed appeared to be silently predicated on the assumption that Linux is the only kernel we're interested in. We have to think beyond that: I would love to make V available for MacOSX, and I bet (eg) the OpenBSD folks would love to get their hands on an x86-openbsd variant of Valgrind. Indeed, once Nick's current commit set goes in, one interesting experiment would be to try an openbsd port. Or freebsd (already exists) or netbsd. > --------------------------------------------------------------------------- >-- Problems + solutions > --------------------------------------------------------------------------- >-- P1. It assumes 3G:1G user/kernel split. > > - For 4G kernels, Valgrind gets the whole extra 1GB for its own use (I > think). This works, but is sub-optimal. > > - For other layouts (eg. 2G:2G, or even 2.9G:1.1G) it just doesn't work. > Changing KICKSTART_BASE is a workaround, if you know that. (But 2G:2G > still cannot run Memcheck, see below.) > > S1. This can be solved easily, by using position-independent executables > (PIE). We can do a configure-time test for PIE, and if supported, make > stage2 a PIE. Then stage1 can decide where stage2 should go, by doing > some kind of run-time test (which would look at where the stack is, or > use shmat(), or something, to determine where the user/kernel division > lies). > > This change is pretty uncontroversial, and Paul already has a patch for > it (which I don't think should be committed as-is, but is a good start). > > For non-PIE-supporting systems, we could build 3 or 4 versions of > stage2, and choose the most appropriate one (I have a patch for this). > Or just a single fixed-location back-up stage2 might be enough. Ok, I agree with P1/S1. This is uncontroversial. Since we can't rely on PIE being around, the 3 or 4 versions solution sounds good to me. > --------------------------------------------------------------------------- >-- P2. For kernels with "overcommit" mmapping off -- which prevents a > process from allocating more address space than the available swap space -- > you need at least 1.5GB of swap for Memcheck to run, because swap must be > at least as large as any individual segment. (And I think users with > ulimit -v set suffer the same problem.) > > S2. Avoiding this requires not using the big-bang shadow allocation > method, and that shadow memory instead be done incrementally. (More about > that below.) Let's just forget about big-bang shadow allocation. It causes a whole bunch of problems, we're not using it at the moment, and we don't have a clear picture of where the cycle-level costs of shadow memory come from anyway. For example, if shadow memory really kills us because it jacks up the D1/L2 cache miss rates, then it's going to do so regardless of the address translation scheme in use. > A more radical solution: truly virtualise the address space (rather > than just partitioning it) -- ie. Valgrind implements it's own virtual > MMU and page table. The exact details of how it would work are not yet > clear. If even feasible, this is a long-term solution; something else > should be done in the meantime. I agree. Currently I do not see how to do this with a small enough performance overhead, so forget about this for the time being. > --------------------------------------------------------------------------- >-- My suggestion > --------------------------------------------------------------------------- >-- In this order, make the following changes. > > 1. Use PIE where possible, solving P1. Agree. > 2. Switch big-bang shadow memory allocation to incremental, solving P2, > P4, P5. Agree. > 3. Make the client/Valgrind division movable, largely solving P3. Agree. > 4. Maybe make the --pointer-check=no change, if it seems useful. Well, I like the fact that currently the client can't trash V. Another thing to consider is how to achieve this portably, on non-x86s. If the client address space is contained entirely in 0 .. N-1, and N is a power of two, ANDing is obviously a cheap solution. If the machine contains a scalar 'min' insn, then we can do this cheaply for any N. J |
|
From: Nicholas N. <nj...@ca...> - 2004-09-02 10:50:38
|
On Thu, 2 Sep 2004, Julian Seward wrote: > One point to bear in mind is that the conversation that followed > appeared to be silently predicated on the assumption that Linux is > the only kernel we're interested in. Some of the stuff (using mmap()) is plain POSIX. But the stuff about overcommitting, etc, is Linux-specific. In general, I'm not happy about relying on doing huge (eg. 1.5GB) mmaps. Seems too fragile in general. > We have to think beyond that: I would love to make V available for > MacOSX, and I bet (eg) the OpenBSD folks would love to get their hands > on an x86-openbsd variant of Valgrind. Indeed, once Nick's current > commit set goes in, one interesting experiment would be to try an > openbsd port. Or freebsd (already exists) or netbsd. Definitely. Once I've finished putting the structure in place, it would be great if someone (Doug?) could create a patch for FreeBSD, like Paul has been doing for PPC. Hopefully the changes required won't be too big -- just some factoring -- in which case I don't see why it couldn't go into CVS if it's working reliably. In contrast, PPC will require UCode reworking, so that's not going to be able to go in quite yet. > Let's just forget about big-bang shadow allocation. It causes a whole > bunch of problems, we're not using it at the moment, No -- we are using it at the moment. However, we are not using direct-offset shadow addressing (for which big-bang shadow allocation is a prerequisite). Sorry if I didn't make this clear, seems like I should have been more careful with my terminology in the RFC. > and we don't have > a clear picture of where the cycle-level costs of shadow memory come > from anyway. For example, if shadow memory really kills us because > it jacks up the D1/L2 cache miss rates, then it's going to do so > regardless of the address translation scheme in use. Yes. >> 4. Maybe make the --pointer-check=no change, if it seems useful. > > Well, I like the fact that currently the client can't trash V. Yes, the default would be --pointer-check=yes, which maintains that non-trashing property. The mooted change would only take effect if you specify --pointer-check=no; that would only be used for really difficult cases where we really can't get it to work otherwise. > Another thing to consider is how to achieve this portably, on non-x86s. > If the client address space is contained entirely in 0 .. N-1, and N is > a power of two, ANDing is obviously a cheap solution. Actually, we just need one of the following to be true: the client must be within 0..2^N or Valgrind must be outside 0..2^N Ie. there could be unused address space between them, and the dividing line just has to fall within that. But this seems problematic -- the only dividing line that really makes sense for 32-bit is either 1GB or 2GB, which doesn't allow the client to get very big. > If the machine contains a scalar 'min' insn, then we can do this cheaply > for any N. Yes. Is such an instruction common? N |
|
From: Jeremy F. <je...@go...> - 2004-09-08 21:11:51
|
On Thu, 2004-09-02 at 11:23 +0100, Julian Seward wrote: > One point to bear in mind is that the conversation that followed > appeared to be silently predicated on the assumption that Linux is > the only kernel we're interested in. We have to think beyond that: > I would love to make V available for MacOSX, and I bet (eg) the > OpenBSD folks would love to get their hands on an x86-openbsd > variant of Valgrind. Indeed, once Nick's current commit set goes > in, one interesting experiment would be to try an openbsd port. > Or freebsd (already exists) or netbsd. I don't think anything here is particularly kernel-dependent. The prerequisite is that there's an mmap syscall which allows us to define the placement of things in the address space. That's not a big ask. > Let's just forget about big-bang shadow allocation. It causes a whole > bunch of problems, we're not using it at the moment, and we don't have > a clear picture of where the cycle-level costs of shadow memory come > from anyway. For example, if shadow memory really kills us because > it jacks up the D1/L2 cache miss rates, then it's going to do so > regardless of the address translation scheme in use. Well, if we use direct map, then we're using the CPU's TLB rather than its D caches, so the tradeoff changes. And we're taking advantage of a CPU hotpath, which is presumably well optimised. "Manual" table indirects have locality patterns which match the client's memory access, so there isn't a huge increase in cache pressure. (ie, there's a doubling because of the data+shadow, but the table accesses only add 1 line per pagetable level.) The manual scheme adds I cache pressure linear to the number of table levels, if the table indirection is inlined. > > 4. Maybe make the --pointer-check=no change, if it seems useful. > > Well, I like the fact that currently the client can't trash V. > > Another thing to consider is how to achieve this portably, on > non-x86s. If the client address space is contained entirely in > 0 .. N-1, and N is a power of two, ANDing is obviously a cheap > solution. If the machine contains a scalar 'min' insn, then > we can do this cheaply for any N. There's the question of whether you simply want to prevent the client hitting Valgrind (by ANDing off the address bits), or actually detect the case and raise an error. J |
|
From: Julian S. <js...@ac...> - 2004-09-02 10:35:27
|
(/me is getting a bit confused here) Let me see if I have this right. If we simply implement Nick's original proposal, then the whole issue of reading debug info becomes moot because V can mmap in executables of any size to read their debug info. At least, assuming some contiguous, suitably large piece of virtual address space >= client_end can be found. This is an improvement on the current situation wherein we essentially are reserving 256M of space for more or less this purpose (+ misc other V-internal storage), and have to fail on executables bigger than ~200M. Do I understand right? J On Wednesday 01 September 2004 22:48, Nicholas Nethercote wrote: > On Wed, 1 Sep 2004, Jeremy Fitzhardinge wrote: > >> But when Nicholas said "incremental loading", he meant loading > >> debug infos for sections/compile units as we need them. > >> That's another problem, not completely decorrelated with > >> the "I can't map the whole file to read dbg infos". > > > > Yes, I know, exactly. > > Er, no, by "incremental loading" I meant loading the debug info for a > single section in pieces, rather than requiring one big mmap(). Loading > debug info as we need them I would call "lazy" or "on-demand" loading. > Sorry if that was unclear. > > >> Not sure it's worth doing it for now. How many people still use > >> stabs (on the platform we support...) ? > > > > Valgrind's dwarf support is pretty weak compared to the stabs support - > > there's nothing there to extract type information. > > But Helgrind is the only tool that uses the type information, and not many > people use Helgrind. > > > We only need to resort to padding things out either before we gain full > > control (constraining ld.so to put things in the right place) > > I'm surprised that people with non-overcommitting kernels are not having a > problem with this step, but are having a problem with the big-bang shadow > memory allocation. I would have thought the padding done by stage1 would > involve more than 1.5GB worth of maps. > > >> BTW, now we are speaking of mem layout, I think it is very important > >> to keep in mind that it would be great to bootstrap V... > > > > You mean self-virtualization? That's a goal, but its tricky. > > PIE should help, though. > > N > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: Nicholas N. <nj...@ca...> - 2004-09-02 10:57:34
|
On Thu, 2 Sep 2004, Julian Seward wrote: > (/me is getting a bit confused here) understandable :) > Let me see if I have this right. If we simply implement Nick's > original proposal, then the whole issue of reading debug info > becomes moot because V can mmap in executables of any size to > read their debug info. At least, assuming some contiguous, > suitably large piece of virtual address space >= client_end > can be found. Correct. I can imagine pathological cases, under my proposal, where the client is already very big and so V cannot find that space (ie. your "at least" assumption fails). In those cases having incremental debug info loading (ie. not mapping in the whole file at once) would help. However, these cases are so unlikely that just reading in the whole file at once (which is much simpler) should be fine. > This is an improvement on the current situation wherein we > essentially are reserving 256M of space for more or less this > purpose (+ misc other V-internal storage), and have to fail on > executables bigger than ~200M. Correct. > Do I understand right? Yes. ---- It might be worth me doing a summary of this thread at some point, like I did for the threads/signals/syscalls discussion, in order to clear up any remaining confusion and clarify our terminology. N |
|
From: Julian S. <js...@ac...> - 2004-09-02 11:19:23
|
> I can imagine pathological cases, under my proposal, where the client is > already very big and so V cannot find that space (ie. your "at least" > assumption fails). In those cases having incremental debug info loading > (ie. not mapping in the whole file at once) would help. However, these > cases are so unlikely that just reading in the whole file at once (which > is much simpler) should be fine. Good. I can imagine that too. So we are of common understanding. Assuming the plan goes through as proposed, I'm going to assume the debug info reading issue is a non-problem. > It might be worth me doing a summary of this thread at some point, like I > did for the threads/signals/syscalls discussion, in order to clear up any > remaining confusion and clarify our terminology. Good plan. J |
|
From: Julian S. <js...@ac...> - 2004-09-02 11:13:18
|
> > One point to bear in mind is that the conversation that followed > > appeared to be silently predicated on the assumption that Linux is > > the only kernel we're interested in. > > Some of the stuff (using mmap()) is plain POSIX. But the stuff about > overcommitting, etc, is Linux-specific. Right. So an important thing to achieve here is to restrict our needs to plain POSIX, and have a solution independent of (eg) overcommitting hints. > In general, I'm not happy about relying on doing huge (eg. 1.5GB) mmaps. > Seems too fragile in general. I entirely agree. > > Let's just forget about big-bang shadow allocation. It causes a whole > > bunch of problems, we're not using it at the moment, > > No -- we are using it at the moment. However, we are not using > direct-offset shadow addressing (for which big-bang shadow allocation is a > prerequisite). Sorry if I didn't make this clear, seems like I should > have been more careful with my terminology in the RFC. Uh, ok, this is my sloppy thinking. I meant, we are not using direct- offset shadow addressing and therefore there is no reason for big-bang shadow allocation (is there?) > Actually, we just need one of the following to be true: > > the client must be within 0..2^N > > or > > Valgrind must be outside 0..2^N > > Ie. there could be unused address space between them, and the dividing > line just has to fall within that. > > But this seems problematic -- the only dividing line that really makes > sense for 32-bit is either 1GB or 2GB, which doesn't allow the client to > get very big. Yes, any solution which involves a single and (or nand) seems to inflexible. > > If the machine contains a scalar 'min' insn, then we can do this cheaply > > for any N. > > Yes. Is such an instruction common? I don't think so, at least not in scalar pipelines (SIMD ones maybe). PaulM, what's the deal on ppc? On x86 (if we don't use seg regs) we can do 'cmp %limit, %addr ; cmovge %limit, %addr' which isn't great, but it isn't a disaster either, and it sounds relatively portable. J |
|
From: Nicholas N. <nj...@ca...> - 2004-09-02 11:17:01
|
On Thu, 2 Sep 2004, Julian Seward wrote: >>> Let's just forget about big-bang shadow allocation. It causes a whole >>> bunch of problems, we're not using it at the moment, >> >> No -- we are using it at the moment. However, we are not using >> direct-offset shadow addressing (for which big-bang shadow allocation is a >> prerequisite). Sorry if I didn't make this clear, seems like I should >> have been more careful with my terminology in the RFC. > > Uh, ok, this is my sloppy thinking. I meant, we are not using direct- > offset shadow addressing and therefore there is no reason for big-bang > shadow allocation (is there?) Correct. N |
|
From: Tom H. <th...@cy...> - 2004-09-02 11:40:44
|
In message <200...@ac...>
Julian Seward <js...@ac...> wrote:
> On x86 (if we don't use seg regs) we can do
> 'cmp %limit, %addr ; cmovge %limit, %addr'
>
> which isn't great, but it isn't a disaster either, and it sounds relatively
> portable.
Bear in mind that CMOV is only available on the Pentium-Pro and
later processors. It isn't available on 486's, original Pentiums,
AMD K6's and so on. I'm not sure about the original MMX Pentiums
as I can't find one here to try it on at the moment.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|