|
From: Roland M. <ro...@re...> - 2004-12-04 02:09:27
|
> I'm not so worried about page table management. Valgrind already needs > to support user-mode programs using mmap, that's just a high-level > interface to the paging hardware. Uh, sort of. I am less worried about implementation difficulties than about being clear in my head and in our discussion what we are actually talking about. :-) (It probably doesn't help that just in what I am myself clear on, I have at least two opposing plans of attack I'm describing.) There are two areas of complication here. First, you can switch page tables (a virtualized %cr3 change). This changes the universe of what each flat address means. To a first approximation, this requires that valgrind have a big hook on which to hang its tables of memory state and translation cache lhs values, and swap different whole worlds onto that hook in response to the appropriate hypercall. But slightly deeper contemplation reveals this approximation is way, way off. The page tables are like user programs' use of mmap in that writing page table entries constitutes a page-granularity allocation. But in user programs you are used to thinking about this as an extra allocation mechanism you don't do a lot with, and focus more on the malloc allocations where you know object boundaries from intercepting those known calls. In a xen guest kernel, the page table allocations are the basic thing underlying all the object allocators. It is necessary and useful just to track the use of the pages and the initializedness of their bytes, but not useful enough. I want to teach valgrind to understand calls to my xen guest kernel's object allocators so it knows the object boundaries of the memory used in the guest kernel just as it knows the boundaries of malloc allocations in a user program. This is what I was getting at when I said page tables are intertwined with the allocation tracking plan. I'd like to explore how you think about this. To get into the deeper contemplation I mentioned, I again diverge into two disparate implementation styles. I think both have merit for different purposes, and ultimately would like to see a flexible hybrid approach. I have yet to settle on which I think is easier to get going first. First, there is the "pure virtual" approach. That is, valgrind's task is to virtualize the entire xen/x86 virtual machine fully, just as vanilla valgrind virtualizes the entire linux-user/x86 virtual machine provided in each individual user process address space by a linux kernel. This is like unto moving towards the whole-machine emulation ability of something like vmware or qemu, but still a whole lot simpler than that because the xen/x86 virtual machine is substantially simpler than the x86 privileged hardware. What this means is that valgrind directly simulates in software all the behavior of privilege modes and page tables that xen/x86 guests can see. valgrind runs the kernel, valgrind runs the user. For the memory handling, valgrind emulates the work of the MMU by translating virtual addresses according to the page tables, and indexes its shadow memory, translation caches, and whatnot, on physical addresses that come out of the MMU module. The translation hopefully can optimize this, analogous to how it's possible to optimize the memory checks done for a translated block containing multiple accesses to a single object; i.e., figure out that some blocks can access a given address range and do the MMU translation once at the start of a translated block. I probably said before this is straightforward. I didn't say it wouldn't be dog slow. This is the completist approach, and it has some benefits. You get a system where one user process can copy around some uninitialized data, write it into a pipe, have it read and written by three other user processes through three other pipes, and come out someplace where valgrind says "this uninitialized garbage you see before you came from over here". That is pretty cool. Of more every day use will be the ability you mentioned earlier to notice kernel code examining uninitialized data copied in from user memory. There are many worthwhile kinds of analysis that become possible when you are tracking the entire use of the machine with no boundary lines (such as process address space, or user/kernel mode) on your knowledge of the details. The other approach is in fact what I had in mind at the genesis of the discussion. That is, take advantage of some knowledge of the guest kernels we're interested in running under valgrind, and only try to instrument kernel code, not user code. First, we decide that user-mode execution is a black box--we don't translate code in user mode, we just run it natively (more in another message about how this is done). When a page table entry is written with user-mode write permissions enabled, and then we've gone to user mode, we just assume something broad about what happened to all the bytes in those pages. (Well actually we can optimize this to know whether the page was touched at all or not.) Second, we observe that the guest kernels of interest actually use their page tables such that the kernel-mode-only pages are all the same in all the address spaces they ever switch among. These two assumptions allow us to go back to the innocent world we know, where there is just one single address space used by a single program: the guest kernel, in the kernel-only subset of the address space set up by its page tables. (Furthermore, kernels draw a fixed line between the two kinds of pages in the page tables; so in fact, the valgrind implementation of the guest hypercalls to install page table entries need only enforce that user-accessible pages are on one side of a threshold address and that kernel-only pages are on the other side, and then it need only keep track of this threshold for identifying user addresses on use rather than consult the page tables.) Now, the kernel does actually access user memory, and when it does so, the results depend on the address space switches that we just decided to pretend weren't happening. So, the translated code needs to identify reads from user addresses as getting random bits from the black box. Basically, any load from user memory is equivalent to getting those bytes from a `read' system call in userland valgrind's perspective. Usefully, in practice each block that loads from user memory always loads from user memory and each block that stores to user memory always stores to user memory. So you can notice at the first execution of a block that it will refer to user memory, and translate that block to quickly check against the threshold address and then do the memory-tracking machinery appropriate for the black box loads/stores. (In practice the guest kernels never call these blocks with an address not in the user side of the address space, so that quick check is just for wild bugs.) Thanks, Roland |