From: Andi K. <ak...@su...> - 2004-02-13 12:38:27
|
On Thu, 12 Feb 2004 23:48:41 -0800 Paul Jackson <pj...@sg...> wrote: > All the above has led me to conclude that we need to move > the cpu and memory placement policy out of the tasks and > vmas, into separate structs, shared by links from the > task struct. My latest version has already switched to separately allocated mempolicies. I did that to allow independent policy for shared memory segments (you can set policy for a shared memory segment even when nobody has it attached and everybody sharing the shm shares the policy too) A mempolicy has a reference count, so it can be easily shared. I also made it stateless for the VMA case, so it doesn't need a lock anymore for interleaving. It's still tied to task structs and the vmas, but there is no fundamental reason it couldn't hang off other data structures. > > Two other, more subtle, concerns reinforce this conclusion: > > 4) Attaching memory policy to vma's has a subtle imposition > on their use. It means you can only have memory policy > where you already have vma's. For example, a parent > process could not pre-establish some memory policy for > a particular range of memory that it knew some child That is what process policy is for. I don't want to extend that to apply more finegrained, because I don't believe you can make a nice user interface out of this. > 5) Finally, cpu and memory placement should work together as > a unit. The essential step to managing performance on a > NUMA system is placing tasks on the CPUs near their > memory. I think it's better to keep them separate. Handling them together automatically doesn't seem to work very well anyways. And for user configuration I don't think it's a problem to offer both. > With this architectural change, I intend that: > > One would no longer have to fault pages in to apply > policy. On some large jobs, this is critical. That always only applied to process policy. mbind vma policy is of course more persistent. > > Process policy persist over swapping. > > Interleaving be per cpuset, not per VMA In my current code it is stateless for VMA policy (only defined by the offset into the address space). For process policy it is still per process. I'm not sure it would be really a good idea to share "dynamic" interleaving state over more than one process. The reason is that the interleaving only makes sense as seen from the virtual address space of a single address. Otherwise you don't get any bandwidth advantages from it while accessing your memory. But the different processes in your cpuset have completely different VMs. And I have my doubts that "gang interleaving" unrelated VMs would result in an even and useful interleaving pattern for the final CPU accesses. > What I am about to attempt next will be to adapt the system > calls and choice of policies in your numa patch, which are > better than I could come up with, to kernel data structures > where these policies are kept along side the CPU placement > policies (that Simon and Sylvain of Bull have been working > on), in a struct hung off the task struct. For example, if I would suggest you wait a bit until I release the next version (still debugging a few issues in it) It has lots of changes. It probably doesn't make sense to base much work on the older version. The new reference counted policies should make that much easier anyways. > One challenge will be to keep the kernel scheduler and memory > allocator and related code efficient - perhaps by caching the > user requested policy in a pre-digested form that enables > quick calculation. That should be pretty easy already, because struct mempolicy is this cache. All you want to change is just where it is stored. -Andi |