From: Robin H. <ho...@sg...> - 2008-05-16 11:23:03
|
On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote: > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > > On Thu, 15 May 2008, Nick Piggin wrote: > > > > > Oh, I get that confused because of the mixed up naming conventions > > > there: unmap_page_range should actually be called zap_page_range. But > > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > > > How is that synchronized with code that walks the same pagetable. These > > walks may not hold mmap_sem either. I would expect that one could only > > remove a portion of the pagetable where we have some sort of guarantee > > that no accesses occur. So the removal of the vma prior ensures that? > > I don't really understand the question. If you remove the pte and invalidate > the TLBS on the remote image's process (importing the page), then it can > of course try to refault the page in because it's vma is still there. But > you catch that refault in your driver , which can prevent the page from > being faulted back in. I think Christoph's question has more to do with faults that are in flight. A recently requested fault could have just released the last lock that was holding up the invalidate callout. It would then begin messaging back the response PFN which could still be in flight. The invalidate callout would then fire and do the interrupt shoot-down while that response was still active (essentially beating the inflight response). The invalidate would clear up nothing and then the response would insert the PFN after it is no longer the correct PFN. Thanks, Robin |
From: Robin H. <ho...@sg...> - 2008-05-16 11:50:03
|
On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote: > On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote: > > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > > > On Thu, 15 May 2008, Nick Piggin wrote: > > > > > > > Oh, I get that confused because of the mixed up naming conventions > > > > there: unmap_page_range should actually be called zap_page_range. But > > > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > > > > > How is that synchronized with code that walks the same pagetable. These > > > walks may not hold mmap_sem either. I would expect that one could only > > > remove a portion of the pagetable where we have some sort of guarantee > > > that no accesses occur. So the removal of the vma prior ensures that? > > > > I don't really understand the question. If you remove the pte and invalidate > > the TLBS on the remote image's process (importing the page), then it can > > of course try to refault the page in because it's vma is still there. But > > you catch that refault in your driver , which can prevent the page from > > being faulted back in. > > I think Christoph's question has more to do with faults that are > in flight. A recently requested fault could have just released the > last lock that was holding up the invalidate callout. It would then > begin messaging back the response PFN which could still be in flight. > The invalidate callout would then fire and do the interrupt shoot-down > while that response was still active (essentially beating the inflight > response). The invalidate would clear up nothing and then the response > would insert the PFN after it is no longer the correct PFN. I just looked over XPMEM. I think we could make this work. We already have a list of active faults which is protected by a simple spinlock. I would need to nest this lock within another lock protected our PFN table (currently it is a mutex) and then the invalidate interrupt handler would need to mark the fault as invalid (which is also currently there). I think my sticking points with the interrupt method remain at fault containment and timeout. The inability of the ia64 processor to handle provide predictive failures for the read/write of memory on other partitions prevents us from being able to contain the failure. I don't think we can get the information we would need to do the invalidate without introducing fault containment issues which has been a continous area of concern for our customers. Thanks, Robin |
From: Andrea A. <an...@qu...> - 2008-05-07 22:22:03
|
On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > > had to do any blocking I/O during vmtruncate before, now we have to. > > I really suspect we don't really have to, and that it would be better to > just fix the code that does that. I'll let you discuss with Christoph and Robin about it. The moment I heard the schedule inside ->invalidate_page() requirement I reacted the same way you did. But I don't see any other real solution for XPMEM other than spin-looping for ages halting the scheduler for ages, while the ack is received from the network device. But mm_lock is required even without XPMEM. And srcu is also required without XPMEM to allow ->release to schedule (however downgrading srcu to rcu will result in a very small patch, srcu and rcu are about the same with a kernel supporting preempt=y like 2.6.26). > I literally think that mm_lock() is an unbelievable piece of utter and > horrible CRAP. > > There's simply no excuse for code like that. I think it's a great smp scalability optimization over the global lock you're proposing below. > No, the simple solution is to just make up a whole new upper-level lock, > and get that lock *first*. You can then take all the multiple locks at a > lower level in any order you damn well please. Unfortunately the lock you're talking about would be: static spinlock_t global_lock = ... There's no way to make it more granular. So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock we'd need to take that extremely wide spinlock first (and even worse, later it would become a rwsem when XPMEM is selected making the VM even slower than it already becomes when XPMEM support is selected at compile time). > And yes, it's one more lock, and yes, it serializes stuff, but: > > - that code had better not be critical anyway, because if it was, then > the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_ mmu_notifier_register can take ages. No problem. > - parallelism is overrated: it doesn't matter one effing _whit_ if > something is a hundred times more parallel, if it's also a hundred > times *SLOWER*. mmu_notifier_register is fine to be hundred times slower (preempt-rt will turn all locks in spinlocks so no problem). > And here's an admission that I lied: it wasn't *all* clearly crap. I did > like one part, namely list_del_init_rcu(), but that one should have been > in a separate patch. I'll happily apply that one. Sure, I'll split it from the rest if the mmu-notifier-core isn't merged. My objective has been: 1) add zero overhead to the VM before anybody starts a VM with kvm and still zero overhead for all other tasks except the task where the VM runs. The only exception is the unlikely(!mm->mmu_notifier_mm) check that is optimized away too when CONFIG_KVM=n. And even for that check my invalidate_page reduces the number of branches to the absolute minimum possible. 2) avoid any new cacheline collision in the fast paths to allow numa systems not to nearly-crash (mm->mmu_notifier_mm will be shared and never written, except during the first mmu_notifier_register) 3) avoid any risk to introduce regressions in 2.6.26 (the patch must be obviously safe). Even if mm_lock would be a bad idea like you say, it's order of magnitude safer even if entirely broken then messing with the VM core locking in 2.6.26. mm_lock (or whatever name you like to give it, I admit mm_lock may not be worrysome enough for people to have an idea to call it in a fast path) is going to be the real deal for the long term to allow mmu_notifier_register to serialize against invalidate_page_start/end. If I fail in 2.6.26 I'll offer maintainership to Christoph as promised, and you'll find him pushing for mm_lock to be merged (as XPMEM/GRU aren't technologies running on cellphones where your global wide spinlocks is optimized away at compile time, and he also has to deal with XPMEM where such a spinlock would need to become a rwsem as the anon_vma->sem has to be taken after it), but let's assume you're right entirely right here that mm_lock is going to be dropped and there's a better way: it's still a fine solution for 2.6.26. And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to mmu_notifier.[ch] so you don't get any pollution in the core VM, and mm_lock will be invisible to everything but anybody calling mmu_notifier_register() then and it will be trivial to remove later if you really want to add a global spinlock as there's no way to be more granular than a _global_ numa-wide spinlock taken before any i_mmap_lock/anon_vma->lock, without my mm_lock. |
From: Andrew M. <ak...@li...> - 2008-05-07 22:31:52
|
On Thu, 8 May 2008 00:22:05 +0200 Andrea Arcangeli <an...@qu...> wrote: > > No, the simple solution is to just make up a whole new upper-level lock, > > and get that lock *first*. You can then take all the multiple locks at a > > lower level in any order you damn well please. > > Unfortunately the lock you're talking about would be: > > static spinlock_t global_lock = ... > > There's no way to make it more granular. > > So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock > we'd need to take that extremely wide spinlock first (and even worse, > later it would become a rwsem when XPMEM is selected making the VM > even slower than it already becomes when XPMEM support is selected at > compile time). Nope. We only need to take the global lock before taking *two or more* of the per-vma locks. I really wish I'd thought of that. |
From: Linus T. <tor...@li...> - 2008-05-08 01:57:22
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > So because the bitflag can't prevent taking the same lock twice on two > different vmas in the same mm, we still can't remove the sorting Andrea. Take five minutes. Take a deep breadth. And *think* about actually reading what I wrote. The bitflag *can* prevent taking the same lock twice. It just needs to be in the right place. Let me quote it for you: > So the flag wouldn't be one of the VM_xyzzy flags, and would require > adding a new field to "struct anon_vma()" IOW, just make it be in that anon_vma (and the address_space). No sorting required. > I think it's more interesting to put a cap on the number of vmas to > min(1024,max_map_count). The sort time on an 8k array runs in constant > time. Shut up already. It's not constant time just because you can cap the overhead. We're not in a university, and we care about performance, not your made-up big-O notation. Linus |
From: Andrea A. <an...@qu...> - 2008-05-08 02:24:20
|
On Wed, May 07, 2008 at 06:57:05PM -0700, Linus Torvalds wrote: > Take five minutes. Take a deep breadth. And *think* about actually reading > what I wrote. > > The bitflag *can* prevent taking the same lock twice. It just needs to be > in the right place. It's not that I didn't read it, but to do it I've to grow every anon_vma by 8 bytes. I thought it was implicit that the conclusion of your email is that it couldn't possibly make sense to grow the size of each anon_vma by 33%, when nobody loaded the kvm or gru or xpmem kernel modules. It surely isn't my preferred solution, while capping the number of vmas to 1024 means sort() will make around 10240 steps, Matt call tell the exact number. The big cost shouldn't be in sort. Even 512 vmas will be more than enough for us infact. Note that I've a cond_resched in the sort compare function and I can re-add the signal_pending check. I had the signal_pending check in the original version that didn't use sort() but was doing an inner loop, I thought signal_pending wasn't necessary after speeding it up with sort(). But I can add it again, so then we'll only fail to abort inside sort() and we'll be able to break the loop while taking all the spinlocks, but with such as small array that can't be an issue and the result will surely run faster than stop_machine with zero ram and cpu overhead for the VM (besides stop_machine can't work or we'd need to disable preemption between invalidate_range_start/end, even removing the xpmem schedule-inside-mmu-notifier requirement). |
From: Linus T. <tor...@li...> - 2008-05-08 02:32:21
|
Andrea, I'm not interested. I've stated my standpoint: the code being discussed is crap. We're not doing that. Not in the core VM. I gave solutions that I think aren't crap, but I already also stated that I have no problems not merging it _ever_ if no solution can be found. The whole issue simply isn't even worth the pain, imnsho. Linus |
From: Linus T. <tor...@li...> - 2008-05-08 04:15:05
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > But removing sort isn't worth it if it takes away ram from the VM even > when global_mm_lock will never be called. Andrea, you really are a piece of work. Your arguments have been bogus crap that didn't even understand what was going on from the beginning, and now you continue to do that. What exactly "takes away ram" from the VM? The notion of adding a flag to "struct anon_vma"? The one that already has a 4 byte padding thing on x86-64 just after the spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two bytes of padding if we didn't just make the spinlock type unconditionally 32 bits rather than the 16 bits we actually _use_? IOW, you didn't even look at it, did you? But whatever. I clearly don't want a patch from you anyway, so .. Linus |
From: Andrea A. <an...@qu...> - 2008-05-08 05:20:22
|
On Wed, May 07, 2008 at 09:14:45PM -0700, Linus Torvalds wrote: > IOW, you didn't even look at it, did you? Actually I looked both at the struct and at the slab alignment just in case it was changed recently. Now after reading your mail I also compiled it just in case. 2.6.26-rc1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> anon_vma 260 576 24 144 1 : tunables 120 60 8 : slabdata 4 4 0 ^^ ^^^ 2.6.26-rc1 + below patch diff --git a/include/linux/rmap.h b/include/linux/rmap.h --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -27,6 +27,7 @@ struct anon_vma { struct anon_vma { spinlock_t lock; /* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ + int flag:1; }; #ifdef CONFIG_MMU # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> anon_vma 250 560 32 112 1 : tunables 120 60 8 : slabdata 5 5 0 ^^ ^^^ Not a big deal sure to grow it 33%, it's so small anyway, but I don't see the point in growing it. sort() can be interrupted by signals, and until it can we can cap it to 512 vmas making the worst case taking dozen usecs, I fail to see what you have against sort(). Again: if a vma bitflag + global lock could have avoided sort and run O(N) instead of current O(N*log(N)) I would have done that immediately, infact I was in the process of doing it when you posted the followup. Nothing personal here, just staying technical. Hope you too. |
From: Pekka E. <pe...@cs...> - 2008-05-08 05:27:44
|
On Thu, May 8, 2008 at 8:20 AM, Andrea Arcangeli <an...@qu...> wrote: > Actually I looked both at the struct and at the slab alignment just in > case it was changed recently. Now after reading your mail I also > compiled it just in case. > > @@ -27,6 +27,7 @@ struct anon_vma { > struct anon_vma { > > spinlock_t lock; /* Serialize access to vma list */ > > struct list_head head; /* List of private "related" vmas */ > + int flag:1; > }; You might want to read carefully what Linus wrote: > The one that already has a 4 byte padding thing on x86-64 just after the > spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two > bytes of padding if we didn't just make the spinlock type unconditionally > 32 bits rather than the 16 bits we actually _use_? So you need to add the flag _after_ ->lock and _before_ ->head.... Pekka |
From: Linus T. <tor...@li...> - 2008-05-08 15:04:15
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > Actually I looked both at the struct and at the slab alignment just in > case it was changed recently. Now after reading your mail I also > compiled it just in case. Put the flag after the spinlock, not after the "list_head". Also, we'd need to make it unsigned short flag:1; _and_ change spinlock_types.h to make the spinlock size actually match the required size (right now we make it an "unsigned int slock" even when we actually only use 16 bits). See the #if (NR_CPUS < 256) code in <asm-x86/spinlock.h>. Linus |
From: Linus T. <tor...@li...> - 2008-05-08 16:12:06
|
On Thu, 8 May 2008, Linus Torvalds wrote: > > Also, we'd need to make it > > unsigned short flag:1; > > _and_ change spinlock_types.h to make the spinlock size actually match the > required size (right now we make it an "unsigned int slock" even when we > actually only use 16 bits). Btw, this is an issue only on 32-bit x86, because on 64-bit one we already have the padding due to the alignment of the 64-bit pointers in the list_head (so there's already empty space there). On 32-bit, the alignment of list-head is obviously just 32 bits, so right now the structure is "perfectly packed" and doesn't have any empty space. But that's just because the spinlock is unnecessarily big. (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the structure really will grow. That's a very odd configuration, though, and not one I feel we really need to care about). Linus |
From: Andrea A. <an...@qu...> - 2008-05-08 22:01:13
|
On Thu, May 08, 2008 at 09:11:33AM -0700, Linus Torvalds wrote: > Btw, this is an issue only on 32-bit x86, because on 64-bit one we already > have the padding due to the alignment of the 64-bit pointers in the > list_head (so there's already empty space there). > > On 32-bit, the alignment of list-head is obviously just 32 bits, so right > now the structure is "perfectly packed" and doesn't have any empty space. > But that's just because the spinlock is unnecessarily big. > > (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the > structure really will grow. That's a very odd configuration, though, and > not one I feel we really need to care about). I see two ways to implement it: 1) use #ifdef and make it zero overhead for 64bit only without playing any non obvious trick. struct anon_vma { spinlock_t lock; #ifdef CONFIG_MMU_NOTIFIER int global_mm_lock:1; #endif struct address_space { spinlock_t private_lock; #ifdef CONFIG_MMU_NOTIFIER int global_mm_lock:1; #endif 2) add a: #define AS_GLOBAL_MM_LOCK (__GFP_BITS_SHIFT + 2) /* global_mm_locked */ and use address_space->flags with bitops And as Andrew pointed me out by PM, for the anon_vma we can use the LSB of the list.next/prev because the list can't be browsed when the lock is taken, so taking the lock and then setting the bit and clearing the bit before unlocking is safe. The LSB will always read 0 even if it's under list_add modification when the global spinlock isn't taken. And after taking the anon_vma lock we can switch it the LSB from 0 to 1 without races and the 1 will be protected by the global spinlock. The above solution is zero cost for 32bit too, so I prefer it. So I now agree with you this is a great idea on how to remove sort() and vmalloc and especially vfree without increasing the VM footprint. I'll send an update with this for review very shortly and I hope this goes in so KVM will be able to swap and do many other things very well starting in 2.6.26. Thanks a lot, Andrea |
From: Andrea A. <an...@qu...> - 2008-05-07 22:44:05
|
On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote: > Nope. We only need to take the global lock before taking *two or more* of > the per-vma locks. > > I really wish I'd thought of that. I don't see how you can avoid taking the system-wide-global lock before every single anon_vma->lock/i_mmap_lock out there without mm_lock. Please note, we can't allow a thread to be in the middle of zap_page_range while mmu_notifier_register runs. vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more than one lock and we've to still take the global-system-wide lock _before_ this single i_mmap_lock and no other lock at all. Please elaborate, thanks! |
From: Linus T. <tor...@li...> - 2008-05-07 22:44:44
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > Unfortunately the lock you're talking about would be: > > static spinlock_t global_lock = ... > > There's no way to make it more granular. Right. So what? It's still about a million times faster than what the code does now. You comment about "great smp scalability optimization" just shows that you're a moron. It is no such thing. The fact is, it's a horrible pessimization, since even SMP will be *SLOWER*. It will just be "less slower" when you have a million CPU's and they all try to do this at the same time (which probably never ever happens). In other words, "scalability" is totally meaningless. The only thing that matters is *performance*. If the "scalable" version performs WORSE, then it is simply worse. Not better. End of story. > mmu_notifier_register can take ages. No problem. So what you're saying is that performance doesn't matter? So why do you do the ugly crazy hundred-line implementation, when a simple two-liner would do equally well? Your arguments are crap. Anyway, discussion over. This code doesn't get merged. It doesn't get merged before 2.6.26, and it doesn't get merged _after_ either. Rewrite the code, or not. I don't care. I'll very happily not merge crap for the rest of my life. Linus |
From: Andrew M. <ak...@li...> - 2008-05-07 23:04:13
|
On Thu, 8 May 2008 00:44:06 +0200 Andrea Arcangeli <an...@qu...> wrote: > On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote: > > Nope. We only need to take the global lock before taking *two or more* of > > the per-vma locks. > > > > I really wish I'd thought of that. > > I don't see how you can avoid taking the system-wide-global lock > before every single anon_vma->lock/i_mmap_lock out there without > mm_lock. > > Please note, we can't allow a thread to be in the middle of > zap_page_range while mmu_notifier_register runs. > > vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more > than one lock and we've to still take the global-system-wide lock > _before_ this single i_mmap_lock and no other lock at all. > > Please elaborate, thanks! umm... CPU0: CPU1: spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_lock(a->lock); bad. CPU0: CPU1: spin_lock(global_lock) spin_lock(global_lock); spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_lock(a->lock); Is OK. CPU0: CPU1: spin_lock(global_lock) spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_unlock(b->lock); spin_lock(a->lock); spin_unlock(a->lock); also OK. As long as all code paths which can take two-or-more locks are all covered by the global lock there is no deadlock scenario. If a thread takes just a single instance of one of these locks without taking the global_lock then there is also no deadlock. Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly added mm_lock() thing and we also take both those locks at the same time in regular code, we're probably screwed. |
From: Andrea A. <an...@qu...> - 2008-05-07 23:04:13
|
On Wed, May 07, 2008 at 03:44:24PM -0700, Linus Torvalds wrote: > > > On Thu, 8 May 2008, Andrea Arcangeli wrote: > > > > Unfortunately the lock you're talking about would be: > > > > static spinlock_t global_lock = ... > > > > There's no way to make it more granular. > > Right. So what? > > It's still about a million times faster than what the code does now. mmu_notifier_register only runs when windows or linux or macosx boots. Who could ever care of the msec spent in mm_lock compared to the time it takes to linux to boot? What you're proposing is to slowdown AIM and certain benchmarks 20% or more for all users, just so you save at most 1msec to start a VM. > Rewrite the code, or not. I don't care. I'll very happily not merge crap > for the rest of my life. If you want the global lock I'll do it no problem, I just think it's obviously inferior solution for 99% of users out there (including kvm users that will also have to take that lock while kvm userland runs). In my view the most we should do in this area is to reduce further the max number of locks to take if max_map_count already isn't enough. |
From: Andrea A. <an...@qu...> - 2008-05-07 23:04:16
|
To remove mm_lock without adding an horrible system-wide lock before every i_mmap_lock etc.. we've to remove invalidate_range_begin/end. Then we can return to an older approach of doing only invalidate_page and serializing it with the PT lock against get_user_pages. That works fine for KVM but GRU will have to flush the tlb once every time we drop the PT lock, that means once per each 512 ptes on x86-64 etc... instead of a single time for the whole range regardless how large the range is. |
From: Linus T. <tor...@li...> - 2008-05-07 23:13:41
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > mmu_notifier_register only runs when windows or linux or macosx > boots. Who could ever care of the msec spent in mm_lock compared to > the time it takes to linux to boot? Andrea, you're *this* close to going to my list of people who it is not worth reading email from, and where it's better for everybody involved if I just teach my spam-filter about it. That code was CRAP. That code was crap whether it's used once, or whether it's used a million times. Stop making excuses for it just because it's not performance- critical. So give it up already. I told you what the non-crap solution was. It's simpler, faster, and is about two lines of code compared to the crappy version (which was what - 200 lines of crap with a big comment on top of it just to explain the idiocy?). So until you can understand the better solution, don't even bother emailing me, ok? Because the next email I get from you that shows the intelligence level of a gnat, I'll just give up and put you in a spam-filter. Because my IQ goes down just from reading your mails. I can't afford to continue. Linus |
From: Linus T. <tor...@li...> - 2008-05-07 23:19:24
|
On Wed, 7 May 2008, Andrew Morton wrote: > > Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly > added mm_lock() thing and we also take both those locks at the same time in > regular code, we're probably screwed. No, just use the normal static ordering for that case: one type of lock goes before the other kind. If those locks nest in regular code, you have to do that *anyway*. The code that can take many locks, will have to get the global lock *and* order the types, but that's still trivial. It's something like spin_lock(&global_lock); for (vma = mm->mmap; vma; vma = vma->vm_next) { if (vma->anon_vma) spin_lock(&vma->anon_vma->lock); } for (vma = mm->mmap; vma; vma = vma->vm_next) { if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping) spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); } spin_unlock(&global_lock); and now everybody follows the rule that "anon_vma->lock" precedes "i_mmap_lock". So there can be no ABBA deadlock between the normal users and the many-locks version, and there can be no ABBA deadlock between many-locks-takers because they use the global_lock to serialize. This really isn't rocket science, guys. (I really hope and believe that they don't nest anyway, and that you can just use a single for-loop for the many-lock case) Linus |
From: Benjamin H. <be...@ke...> - 2008-05-07 23:29:49
|
On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote: > > Please note, we can't allow a thread to be in the middle of > zap_page_range while mmu_notifier_register runs. You said yourself that mmu_notifier_register can be as slow as you want ... what about you use stop_machine for it ? I'm not even joking here :-) > vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more > than one lock and we've to still take the global-system-wide lock > _before_ this single i_mmap_lock and no other lock at all. Ben. |
From: Christoph L. <cla...@sg...> - 2008-05-07 23:39:41
|
On Wed, 7 May 2008, Linus Torvalds wrote: > The code that can take many locks, will have to get the global lock *and* > order the types, but that's still trivial. It's something like > > spin_lock(&global_lock); > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (vma->anon_vma) > spin_lock(&vma->anon_vma->lock); > } > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping) > spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); > } > spin_unlock(&global_lock); Multiple vmas may share the same mapping or refer to the same anonymous vma. The above code will deadlock since we may take some locks multiple times. |
From: Andrea A. <an...@qu...> - 2008-05-07 23:39:53
|
Hi Andrew, On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote: > CPU0: CPU1: > > spin_lock(global_lock) > spin_lock(a->lock); spin_lock(b->lock); ================== mmu_notifier_register() > spin_lock(b->lock); spin_unlock(b->lock); > spin_lock(a->lock); > spin_unlock(a->lock); > > also OK. But the problem is that we've to stop the critical section in the place I marked with "========" while mmu_notifier_register runs. Otherwise the driver calling mmu_notifier_register won't know if it's safe to start establishing secondary sptes/tlbs. If the driver will establish sptes/tlbs with get_user_pages/follow_page the page could be freed immediately later when zap_page_range starts. So if CPU1 doesn't take the global_lock before proceeding in zap_page_range (inside vmtruncate i_mmap_lock that is represented as b->lock above) we're in trouble. What we can do is to replace the mm_lock with a spin_lock(&global_lock) only if all places that takes i_mmap_lock takes the global lock first and that hurts scalability of the fast paths that are performance critical like vmtruncate and anon_vma->lock. Perhaps they're not so performance critical, but surely much more performant critical than mmu_notifier_register ;). The idea of polluting various scalable paths like truncate() syscall in the VM with a global spinlock frightens me, I'd rather return to invalidate_page() inside the PT lock removing both invalidate_range_start/end. Then all serialization against the mmu notifiers will be provided by the PT lock that the secondary mmu page fault also has to take in get_user_pages (or follow_page). In any case that is a better solution that won't slowdown the VM when MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance is about the same with or without invalidate_range_start/end. I didn't think anybody could care about how long mmu_notifier_register takes until it returns compared to all heavyweight operations that happens to start a VM (not only in the kernel but in the guest too). Infact if it's security that we worry about here, can put a cap of _time_ that mmu_notifier_register can take before it fails, and we fail to start a VM if it takes more than 5sec, that's still fine as the failure could happen for other reasons too like vmalloc shortage and we already handle it just fine. This 5sec delay can't possibly happen in practice anyway in the only interesting scenario, just like the vmalloc shortage. This is obviously a superior solution than polluting the VM with an useless global spinlock that will destroy truncate/AIM on numa. Anyway Christoph, I uploaded my last version here: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v16 (applies and runs fine on 26-rc1) You're more than welcome to takeover from it, I kind of feel my time now may be better spent to emulate the mmu-notifier-core with kprobes. |
From: Pekka E. <pe...@cs...> - 2008-05-08 05:30:16
|
On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg <pe...@cs...> wrote: > You might want to read carefully what Linus wrote: > > > The one that already has a 4 byte padding thing on x86-64 just after the > > spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two > > bytes of padding if we didn't just make the spinlock type unconditionally > > 32 bits rather than the 16 bits we actually _use_? > > So you need to add the flag _after_ ->lock and _before_ ->head.... Oh should have taken my morning coffee first, before ->lock works obviously as well. |
From: Andrea A. <an...@qu...> - 2008-05-08 05:49:32
|
On Thu, May 08, 2008 at 08:30:20AM +0300, Pekka Enberg wrote: > On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg <pe...@cs...> wrote: > > You might want to read carefully what Linus wrote: > > > > > The one that already has a 4 byte padding thing on x86-64 just after the > > > spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two > > > bytes of padding if we didn't just make the spinlock type unconditionally > > > 32 bits rather than the 16 bits we actually _use_? > > > > So you need to add the flag _after_ ->lock and _before_ ->head.... > > Oh should have taken my morning coffee first, before ->lock works > obviously as well. Sorry, Linus's right: I didn't realize the "after the spinlock" was literally after the spinlock, I didn't see the 4 byte padding when I read the code and put the flag:1 in. If put between ->lock and ->head it doesn't take more memory on x86-64 as described literlly. So the next would be to find another place like that in the address space. Perhaps after the private_lock using the same trick or perhaps the slab alignment won't actually alter the number of slabs per page regardless. I leave that to Christoph, he's surely better than me at doing this, I give it up entirely and I consider my attempt to merge a total failure and I strongly regret it. On a side note the anon_vma will change to this when XPMEM support is compiled in: struct anon_vma { - spinlock_t lock; /* Serialize access to vma list */ + atomic_t refcount; /* vmas on the list */ + struct rw_semaphore sem;/* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ }; not sure if it'll grow in size or not after that but let's say it's not a big deal. |