You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Robin H. <ho...@sg...> - 2008-05-08 00:38:34
|
On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > On Wed, 7 May 2008, Andrea Arcangeli wrote: > > > > I think the spinlock->rwsem conversion is ok under config option, as > > you can see I complained myself to various of those patches and I'll > > take care they're in a mergeable state the moment I submit them. What > > XPMEM requires are different semantics for the methods, and we never > > had to do any blocking I/O during vmtruncate before, now we have to. > > I really suspect we don't really have to, and that it would be better to > just fix the code that does that. That fix is going to be fairly difficult. I will argue impossible. First, a little background. SGI allows one large numa-link connected machine to be broken into seperate single-system images which we call partitions. XPMEM allows, at its most extreme, one process on one partition to grant access to a portion of its virtual address range to processes on another partition. Those processes can then fault pages and directly share the memory. In order to invalidate the remote page table entries, we need to message (uses XPC) to the remote side. The remote side needs to acquire the importing process's mmap_sem and call zap_page_range(). Between the messaging and the acquiring a sleeping lock, I would argue this will require sleeping locks in the path prior to the mmu_notifier invalidate_* callouts(). On a side note, we currently have XPMEM working on x86_64 SSI, and ia64 cross-partition. We are in the process of getting XPMEM working on x86_64 cross-partition in support of UV. Thanks, Robin Holt |
From: Linus T. <tor...@li...> - 2008-05-08 00:04:40
|
On Wed, 7 May 2008, Christoph Lameter wrote: > > Multiple vmas may share the same mapping or refer to the same anonymous > vma. The above code will deadlock since we may take some locks multiple > times. Ok, so that actually _is_ a problem. It would be easy enough to also add just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's suggestion of just using stop-machine is actually the right one just because it's _so_ simple. (That said, we're not running out of vm flags yet, and if we were, we could just add another word. We're already wasting that space right now on 64-bit by calling it "unsigned long"). Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 23:45:17
|
On Thu, May 08, 2008 at 09:28:38AM +1000, Benjamin Herrenschmidt wrote: > > On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote: > > > > Please note, we can't allow a thread to be in the middle of > > zap_page_range while mmu_notifier_register runs. > > You said yourself that mmu_notifier_register can be as slow as you > want ... what about you use stop_machine for it ? I'm not even joking > here :-) We can put a cap of time + a cap of vmas. It's not important if it fails, the only useful case we know it, and it won't be slow at all. The failure can happen because the cap of time or the cap of vmas or the cap vmas triggers or there's a vmalloc shortage. We handle the failure in userland of course. There are zillon of allocations needed anyway, any one of them can fail, so this isn't a new fail path, is the same fail path that always existed before mmu_notifiers existed. I can't possibly see how adding a new global wide lock that forces all truncate to be serialized against each other, practically eliminating the need of the i_mmap_lock, could be superior to an approach that doesn't cause the overhead to the VM at all, and only require kvm to pay for an additional cost when it startup. Furthermore the only reason I had to implement mm_lock was to fix the invalidate_range_start/end model, if we go with only invalidate_page and invalidate_pages called inside the PT lock and we use the PT lock to serialize, we don't need a mm_lock anymore and no new lock from the VM either. I tried to push for that, but everyone else wanted invalidate_range_start/end. I only did the only possible thing to do: to make invalidate_range_start safe to make everyone happy without slowing down the VM. |
From: Linus T. <tor...@li...> - 2008-05-07 23:40:02
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > At least for mmu-notifier-core given I obviously am the original > author of that code, I hope the From: of the email was enough even if > an additional From: andrea was missing in the body. Ok, this whole series of patches have just been such a disaster that I'm (a) disgusted that _anybody_ sent an Acked-by: for any of it, and (b) that I'm still looking at it at all, but I am. And quite frankly, the more I look, and the more answers from you I get, the less I like it. And I didn't like it that much to start with, as you may have noticed. You say that "At least for mmu-notifier-core given I obviously am the original author of that code", but that is not at all obvious either. One of the reasons I stated that authorship seems to have been thrown away is very much exactly in that first mmu-notifier-core patch: + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <cla...@sg...> so I would very strongly dispute that it's "obvious" that you are the original author of the code there. So there was a reason why I said that I thought authorship had been lost somewhere along the way. Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 23:39:53
|
Hi Andrew, On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote: > CPU0: CPU1: > > spin_lock(global_lock) > spin_lock(a->lock); spin_lock(b->lock); ================== mmu_notifier_register() > spin_lock(b->lock); spin_unlock(b->lock); > spin_lock(a->lock); > spin_unlock(a->lock); > > also OK. But the problem is that we've to stop the critical section in the place I marked with "========" while mmu_notifier_register runs. Otherwise the driver calling mmu_notifier_register won't know if it's safe to start establishing secondary sptes/tlbs. If the driver will establish sptes/tlbs with get_user_pages/follow_page the page could be freed immediately later when zap_page_range starts. So if CPU1 doesn't take the global_lock before proceeding in zap_page_range (inside vmtruncate i_mmap_lock that is represented as b->lock above) we're in trouble. What we can do is to replace the mm_lock with a spin_lock(&global_lock) only if all places that takes i_mmap_lock takes the global lock first and that hurts scalability of the fast paths that are performance critical like vmtruncate and anon_vma->lock. Perhaps they're not so performance critical, but surely much more performant critical than mmu_notifier_register ;). The idea of polluting various scalable paths like truncate() syscall in the VM with a global spinlock frightens me, I'd rather return to invalidate_page() inside the PT lock removing both invalidate_range_start/end. Then all serialization against the mmu notifiers will be provided by the PT lock that the secondary mmu page fault also has to take in get_user_pages (or follow_page). In any case that is a better solution that won't slowdown the VM when MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance is about the same with or without invalidate_range_start/end. I didn't think anybody could care about how long mmu_notifier_register takes until it returns compared to all heavyweight operations that happens to start a VM (not only in the kernel but in the guest too). Infact if it's security that we worry about here, can put a cap of _time_ that mmu_notifier_register can take before it fails, and we fail to start a VM if it takes more than 5sec, that's still fine as the failure could happen for other reasons too like vmalloc shortage and we already handle it just fine. This 5sec delay can't possibly happen in practice anyway in the only interesting scenario, just like the vmalloc shortage. This is obviously a superior solution than polluting the VM with an useless global spinlock that will destroy truncate/AIM on numa. Anyway Christoph, I uploaded my last version here: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v16 (applies and runs fine on 26-rc1) You're more than welcome to takeover from it, I kind of feel my time now may be better spent to emulate the mmu-notifier-core with kprobes. |
From: Christoph L. <cla...@sg...> - 2008-05-07 23:39:41
|
On Wed, 7 May 2008, Linus Torvalds wrote: > The code that can take many locks, will have to get the global lock *and* > order the types, but that's still trivial. It's something like > > spin_lock(&global_lock); > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (vma->anon_vma) > spin_lock(&vma->anon_vma->lock); > } > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping) > spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); > } > spin_unlock(&global_lock); Multiple vmas may share the same mapping or refer to the same anonymous vma. The above code will deadlock since we may take some locks multiple times. |
From: Benjamin H. <be...@ke...> - 2008-05-07 23:29:49
|
On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote: > > Please note, we can't allow a thread to be in the middle of > zap_page_range while mmu_notifier_register runs. You said yourself that mmu_notifier_register can be as slow as you want ... what about you use stop_machine for it ? I'm not even joking here :-) > vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more > than one lock and we've to still take the global-system-wide lock > _before_ this single i_mmap_lock and no other lock at all. Ben. |
From: Linus T. <tor...@li...> - 2008-05-07 23:19:24
|
On Wed, 7 May 2008, Andrew Morton wrote: > > Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly > added mm_lock() thing and we also take both those locks at the same time in > regular code, we're probably screwed. No, just use the normal static ordering for that case: one type of lock goes before the other kind. If those locks nest in regular code, you have to do that *anyway*. The code that can take many locks, will have to get the global lock *and* order the types, but that's still trivial. It's something like spin_lock(&global_lock); for (vma = mm->mmap; vma; vma = vma->vm_next) { if (vma->anon_vma) spin_lock(&vma->anon_vma->lock); } for (vma = mm->mmap; vma; vma = vma->vm_next) { if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping) spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); } spin_unlock(&global_lock); and now everybody follows the rule that "anon_vma->lock" precedes "i_mmap_lock". So there can be no ABBA deadlock between the normal users and the many-locks version, and there can be no ABBA deadlock between many-locks-takers because they use the global_lock to serialize. This really isn't rocket science, guys. (I really hope and believe that they don't nest anyway, and that you can just use a single for-loop for the many-lock case) Linus |
From: Linus T. <tor...@li...> - 2008-05-07 23:13:41
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > mmu_notifier_register only runs when windows or linux or macosx > boots. Who could ever care of the msec spent in mm_lock compared to > the time it takes to linux to boot? Andrea, you're *this* close to going to my list of people who it is not worth reading email from, and where it's better for everybody involved if I just teach my spam-filter about it. That code was CRAP. That code was crap whether it's used once, or whether it's used a million times. Stop making excuses for it just because it's not performance- critical. So give it up already. I told you what the non-crap solution was. It's simpler, faster, and is about two lines of code compared to the crappy version (which was what - 200 lines of crap with a big comment on top of it just to explain the idiocy?). So until you can understand the better solution, don't even bother emailing me, ok? Because the next email I get from you that shows the intelligence level of a gnat, I'll just give up and put you in a spam-filter. Because my IQ goes down just from reading your mails. I can't afford to continue. Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 23:04:16
|
To remove mm_lock without adding an horrible system-wide lock before every i_mmap_lock etc.. we've to remove invalidate_range_begin/end. Then we can return to an older approach of doing only invalidate_page and serializing it with the PT lock against get_user_pages. That works fine for KVM but GRU will have to flush the tlb once every time we drop the PT lock, that means once per each 512 ptes on x86-64 etc... instead of a single time for the whole range regardless how large the range is. |
From: Linus T. <tor...@li...> - 2008-05-07 23:04:14
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > Ok so I see the problem Linus is referring to now (I received the hint > by PM too), I thought the order of the signed-off-by was relevant, it > clearly isn't or we're wasting space ;) The order of the signed-offs are somewhat relevant, but no, sign-offs don't mean authorship. See the rules for sign-off: you can sign off on another persons patches, even if they didn't sign off on them themselves. That's clause (b) in particular. So yes, quite often you'd _expect_ the first sign-off to match the author, but that's a correlation, not a causal relationship. Linus |
From: Linus T. <tor...@li...> - 2008-05-07 23:04:14
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. How long have you been doing kernel development? How about you read SubmittingPatches a few times before you show just how clueless you are? Hint: look for the string that says "From:". Also look at the section that talks about "summary phrase". You got it all wrong, and you don't even seem to realize that you got it wrong, even when I told you. Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 23:04:13
|
On Wed, May 07, 2008 at 03:44:24PM -0700, Linus Torvalds wrote: > > > On Thu, 8 May 2008, Andrea Arcangeli wrote: > > > > Unfortunately the lock you're talking about would be: > > > > static spinlock_t global_lock = ... > > > > There's no way to make it more granular. > > Right. So what? > > It's still about a million times faster than what the code does now. mmu_notifier_register only runs when windows or linux or macosx boots. Who could ever care of the msec spent in mm_lock compared to the time it takes to linux to boot? What you're proposing is to slowdown AIM and certain benchmarks 20% or more for all users, just so you save at most 1msec to start a VM. > Rewrite the code, or not. I don't care. I'll very happily not merge crap > for the rest of my life. If you want the global lock I'll do it no problem, I just think it's obviously inferior solution for 99% of users out there (including kvm users that will also have to take that lock while kvm userland runs). In my view the most we should do in this area is to reduce further the max number of locks to take if max_map_count already isn't enough. |
From: Andrew M. <ak...@li...> - 2008-05-07 23:04:13
|
On Thu, 8 May 2008 00:44:06 +0200 Andrea Arcangeli <an...@qu...> wrote: > On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote: > > Nope. We only need to take the global lock before taking *two or more* of > > the per-vma locks. > > > > I really wish I'd thought of that. > > I don't see how you can avoid taking the system-wide-global lock > before every single anon_vma->lock/i_mmap_lock out there without > mm_lock. > > Please note, we can't allow a thread to be in the middle of > zap_page_range while mmu_notifier_register runs. > > vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more > than one lock and we've to still take the global-system-wide lock > _before_ this single i_mmap_lock and no other lock at all. > > Please elaborate, thanks! umm... CPU0: CPU1: spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_lock(a->lock); bad. CPU0: CPU1: spin_lock(global_lock) spin_lock(global_lock); spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_lock(a->lock); Is OK. CPU0: CPU1: spin_lock(global_lock) spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_unlock(b->lock); spin_lock(a->lock); spin_unlock(a->lock); also OK. As long as all code paths which can take two-or-more locks are all covered by the global lock there is no deadlock scenario. If a thread takes just a single instance of one of these locks without taking the global_lock then there is also no deadlock. Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly added mm_lock() thing and we also take both those locks at the same time in regular code, we're probably screwed. |
From: Linus T. <tor...@li...> - 2008-05-07 22:44:44
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > Unfortunately the lock you're talking about would be: > > static spinlock_t global_lock = ... > > There's no way to make it more granular. Right. So what? It's still about a million times faster than what the code does now. You comment about "great smp scalability optimization" just shows that you're a moron. It is no such thing. The fact is, it's a horrible pessimization, since even SMP will be *SLOWER*. It will just be "less slower" when you have a million CPU's and they all try to do this at the same time (which probably never ever happens). In other words, "scalability" is totally meaningless. The only thing that matters is *performance*. If the "scalable" version performs WORSE, then it is simply worse. Not better. End of story. > mmu_notifier_register can take ages. No problem. So what you're saying is that performance doesn't matter? So why do you do the ugly crazy hundred-line implementation, when a simple two-liner would do equally well? Your arguments are crap. Anyway, discussion over. This code doesn't get merged. It doesn't get merged before 2.6.26, and it doesn't get merged _after_ either. Rewrite the code, or not. I don't care. I'll very happily not merge crap for the rest of my life. Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 22:44:05
|
On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote: > Nope. We only need to take the global lock before taking *two or more* of > the per-vma locks. > > I really wish I'd thought of that. I don't see how you can avoid taking the system-wide-global lock before every single anon_vma->lock/i_mmap_lock out there without mm_lock. Please note, we can't allow a thread to be in the middle of zap_page_range while mmu_notifier_register runs. vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more than one lock and we've to still take the global-system-wide lock _before_ this single i_mmap_lock and no other lock at all. Please elaborate, thanks! |
From: Jack S. <st...@sg...> - 2008-05-07 22:42:29
|
> And I don't see a problem in making the conversion from > spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on > anything but ia64. That is currently true but we are also working on XPMEM for x86_64. The new XPMEM code should be posted within a few weeks. --- jack |
From: Andrea A. <an...@qu...> - 2008-05-07 22:39:11
|
On Wed, May 07, 2008 at 03:31:08PM -0700, Roland Dreier wrote: > I think the point you're missing is that any patches written by > Christoph need a line like > > From: Christoph Lameter <cla...@sg...> > > at the top of the body so that Christoph becomes the author when it is > committed into git. The Signed-off-by: line needs to be preserved too > of course, but it is not sufficient by itself. Ok so I see the problem Linus is referring to now (I received the hint by PM too), I thought the order of the signed-off-by was relevant, it clearly isn't or we're wasting space ;) |
From: Andrea A. <an...@qu...> - 2008-05-07 22:37:35
|
On Thu, May 08, 2008 at 12:27:58AM +0200, Andrea Arcangeli wrote: > I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. In any case I expect Christoph to review (he's CCed) and to > point me to any attribution error. The only mistake I did once in that > area was to give too _few_ attribution to myself and he asked me to > add myself in the signed-off so I added myself by Christoph own > request, but be sure I didn't remove him! By PM (guess he's scared to post to this thread ;) Chris is telling me, what you mean perhaps is I should add a From: Christoph in the body of the email if the first signed-off-by is from Christoph, to indicate the first signoff was by him and the patch in turn was started by him. I thought the order of the signoffs was enough, but if that From was mandatory and missing, if there's any error it obviously wasn't intentional especially given I only left a signed-off-by: christoph on his patches until he asked me to add my signoff too. Correcting it is trivial given I carefully ordered the signoff so that the author is at the top of the signoff list. At least for mmu-notifier-core given I obviously am the original author of that code, I hope the From: of the email was enough even if an additional From: andrea was missing in the body. Also you can be sure that Christoph and especially Robin (XPMEM) will be more than happy if all patches with Christoph at the top of the signed-off-by will be merged in 2.6.26 despite there wasn't From: christoph at the top of the body ;). So I don't see a big deal here... |
From: Andrew M. <ak...@li...> - 2008-05-07 22:31:52
|
On Thu, 8 May 2008 00:22:05 +0200 Andrea Arcangeli <an...@qu...> wrote: > > No, the simple solution is to just make up a whole new upper-level lock, > > and get that lock *first*. You can then take all the multiple locks at a > > lower level in any order you damn well please. > > Unfortunately the lock you're talking about would be: > > static spinlock_t global_lock = ... > > There's no way to make it more granular. > > So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock > we'd need to take that extremely wide spinlock first (and even worse, > later it would become a rwsem when XPMEM is selected making the VM > even slower than it already becomes when XPMEM support is selected at > compile time). Nope. We only need to take the global lock before taking *two or more* of the per-vma locks. I really wish I'd thought of that. |
From: Roland D. <rd...@ci...> - 2008-05-07 22:31:36
|
> I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. In any case I expect Christoph to review (he's CCed) and to > point me to any attribution error. The only mistake I did once in that > area was to give too _few_ attribution to myself and he asked me to > add myself in the signed-off so I added myself by Christoph own > request, but be sure I didn't remove him! I think the point you're missing is that any patches written by Christoph need a line like From: Christoph Lameter <cla...@sg...> at the top of the body so that Christoph becomes the author when it is committed into git. The Signed-off-by: line needs to be preserved too of course, but it is not sufficient by itself. - R. |
From: Anthony L. <an...@co...> - 2008-05-07 22:30:08
|
Ryan Harper wrote: > I've been playing around with smp guests on a couple amd systems and > I've also seen some of the smp/locking issues which lead me to dig into > some of the tsc code. The svm_cpu_load, the tsc_offset calc will > generate a massively large tsc_offset if we're switching cpus and > tsc_this is ahead of the host_tsc value (delta would normally be > negative, but since it's unsigned, we get a huge positive number). > > svm_vcpu_load() > ... > rdtscll(tsc_this); > delta = vcpu->arch.host_tsc - tsc_this; > svm->vmcb->control.tsc_offset += delta; > This math will work out fine since the very large number will generate an overflow and the result will be identical to if we were using s64s. We're using u64s because that's how the tsc_offset is defined by hardware. > This is handled a little differently on Intel (in vmx.c) where there is > a check: > > if (tsc_this < vcpu->arch.host_tsc) > /* do delta and new offset calc */ > So what your patch really does is change the behavior of the tsc_offset to increase the guest's TSC by a potentially large amount depending on how out of sync the TSC is on CPU migration. The question is why this would make things work out better for you.. Do you have Gerd's kvm-clock most recent patch applied? Regards, Anthony Liguori > This check makes sense to me in that we only need to ensure that we > don't go backwards which means that unless the new cpu is behind the > current vcpu's host_tsc, we can skip a new delta calc. If the check > doesn't make sense then we'll need to do the math with s64s. > > Attached patch fixed the case where an idle guest was live-locked. > > |
From: Andrea A. <an...@qu...> - 2008-05-07 22:27:58
|
On Wed, May 07, 2008 at 03:11:10PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Andrea Arcangeli wrote: > > > > As far as I can tell, authorship has been destroyed by at least two of the > > > patches (ie Christoph seems to be the author, but Andrea seems to have > > > dropped that fact). > > > > I can't follow this, please be more specific. > > The patches were sent to lkml without *any* indication that you weren't > actually the author. > > So if Andrew had merged them, they would have been merged as yours. I rechecked and I guarantee that the patches where Christoph isn't listed are developed by myself and he didn't write a single line on them. In any case I expect Christoph to review (he's CCed) and to point me to any attribution error. The only mistake I did once in that area was to give too _few_ attribution to myself and he asked me to add myself in the signed-off so I added myself by Christoph own request, but be sure I didn't remove him! |
From: Andrea A. <an...@qu...> - 2008-05-07 22:22:03
|
On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > > had to do any blocking I/O during vmtruncate before, now we have to. > > I really suspect we don't really have to, and that it would be better to > just fix the code that does that. I'll let you discuss with Christoph and Robin about it. The moment I heard the schedule inside ->invalidate_page() requirement I reacted the same way you did. But I don't see any other real solution for XPMEM other than spin-looping for ages halting the scheduler for ages, while the ack is received from the network device. But mm_lock is required even without XPMEM. And srcu is also required without XPMEM to allow ->release to schedule (however downgrading srcu to rcu will result in a very small patch, srcu and rcu are about the same with a kernel supporting preempt=y like 2.6.26). > I literally think that mm_lock() is an unbelievable piece of utter and > horrible CRAP. > > There's simply no excuse for code like that. I think it's a great smp scalability optimization over the global lock you're proposing below. > No, the simple solution is to just make up a whole new upper-level lock, > and get that lock *first*. You can then take all the multiple locks at a > lower level in any order you damn well please. Unfortunately the lock you're talking about would be: static spinlock_t global_lock = ... There's no way to make it more granular. So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock we'd need to take that extremely wide spinlock first (and even worse, later it would become a rwsem when XPMEM is selected making the VM even slower than it already becomes when XPMEM support is selected at compile time). > And yes, it's one more lock, and yes, it serializes stuff, but: > > - that code had better not be critical anyway, because if it was, then > the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_ mmu_notifier_register can take ages. No problem. > - parallelism is overrated: it doesn't matter one effing _whit_ if > something is a hundred times more parallel, if it's also a hundred > times *SLOWER*. mmu_notifier_register is fine to be hundred times slower (preempt-rt will turn all locks in spinlocks so no problem). > And here's an admission that I lied: it wasn't *all* clearly crap. I did > like one part, namely list_del_init_rcu(), but that one should have been > in a separate patch. I'll happily apply that one. Sure, I'll split it from the rest if the mmu-notifier-core isn't merged. My objective has been: 1) add zero overhead to the VM before anybody starts a VM with kvm and still zero overhead for all other tasks except the task where the VM runs. The only exception is the unlikely(!mm->mmu_notifier_mm) check that is optimized away too when CONFIG_KVM=n. And even for that check my invalidate_page reduces the number of branches to the absolute minimum possible. 2) avoid any new cacheline collision in the fast paths to allow numa systems not to nearly-crash (mm->mmu_notifier_mm will be shared and never written, except during the first mmu_notifier_register) 3) avoid any risk to introduce regressions in 2.6.26 (the patch must be obviously safe). Even if mm_lock would be a bad idea like you say, it's order of magnitude safer even if entirely broken then messing with the VM core locking in 2.6.26. mm_lock (or whatever name you like to give it, I admit mm_lock may not be worrysome enough for people to have an idea to call it in a fast path) is going to be the real deal for the long term to allow mmu_notifier_register to serialize against invalidate_page_start/end. If I fail in 2.6.26 I'll offer maintainership to Christoph as promised, and you'll find him pushing for mm_lock to be merged (as XPMEM/GRU aren't technologies running on cellphones where your global wide spinlocks is optimized away at compile time, and he also has to deal with XPMEM where such a spinlock would need to become a rwsem as the anon_vma->sem has to be taken after it), but let's assume you're right entirely right here that mm_lock is going to be dropped and there's a better way: it's still a fine solution for 2.6.26. And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to mmu_notifier.[ch] so you don't get any pollution in the core VM, and mm_lock will be invisible to everything but anybody calling mmu_notifier_register() then and it will be trivial to remove later if you really want to add a global spinlock as there's no way to be more granular than a _global_ numa-wide spinlock taken before any i_mmap_lock/anon_vma->lock, without my mm_lock. |
From: Linus T. <tor...@li...> - 2008-05-07 22:12:04
|
On Wed, 7 May 2008, Andrea Arcangeli wrote: > > As far as I can tell, authorship has been destroyed by at least two of the > > patches (ie Christoph seems to be the author, but Andrea seems to have > > dropped that fact). > > I can't follow this, please be more specific. The patches were sent to lkml without *any* indication that you weren't actually the author. So if Andrew had merged them, they would have been merged as yours. > > That "locking" code is also too ugly to live, at least without some > > serious arguments for why it has to be done that way. Sorting the locks? > > In a vmalloc'ed area? And calling this something innocuous like > > "mm_lock()"? Hell no. > > That's only invoked in mmu_notifier_register, mm_lock is explicitly > documented as heavyweight function. Is that an excuse for UTTER AND TOTAL CRAP? Linus |