From: Andrea A. <an...@qu...> - 2008-04-22 07:20:43
|
This is a followup of the locking of the mmu-notifier methods against the secondary-mmu page fault, each driver can implement differently but this is to show an example of what I planned for KVM, others may follow closely if they find this useful. I post this as pseudocode to hide 99% of kvm internal complexities and to focus only on the locking. The KVM locking scheme should be something on these lines: invalidate_range_start { spin_lock(&kvm->mmu_lock); kvm->invalidate_range_count++; rmap-invalidate of sptes in range spin_unlock(&kvm->mmu_lock) } invalidate_range_end { spin_lock(&kvm->mmu_lock); kvm->invalidate_range_count--; spin_unlock(&kvm->mmu_lock) } invalidate_page { spin_lock(&kvm->mmu_lock); write_seqlock() rmap-invalidate of sptes of page write_sequnlock() spin_unlock(&kvm->mmu_lock) } kvm_page_fault { seq = read_seqlock() get_user_pages() (aka gfn_to_pfn() in kvm terms) spin_lock(&kvm->mmu_lock) if (seq_trylock(seq) || kvm->invalidate_range_count) goto out; /* reply page fault */ map sptes and build rmap out: spin_unlock(&kvm->mmu_lock) } This will allow to remove the page pinning from KVM. I'd appreciate if you Robin and Christoph can have a second look and pinpoint any potential issue in my plan. invalidate_page as you can notice, allows to decrease the fixed cost overhead from all VM code that works with a single page and where freeing the page _after_ calling invalidate_page is zero runtime/tlb cost. We need invalidate_range_begin/end because when we work on multiple pages, we can reduce cpu utilization and avoid many tlb flushes by holding off the kvm page fault when we work on the range. invalidate_page also allows to decrease the window where the kvm page fault could possibly need to be replied (the ptep_clear_flush <-> invalidate_page window is shorter than a invalidate_range_begin(PAGE_SIZE) <-> invalidate_range_end(PAGE_SIZE)). So even if only as a microoptimization it worth it to decrease the impact on the common VM code. The cost of having both a seqlock and a range_count is irrlevant in kvm terms as they'll be in the same cacheline and checked at the same time by the page fault and it won't require any additional blocking (or writing) lock. Note that the kvm page fault can't happen unless the cpu switches to guest mode, and it can't switch to guest mode if we're in the begin/end critical section, so in theory I could loop inside the page fault too without risking deadlocking, but replying it by restarting guest mode sounds nicer in sigkill/scheduling terms. Soon I'll release a new mmu notifier patchset with patch 1 being the mmu-notifier-core self-included and ready to go in -mm and mainline in time for 2.6.26. Then I'll be glad to help merging any further patch in the patchset to allow methods to sleep so XPMEM can run on mainline 2.6.27 the same way GRU/KVM/Quadrics will run fine on 2.6.26, in a fully backwards compatible way with 2.6.26 (and of course it doesn't really need to be backwards compatible because this is a kernel internal API only, ask Greg etc... ;). But that will likely require a new config option to avoid hurting AIM performance in fork because the anon_vma critical sections are so short in the fast path. |
From: Andrea A. <an...@qu...> - 2008-04-22 12:02:31
|
On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote: > invalidate_range_start { > spin_lock(&kvm->mmu_lock); > > kvm->invalidate_range_count++; > rmap-invalidate of sptes in range > write_seqlock; write_sequnlock; > spin_unlock(&kvm->mmu_lock) > } > > invalidate_range_end { > spin_lock(&kvm->mmu_lock); > > kvm->invalidate_range_count--; write_seqlock; write_sequnlock; > > spin_unlock(&kvm->mmu_lock) > } Robin correctly pointed out by PM there should be a seqlock in range_begin/end too like corrected above. I guess it's better to use an explicit sequence counter so we avoid an useless spinlock of the write_seqlock (mmu_lock is enough already in all places) and so we can increase it with a single op with +=2 in the range_begin/end. The above is a lower-perf version of the final locking but simpler for reading purposes. |
From: Robin H. <ho...@sg...> - 2008-04-22 13:01:41
|
On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote: > > invalidate_range_start { > > spin_lock(&kvm->mmu_lock); > > > > kvm->invalidate_range_count++; > > rmap-invalidate of sptes in range > > > > write_seqlock; write_sequnlock; I don't think you need it here since invalidate_range_count is already elevated which will accomplish the same effect. Thanks, Robin |
From: Andrea A. <an...@qu...> - 2008-04-22 13:22:51
|
On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote: > On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote: > > On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote: > > > invalidate_range_start { > > > spin_lock(&kvm->mmu_lock); > > > > > > kvm->invalidate_range_count++; > > > rmap-invalidate of sptes in range > > > > > > > write_seqlock; write_sequnlock; > > I don't think you need it here since invalidate_range_count is already > elevated which will accomplish the same effect. Agreed, seqlock only in range_end should be enough. BTW, the fact seqlock is needed regardless of invalidate_page existing or not, really makes invalidate_page a no brainer not just from the core VM point of view, but from the driver point of view too. The kvm_page_fault logic would be the same even if I remove invalidate_page from the mmu notifier patch but it'd run slower both when armed and disarmed. |
From: Robin H. <ho...@sg...> - 2008-04-22 13:36:04
|
On Tue, Apr 22, 2008 at 03:21:43PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote: > > On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote: > > > On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote: > > > > invalidate_range_start { > > > > spin_lock(&kvm->mmu_lock); > > > > > > > > kvm->invalidate_range_count++; > > > > rmap-invalidate of sptes in range > > > > > > > > > > write_seqlock; write_sequnlock; > > > > I don't think you need it here since invalidate_range_count is already > > elevated which will accomplish the same effect. > > Agreed, seqlock only in range_end should be enough. BTW, the fact I am a little confused about the value of the seq_lock versus a simple atomic, but I assumed there is a reason and left it at that. > seqlock is needed regardless of invalidate_page existing or not, > really makes invalidate_page a no brainer not just from the core VM > point of view, but from the driver point of view too. The > kvm_page_fault logic would be the same even if I remove > invalidate_page from the mmu notifier patch but it'd run slower both > when armed and disarmed. I don't know what you mean by "it'd" run slower and what you mean by "armed and disarmed". For the sake of this discussion, I will assume "it'd" means the kernel in general and not KVM. With the two call sites for range_begin/range_end, I would agree we have more call sites, but the second is extremely likely to be cache hot. By disarmed, I will assume you mean no notifiers registered for a particular mm. In that case, the cache will make the second call effectively free. So, for the disarmed case, I see no measurable difference. For the case where there is a notifier registered, I certainly can see a difference. I am not certain how to quantify the difference as it depends on the callee. In the case of xpmem, our callout is always very expensive for the _start case. Our _end case is very light, but it is essentially the exact same steps we would perform for the _page callout. When I was discussing this difference with Jack, he reminded me that the GRU, due to its hardware, does not have any race issues with the invalidate_page callout simply doing the tlb shootdown and not modifying any of its internal structures. He then put a caveat on the discussion that _either_ method was acceptable as far as he was concerned. The real issue is getting a patch in that satisfies all needs and not whether there is a seperate invalidate_page callout. Thanks, Robin |
From: Andrea A. <an...@qu...> - 2008-04-22 13:48:43
|
On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote: > I am a little confused about the value of the seq_lock versus a simple > atomic, but I assumed there is a reason and left it at that. There's no value for anything but get_user_pages (get_user_pages takes its own lock internally though). I preferred to explain it as a seqlock because it was simpler for reading, but I totally agree in the final implementation it shouldn't be a seqlock. My code was meant to be pseudo-code only. It doesn't even need to be atomic ;). > I don't know what you mean by "it'd" run slower and what you mean by > "armed and disarmed". 1) when armed the time-window where the kvm-page-fault would be blocked would be a bit larger without invalidate_page for no good reason 2) if you were to remove invalidate_page when disarmed the VM could would need two branches instead of one in various places I don't want to waste cycles if not wasting them improves performance both when armed and disarmed. > For the sake of this discussion, I will assume "it'd" means the kernel in > general and not KVM. With the two call sites for range_begin/range_end, I actually meant for both. > By disarmed, I will assume you mean no notifiers registered for a > particular mm. In that case, the cache will make the second call > effectively free. So, for the disarmed case, I see no measurable > difference. For rmap is sure effective free, for do_wp_page it costs one branch for no good reason. > For the case where there is a notifier registered, I certainly can see > a difference. I am not certain how to quantify the difference as it Agreed. > When I was discussing this difference with Jack, he reminded me that > the GRU, due to its hardware, does not have any race issues with the > invalidate_page callout simply doing the tlb shootdown and not modifying > any of its internal structures. He then put a caveat on the discussion > that _either_ method was acceptable as far as he was concerned. The real > issue is getting a patch in that satisfies all needs and not whether > there is a seperate invalidate_page callout. Sure, we have that patch now, I'll send it out in a minute, I was just trying to explain why it makes sense to have an invalidate_page too (which remains the only difference by now), removing it would be a regression on all sides, even if a minor one. |
From: Robin H. <ho...@sg...> - 2008-04-22 15:26:03
|
Andrew, Could we get direction/guidance from you as regards the invalidate_page() callout of Andrea's patch set versus the invalidate_range_start/invalidate_range_end callout pairs of Christoph's patchset? This is only in the context of the __xip_unmap, do_wp_page, page_mkclean_one, and try_to_unmap_one call sites. On Tue, Apr 22, 2008 at 03:48:47PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote: > > I am a little confused about the value of the seq_lock versus a simple > > atomic, but I assumed there is a reason and left it at that. > > There's no value for anything but get_user_pages (get_user_pages takes > its own lock internally though). I preferred to explain it as a > seqlock because it was simpler for reading, but I totally agree in the > final implementation it shouldn't be a seqlock. My code was meant to > be pseudo-code only. It doesn't even need to be atomic ;). Unless there is additional locking in your fault path, I think it does need to be atomic. > > I don't know what you mean by "it'd" run slower and what you mean by > > "armed and disarmed". > > 1) when armed the time-window where the kvm-page-fault would be > blocked would be a bit larger without invalidate_page for no good > reason But that is a distinction without a difference. In the _start/_end case, kvm's fault handler will not have any _DIRECT_ blocking, but get_user_pages() had certainly better block waiting for some other lock to prevent the process's pages being refaulted. I am no VM expert, but that seems like it is critical to having a consistent virtual address space. Effectively, you have a delay on the kvm fault handler beginning when either invalidate_page() is entered or invalidate_range_start() is entered until when the _CALLER_ of the invalidate* method has unlocked. That time will remain essentailly identical for either case. I would argue you would be hard pressed to even measure the difference. > 2) if you were to remove invalidate_page when disarmed the VM could > would need two branches instead of one in various places Those branches are conditional upon there being list entries. That check should be extremely cheap. The vast majority of cases will have no registered notifiers. The second check for the _end callout will be from cpu cache. > I don't want to waste cycles if not wasting them improves performance > both when armed and disarmed. In summary, I think we have narrowed down the case of no registered notifiers to being infinitesimal. The case of registered notifiers being a distinction without a difference. > > When I was discussing this difference with Jack, he reminded me that > > the GRU, due to its hardware, does not have any race issues with the > > invalidate_page callout simply doing the tlb shootdown and not modifying > > any of its internal structures. He then put a caveat on the discussion > > that _either_ method was acceptable as far as he was concerned. The real > > issue is getting a patch in that satisfies all needs and not whether > > there is a seperate invalidate_page callout. > > Sure, we have that patch now, I'll send it out in a minute, I was just > trying to explain why it makes sense to have an invalidate_page too > (which remains the only difference by now), removing it would be a > regression on all sides, even if a minor one. I think GRU is the only compelling case I have heard for having the invalidate_page seperate. In the case of the GRU, the hardware enforces a lifetime of the invalidate which covers all in-progress faults including ones where the hardware is informed after the flush of a PTE. in all cases, once the GRU invalidate instruction is issued, all active requests are invalidated. Future faults will be blocked in get_user_pages(). Without that special feature of the hardware, I don't think any code simplification exists. I, of course, reserve the right to be wrong. I believe the argument against a seperate invalidate_page() callout was Christoph's interpretation of Andrew's comments. I am not certain Andrew was aware of this special aspects of the GRU hardware and whether that had been factored into the discussion at that point in time. Thanks, Robin |
From: Christoph L. <cla...@sg...> - 2008-04-14 23:09:29
|
On Tue, 8 Apr 2008, Andrea Arcangeli wrote: > The difference with #v11 is a different implementation of mm_lock that > guarantees handling signals in O(N). It's also more lowlatency friendly. Ok. So the rest of the issues remains unaddressed? I am glad that we finally settled on the locking. But now I will have to clean this up, address the remaining issues, sequence the patches right, provide docs, handle the merging issue etc etc? I have seen no detailed review of my patches that you include here. We are going down the same road as we had to go with the OOM patches where David Rientjes and me had to deal with the issues you raised? |
From: Robin H. <ho...@sg...> - 2008-04-16 16:33:44
|
I don't think this lock mechanism is completely working. I have gotten a few failures trying to dereference 0x100100 which appears to be LIST_POISON1. Thanks, Robin |
From: Christoph L. <cla...@sg...> - 2008-04-16 18:35:33
|
On Wed, 16 Apr 2008, Robin Holt wrote: > I don't think this lock mechanism is completely working. I have > gotten a few failures trying to dereference 0x100100 which appears to > be LIST_POISON1. How does xpmem unregistering of notifiers work? |
From: Robin H. <ho...@sg...> - 2008-04-16 19:02:17
|
On Wed, Apr 16, 2008 at 11:35:38AM -0700, Christoph Lameter wrote: > On Wed, 16 Apr 2008, Robin Holt wrote: > > > I don't think this lock mechanism is completely working. I have > > gotten a few failures trying to dereference 0x100100 which appears to > > be LIST_POISON1. > > How does xpmem unregistering of notifiers work? For the tests I have been running, we are waiting for the release callout as part of exit. Thanks, Robin |
From: Christoph L. <cla...@sg...> - 2008-04-16 19:15:15
|
On Wed, 16 Apr 2008, Robin Holt wrote: > On Wed, Apr 16, 2008 at 11:35:38AM -0700, Christoph Lameter wrote: > > On Wed, 16 Apr 2008, Robin Holt wrote: > > > > > I don't think this lock mechanism is completely working. I have > > > gotten a few failures trying to dereference 0x100100 which appears to > > > be LIST_POISON1. > > > > How does xpmem unregistering of notifiers work? > > For the tests I have been running, we are waiting for the release > callout as part of exit. Some more details on the failure may be useful. AFAICT list_del[_rcu] is the culprit here and that is only used on release or unregister. |
From: Robin H. <ho...@sg...> - 2008-04-17 11:14:03
|
On Wed, Apr 16, 2008 at 12:15:08PM -0700, Christoph Lameter wrote: > On Wed, 16 Apr 2008, Robin Holt wrote: > > > On Wed, Apr 16, 2008 at 11:35:38AM -0700, Christoph Lameter wrote: > > > On Wed, 16 Apr 2008, Robin Holt wrote: > > > > > > > I don't think this lock mechanism is completely working. I have > > > > gotten a few failures trying to dereference 0x100100 which appears to > > > > be LIST_POISON1. > > > > > > How does xpmem unregistering of notifiers work? > > > > For the tests I have been running, we are waiting for the release > > callout as part of exit. > > Some more details on the failure may be useful. AFAICT list_del[_rcu] is > the culprit here and that is only used on release or unregister. I think I have this understood now. It happens quite quickly (within 10 minutes) on a 128 rank job of small data set in a loop. In these failing jobs, all the ranks are nearly symmetric. There is a certain part of each ranks address space that has access granted. All the ranks have included all the other ranks including themselves in exactly the same layout at exactly the same virtual address. Rank 3 has hit _release and is beginning to clean up, but has not deleted the notifier from its list. Rank 9 calls the xpmem_invalidate_page() callout. That page was attached by rank 3 so we call zap_page_range on rank 3 which then calls back into xpmem's invalidate_range_start callout. The rank 3 _release callout begins and deletes its notifier from the list. Rank 9's call to rank 3's zap_page_range notifier returns and dereferences LIST_POISON1. I often confuse myself while trying to explain these so please kick me where the holes in the flow appear. The console output from the simple debugging stuff I put in is a bit overwhelming. I am trying to figure out now which locks we hold as part of the zap callout that should have prevented the _release callout. Thanks, Robin |
From: Andrea A. <an...@qu...> - 2008-04-17 15:52:26
|
On Wed, Apr 16, 2008 at 11:35:38AM -0700, Christoph Lameter wrote: > On Wed, 16 Apr 2008, Robin Holt wrote: > > > I don't think this lock mechanism is completely working. I have > > gotten a few failures trying to dereference 0x100100 which appears to > > be LIST_POISON1. > > How does xpmem unregistering of notifiers work? Especially are you using mmu_notifier_unregister? |
From: Robin H. <ho...@sg...> - 2008-04-17 16:36:40
|
On Thu, Apr 17, 2008 at 05:51:57PM +0200, Andrea Arcangeli wrote: > On Wed, Apr 16, 2008 at 11:35:38AM -0700, Christoph Lameter wrote: > > On Wed, 16 Apr 2008, Robin Holt wrote: > > > > > I don't think this lock mechanism is completely working. I have > > > gotten a few failures trying to dereference 0x100100 which appears to > > > be LIST_POISON1. > > > > How does xpmem unregistering of notifiers work? > > Especially are you using mmu_notifier_unregister? In this case, we are not making the call to unregister, we are waiting for the _release callout which has already removed it from the list. In the event that the user has removed all the grants, we use unregister. That typically does not occur. We merely wait for exit processing to clean up the structures. Thanks, Robin |
From: Andrea A. <an...@qu...> - 2008-04-17 17:14:45
|
On Thu, Apr 17, 2008 at 11:36:42AM -0500, Robin Holt wrote: > In this case, we are not making the call to unregister, we are waiting > for the _release callout which has already removed it from the list. > > In the event that the user has removed all the grants, we use unregister. > That typically does not occur. We merely wait for exit processing to > clean up the structures. Then it's very strange. LIST_POISON1 is set in n->next. If it was a second hlist_del triggering the bug in theory list_poison2 should trigger first, so perhaps it's really a notifier running despite a mm_lock is taken? Could you post a full stack trace so I can see who's running into LIST_POISON1? If it's really a notifier running outside of some mm_lock that will be _immediately_ visible from the stack trace that triggered the LIST_POISON1! Also note, EMM isn't using the clean hlist_del, it's implementing list by hand (with zero runtime gain) so all the debugging may not be existent in EMM, so if it's really a mm_lock race, and it only triggers with mmu notifiers and not with EMM, it doesn't necessarily mean EMM is bug free. If you've a full stack trace it would greatly help to verify what is mangling over the list when the oops triggers. Thanks! Andrea |
From: Robin H. <ho...@sg...> - 2008-04-17 17:25:55
|
On Thu, Apr 17, 2008 at 07:14:43PM +0200, Andrea Arcangeli wrote: > On Thu, Apr 17, 2008 at 11:36:42AM -0500, Robin Holt wrote: > > In this case, we are not making the call to unregister, we are waiting > > for the _release callout which has already removed it from the list. > > > > In the event that the user has removed all the grants, we use unregister. > > That typically does not occur. We merely wait for exit processing to > > clean up the structures. > > Then it's very strange. LIST_POISON1 is set in n->next. If it was a > second hlist_del triggering the bug in theory list_poison2 should > trigger first, so perhaps it's really a notifier running despite a > mm_lock is taken? Could you post a full stack trace so I can see who's > running into LIST_POISON1? If it's really a notifier running outside > of some mm_lock that will be _immediately_ visible from the stack > trace that triggered the LIST_POISON1! > > Also note, EMM isn't using the clean hlist_del, it's implementing list > by hand (with zero runtime gain) so all the debugging may not be > existent in EMM, so if it's really a mm_lock race, and it only > triggers with mmu notifiers and not with EMM, it doesn't necessarily > mean EMM is bug free. If you've a full stack trace it would greatly > help to verify what is mangling over the list when the oops triggers. The stack trace is below. I did not do this level of testing on emm so I can not compare the two in this area. This is for a different, but equivalent failure. I just reproduce the LIST_POISON1 failure without trying to reproduce the exact same failure as I had documented earlier (lost that stack trace, sorry). Thanks, Robin <1>Unable to handle kernel paging request at virtual address 0000000000100100 <4>mpi006.f.x[23403]: Oops 11012296146944 [1] <4>Modules linked in: nfs lockd sunrpc binfmt_misc thermal processor fan button loop md_mod dm_mod xpmem xp mspec sg <4> <4>Pid: 23403, CPU 114, comm: mpi006.f.x <4>psr : 0000121008526010 ifs : 800000000000038b ip : [<a00000010015d6a1>] Not tainted (2.6.25-rc8) <4>ip is at __mmu_notifier_invalidate_range_start+0x81/0x120 <4>unat: 0000000000000000 pfs : 000000000000038b rsc : 0000000000000003 <4>rnat: a000000100149a00 bsps: a000000000010740 pr : 66555666a9599aa9 <4>ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f <4>csd : 0000000000000000 ssd : 0000000000000000 <4>b0 : a00000010015d670 b6 : a0000002101ddb40 b7 : a00000010000eb50 <4>f6 : 1003e2222222222222222 f7 : 000000000000000000000 <4>f8 : 000000000000000000000 f9 : 000000000000000000000 <4>f10 : 000000000000000000000 f11 : 000000000000000000000 <4>r1 : a000000100ef1190 r2 : e0000e6080cc1940 r3 : a0000002101edd10 <4>r8 : e0000e6080cc1970 r9 : 0000000000000000 r10 : e0000e6080cc19c8 <4>r11 : 20000003a6480000 r12 : e0000c60d31efb90 r13 : e0000c60d31e0000 <4>r14 : 000000000000004d r15 : e0000e6080cc1914 r16 : e0000e6080cc1970 <4>r17 : 20000003a6480000 r18 : 20000007bf900000 r19 : 0000000000040000 <4>r20 : e0000c60d31e0000 r21 : 0000000000000010 r22 : e0000e6080cc19a8 <4>r23 : e0000c60c55f1120 r24 : e0000c60d31efda0 r25 : e0000c60d31efd98 <4>r26 : e0000e60812166d0 r27 : e0000c60d31efdc0 r28 : e0000c60d31efdb8 <4>r29 : e0000c60d31e0b60 r30 : 0000000000000000 r31 : 0000000000000081 <4> <4>Call Trace: <4> [<a000000100014a20>] show_stack+0x40/0xa0 <4> sp=e0000c60d31ef760 bsp=e0000c60d31e11f0 <4> [<a000000100015330>] show_regs+0x850/0x8a0 <4> sp=e0000c60d31ef930 bsp=e0000c60d31e1198 <4> [<a000000100035ed0>] die+0x1b0/0x2e0 <4> sp=e0000c60d31ef930 bsp=e0000c60d31e1150 <4> [<a000000100060e90>] ia64_do_page_fault+0x8d0/0xa40 <4> sp=e0000c60d31ef930 bsp=e0000c60d31e1100 <4> [<a00000010000ab00>] ia64_leave_kernel+0x0/0x270 <4> sp=e0000c60d31ef9c0 bsp=e0000c60d31e1100 <4> [<a00000010015d6a0>] __mmu_notifier_invalidate_range_start+0x80/0x120 <4> sp=e0000c60d31efb90 bsp=e0000c60d31e10a8 <4> [<a00000010011b1d0>] unmap_vmas+0x70/0x14c0 <4> sp=e0000c60d31efb90 bsp=e0000c60d31e0fa8 <4> [<a00000010011c660>] zap_page_range+0x40/0x60 <4> sp=e0000c60d31efda0 bsp=e0000c60d31e0f70 <4> [<a0000002101d62d0>] xpmem_clear_PTEs+0x350/0x560 [xpmem] <4> sp=e0000c60d31efdb0 bsp=e0000c60d31e0ef0 <4> [<a0000002101d1e30>] xpmem_remove_seg+0x3f0/0x700 [xpmem] <4> sp=e0000c60d31efde0 bsp=e0000c60d31e0ea8 <4> [<a0000002101d2500>] xpmem_remove_segs_of_tg+0x80/0x140 [xpmem] <4> sp=e0000c60d31efe10 bsp=e0000c60d31e0e78 <4> [<a0000002101dda40>] xpmem_mmu_notifier_release+0x40/0x80 [xpmem] <4> sp=e0000c60d31efe10 bsp=e0000c60d31e0e58 <4> [<a00000010015d7f0>] __mmu_notifier_release+0xb0/0x100 <4> sp=e0000c60d31efe10 bsp=e0000c60d31e0e38 <4> [<a000000100124430>] exit_mmap+0x50/0x180 <4> sp=e0000c60d31efe10 bsp=e0000c60d31e0e10 <4> [<a00000010008fb30>] mmput+0x70/0x180 <4> sp=e0000c60d31efe20 bsp=e0000c60d31e0dd8 <4> [<a000000100098df0>] exit_mm+0x1f0/0x220 <4> sp=e0000c60d31efe20 bsp=e0000c60d31e0da0 <4> [<a00000010009ca60>] do_exit+0x4e0/0xf40 <4> sp=e0000c60d31efe20 bsp=e0000c60d31e0d58 <4> [<a00000010009d640>] do_group_exit+0x180/0x1c0 <4> sp=e0000c60d31efe30 bsp=e0000c60d31e0d20 <4> [<a00000010009d6a0>] sys_exit_group+0x20/0x40 <4> sp=e0000c60d31efe30 bsp=e0000c60d31e0cc8 <4> [<a00000010000a960>] ia64_ret_from_syscall+0x0/0x20 <4> sp=e0000c60d31efe30 bsp=e0000c60d31e0cc8 <4> [<a000000000010720>] __kernel_syscall_via_break+0x0/0x20 <4> sp=e0000c60d31f0000 bsp=e0000c60d31e0cc8 |
From: Christoph L. <cla...@sg...> - 2008-04-17 19:10:59
|
On Thu, 17 Apr 2008, Andrea Arcangeli wrote: > Also note, EMM isn't using the clean hlist_del, it's implementing list > by hand (with zero runtime gain) so all the debugging may not be > existent in EMM, so if it's really a mm_lock race, and it only > triggers with mmu notifiers and not with EMM, it doesn't necessarily > mean EMM is bug free. If you've a full stack trace it would greatly > help to verify what is mangling over the list when the oops triggers. EMM was/is using a single linked list which allows atomic updates. Looked cleaner to me since doubly linked list must update two pointers. I have not seen docs on the locking so not sure why you use rcu operations here? Isnt the requirement to have either rmap locks or mmap_sem held enough to guarantee the consistency of the doubly linked list? |
From: Andrea A. <an...@qu...> - 2008-04-17 22:17:16
|
On Thu, Apr 17, 2008 at 12:10:52PM -0700, Christoph Lameter wrote: > EMM was/is using a single linked list which allows atomic updates. Looked > cleaner to me since doubly linked list must update two pointers. Cleaner would be if it would provide an abstraction in list.h. The important is the memory taken by the head for this usage. > I have not seen docs on the locking so not sure why you use rcu > operations here? Isnt the requirement to have either rmap locks or > mmap_sem held enough to guarantee the consistency of the doubly linked list? Yes, exactly, I'm not using rcu anymore. |