From: <JD...@de...> - 2001-06-15 10:50:19
|
We analyzed Dbench scalability behaviour on ext2 file system and kernel 2.4.0. Dbench (ftp://samba.org/pub/tridge/dbench/) is an emulation of the Netbench benchmark. It produces only the filesystem load. It does all the same IO calls that the smbd server in Samba would produce when confronted with a netbench run. It does no networking calls. The system used was: 8 way 700 MHz Intel Xeon, 8x1MB L2 cache, with 37 GB disk IBM Ultrastar 36LZX on Ultra2 SCSI Controller Adaptec AIC-7896. Main memory was limited to 1680 MB via "mem=" kernel boot parameter. The throughput values are showing that the test does not exceed the buffer cache (max. disk-buffer transfer speed theoretical 43 MB/sec). We tested dbench by varying the number of clients from 1 to 30 and the number of CPUs from Uniprocessor to 8 CPUs. Each Dbench run was repeated 11 times and the first run was ignored as warmup. We found for maximum throughput: CPU | throughput scalability | [MB/s] ----+----------------------- U | 101,16 1,00 1 | 94,46 0,93 2 | 144,93 1,43 4 | 195,01 1,93 8 | 197,17 1,95 Looking at an excerpt of the data at SMP kernel with 8 CPUs: #clients 1 2 4 6 8 10 20 30 Throughput [MB/s] 92.26 143.00 188.29 197.17 173.88 179.68 173.18 175.34 The throughput reaches the maximum with 6 clients and decreases with 8 clients. The throughput for the other sets of CPUs reaching the maximum when the number of clients is equal the number of CPUs and do not decrease in such order of magnitude between two adjacent measurement points. Running kernel profiler in pc mode [ticks] (5 loops with 30 clients), we found | CPU |increase from 2 CPUs entry name | 1 2 4 8 |to 8 CPUs by factor -------------------------+-----------------------------+-------------------- USER | 8,507 8,605 8,834 9,002 | 1.1 __generic_copy_from_user | 5,572 5,804 9,111 14,787 | 2.6 file_read_actor | 2,724 3,040 4,218 5,905 | 1.9 default_idle | 1,651 1,245 2,572 8,281 | 6.7 stext_lock | 0 574 5,212 27,993 | 48.8 misc | 7,501 11,140 16,930 30,028 | 4.0 -------------------------+-----------------------------+-------------------- total number of ticks | 25,955 30,408 46,877 95,996 | 3.2 The portion of entry stext_lock indicates that the kernel is spent more and more time with spinning for spinlocks. Calculating the percentage of these entries from the total number of ticks: | CPU | entry name | 1 2 4 8 | -------------------------+---------------------------------+ USER | 32.78% 28.30% 18.85% 9.38% | __generic_copy_from_user | 21.47% 19.09% 19.44% 15.40% | file_read_actor | 10.50% 10.00% 9.00% 6.15% | default_idle | 6.36% 4.09% 5.49% 8.63% | stext_lock | 0.00% 1.89% 11.12% 29.16% | misc | 28.90% 36.64% 36.12% 31.28% | -------------------------+---------------------------------+ total number of ticks | 100.00% 100.00% 100.00% 100.00% | On 8 CPUs the entry stext_lock consumes 29% of the total CPU power, without contributing to the workload. Running lockmeter (5 loops with 30 clients), we found: 1. CPU utilization [%] spent for spinning (looping to get a spin lock) | #CPU |increase from 2 CPUs lock name | 2 4 8 |to 8 CPUs by factor -----------------+-------------------+-------------------- kmap_lock | 0.64 2.90 18.60 | 29.1 pagecache_lock | 0.35 1.90 8.90 | 25.4 lru_list_lock | 0.28 4.80 6.80 | 24.3 dcache_lock | 0.13 0.37 0.64 | 4.9 pagemap_lru_lock | 0.15 0.46 0.90 | 6.0 kernel_flag | 0.73 2.20 6.40 | 8.8 misc | 0.02 0.27 0.26 | 13.0 -----------------+-------------------+-------------- total | 2,30 12,90 42,50 | 18.5 2. average time [us] the lock is held | #CPU |increase from 2 CPUs lock name | 2 4 8 |to 8 CPUs by factor -----------------+------------------+-------------------- kmap_lock | 0.60 1.00 2.90 | 4.8 pagecache_lock | 0.90 1.50 3.10 | 3.4 lru_list_lock | 0.80 1.80 2.80 | 3.5 dcache_lock | 0.40 0.50 0.60 | 1.5 pagemap_lru_lock | 0.90 1.40 2.60 | 2.9 kernel_flag | 1.90 3.20 6.00 | 3.2 The measurements are showing that when increasing the number of CPUs and clients for dbench the additional CPU power is mostly invested into lock handling, 3+ CPUs out of eight CPUs are spinning for six locks. Curiosuly the lock hold times are increasing up to 4.8 times, this contributes to increasing spin times. For comparison we tried the same workload on a Linux/390 (on IBM 9672-XZ7 G6) For maximum throughput, we found: CPU | throughput scalability | [MB/s] ----+----------------------- 1 | 70.64 1.00 2 | 134.62 1.91 4 | 249.19 3.53 8 | 422.10 5.98 Juergen Doelle jd...@de... IBM Lab Boeblingen - Linux Architecture & Performance |
From: Mark H. <ma...@ve...> - 2001-06-15 11:19:37
|
On Fri, 15 Jun 2001 JD...@de... wrote: > Running lockmeter (5 loops with 30 clients), we found: > > 1. CPU utilization [%] spent for spinning (looping to get a spin lock) > > > | #CPU |increase from 2 CPUs > lock name | 2 4 8 |to 8 CPUs by factor > -----------------+-------------------+-------------------- > kmap_lock | 0.64 2.90 18.60 | 29.1 Sometime back, I posted a patch to reduce kmap_lock contention to l-k; the kernel (for x86 at least) was being far to protective of the PTEs when doing a shootdown. I've included the patch I sent below. If you have time to re-run the 8-way benchmark, I'd be interested in the results. Mark diff -urN -X dontdiff linux-2.4.3/mm/highmem.c markhe-2.4.3/mm/highmem.c --- linux-2.4.3/mm/highmem.c Tue Nov 28 20:31:02 2000 +++ markhe-2.4.3/mm/highmem.c Sat Mar 31 15:03:43 2001 @@ -46,7 +46,7 @@ for (i = 0; i < LAST_PKMAP; i++) { struct page *page; - pte_t pte; + /* * zero means we don't have anything to do, * >1 means that it is still in use. Only @@ -56,10 +56,21 @@ if (pkmap_count[i] != 1) continue; pkmap_count[i] = 0; - pte = ptep_get_and_clear(pkmap_page_table+i); - if (pte_none(pte)) + + /* sanity check */ + if (pte_none(pkmap_page_table[i])) BUG(); - page = pte_page(pte); + + /* + * Don't need an atomic fetch-and-clear op here; + * no-one has the page mapped, and cannot get at + * its virtual address (and hence PTE) without first + * getting the kmap_lock (which is held here). + * So no dangers, even with speculative execution. + */ + page = pte_page(pkmap_page_table[i]); + pte_clear(&pkmap_page_table[i]); + page->virtual = NULL; } flush_tlb_all(); @@ -139,6 +150,7 @@ { unsigned long vaddr; unsigned long nr; + int need_wakeup; spin_lock(&kmap_lock); vaddr = (unsigned long) page->virtual; @@ -150,13 +162,31 @@ * A count must never go down to zero * without a TLB flush! */ + need_wakeup = 0; switch (--pkmap_count[nr]) { case 0: BUG(); case 1: - wake_up(&pkmap_map_wait); + /* + * Avoid an unnecessary wake_up() function call. + * The common case is pkmap_count[] == 1, but + * no waiters. + * The tasks queued in the wait-queue are guarded + * by both the lock in the wait-queue-head and by + * the kmap_lock. As the kmap_lock is held here, + * no need for the wait-queue-head's lock. Simply + * test if the queue is empty. + */ + need_wakeup = waitqueue_active(&pkmap_map_wait); } spin_unlock(&kmap_lock); + + /* + * Can do wake-up, if needed, race-free outside of + * the spinlock. + */ + if (need_wakeup) + wake_up(&pkmap_map_wait); } /* |
From: John H. <ha...@en...> - 2001-06-15 17:00:29
|
From: "Mark Hemment" <ma...@ve...> > Sometime back, I posted a patch to reduce kmap_lock contention to > l-k; the kernel (for x86 at least) was being far to protective of the PTEs > when doing a shootdown. And Ingo Molnar's Redhat site (http://people.redhat.com/mingo) contains a patch to make the pagecache_lock more fine-grained. Unfortunately, the last time I looked, that patch was only for something like the 2.4.2 or 2.4.3 kernel, so you'll have to monkey with it a bit for 2.4.0. This patch ought to effectively eliminate that wasted contention on the pagecache_lock. John Hawkes ha...@en... |
From: Bill H. <ha...@au...> - 2001-06-16 19:40:27
|
Mark Hemment wrote: > > On Fri, 15 Jun 2001 JD...@de... wrote: > > Running lockmeter (5 loops with 30 clients), we found: > > > > 1. CPU utilization [%] spent for spinning (looping to get a spin lock) > > > > > > | #CPU |increase from 2 CPUs > > lock name | 2 4 8 |to 8 CPUs by factor > > -----------------+-------------------+-------------------- > > kmap_lock | 0.64 2.90 18.60 | 29.1 > > Sometime back, I posted a patch to reduce kmap_lock contention to > l-k; the kernel (for x86 at least) was being far to protective of the PTEs > when doing a shootdown. > I've included the patch I sent below. If you have time to re-run the > 8-way benchmark, I'd be interested in the results. Mark, Here is the raw lockmeter output from above : kmap_lock SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN *Total* 18.9% 2.6us( 56ms) 33us( 56ms)(42.5%) 90324784 81.1% 18.9% kmap_lock 26.6% 37.0% 2.9us( 759us) 43us( 10ms)(18.6%) 15368850 63.0% 37.0% kmap_high+0x10 7.8% 35.8% 1.7us( 759us) 45us( 10ms)( 9.4%) 7684425 64.2% 35.8% kunmap_high+0xc 18.9% 38.2% 4.1us( 445us) 41us(1506us)( 9.2%) 7684425 61.8% 38.2% So (26.6% utilization) + (18.6 % CPU * 8 cpus spinning) = 1.75 CPUs spinning or holding this lock. Needless to say, this is substantial on an 8-way. Looking at kmap_high(), kunmap_high(), map_new_virtual(), and flush_all_zero_pkmaps() - there is room for improvement - the code searchs the 1024 slots for a free virtual_addr .... cache effects on SMP .... Is kmap_high/kunmap_high necessary for a 4GB kernel ? I wonder if this will go away in 2.5 ? Also, the max wait time of 10ms and 1506us is interesting. ---- So I was curious how a 1GB 2.4.4 kernel would perform. A 1GB kernel would remove the kmap_high/kunmap_high calls. Since the working set for "dbench 8" should fit in < 1GB, one might expect an improvement in throughput if the 1.75 cpus could be turned into useful work. I did a sniff test of a 1GB 2.4.4 kernel vs a 4GB 2.4.4 kernel running "dbench 8" on a similar 8-way. I also did a lockmeter on the 1GB 2.4.4 kernel. The file system was ext2. Throughput did not change much between the 1GB and 4GB kernel. Maybe a little better. The hot locks shifted : SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN *Total* 26.2% 5.0us( 42ms) 52us( 27ms)(51.7%) 3393440 73.8% 26.2% dcache_lock 3.9% 23.2% 1.0us( 355us) 29us( 864us)( 3.4%) 452657 76.8% 23.2% kernel_flag 33.3% 27.7% 9.6us( 42ms) 65us( 27ms)( 7.8%) 387534 72.3% 27.7% lru_list_lock 35.8% 28.3% 6.8us( 574us) 107us(3169us)(19.8%) 584612 71.7% 28.3% pagecache_lock 29.3% 52.1% 4.6us( 333us) 46us(1194us)(19.1%) 706808 47.9% 52.1% pagemap_lru_lock 12.9% 23.3% 3.1us( 250us) 9.4us( 322us)( 1.1%) 461914 76.7% 23.3% Bill Hartner ha...@au... [out of the office till June 25 2001] |
From: Andi K. <ak...@su...> - 2001-06-16 19:54:27
|
On Sat, Jun 16, 2001 at 02:41:48PM -0500, Bill Hartner wrote: > So (26.6% utilization) + (18.6 % CPU * 8 cpus spinning) = 1.75 CPUs > spinning or holding this lock. Needless to say, this is substantial > on an 8-way. Looking at kmap_high(), kunmap_high(), map_new_virtual(), > and flush_all_zero_pkmaps() - there is room for improvement - the code > searchs the 1024 slots for a free virtual_addr .... cache effects on > SMP .... It should probably use a freelist stored in the free kmap entries, then it would be O(1). > > Is kmap_high/kunmap_high necessary for a 4GB kernel ? I wonder if this > will go away in 2.5 ? It is currently necessary for driver compatibility but will hopefully go away in 2.5 with a better driver interface and fixed drivers. -Andi |
From: Mark H. <ma...@ve...> - 2001-06-18 10:16:51
|
On Sat, 16 Jun 2001, Bill Hartner wrote: > Here is the raw lockmeter output from above : > > kmap_lock > > SPINLOCKS HOLD WAIT > UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN > > *Total* > 18.9% 2.6us( 56ms) 33us( 56ms)(42.5%) 90324784 81.1% 18.9% > > kmap_lock > 26.6% 37.0% 2.9us( 759us) 43us( 10ms)(18.6%) 15368850 63.0% 37.0% > kmap_high+0x10 > 7.8% 35.8% 1.7us( 759us) 45us( 10ms)( 9.4%) 7684425 64.2% 35.8% > kunmap_high+0xc > 18.9% 38.2% 4.1us( 445us) 41us(1506us)( 9.2%) 7684425 61.8% 38.2% > > So (26.6% utilization) + (18.6 % CPU * 8 cpus spinning) = 1.75 CPUs > spinning or holding this lock. Needless to say, this is substantial > on an 8-way. Looking at kmap_high(), kunmap_high(), map_new_virtual(), > and flush_all_zero_pkmaps() - there is room for improvement - the code > searchs the 1024 slots for a free virtual_addr .... cache effects on > SMP .... Ugh, those numbers look awful. The "last_pkmap_nr" should, in the common case, mean all 1024 (or 512 on PAE) aren't seached, but a free-list would help map_new_virtual(). I'll code a prototype to do this. Also, the "kmap_lock" could well end-up on the same L1 cacheline as "last_pkmap_nr", which will give very bad effects (false-sharing). > Is kmap_high/kunmap_high necessary for a 4GB kernel ? I wonder if this > will go away in 2.5 ? By default, the kernel has a 1GB address-space, so kmaps are necessary for the kernel to access the contents of highmem pages. You can compile the kernel with a larger address-space, which eats into the user-space's 3GB. To reverse the split, edit include/asm/page.h to change __PAGE_OFFSET from 0xC0000000 to 0x4000000. Also edit arch/i386/vmlinux.lds, changing 0xC0000000 to 0x4000000. This will give a kernel with a 3GB address-space. NOTE: For the page-cache, Linux first gives out highmem followed by normal memory. So, even with the large address-space, on a box with 4GB, approx. the first GB of page-cache memory will come from highmem. Mark |
From: <ma...@ve...> - 2001-07-03 11:20:26
Attachments:
high.patch
|
On Sat, 16 Jun 2001, Bill Hartner wrote: > So (26.6% utilization) + (18.6 % CPU * 8 cpus spinning) = 1.75 CPUs > spinning or holding this lock. Needless to say, this is substantial > on an 8-way. Looking at kmap_high(), kunmap_high(), map_new_virtual(), > and flush_all_zero_pkmaps() - there is room for improvement - the code > searchs the 1024 slots for a free virtual_addr .... cache effects on > SMP .... Hi Bill, I've been doing some more work on the kmap_lock. The first part isn't very interesting - it modifies highmem.c to use a freelist linkage. Under "normal" usage patterns, I wouldn't expect this to give much of an improvement. The second part is more interesting. :) Currently, I don't have access to an 8-way box (only a 4), but I'm guessing the kmap lock contention is caused by the flush_tlb_all() which is done with the kmap_lock held. Now, flush_tlb_all() sends an IPI to all engines (processors) and _waits_ for them all to perform the shootdown. If any of the engines have interrupts blocked, then flush_tlb_all() busy- waits until the interrupt (IPI) is delieved and processed. This can be a significant number of CPU cycles - espically as locks which require interrupts to be blocked spin with ints disabled (image such a lock which has contention - it can be quite sometime until the last contentor breaks out of the critical region and enables ints). Analysing the usage of flush_tlb_all() shows that it does not need to busy-wait - it can simply send the "shootdown" IPI and continue; it doesn't even need to busy-wait for an ack from the other engines. ie. flush_tlb_all() can become asynchronous. For example, think about the flush_tlb_all() for highmem. New mappings cannot be created with interrupts disabled (else the orignal flush_tlb_all() could deadlock the system), nor from within an interupt handler. Same for "dropping" a mapping, and gaining a reference to an existing mapping. Infact, an engine's TLB doesn't need to be flushed until its next call to kmap_high() or until a context-switch occurs on the engine. As the highmem TLBs are marked "global", we'd need to add an extra test in schedule() which I'd rather stay away from. So, instead, we can wait for the flush on an engine to occur when it enables interupts. This works for the highmem case, and other uses of flush_tlb_all(). At least, I believe it does - can anyone find an existing case where it doesn't? Ingo, you know the revelent code better than anyone else. Is the idea sound? Does this sound fragile? Yes, it is, but if it improves scalability and is well documented, then it is worth doing. I've attached a patch against 2.4.5. The original code was pulled from a highly modified tree, but I don't think I've made any mistakes... Mark |
From: Ingo M. <mi...@el...> - 2001-07-03 11:29:21
|
has anyone experimented with increasing the size of the kmap area? Right now it's 2 MB on PAE and 4 MB on non-PAE x86 kernels. The kmap design doesnt really work well if there is contention for kmap slots, so one way to go would be to double/quadruple the size. (i dont remember whether the initial kmap-pagetable setup code is ready to use a bigger size - i think it's almost ready: the pagetables definitely need to be allocated continuously, for the array indexing to work.) Ingo |
From: <ma...@ve...> - 2001-07-03 11:33:44
|
On Tue, 3 Jul 2001, Ingo Molnar wrote: > has anyone experimented with increasing the size of the kmap area? Right > now it's 2 MB on PAE and 4 MB on non-PAE x86 kernels. The kmap design > doesnt really work well if there is contention for kmap slots, so one way > to go would be to double/quadruple the size. (i dont remember whether the > initial kmap-pagetable setup code is ready to use a bigger size - i think > it's almost ready: the pagetables definitely need to be allocated > continuously, for the array indexing to work.) A collegue has a patch for this, but is it part of a much larger work. Might be worthwhile pulling it out as a standalone patch., Mark |
From: John H. <ha...@en...> - 2001-07-03 17:34:41
|
From: <ma...@ve...> > Now, flush_tlb_all() sends an IPI to all engines (processors) and > _waits_ for them all to perform the shootdown. > If any of the engines have interrupts blocked, then flush_tlb_all() busy- > waits until the interrupt (IPI) is delieved and processed. This can be a > significant number of CPU cycles - espically as locks which require > interrupts to be blocked spin with ints disabled (image such a lock which > has contention - it can be quite sometime until the last contentor breaks > out of the critical region and enables ints). I've been running a different workload (a variant of AIM7 subtests) on a 32-CPU ia64 system that happens to need to use IPI to any and all tlb shootdowns, and this is definitely a bottleneck at 32p. I've seen 50% of total available CPU cycles consumed waiting on the spinlock used in the ia64 smp_call_function() (waiting for all the CPUs to see the IPI). In my workload the culprit is flush_tlb_range(), not flush_tlb_all(), and half the calls stem from handle_mm_fault() (-> handle_pte_fault -> do_wp_range -> break_cow -> establish_pte-> flush_tlb_range). The other half of the load stem from do_munmap() (mostly from sys_munmap). I have more data that I won't bore you with. I just wanted to echo Mark's concern about tlb shootdowns that are implemented by IPIs. I, too, suspect that some or all of the long spinlock hold-time delays are due to a target CPU having interrupts disabled, though I haven't yet figured out a spiffy way to prove it (or even better, a spiffy way to expose who is holding off interrupts -- though I've heard rumors that the Qlogic FC driver might be a culprit). And I wanted to observe that flush_tlb_all() isn't the only tlb shootdown problem. John Hawkes ha...@en... |
From: <ma...@ve...> - 2001-07-03 18:25:53
|
On Tue, 3 Jul 2001, John Hawkes wrote: > I have more data that I won't bore you with. I just wanted to echo > Mark's concern about tlb shootdowns that are implemented by IPIs. I, > too, suspect that some or all of the long spinlock hold-time delays are > due to a target CPU having interrupts disabled, though I haven't yet > figured out a spiffy way to prove it (or even better, a spiffy way to > expose who is holding off interrupts -- though I've heard rumors that > the Qlogic FC driver might be a culprit). And I wanted to observe that > flush_tlb_all() isn't the only tlb shootdown problem. I've been wondering whether it is worthwhile implementing interrupt priority levels. For h/w which doesn't support this (IA32), cli()/sti() (and friends) become "soft" interrupt blockers. That is, they play with a per-engine value which is tested when an interrupt is received to see if it should be serviced now, or deferred (or even passed to a different engine to service). Before Linux, all UNIX OSes I worked on had priority levels. While the underlying code does get slightly ugly, do they allow 'high-priority' level interrupts (such as IPIs) to get through (and be serviced) with low latency. I'm certainly not suggesting going to the 7(?) levels which SVR3.2 had, but simply three levels; o Base (no interrupts deferred) o Medium (non high-priority interrupts deferred) o High (all interrupts disabled) There are problems of interrupt storms at 'medium', but that problem already exists anyway. Mark |
From: John H. <ha...@en...> - 2001-07-03 18:40:14
|
> I've been wondering whether it is worthwhile implementing interrupt > priority levels. For h/w which doesn't support this (IA32), cli()/sti() > (and friends) become "soft" interrupt blockers. The 32p ia64 system I use has interrupt priority levels, yet I'm still seeing occasional IPI holdoffs of upwards of 10msec, sometimes more, even though IPI interrupts are treated as the highest priority interrupts (or the second highest, below NMI). As I said, I don't have any proof that the smp_call_function() stalls are due to target CPUs with disabled interrupts. I'd like to determine this, once and for all, before I get too focused on interrupt handling strategies. John Hawkes ha...@en... |
From: Ingo M. <mi...@el...> - 2001-07-04 13:17:29
|
John, a perhaps better solution on x86 is to make the TLB IPI handler an NMI interrupt. This solution is ideal: the only overhead would be the IPI latency, no waiting on the target CPU. To implement this, put a call to smp_invalidate_interrupt() into do_nmi(), disable the NMI watchdog, and add APIC_DM_NMI or-ed to the first argument of send_IPI_mask() in the TLB flush code. This should work because smp_invalidate_interrupt() is/should be simple enough to be called from an NMI context. Could you try this solution and verify how much time is spent waiting for the TLB shootdown to finish? Ingo |
From: <ma...@ve...> - 2001-07-04 14:18:35
|
On Wed, 4 Jul 2001, Ingo Molnar wrote: > a perhaps better solution on x86 is to make the TLB IPI handler an NMI > interrupt. This solution is ideal: the only overhead would be the IPI > latency, no waiting on the target CPU. Ingo, Is that race free? John was reporting scalability problems with flush_tlb_range(), which uses (on IA32 - John was IA64) "flush_mm" and the tlbstate_lock. Doesn't feel safe if using the NMI... Mark |
From: Andi K. <ak...@su...> - 2001-07-04 14:22:19
|
On Wed, Jul 04, 2001 at 03:18:55PM +0100, ma...@ve... wrote: > On Wed, 4 Jul 2001, Ingo Molnar wrote: > > a perhaps better solution on x86 is to make the TLB IPI handler an NMI > > interrupt. This solution is ideal: the only overhead would be the IPI > > latency, no waiting on the target CPU. > > Ingo, > > Is that race free? > > John was reporting scalability problems with flush_tlb_range(), which > uses (on IA32 - John was IA64) "flush_mm" and the tlbstate_lock. Doesn't > feel safe if using the NMI... It should be. x86 has some special hacks in hardware to not reenter NMI handlers. I think Andrea Arcangelli also played earlier with NMI TLB flushes, if you need some experiences. -Andi |
From: <ma...@ve...> - 2001-07-04 14:28:36
|
On Wed, 4 Jul 2001, Andi Kleen wrote: > On Wed, Jul 04, 2001 at 03:18:55PM +0100, ma...@ve... wrote: > > Ingo, > > > > Is that race free? > > > > John was reporting scalability problems with flush_tlb_range(), which > > uses (on IA32 - John was IA64) "flush_mm" and the tlbstate_lock. Doesn't > > feel safe if using the NMI... > > It should be. > x86 has some special hacks in hardware to not reenter NMI handlers. > > I think Andrea Arcangelli also played earlier with NMI TLB flushes, if you > need some experiences. Thanks, Andi. Just thought about it some more, and realised all is OK. Mark |
From: John H. <ha...@en...> - 2001-07-05 21:23:37
|
From: "Ingo Molnar" <mi...@el...> To: "John Hawkes" <ha...@en...> > John, > > a perhaps better solution on x86 is to make the TLB IPI handler an NMI > interrupt. Interesting experiment. But my problem is with a 32p ia64 system, not an i386 system. Even my 32p mips64 ccNUMA system doesn't bottleneck on IPI-implemented tlb shootdowns. John Hawkes |