From: Hanna L. <ha...@us...> - 2002-02-27 18:18:50
|
Congratulations to everyone working to reduce contention of the Big Kernel Lock. According to this benchmark your work has paid off! John Hawkes of SGI asked me to Beta test his 2.5.5 version of lockmeter last night (that's what I get for asking when it would be out). The results were interesting enough to post. All three runs were on an 8-way SMP system running dbench with up to 16 clients 10 times each. The results are at http://lse.sf.net/locking . Throughput numbers are not included yet, I need to rerun dbench without lockmeter to get accurate throughput results. (Read down the Con(tention) column) TOTAL is for the whole system kernel_flag is for every function holding BKL combined. SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 2.4.17: 13.7% 2.2us( 43ms) 31us( 43ms)(20.0%) 232292367 86.3% 13.7% 0.00% *TOTAL* 33.9% 40.3% 11us( 43ms) 51us( 43ms)( 8.2%) 19725127 59.7% 40.3% 0% kernel_flag 2.5.3: 11.1% 1.0us( 21ms) 8.2us( 18ms)( 3.8%) 738953957 88.9% 11.1% 0.00% *TOTAL* 10.4% 22.6% 8.3us( 21ms) 23us( 18ms)(0.81%) 27982565 77.4% 22.6% 0% kernel_flag 2.5.5: 8.6% 1.6us( 100ms) 30us( 86ms)( 9.4%) 783373441 91.4% 8.6% 0.00% *TOTAL* 1.2% 0.33% 2.5us( 50ms) 1167us( 43ms)(0.23%) 12793605 99.7% 0.33% 0% kernel_flag Hanna Linder IBM Linux Technology Center ha...@us... |
From: Martin J. B. <Mar...@us...> - 2002-02-27 18:35:03
|
> 2.5.5: > > 8.6% 1.6us( 100ms) 30us( 86ms)( 9.4%) 783373441 91.4% 8.6% 0.00% *TOTAL* Whilst it's great to see BKL contention going down, this: 0.16% 0.25% 0.7us( 100ms) 252us( 86ms)(0.02%) 6077746 99.8% 0.25% 0% inode_lock 0.03% 0.11% 0.6us( 55us) 2.1us( 9.9us)(0.00%) 1322338 99.9% 0.11% 0% __mark_inode_dirty+0x48 0.00% 0% 0.7us( 5.9us) 0us 391 100% 0% 0% get_new_inode+0x28 0.00% 0.22% 2.5us( 50us) 495us( 28ms)(0.00%) 50397 99.8% 0.22% 0% iget4+0x3c 0.03% 0.28% 0.6us( 26us) 30us( 58ms)(0.00%) 1322080 99.7% 0.28% 0% insert_inode_hash+0x44 0.04% 0.29% 0.5us( 39us) 332us( 86ms)(0.01%) 2059365 99.7% 0.29% 0% iput+0x68 0.03% 0.30% 0.7us( 57us) 422us( 77ms)(0.01%) 1323036 99.7% 0.30% 0% new_inode+0x1c 0.03% 8.3% 63ms( 100ms) 3.8us( 3.8us)(0.00%) 12 91.7% 8.3% 0% prune_icache+0x1c 0.00% 0% 1.0us( 5.2us) 0us 34 100% 0% 0% sync_unlocked_inodes+0x10 0.00% 0% 1.0us( 2.4us) 0us 93 100% 0% 0% sync_unlocked_inodes+0x110 looks a little distressing - the hold times on inode_lock by prune_icache look bad in terms of latency (contention is still low, but people are still waiting on it for a very long time). Is this a transient thing, or do people think this is going to be a problem? Martin. |
From: Andrew M. <ak...@zi...> - 2002-02-27 21:16:59
|
"Martin J. Bligh" wrote: > > ... > looks a little distressing - the hold times on inode_lock by prune_icache > look bad in terms of latency (contention is still low, but people are still > waiting on it for a very long time). Is this a transient thing, or do people > think this is going to be a problem? inode_lock hold times are a problem for other reasons. Leaving this unfixed makes the preepmtible kernel rather pointless.... An ideal fix would be to release inodes based on VM pressure against their backing page. But I don't think anyone's started looking at inode_lock yet. The big one is lru_list_lock, of course. I'll be releasing code in the next couple of days which should take that off the map. Testing would be appreciated. I have a concern about the lockmeter results. Lockmeter appears to be measuring lock frequency and hold times and contention. But is it measuring the cost of the cacheline transfers? For example, I expect that with delayed allocation and radix-tree pagecache, one of the major remaining bottlenecks will be ownership of the superblock semaphore's cacheline. Is this measurable? (Actually, we may then be at the point where copy_from_user costs dominate). - |
From: Andrew M. <ak...@zi...> - 2002-02-27 20:16:52
|
"Martin J. Bligh" wrote: > > > inode_lock hold times are a problem for other reasons. Leaving this > > unfixed makes the preepmtible kernel rather pointless.... An ideal > > fix would be to release inodes based on VM pressure against their backing > > page. But I don't think anyone's started looking at inode_lock yet. > > > > The big one is lru_list_lock, of course. I'll be releasing code in > > the next couple of days which should take that off the map. Testing > > would be appreciated. > > Seeing as people seem to be interested ... there are some big holders > of BKL around too - do_exit shows up badly (50ms in the data Hanna > posted, and I've seen that a lot before). That'll be where exit() takes down the tasks's address spaces. zap_page_range(). That's a nasty one. > I've seen sync_old_buffers > hold the BKL for 64ms on an 8way Specweb99 run (22Gb of RAM?) > (though this was on an older 2.4 kernel, and might be fixed by now). That will still be there - presumably it's where we walk the per-superblock dirty inode list. hmm. For lru_list_lock we can do an end-around by not using buffers at all. The other big one is truncate_inode_pages(). With ratcache it's not a contention problem, but it is a latency problem. I expect that we can drastically reduce the lock hold time there by simply snipping the wholly-truncated pages out of the tree, and thus privatising them so they can be disposed of outside any locking. - |
From: Alexander V. <vi...@ma...> - 2002-02-27 21:48:17
|
On Wed, 27 Feb 2002, Andrew Morton wrote: > "Martin J. Bligh" wrote: > > > > ... > > looks a little distressing - the hold times on inode_lock by prune_icache > > look bad in terms of latency (contention is still low, but people are still > > waiting on it for a very long time). Is this a transient thing, or do people > > think this is going to be a problem? > > inode_lock hold times are a problem for other reasons. ed mm/vmscan.c <<EOF /shrink_icache_memory/s/priority/1/ w q EOF and repeat the tests. Unreferenced inodes == useless inodes. Aging is already taken care of in dcache and anything that had fallen through is fair game. |
From: Hanna L. <ha...@us...> - 2002-02-27 23:12:35
|
--On Wednesday, February 27, 2002 16:48:07 -0500 Alexander Viro <vi...@ma...> wrote: > >> > looks a little distressing - the hold times on inode_lock by prune_icache >> > look bad in terms of latency (contention is still low, but people are still >> > waiting on it for a very long time). Is this a transient thing, or do people >> > think this is going to be a problem? >> >> inode_lock hold times are a problem for other reasons. > > ed mm/vmscan.c <<EOF > /shrink_icache_memory/s/priority/1/ > w > q > EOF > > and repeat the tests. Unreferenced inodes == useless inodes. Aging is > already taken care of in dcache and anything that had fallen through > is fair game. > I applied your patch and reran the tests. Looks like you solved the problem: SPINLOCKS HOLD WAIT UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME 7.1% 0.7us( 19ms) 7.7us( 17ms)( 2.6%) 779799309 92.9% 7.1% 0.00% *TOTAL* 0.16% 0.29% 0.6us( 91us) 2.2us( 46us)(0.00%) 5495642 99.7% 0.29% 0% inode_lock 0.90% 0.47% 1.4us( 19ms) 280us( 17ms)(0.10%) 12681192 99.5% 0.47% 0% kernel_flag The results are again stored at http://lse.sf.net/locking . Hanna |
From: Hanna L. <ha...@us...> - 2002-02-27 23:30:59
|
--On Wednesday, February 27, 2002 16:48:07 -0500 Alexander Viro <vi...@ma...> wrote: > > ed mm/vmscan.c <<EOF > /shrink_icache_memory/s/priority/1/ > w > q > EOF > > and repeat the tests. Unreferenced inodes == useless inodes. Aging is > already taken care of in dcache and anything that had fallen through > is fair game. > FYI: The patch does this: *** vmscan.c.orig Wed Feb 27 14:09:49 2002 --- vmscan.c Wed Feb 27 14:10:16 2002 *************** *** 578,584 **** return 0; shrink_dcache_memory(priority, gfp_mask); ! shrink_icache_memory(priority, gfp_mask); #ifdef CONFIG_QUOTA shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); #endif --- 578,584 ---- return 0; shrink_dcache_memory(priority, gfp_mask); ! shrink_icache_memory(1, gfp_mask); #ifdef CONFIG_QUOTA shrink_dqcache_memory(DEF_PRIORITY, gfp_mask); #endif |
From: Hanna L. <ha...@us...> - 2002-02-28 00:45:09
|
--On Wednesday, February 27, 2002 11:45:15 -0800 Andrew Morton <ak...@zi...> wrote: > The big one is lru_list_lock, of course. I'll be releasing code in > the next couple of days which should take that off the map. Testing > would be appreciated. Ill be glad to run this again with your patch. Also, John Hawkes has an even bigger system and keeps hitting lru_list_lock too. > > I have a concern about the lockmeter results. Lockmeter appears > to be measuring lock frequency and hold times and contention. But > is it measuring the cost of the cacheline transfers? This has come up a few times on lse-tech. Lockmeter doesnt measure cacheline hits/misses/bouncing. However, someone said kernprof could be used to access performance registers on the Pentium chip to get this info. I don't know anyone who has tried that though. I am working on a patch to decrease cacheline bouncing and it would be great to see some specific results. Is anyone working on a tool that could measure cache hits/misses/bouncing? Hanna |
From: Ravikiran G T. <ki...@in...> - 2002-02-28 08:36:55
|
On Wed, Feb 27, 2002 at 11:57:57AM -0800, Hanna Linder wrote: > > > > > I have a concern about the lockmeter results. Lockmeter appears > > to be measuring lock frequency and hold times and contention. But > > is it measuring the cost of the cacheline transfers? > > This has come up a few times on lse-tech. Lockmeter doesnt > measure cacheline hits/misses/bouncing. However, someone said > kernprof could be used to access performance registers on the Pentium > chip to get this info. I don't know anyone who has tried that though. I have tried using kernprof; What kernprof can do is list out the routine-wise performance event counts for a given workload. I use it with benchmark runs using a script to start and stop profiling and run the tests in between something like kernprof -r kernprof -b -c all -d pmc -a 0x0020024 urtesttobeexecuted kernprof -e Event above "0x0020024" is L2_LINES_IN for P6 family I've used this to measure cache effects with smp friendly counters, But for lockmetering, this would not be as good as having cache event measurements built into lockmeter ....... -Kiran |
From: John H. <ha...@sg...> - 2002-02-28 16:54:38
|
From: "Ravikiran G Thirumalai" <ki...@in...> ... > I've used this to measure cache effects with smp friendly counters, > But for lockmetering, this would not be as good as having cache event > measurements built into lockmeter ....... Forgive my obtuseness, but what "Aha!" higher level of understanding do you believe you would get if Lockmeter showed you quantitative cache event data for spinlocks? It seems to me that benchmark performance is the ultimate metric, and that cache event data would be underlying evidence of why you did (or did not) see an improvement. For example, you move a spinlock to its own isolated cacheline, and Lockmeter shows you lower Utilization and Contention numbers (because the hold-time is lower) and, hopefully, fewer wasted CPU cycles doing waits. And having a quantitative cachemiss count is icing on the cake, isn't it? What does the precise cachemiss count really tell you, above and beyond what all the other numbers tell you? Also, I'll remind you that the current Lockmeter doesn't count the time consumed by the initial spin_trylock() as part of the hold-time or part of the wait-time, and it seems to me that's where the first (and perhaps only) cachemiss occurs (and triggers any cache-concurrency cacheblock writeback for another CPU). Once one or more CPUs are engaged in a wait loop, then it seems to me that cacheblock events are not the problem. Contention is the problem. And the "thundering herd" chaos at spin_unlock() is the problem. In my opinion, the more interesting "Aha!" cache event data comes from read-write locks, not from global spinlocks. For example, the read_lock(&files->file_lock) in fget() can be a Silent Killer -- at least silent to Lockmeter. At least one of those CPU scheduler benchmarks ("chat" and "reflex" come to mind) use clone() and produce children that share a files_struct. Lockmeter shows you a massive number of uncontended read_lock() events for that rwlock. You might not even notice the Lockstat output line. What Lockmeter doesn't show you is the cacheblock ping-ponging of the file_lock, since each seemingly innocuous read_lock() dirties the reader count. It's not until you run Kernprof that you see the huge waste of CPU cycles. In the case of my 30p NUMA system, the cacheblock ping-pongs were the big bottleneck for those benchmarks, not the global runqueue_lock. John Hawkes |
From: Martin J. B. <Mar...@us...> - 2002-02-28 01:32:43
|
> inode_lock hold times are a problem for other reasons. Leaving this > unfixed makes the preepmtible kernel rather pointless.... An ideal > fix would be to release inodes based on VM pressure against their backing > page. But I don't think anyone's started looking at inode_lock yet. > > The big one is lru_list_lock, of course. I'll be releasing code in > the next couple of days which should take that off the map. Testing > would be appreciated. Seeing as people seem to be interested ... there are some big holders of BKL around too - do_exit shows up badly (50ms in the data Hanna posted, and I've seen that a lot before). I've seen sync_old_buffers hold the BKL for 64ms on an 8way Specweb99 run (22Gb of RAM?) (though this was on an older 2.4 kernel, and might be fixed by now). M. |