Thread: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

Status: Pre-Alpha

Brought to you by: atheurer, hlinder, jwright, mingming, and 3 others

lse-tech

[Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Hanna L. <ha...@us...> - 2002-02-27 18:18:50

	Congratulations to everyone working to reduce contention of the
Big Kernel Lock. According to this benchmark your work has paid off!
John Hawkes of SGI asked me to Beta test his 2.5.5 version of lockmeter 
last night (that's what I get for asking when it would be out). The results 
were interesting enough to post. 
	All three runs were on an 8-way SMP system running dbench with 
up to 16 clients 10 times each. The results are at http://lse.sf.net/locking . 
Throughput numbers are not included yet, I need to rerun dbench without 
lockmeter to get accurate throughput results.
	
(Read down the Con(tention) column)
TOTAL is for the whole system
kernel_flag is for every function holding BKL combined.


SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME

2.4.17:

       13.7%  2.2us(  43ms)   31us(  43ms)(20.0%) 232292367 86.3% 13.7% 0.00%  *TOTAL*

 33.9% 40.3%   11us(  43ms)   51us(  43ms)( 8.2%)  19725127 59.7% 40.3%    0%  kernel_flag


2.5.3:
       11.1%  1.0us(  21ms)  8.2us(  18ms)( 3.8%) 738953957 88.9% 11.1% 0.00%  *TOTAL*

 10.4% 22.6%  8.3us(  21ms)   23us(  18ms)(0.81%)  27982565 77.4% 22.6%    0%  kernel_flag


2.5.5: 

       8.6%  1.6us( 100ms)   30us(  86ms)( 9.4%) 783373441 91.4%  8.6% 0.00%  *TOTAL*

 1.2% 0.33%  2.5us(  50ms) 1167us(  43ms)(0.23%)  12793605 99.7% 0.33%    0%  kernel_flag



Hanna Linder
IBM Linux Technology Center
ha...@us...

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Martin J. B. <Mar...@us...> - 2002-02-27 18:35:03

> 2.5.5: 
> 
>        8.6%  1.6us( 100ms)   30us(  86ms)( 9.4%) 783373441 91.4%  8.6% 0.00%  *TOTAL*

Whilst it's great to see BKL contention going down, this:

0.16% 0.25%  0.7us( 100ms)  252us(  86ms)(0.02%)   6077746 99.8% 0.25%    0%  inode_lock
 0.03% 0.11%  0.6us(  55us)  2.1us( 9.9us)(0.00%)   1322338 99.9% 0.11%    0%    __mark_inode_dirty+0x48
 0.00%    0%  0.7us( 5.9us)    0us                      391  100%    0%    0%    get_new_inode+0x28
 0.00% 0.22%  2.5us(  50us)  495us(  28ms)(0.00%)     50397 99.8% 0.22%    0%    iget4+0x3c
 0.03% 0.28%  0.6us(  26us)   30us(  58ms)(0.00%)   1322080 99.7% 0.28%    0%    insert_inode_hash+0x44
 0.04% 0.29%  0.5us(  39us)  332us(  86ms)(0.01%)   2059365 99.7% 0.29%    0%    iput+0x68
 0.03% 0.30%  0.7us(  57us)  422us(  77ms)(0.01%)   1323036 99.7% 0.30%    0%    new_inode+0x1c
 0.03%  8.3%   63ms( 100ms)  3.8us( 3.8us)(0.00%)        12 91.7%  8.3%    0%    prune_icache+0x1c
 0.00%    0%  1.0us( 5.2us)    0us                       34  100%    0%    0%    sync_unlocked_inodes+0x10
 0.00%    0%  1.0us( 2.4us)    0us                       93  100%    0%    0%    sync_unlocked_inodes+0x110

looks a little distressing - the hold times on inode_lock by prune_icache 
look bad in terms of latency (contention is still low, but people are still 
waiting on it for a very long time). Is this a transient thing, or do people 
think this is going to be a problem?

Martin.

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Andrew M. <ak...@zi...> - 2002-02-27 21:16:59

"Martin J. Bligh" wrote:
> 
> ...
> looks a little distressing - the hold times on inode_lock by prune_icache
> look bad in terms of latency (contention is still low, but people are still
> waiting on it for a very long time). Is this a transient thing, or do people
> think this is going to be a problem?

inode_lock hold times are a problem for other reasons.  Leaving this
unfixed makes the preepmtible kernel rather pointless....  An ideal
fix would be to release inodes based on VM pressure against their backing
page.  But I don't think anyone's started looking at inode_lock yet.

The big one is lru_list_lock, of course.  I'll be releasing code in
the next couple of days which should take that off the map.  Testing
would be appreciated.

I have a concern about the lockmeter results.  Lockmeter appears
to be measuring lock frequency and hold times and contention.  But
is it measuring the cost of the cacheline transfers?   For example,
I expect that with delayed allocation and radix-tree pagecache, one
of the major remaining bottlenecks will be ownership of the superblock
semaphore's cacheline.   Is this measurable?  (Actually, we may
then be at the point where copy_from_user costs dominate).

-

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Andrew M. <ak...@zi...> - 2002-02-27 20:16:52

"Martin J. Bligh" wrote:
> 
> > inode_lock hold times are a problem for other reasons.  Leaving this
> > unfixed makes the preepmtible kernel rather pointless....  An ideal
> > fix would be to release inodes based on VM pressure against their backing
> > page.  But I don't think anyone's started looking at inode_lock yet.
> >
> > The big one is lru_list_lock, of course.  I'll be releasing code in
> > the next couple of days which should take that off the map.  Testing
> > would be appreciated.
> 
> Seeing as people seem to be interested ... there are some big holders
> of BKL around too - do_exit shows up badly (50ms in the data Hanna
> posted, and I've seen that a lot before).

That'll be where exit() takes down the tasks's address spaces.  
zap_page_range().  That's a nasty one.

> I've seen sync_old_buffers
> hold the BKL for 64ms on an 8way Specweb99 run (22Gb of RAM?)
> (though this was on an older 2.4 kernel, and might be fixed by now).

That will still be there - presumably it's where we walk the
per-superblock dirty inode list.  hmm.

For lru_list_lock we can do an end-around by not using
buffers at all.

The other big one is truncate_inode_pages().  With ratcache
it's not a contention problem, but it is a latency problem.
I expect that we can drastically reduce the lock hold time
there by simply snipping the wholly-truncated pages out of
the tree, and thus privatising them so they can be disposed
of outside any locking.

-

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Alexander V. <vi...@ma...> - 2002-02-27 21:48:17

On Wed, 27 Feb 2002, Andrew Morton wrote:

> "Martin J. Bligh" wrote:
> > 
> > ...
> > looks a little distressing - the hold times on inode_lock by prune_icache
> > look bad in terms of latency (contention is still low, but people are still
> > waiting on it for a very long time). Is this a transient thing, or do people
> > think this is going to be a problem?
> 
> inode_lock hold times are a problem for other reasons.

ed mm/vmscan.c <<EOF
/shrink_icache_memory/s/priority/1/
w
q
EOF

and repeat the tests.  Unreferenced inodes == useless inodes.  Aging is
already taken care of in dcache and anything that had fallen through
is fair game.

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Hanna L. <ha...@us...> - 2002-02-27 23:12:35

--On Wednesday, February 27, 2002 16:48:07 -0500 Alexander Viro <vi...@ma...> wrote:
> 
>> > looks a little distressing - the hold times on inode_lock by prune_icache
>> > look bad in terms of latency (contention is still low, but people are still
>> > waiting on it for a very long time). Is this a transient thing, or do people
>> > think this is going to be a problem?
>> 
>> inode_lock hold times are a problem for other reasons.
> 
> ed mm/vmscan.c <<EOF
> /shrink_icache_memory/s/priority/1/
> w
> q
> EOF
> 
> and repeat the tests.  Unreferenced inodes == useless inodes.  Aging is
> already taken care of in dcache and anything that had fallen through
> is fair game.
> 

I applied your patch and reran the tests. Looks like you solved the problem:

  SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME

        7.1%  0.7us(  19ms)  7.7us(  17ms)( 2.6%) 779799309 92.9%  7.1% 0.00%  *TOTAL*

 0.16% 0.29%  0.6us(  91us)  2.2us(  46us)(0.00%)   5495642 99.7% 0.29%    0%  inode_lock

 0.90% 0.47%  1.4us(  19ms)  280us(  17ms)(0.10%)  12681192 99.5% 0.47%    0%  kernel_flag

The results are again stored at http://lse.sf.net/locking . 

Hanna

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Hanna L. <ha...@us...> - 2002-02-27 23:30:59

--On Wednesday, February 27, 2002 16:48:07 -0500 Alexander Viro <vi...@ma...> wrote:

> 
> ed mm/vmscan.c <<EOF
> /shrink_icache_memory/s/priority/1/
> w
> q
> EOF
> 
> and repeat the tests.  Unreferenced inodes == useless inodes.  Aging is
> already taken care of in dcache and anything that had fallen through
> is fair game.
> 

FYI:

The patch does this:

*** vmscan.c.orig	Wed Feb 27 14:09:49 2002
--- vmscan.c	Wed Feb 27 14:10:16 2002
***************
*** 578,584 ****
  		return 0;
  
  	shrink_dcache_memory(priority, gfp_mask);
! 	shrink_icache_memory(priority, gfp_mask);
  #ifdef CONFIG_QUOTA
  	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
  #endif
--- 578,584 ----
  		return 0;
  
  	shrink_dcache_memory(priority, gfp_mask);
! 	shrink_icache_memory(1, gfp_mask);
  #ifdef CONFIG_QUOTA
  	shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
  #endif

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Hanna L. <ha...@us...> - 2002-02-28 00:45:09

--On Wednesday, February 27, 2002 11:45:15 -0800 Andrew Morton <ak...@zi...> wrote:

> The big one is lru_list_lock, of course.  I'll be releasing code in
> the next couple of days which should take that off the map.  Testing
> would be appreciated.

	Ill be glad to run this again with your patch. Also, John Hawkes
has an even bigger system and keeps hitting lru_list_lock too.
> 
> I have a concern about the lockmeter results.  Lockmeter appears
> to be measuring lock frequency and hold times and contention.  But
> is it measuring the cost of the cacheline transfers?   

	This has come up a few times on lse-tech. Lockmeter doesnt
measure cacheline hits/misses/bouncing. However, someone said
kernprof could be used to access performance registers on the Pentium
chip to get this info. I don't know anyone who has tried that though.
	I am working on a patch to decrease cacheline bouncing and it
would be great to see some specific results. Is anyone working on a tool 
that could measure cache hits/misses/bouncing?

Hanna

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Ravikiran G T. <ki...@in...> - 2002-02-28 08:36:55

On Wed, Feb 27, 2002 at 11:57:57AM -0800, Hanna Linder wrote:
> 
> > 
> > I have a concern about the lockmeter results.  Lockmeter appears
> > to be measuring lock frequency and hold times and contention.  But
> > is it measuring the cost of the cacheline transfers?   
> 
> 	This has come up a few times on lse-tech. Lockmeter doesnt
> measure cacheline hits/misses/bouncing. However, someone said
> kernprof could be used to access performance registers on the Pentium
> chip to get this info. I don't know anyone who has tried that though.

I have tried using kernprof; What kernprof can do is list 
out the routine-wise performance event counts for a given workload.

I use it with benchmark runs using a script to start and stop profiling
and run the tests in between
something like

kernprof -r
kernprof -b -c all -d pmc -a 0x0020024 
urtesttobeexecuted
kernprof -e

Event above "0x0020024" is L2_LINES_IN for P6 family

I've used this to measure cache effects with smp friendly counters,
But for lockmetering, this would not be as good as having cache event 
measurements built into lockmeter .......

-Kiran

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: John H. <ha...@sg...> - 2002-02-28 16:54:38

From: "Ravikiran G Thirumalai" <ki...@in...>
...
> I've used this to measure cache effects with smp friendly counters,
> But for lockmetering, this would not be as good as having cache event
> measurements built into lockmeter .......

Forgive my obtuseness, but what "Aha!" higher level of understanding do
you believe you would get if Lockmeter showed you quantitative cache
event data for spinlocks?  It seems to me that benchmark performance is
the ultimate metric, and that cache event data would be underlying
evidence of why you did (or did not) see an improvement.  For example,
you move a spinlock to its own isolated cacheline, and Lockmeter shows
you lower Utilization and Contention numbers (because the hold-time is
lower) and, hopefully, fewer wasted CPU cycles doing waits.  And having
a quantitative cachemiss count is icing on the cake, isn't it?  What
does the precise cachemiss count really tell you, above and beyond what
all the other numbers tell you?

Also, I'll remind you that the current Lockmeter doesn't count the time
consumed by the initial spin_trylock() as part of the hold-time or part
of the wait-time, and it seems to me that's where the first (and perhaps
only) cachemiss occurs (and triggers any cache-concurrency cacheblock
writeback for another CPU).  Once one or more CPUs are engaged in a wait
loop, then it seems to me that cacheblock events are not the problem.
Contention is the problem.  And the "thundering herd" chaos at
spin_unlock() is the problem.

In my opinion, the more interesting "Aha!" cache event data comes from
read-write locks, not from global spinlocks.  For example, the
read_lock(&files->file_lock) in fget() can be a Silent Killer -- at
least silent to Lockmeter.  At least one of those CPU scheduler
benchmarks ("chat" and "reflex" come to mind) use clone() and produce
children that share a files_struct.  Lockmeter shows you a massive
number of uncontended read_lock() events for that rwlock.  You might not
even notice the Lockstat output line.  What Lockmeter doesn't show you
is the cacheblock ping-ponging of the file_lock, since each seemingly
innocuous read_lock() dirties the reader count.  It's not until you run
Kernprof that you see the huge waste of CPU cycles.  In the case of my
30p NUMA system, the cacheblock ping-pongs were the big bottleneck for
those benchmarks, not the global runqueue_lock.

John Hawkes

Re: [Lse-tech] lockmeter results comparing 2.4.17, 2.5.3, and 2.5.5

From: Martin J. B. <Mar...@us...> - 2002-02-28 01:32:43

> inode_lock hold times are a problem for other reasons.  Leaving this
> unfixed makes the preepmtible kernel rather pointless....  An ideal
> fix would be to release inodes based on VM pressure against their backing
> page.  But I don't think anyone's started looking at inode_lock yet.
>
> The big one is lru_list_lock, of course.  I'll be releasing code in
> the next couple of days which should take that off the map.  Testing
> would be appreciated.

Seeing as people seem to be interested ... there are some big holders 
of BKL around too - do_exit shows up badly (50ms in the data Hanna 
posted, and I've seen that a lot before). I've seen sync_old_buffers 
hold the BKL for 64ms on an 8way Specweb99 run (22Gb of RAM?)
(though this was on an older 2.4 kernel, and might be fixed by now).

M.