From: "Ravikiran G Thirumalai" <kiran@...>
> I've used this to measure cache effects with smp friendly counters,
> But for lockmetering, this would not be as good as having cache event
> measurements built into lockmeter .......
Forgive my obtuseness, but what "Aha!" higher level of understanding do
you believe you would get if Lockmeter showed you quantitative cache
event data for spinlocks? It seems to me that benchmark performance is
the ultimate metric, and that cache event data would be underlying
evidence of why you did (or did not) see an improvement. For example,
you move a spinlock to its own isolated cacheline, and Lockmeter shows
you lower Utilization and Contention numbers (because the hold-time is
lower) and, hopefully, fewer wasted CPU cycles doing waits. And having
a quantitative cachemiss count is icing on the cake, isn't it? What
does the precise cachemiss count really tell you, above and beyond what
all the other numbers tell you?
Also, I'll remind you that the current Lockmeter doesn't count the time
consumed by the initial spin_trylock() as part of the hold-time or part
of the wait-time, and it seems to me that's where the first (and perhaps
only) cachemiss occurs (and triggers any cache-concurrency cacheblock
writeback for another CPU). Once one or more CPUs are engaged in a wait
loop, then it seems to me that cacheblock events are not the problem.
Contention is the problem. And the "thundering herd" chaos at
spin_unlock() is the problem.
In my opinion, the more interesting "Aha!" cache event data comes from
read-write locks, not from global spinlocks. For example, the
read_lock(&files->file_lock) in fget() can be a Silent Killer -- at
least silent to Lockmeter. At least one of those CPU scheduler
benchmarks ("chat" and "reflex" come to mind) use clone() and produce
children that share a files_struct. Lockmeter shows you a massive
number of uncontended read_lock() events for that rwlock. You might not
even notice the Lockstat output line. What Lockmeter doesn't show you
is the cacheblock ping-ponging of the file_lock, since each seemingly
innocuous read_lock() dirties the reader count. It's not until you run
Kernprof that you see the huge waste of CPU cycles. In the case of my
30p NUMA system, the cacheblock ping-pongs were the big bottleneck for
those benchmarks, not the global runqueue_lock.