Re: [Lse-tech] Recalculate trigger in MQ1

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

From: "Christoph Hellwig" <hc...@ns...>
> On Wed, Mar 07, 2001 at 12:46:51PM -0500, Hubertus Franke wrote:
> >
> > Not necessarily.
> >
> > This is a question of false sharing.
> > We need to put data together that is rarely shared with other CPUs
and
> > don't intermangle it with
> > data that is shared.
>
> Correct.  But IMHO it's better to archive this with per-cpu arrays
that
> contain all kinds of this per-cpu stuff instead of having small arrays
> with excessiv padding.

Yes.  And no.  Yes, we need to avoid excessive padding, especially when
padding is done using longer L2, L3, or L4 cacheline lengths.  And we
need to be careful about what gets packed with what.

Sometimes the tradeoffs aren't obvious.  For example, "struct rq_data"
is a per-cpu structure, but it isn't a *private* per-cpu structure.
Each cpu will, on occasion, look at every other cpu's rq_data (e.g.,
nt_running).  When I analyzed the mips64 32p NUMA behavior, I discovered
that it didn't make much difference if rq_data was padded or not.  In
fact, with some workloads I saw a performance *improvement* when rq_data
was *not* padded, especially when it was not padded to 128 L2 cacheline
bytes.  Why?  I concluded that when a cpu wanted to look at every other
cpu's rq_data.nt_running, it suffered fewer L2 cache misses when it
pulled in several rq_data elements per miss.  That seemed to outweight
the downside of increased cacheblock ping-pongs when the array elements
were sharing an L2 cacheblock.  (I still think the right way to go is to
pad rq_data, just out of general principles.)

John Hawkes
ha...@en...