From: John H. <ha...@en...> - 2001-03-07 18:03:13
|
From: "Christoph Hellwig" <hc...@ns...> > On Wed, Mar 07, 2001 at 12:46:51PM -0500, Hubertus Franke wrote: > > > > Not necessarily. > > > > This is a question of false sharing. > > We need to put data together that is rarely shared with other CPUs and > > don't intermangle it with > > data that is shared. > > Correct. But IMHO it's better to archive this with per-cpu arrays that > contain all kinds of this per-cpu stuff instead of having small arrays > with excessiv padding. Yes. And no. Yes, we need to avoid excessive padding, especially when padding is done using longer L2, L3, or L4 cacheline lengths. And we need to be careful about what gets packed with what. Sometimes the tradeoffs aren't obvious. For example, "struct rq_data" is a per-cpu structure, but it isn't a *private* per-cpu structure. Each cpu will, on occasion, look at every other cpu's rq_data (e.g., nt_running). When I analyzed the mips64 32p NUMA behavior, I discovered that it didn't make much difference if rq_data was padded or not. In fact, with some workloads I saw a performance *improvement* when rq_data was *not* padded, especially when it was not padded to 128 L2 cacheline bytes. Why? I concluded that when a cpu wanted to look at every other cpu's rq_data.nt_running, it suffered fewer L2 cache misses when it pulled in several rq_data elements per miss. That seemed to outweight the downside of increased cacheblock ping-pongs when the array elements were sharing an L2 cacheblock. (I still think the right way to go is to pad rq_data, just out of general principles.) John Hawkes ha...@en... |