|
From: Ulrich D. <dr...@re...> - 2006-11-07 07:12:54
Attachments:
d-valgrind-cg_sim
|
In case a data set spreads two cache lines and the second cache line is=20 at index zero (i.e., the first at the highest index), the tag used for=20 the second cache line is wrong. It is one higher than the tag for the=20 first, otherwise no wrap-around would happen. Patch is attached. I've also added some optimizations. Since you already decided to keep=20 L.sets - 1 in a separate variable around you might as well use it in all=20 places. Once you do this there is no reason to keep L.sets around. So=20 I went on removing it. Which brings on the next step: now the cache_t2 structure consists of 8=20 words and the char array. If you rearrange the struct to move the tags=20 pointer before the desc_line element all commonly used elements are in=20 the first 32 or 64 bytes (for 32 or 64 byte platforms respectively). If=20 now cache_t2 is aligned for this value there is only one cache line=20 needed for L2, I1, D1. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro St = =E2=9E=A7 Mountain View, CA =E2=9D=96 |
|
From: Josef W. <Jos...@gm...> - 2006-11-07 17:33:40
|
On Tuesday 07 November 2006 08:12, Ulrich Drepper wrote: > In case a data set spreads two cache lines and the second cache line is > at index zero (i.e., the first at the highest index), the tag used for > the second cache line is wrong. It is one higher than the tag for the > first, otherwise no wrap-around would happen. Patch is attached. Indeed. And thanks; this is buggy in the callgrind version of the simulator, too. Wow. How did you find this? > Which brings on the next step: now the cache_t2 structure consists of 8 > words and the char array. If you rearrange the struct to move the tags > pointer before the desc_line element all commonly used elements are in > the first 32 or 64 bytes (for 32 or 64 byte platforms respectively). If > now cache_t2 is aligned for this value there is only one cache line > needed for L2, I1, D1. cache_t2 for sure is always present in L1 in the whole run. So this is not about reducing latency in a hot path, but about freeing a cache line in L1 to make room for other uses. The patch is fine; I just wonder if this really makes any difference in practice. With 32 kB L1 cache, you have 512 lines, so this gives you 0.2% more space. Josef |
|
From: Nicholas N. <nj...@cs...> - 2006-11-21 04:38:12
|
On Mon, 6 Nov 2006, Ulrich Drepper wrote: > In case a data set spreads two cache lines and the second cache line is at > index zero (i.e., the first at the highest index), the tag used for the > second cache line is wrong. It is one higher than the tag for the first, > otherwise no wrap-around would happen. Patch is attached. > block2: \ > set = &(L.tags[set2 << L.assoc_bits]); \ > + if (set2 == 0) \ > + ++tag; I think a clearer thing to do in the two-set case is to compute two tag values, one for 'a' and one for 'a+size-1', just as is done for the sets. The first tag would be used when checking the first set, the second tag for the second set. I think the end effect is the same. > I've also added some optimizations. Since you already decided to keep L.sets > - 1 in a separate variable around you might as well use it in all places. > Once you do this there is no reason to keep L.sets around. So I went on > removing it. > > Which brings on the next step: now the cache_t2 structure consists of 8 words > and the char array. If you rearrange the struct to move the tags pointer > before the desc_line element all commonly used elements are in the first 32 > or 64 bytes (for 32 or 64 byte platforms respectively). If now cache_t2 is > aligned for this value there is only one cache line needed for L2, I1, D1. Have you measured the effect of these changes? Nick |
|
From: Ulrich D. <dr...@re...> - 2006-11-21 08:48:17
|
Nicholas Nethercote wrote: > I think a clearer thing to do in the two-set case is to compute two tag= =20 > values, one for 'a' and one for 'a+size-1', just as is done for the=20 > sets. The first tag would be used when checking the first set, the=20 > second tag for the second set. I think the end effect is the same. Yes, but it's slower. And this in a hot path. My patch is as fast as=20 you can get it. >> Which brings on the next step: now the cache_t2 structure consists of=20 >> 8 words and the char array. If you rearrange the struct to move the=20 >> tags pointer before the desc_line element all commonly used elements=20 >> are in the first 32 or 64 bytes (for 32 or 64 byte platforms=20 >> respectively). If now cache_t2 is aligned for this value there is=20 >> only one cache line needed for L2, I1, D1. >=20 > Have you measured the effect of these changes? I don't have any numbers anymore. But it was a bit faster. It should=20 be obvious that this is the case. If you only need one cache line=20 instead of two especially L1d is used much more efficient. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro St = =E2=9E=A7 Mountain View, CA =E2=9D=96 |
|
From: Nicholas N. <nj...@cs...> - 2006-11-21 11:28:20
|
On Tue, 21 Nov 2006, Ulrich Drepper wrote: > Nicholas Nethercote wrote: >> I think a clearer thing to do in the two-set case is to compute two tag >> values, one for 'a' and one for 'a+size-1', just as is done for the sets. >> The first tag would be used when checking the first set, the second tag for >> the second set. I think the end effect is the same. > > Yes, but it's slower. And this in a hot path. My patch is as fast as you > can get it. This is not such a hot path, the set1 != set2 case is much less common than the set1 == set2 case. With my suggested approach you can compute the second tag only when needed, and as there's no extra test. >>> Which brings on the next step: now the cache_t2 structure consists of 8 >>> words and the char array. If you rearrange the struct to move the tags >>> pointer before the desc_line element all commonly used elements are in the >>> first 32 or 64 bytes (for 32 or 64 byte platforms respectively). If now >>> cache_t2 is aligned for this value there is only one cache line needed for >>> L2, I1, D1. >> >> Have you measured the effect of these changes? > > I don't have any numbers anymore. But it was a bit faster. It should be > obvious that this is the case. If you only need one cache line instead of > two especially L1d is used much more efficient. Cache optimisations are sufficiently subtle that I would consider very few of them obvious. Nick |
|
From: Ulrich D. <dr...@re...> - 2006-11-22 08:56:18
|
Nicholas Nethercote wrote: > This is not such a hot path, the set1 !=3D set2 case is much less commo= n > than the set1 =3D=3D set2 case. True, but still far too frequent to be ignored. Even the wrong case=20 (i.e., when the tag needs to be corrected) appeared in measurable=20 amounts. I ran the broken and fixed cachegrind binaries and the=20 introduced errors were measurable. And that's only a fraction of cases=20 handle in this if branch. > Cache optimisations are sufficiently subtle that I would consider very=20 > few of them obvious. Well, it _is_ obvious that if in one version two cache lines are used=20 and in the other just one, the second is faster. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro St = =E2=9E=A7 Mountain View, CA =E2=9D=96 |
|
From: Nicholas N. <nj...@cs...> - 2006-11-22 11:48:07
|
On Wed, 22 Nov 2006, Ulrich Drepper wrote: > Nicholas Nethercote wrote: >> This is not such a hot path, the set1 != set2 case is much less common >> than the set1 == set2 case. > > True, but still far too frequent to be ignored. Even the wrong case (i.e., > when the tag needs to be corrected) appeared in measurable amounts. I ran > the broken and fixed cachegrind binaries and the introduced errors were > measurable. And that's only a fraction of cases handle in this if branch. I found the set1 != set2 case to account for 2--3% of all cases, and the tag miscomputation to occur in about 0.05% of all cases. It's certainly worth fixing, but the performance effect of the extra tag computation on this cold path is negligible. >> Cache optimisations are sufficiently subtle that I would consider very few >> of them obvious. > > Well, it _is_ obvious that if in one version two cache lines are used and in > the other just one, the second is faster. What's less obvious is how much faster. I added the extra tag computation in the set1 != set2 case and saw no difference. I then tried the struct rearrangement you suggested and again saw no difference. I used valgrind/perf/bz2.c as the benchmark for these tests. I've committed a fix for the simulation error. Thanks for pointing this out! Nick |