|
From: David <dg...@iu...> - 2006-09-18 01:13:28
|
Any keys on how much repeatability should we expect from callgrind results? We are intending to execute callgrind to detect efficiency regression on th= e=20 execution of (deterministic) algorithms. But when we executed several times= =20 the same binary, the results we got for tested methods were different. I expected callgrind/valgrind to simulate cache and system calls so that th= e=20 instrumentation always would get the same results for successive executions= =20 of the same compilation of the same program. =46or some simple test cases (math ops+just few cache misses) the results s= eems=20 pretty repetable. But when dealing with real test cases the results vary=20 (getting random speedups among +-2%). Some test cases variation may be due = to=20 system calls for file access and memory handling. But some others test case= s=20 don't. So my only suspicious is that the cache initial state is not=20 deterministic. Note: This is on the context of developing Efficiency Guardian, an efficien= cy=20 testing framework.=20 See https://sourceforge.net/projects/efficiencyguard =2D-=20 David Garc=C3=ADa Garz=C3=B3n (Work) dgarcia at iua dot upf anotherdot es (Home) vokimon at telefonica adot net http://www.iua.upf.edu/~dgarcia |
|
From: Julian S. <js...@ac...> - 2006-09-18 09:42:35
|
> vary (getting random speedups among +-2%). Some test cases variation may be > due to system calls for file access and memory handling. But some others > test cases don't. So my only suspicious is that the cache initial state is > not deterministic. Please send a demo program so we can reproduce this problem. J |
|
From: Josef W. <Jos...@gm...> - 2006-09-18 15:23:28
|
Hi, On Monday 18 September 2006 03:06, David Garc=C3=ADa Garz=C3=B3n wrote: > (getting random speedups among +-2%). 2% indeed seems quite much. But as you get the profiles, you can compare where this difference is coming from. It would be good to known about the reason for these small changes. Is this 2% for cache events or for instructions fetches, too? Does cachegrind show similar differences? I ask because the cache simulator is heavy based on cachegrind (although you can not directly compare them, though). > Some test cases variation may be due to =20 > system calls Yes, I would suspect system calls, too. > So my only suspicious is that the cache initial state is not=20 > deterministic. =46or sure, this would be a bug. The simulation always should start with an empty cache. Josef |
|
From: David G. G. <dg...@iu...> - 2006-09-18 16:03:27
|
A Dilluns, 18 de Setembre de 2006 11:42, Julian Seward va escriure: > > vary (getting random speedups among +-2%). Some test cases variation may > > be due to system calls for file access and memory handling. But some > > others test cases don't. So my only suspicious is that the cache initial > > state is not deterministic. > > Please send a demo program so we can reproduce this problem. Do your words mean that unrepeatable results are an anomaly? I just was asking whether we were right to expect repeatable callgrind resu= lts=20 or not. If not we should accept that and offer some kind of threshold=20 solution, but if they are repeatable, at least at some level, we'll look=20 closer to our code. The program that gives us variations is too big to be sent (1200 CLAM [1] u= nit=20 tests). We have an small program we used for testing the framework that is= =20 not giving us such variations; maybe because it is too simple. We are about= =20 building a mid point case which i will send to the list as soon as it is=20 available. [1] http://clam.iua.upf.edu Thanks for the reply. =2D-=20 David Garc=EDa Garz=F3n <dav...@re...> Phone: 034 93 542 21 99 Music Technology Group, Institut Universitari de l'Audiovisual Universitat Pompeu Fabra |
|
From: Julian S. <js...@ac...> - 2006-09-18 16:59:40
|
On Monday 18 September 2006 16:37, David Garcia Garzon wrote: > A Dilluns, 18 de Setembre de 2006 11:42, Julian Seward va escriure: > > > vary (getting random speedups among +-2%). Some test cases variation > > > may be due to system calls for file access and memory handling. But > > > some others test cases don't. So my only suspicious is that the cache > > > initial state is not deterministic. > > > > Please send a demo program so we can reproduce this problem. > > Do your words mean that unrepeatable results are an anomaly? Er, sorry, badly thought out reply on my behalf. What I believe is, if the program you are running is genuinely deterministic, then results should be repeatable. I think Josef W (callgrind's author) also believes that. However, it's amazing how many ways there are for a program not to follow the exact same path every time. Any variation in syscall args/results, signal delivery and thread scheduling will cause divergence. > I just was asking whether we were right to expect repeatable callgrind > results or not. If your program (including the dynamic linker crud that runs underneath it) is completely repeatable, and if callgrind is not doing any time based profiling [which is something Josef can verify], and your program has no threads and no signals, then yes. Given the above I suspect the answer for any non-toy program is 'no'. J |
|
From: Josef W. <Jos...@gm...> - 2006-09-18 17:17:55
|
On Monday 18 September 2006 18:59, Julian Seward wrote: > > > Please send a demo program so we can reproduce this problem. > > > > Do your words mean that unrepeatable results are an anomaly? > > Er, sorry, badly thought out reply on my behalf. What I believe is, if the > program you are running is genuinely deterministic, then results should be > repeatable. I think Josef W (callgrind's author) also believes that. Yes. > However, it's amazing how many ways there are for a program not to follow > the exact same path every time. Any variation in syscall args/results, > signal delivery and thread scheduling will cause divergence. Actually, I forgot about signals. BTW, when can VGs scheduler cause divergence? I was under the impression that the scheduler decisions are predictable. > > I just was asking whether we were right to expect repeatable callgrind > > results or not. > > If your program (including the dynamic linker crud that runs underneath > it) is completely repeatable, and if callgrind is not doing any time based > profiling [which is something Josef can verify], and your program has no > threads and no signals, then yes. Unless when using "--collect-systime=yes" (which probably nobody does?), nothing is time based in callgrind (same as with cachegrind). Josef |
|
From: David <dg...@iu...> - 2006-09-20 09:04:46
|
On Monday 18 September 2006 17:23, Josef Weidendorfer wrote: > Hi, > > On Monday 18 September 2006 03:06, David Garc=C3=ADa Garz=C3=B3n wrote: > > (getting random speedups among +-2%). > > 2% indeed seems quite much. > > But as you get the profiles, you can compare where this difference > is coming from. It would be good to known about the reason for these > small changes. > > Is this 2% for cache events or for instructions fetches, too? Yes, differences comes both on cache failures and instructions fetch, so, i= t=20 seems that it really executes more code. My suspicions are getting into=20 memory allocation. Let me explore this deeply. > Does cachegrind show similar differences? I ask because the cache simulat= or > is heavy based on cachegrind (although you can not directly compare them, > though). Yes, i tested that on first place because i was not confident on our callgr= ind=20 output parser, but kcachegrind told us just the same numbers we were gettin= g. > > Some test cases variation may be due to > > system calls > > Yes, I would suspect system calls, too. Indeed, i suspect that even code outside the testing umbrella can modify ca= che=20 and system status, so that tests executed after that would get differences.= =20 This would be a fatal bullet for our project! > > So my only suspicious is that the cache initial state is not > > deterministic. > > For sure, this would be a bug. The simulation always should start > with an empty cache. As i explain above, i think that there an non-deterministic initial cache=20 status is not needed to get unrepeatable results. As all tests run on the=20 same process, just by having one test (or even code outside any test) that= =20 executes differently, remaining tests will change their results. :-( It wou= ld=20 be nice if i could reset the cache status in some way. In summary, my main worries now are: =2D Are glib allocations unrepeatable? If so, could we give them a fixed co= st? =2D Tests interdependency -> Any way of resetting the simulated machine? =2D-=20 David Garc=C3=ADa Garz=C3=B3n (Work) dgarcia at iua dot upf anotherdot es (Home) vokimon at telefonica adot net http://www.iua.upf.edu/~dgarcia |
|
From: Josef W. <Jos...@gm...> - 2006-09-20 10:01:12
|
On Wednesday 20 September 2006 11:03, David Garc=C3=ADa Garz=C3=B3n wrote:
> > Does cachegrind show similar differences? I ask because the cache simul=
ator
> > is heavy based on cachegrind (although you can not directly compare the=
m,
> > though).
>=20
> Yes, i tested that on first place because i was not confident on our call=
grind=20
> output parser, but kcachegrind told us just the same numbers we were gett=
ing.
Sorry, I meant: Does "valgrind --tool=3Dcachegrind ..." results show
similar differences as callgrind?
=20
> > > Some test cases variation may be due to
> > > system calls
> >
> > Yes, I would suspect system calls, too.
>=20
> Indeed, i suspect that even code outside the testing umbrella
I do not understand. What code are you talking about here?
And what is your "testing umbrella"?
Are we talking about repeatability of whole profile runs (ie. from
process start to termination), or do you want to have repeatability
inside of parts of such a run?
> can modify cache =20
> and system status, so that tests executed after that would get difference=
s.
So you are talking about multiple tests you do in a row inside of
one run? Then, of course you have dependency among the tests
because of e.g. the state of the cache simulator.
However, you have the same dependency in a real run with real hardware.
> As i explain above, i think that there an non-deterministic initial cache=
=20
> status is not needed to get unrepeatable results. As all tests run on the=
=20
> same process,
OK, so that is your problem.
Why can't you do one run per test?
(I thought from the beginning of this discussion that this is the case;
and Julian most probably, too)
> just by having one test (or even code outside any test) that =20
> executes differently, remaining tests will change their results. :-( It w=
ould=20
> be nice if i could reset the cache status in some way.
The cache state is cleared everytime the instrumentation mode is changed.
When code was run with the simulator switched off some time and later
switched on again, it simply does not make sense to use the arbitrary,
old cache state.
So, you could put the following between tests:
#include <valgrind/callgrind.h>
...
main()
{
...
<TEST1>
CALLGRIND_STOP_INSTRUMENTATION;
CALLGRIND_START_INSTRUMENTATION;
<TEST2>
However, a clean cache state will lead to a lot of cold misses
afterwards. OTOH, if this accounts for 2% of your results, the
test is probably much to small/short. I would say that for a
performance test, such startup costs should hide in noise.
> In summary, my main worries now are:
> - Are glib allocations unrepeatable?
> If so, could we give them a fixed cost?=20
What do you mean here? Whether calls to malloc() always give the same
address range? What is the cost of a malloc() for you?
>=20
> - Tests interdependency -> Any way of resetting the simulated machine?
See above.
Josef
|
|
From: Nicholas N. <nj...@cs...> - 2006-09-20 15:11:44
|
On Wed, 20 Sep 2006, David [utf-8] Garc=EDa Garz=F3n wrote: >> Is this 2% for cache events or for instructions fetches, too? > > Yes, differences comes both on cache failures and instructions fetch, so,= it > seems that it really executes more code. My suspicions are getting into > memory allocation. Let me explore this deeply. My experience with Cachegrind is that minor variations are extremely common= =2E=20 If the slightest thing changes (eg. the length of the program command)=20 you'll get different results. Plus non-determinism can creep in many ways,= =20 as Julian said. But Cachegrind and Callgrind certainly start with an empty= =20 simulated cache each time they're invoked. Nick |