|
From: Joerg B. <jo...@we...> - 2003-07-28 09:11:43
|
Dear List, I am aware that valgrind (with the calltree skin) counts the processor instructions, used to run some code. Is there also a way to count the CPU cycles? valgrind is a cool tool to do optimizations on the source (you could even do challenges: "you finds the fastest variant of this piece of code ?"). Now somebody tells me, that counting the instructions might be wrong, because not all instructions take the same amount of cycles. I assume, that this is right, but irrelevant from a practical point of view. Joerg |
|
From: Nicholas N. <nj...@ca...> - 2003-07-28 09:22:30
|
On Mon, 28 Jul 2003, Joerg Beyer wrote: > I am aware that valgrind (with the calltree skin) counts > the processor instructions, used to run some code. Is there > also a way to count the CPU cycles? No. > valgrind is a cool tool to do optimizations on the source > (you could even do challenges: "you finds the fastest variant > of this piece of code ?"). Now somebody tells me, that counting > the instructions might be wrong, because not all instructions > take the same amount of cycles. I assume, that this is right, > but irrelevant from a practical point of view. On the contrary, it's extremely relevant from a practical point of view. It's almost impossible to know how long any one instruction will take. For example, on my machine, a memory read takes 1 cycle if it hits the L1 cache, 10 cycles in the worst case if it only hits the L2 cache, and 206 cycles in the worst case if it misses both caches. But out-of-order execution and all the other fancy modern CPU mumbo-jumbo means that these delays are rarely as bad as the worst case. Then throw in branch mispredictions, and other pipeline stalls... it's a mess. N |
|
From: Joerg B. <jo...@we...> - 2003-07-28 09:35:11
|
Nicholas Nethercote wrote: > On Mon, 28 Jul 2003, Joerg Beyer wrote: > > >>I am aware that valgrind (with the calltree skin) counts >>the processor instructions, used to run some code. Is there >>also a way to count the CPU cycles? > > > No. > would it be good to have such a cycle counter? >>valgrind is a cool tool to do optimizations on the source >>(you could even do challenges: "you finds the fastest variant >>of this piece of code ?"). Now somebody tells me, that counting >>the instructions might be wrong, because not all instructions >>take the same amount of cycles. I assume, that this is right, >>but irrelevant from a practical point of view. > > > On the contrary, it's extremely relevant from a practical point of view. > It's almost impossible to know how long any one instruction will take. > For example, on my machine, a memory read takes 1 cycle if it hits the L1 > cache, 10 cycles in the worst case if it only hits the L2 cache, and 206 > cycles in the worst case if it misses both caches. But out-of-order > execution and all the other fancy modern CPU mumbo-jumbo means that these > delays are rarely as bad as the worst case. Then throw in branch > mispredictions, and other pipeline stalls... it's a mess. > > N OK, so an implementation to do cycle counting needs to have a table that lists for every instruction how many cycles it need for the different cache-hit/miss situations. Are these informations available (for all the processors)? Joerg |
|
From: Nicholas N. <nj...@ca...> - 2003-07-28 09:50:41
|
On Mon, 28 Jul 2003, Joerg Beyer wrote: > would it be good to have such a cycle counter? It would be fantastic, but I don't think it's going to happen. > > But out-of-order execution and all the other fancy modern CPU > > mumbo-jumbo means that these delays are rarely as bad as the worst > > case. > > OK, so an implementation to do cycle counting needs to have > a table that lists for every instruction how many cycles it > need for the different cache-hit/miss situations. No, because the _worst-case_ latencies often don't show up, if the processor can find other work to do while waiting for the memory read; that's why it's so difficult. It heavily depends on how much instruction-level parallelism (ILP) the CPU is able to get out of it, which depends what the code is doing. If your code has very poor ILP, then you might be able to do a reasonable estimate of the time taken due to cache misses and branch mispredictions, but only once you've run the program and found out the cache miss and branch mispredict rates, which you can't work out ahead of time. See www.cl.cam.ac.uk/~njn25/pubs/cache-large-lazy2002.ps.gz for more (possibly too much) information about this, including examples of predictions with code that does have very low levels of ILP. > Are these informations available (for all the processors)? I think it depends not only on processor, but also the speed of your RAM, etc, so there's no single answer. For your machine, you can find out the _worst-case_ cache latencies with Calibrator (homepages.cwi.nl/~manegold/Calibrator/). N |
|
From: Joerg B. <jo...@we...> - 2003-07-28 09:58:39
|
Nicholas Nethercote wrote: > On Mon, 28 Jul 2003, Joerg Beyer wrote: > > >>would it be good to have such a cycle counter? > > > It would be fantastic, but I don't think it's going to happen. > > No, because the _worst-case_ latencies often don't show up, if the > processor can find other work to do while waiting for the memory read; > that's why it's so difficult. It heavily depends on how much > instruction-level parallelism (ILP) the CPU is able to get out of it, is that already implemented in valgrind? If true, then _only_ (ok, it still is a lot) the timing tables and a counter are missing, right? > which depends what the code is doing. If your code has very poor ILP, > then you might be able to do a reasonable estimate of the time taken due > to cache misses and branch mispredictions, but only once you've run the > program and found out the cache miss and branch mispredict rates, which > you can't work out ahead of time. > > See www.cl.cam.ac.uk/~njn25/pubs/cache-large-lazy2002.ps.gz for more > (possibly too much) information about this, including examples of > predictions with code that does have very low levels of ILP. > > >>Are these informations available (for all the processors)? > > > I think it depends not only on processor, but also the speed of your RAM, > etc, so there's no single answer. For your machine, you can find out the > _worst-case_ cache latencies with Calibrator > (homepages.cwi.nl/~manegold/Calibrator/). > > N > Well, for me, the instruction counter is at least a very good start to verify that a optimization in my C++ program really is an enhancement. Joerg |
|
From: Nicholas N. <nj...@ca...> - 2003-07-28 10:27:58
|
On Mon, 28 Jul 2003, Joerg Beyer wrote: > Well, for me, the instruction counter is at least a very good start > to verify that a optimization in my C++ program really is an > enhancement. For gross cycle counting, try Rabbit (www.scl.ameslab.gov/Projects/Rabbit/menu.html) or Brink/Abyss www.eg.bucknell.edu/~bsprunt/emon/brink_abyss/brink_abyss.shtm. I've used Rabbit, it's very good. I don't think you're ever going to be able to get reliable cycle counts out of a Valgrind-based tool (unless that tool reads the performance counters :) N |
|
From: Joerg B. <jo...@we...> - 2003-07-28 11:17:38
|
Nicholas Nethercote wrote: > On Mon, 28 Jul 2003, Joerg Beyer wrote: > > >>would it be good to have such a cycle counter? > > > It would be fantastic, but I don't think it's going to happen. would it be resonable to make some restrictions, e.g. ignore all the branch-prediction-stuff, take only the D- and I-Cache into account? Iff we could shrink the expectations down to something reachable, then I would take some time to figure out, how such a counter has to be implemented. As far as I can see, the first and second level cache logic is already in place (and works fine). comments? Joerg |