You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
1
(15) |
2
(15) |
3
(16) |
4
(16) |
5
(19) |
6
(15) |
|
7
(1) |
8
(4) |
9
|
10
(4) |
11
(14) |
12
(5) |
13
|
|
14
(1) |
15
|
16
|
17
(12) |
18
(25) |
19
(18) |
20
(18) |
|
21
(16) |
22
(1) |
23
(18) |
24
(15) |
25
|
26
(3) |
27
(18) |
|
28
(8) |
29
|
30
(4) |
|
|
|
|
|
From: Josef W. <Jos...@gm...> - 2013-04-08 18:39:17
|
Am 08.04.2013 17:58, schrieb Niall Douglas: >> If this is mainly about memory, aren't the events collected by >> cachegrind/callgrind already enough, and you can calculate what you want > in a >> post-processing step? (KCachegrind does such a thing to come up with a > cycle >> estimation. It would be useful to add that feature also to the *_annotate > scripts) > > That's exactly the intent. Cachegrind/Callgrind output would simply include > the host's cache and memory latencies prepended as comments; if in XML form, > it appears as an additional XML stanza. That way it doesn't break any > tooling which relies on output to not change. Cachegrind/Callgrind do not support XML output for profile data (at least for now). But it should be quite easy to define a sensible XML format for the data (not saying that we want that - I do not see any benefit). Both the *_annotate scripts and KCachegrind support arbitrarily named event types. And as you can see at section 3.1.7.2 on http://valgrind.org/docs/manual/cl-format.html the callgrind format allows to specify formulas for derived event types. I think this is everything you need: Add a line with e.g. "event: CycleEst = Ir + 100 DLmr + 100 DLmw" where 100 is your cycle penalty for LL cache misses. Only {K,Q}Cachegrind currently support such lines, but it should not be too complex to add that to the *_annotate scripts. >> What kind of instruction types do you have in mind? How do you get them, > from >> VEX IR or guest machine opcodes? > > Thing is, on mobile device hardware you don't really care about arithmetic > op costs because they're fairly trivial relative to main memory costs. Ok, so there is not much missing. > [...] Actually, also for Intel, main memory accesses are slow, similar to the numbers you quote for the mobile chips. However, miss penalties are quite different for random and stream access, where hardware prefetchers kick in. >> Why does this microbenchmark measurement have to be part of the tool? > right (mea culpa, I should have read the docs more closely). I have no issue > with having cachegrind/callgrind refuse to run without a known good cache > config BTW, but that does seem a bit overkill. Not so sure about that. You said yourself that you wasted much time. Micro-benchmarks often are very sensitive on what else is going on in the systems. If the system is loaded, results may be way off. It seems better to run that benchmark at a time controlled by the user. Josef |
|
From: Niall D. <ndo...@bl...> - 2013-04-08 16:36:19
|
> On 04/08/2013 05:58 PM, Niall Douglas wrote: > > Working around the lack of libc math functions and 64 bit integers (I > > don't know why valgrind disables 64 bit ints, it's a real pain) > > 64 bit ints are supported; the "house" types are Long and ULong for the > signed/unsigned versions respectively. It would be completely impossible to > support any 64 bit targets without 64 bit int support. It turns out perhaps I misspoke: last thing Friday evening I got a Floating point exception from (num is a double): int dgt = ((long long) num) % 10; I had assumed that lack of libc equals lack of internal double to 64 bit int routines due to valgrind using -fno-builtin. But perhaps I was just overflowing, so I assumed wrong. Blame Friday afternoon-itus. Niall |
|
From: Julian S. <js...@ac...> - 2013-04-08 16:13:56
|
On 04/08/2013 05:58 PM, Niall Douglas wrote: > Working around the lack of libc math functions and 64 bit integers (I don't > know why valgrind disables 64 bit ints, it's a real pain) 64 bit ints are supported; the "house" types are Long and ULong for the signed/unsigned versions respectively. It would be completely impossible to support any 64 bit targets without 64 bit int support. J |
|
From: Niall D. <ndo...@bl...> - 2013-04-08 15:58:51
|
> > I'm developing an enhancement to cachegrind/callgrind's output - an > > estimated likely execution time log - which can multiply the > > instruction type counts by their average execution time on the target > > CPU in order to generate somewhat more realistic profiling results. > > This would be highly useful to us for ARM targets especially as these > > have an unusually slow main memory relative to other architectures. > > If this is mainly about memory, aren't the events collected by > cachegrind/callgrind already enough, and you can calculate what you want in a > post-processing step? (KCachegrind does such a thing to come up with a cycle > estimation. It would be useful to add that feature also to the *_annotate scripts) That's exactly the intent. Cachegrind/Callgrind output would simply include the host's cache and memory latencies prepended as comments; if in XML form, it appears as an additional XML stanza. That way it doesn't break any tooling which relies on output to not change. If, and only if, KCacheGrind can cope with unexpected event types being added, I *may* have it _optionally_ generate an additional event type for convenience. I'd suppose you'd know better than I here Josef. Later this week (I hope!) BlackBerry will publish a new open source library which detects, using completely generic code, host cache configuration and memory latencies. This library will then be used in a forthcoming patch to cachegrind. Working around the lack of libc math functions and 64 bit integers (I don't know why valgrind disables 64 bit ints, it's a real pain) when implementing double printf formatting output has been frustrating, but I'm nearly there. > Taking the instruction types into account may be useful (e.g. add vs. div). I > suppose you would add another event type for that to callgrind/cachegrind, > something like "core cycle estimation"? > > What kind of instruction types do you have in mind? How do you get them, from > VEX IR or guest machine opcodes? Thing is, on mobile device hardware you don't really care about arithmetic op costs because they're fairly trivial relative to main memory costs. Let me quickly explain: a 3.5Ghz Intel Ivy Bridge generation computer can fetch a 64 byte cache line from main memory in about 1000 psec while an arithmetic op costs 250 psec, so that's a 4:1 ratio (and lower clocked Intel CPUs can do 2.5:1). A 1.5Ghz Qualcomm Krait does an arithmetic op in anywhere between 650 psec and 1650 psec, but a 16/32 byte cache line costs *18000* psec to fetch from main memory. That's an average ratio of about 19:1. If you think that's bad, some of the old Cortex A9s have average ratios in the 117:1 region, that's how slow their main memory can be. However to load from a Krait's LL cache costs only 5000 psec, 3.5x faster. If we can figure out how to get the right data into a Krait's LL cache before it gets used, that makes a huge difference for us. Hence from our perspective, all we really care about right now at least is LL cache misses. They're the big performance limiter for us, and hence why it's gained our attention. And cachegrind/callgrind already exports exactly that info, so it's just a case of doing lots of multiplication. > > At the start of cachegrind/callgrind, a cpucacheconfig.xml is loaded > > in with the CPU's configuration. > > So the idea is to have instruction type latencies, cache parameters and miss > latencies in this file? Cachegrind/callgrind also has simple branch prediction, and > it would be useful to also have a microbenchmark for that ;-) Sure, cpucacheconfig.xml can be anything you like. If there isn't one, one is auto-generated for you based on the local host but after that they're intentionally totally standalone. As I mentioned, the plan is to embed them as a comment at the top of cachegrind/callgrind output, and as a stanza into XML output. I think we also should be able to supply a script which does useful stuff with the output, most specifically to generate some XML with various estimated execution times, possibly with some XHTML output showing estimated slow code (from the perspective of LL cache misses). We have a ton of internal performance tools which take XML as input, hence the priority on XML. > > The problem, at present, is that VG_(read_millisecond_timer) is the > > only timing routine I can see. The generic cache configuration > > detection routine is far more accurate if it is given microsecond or better > accurate timing. > > So would it be okay if I add VG_(read_nanosecond_timer) returning a ULong? > > Why does this microbenchmark measurement have to be part of the tool? > If the config is not available, the tool could error out with a message explaining > what to do to generate the config (e.g. run another binary). Firstly, right now there is no automatic cache configuration on non-x86/x64 architectures, nor is there likely to be given such instructions are privileged. This causes the naïve to run cachegrind/callgrind on let's say ARM and not realize that it's using a default config for a Cortex-A8 and the output is going to be heavily skewed as a result. I myself wasted a week of my time before the lightbulb switched on because the numbers didn't feel right (mea culpa, I should have read the docs more closely). I have no issue with having cachegrind/callgrind refuse to run without a known good cache config BTW, but that does seem a bit overkill. Secondly, it's pretty straightforward to add this to the tool. The changes are minimal, and you should find the patch review easy. The only change outside cachegrind is quite literally VG_(read_nanosecond_timer) as I reimplemented vsnprintf() separately for double formatting support. Thirdly, I've only been given the authorization to contribute what I've outlined above. It is however my great hope that others might take up the baton and take the idea of using valgrind as a unique kind of performance profiler much further. In particular, for example, ideally we'd compile everything here to LLVM and figure out slow code from that given arbitrary models of CPU configurations, but neither LLVM nor the BB10 source code ecosystem is there yet. Indeed, I remember reading somewhere that valgrind intends to move to LLVM one day too (it makes sense, C++17 kinda implies a LLVM type compiler), so maybe my contribution here might help focus some attention from the majors on this issue. After all, this problem doesn't just affect us, it affects everyone based on mobile device hardware, and I'm sure we'd all much prefer to spend our time not debugging weird performance corner cases. Niall |