You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
1
(2) |
2
|
3
(1) |
|
4
(3) |
5
(5) |
6
(1) |
7
(3) |
8
(1) |
9
|
10
|
|
11
|
12
|
13
(1) |
14
|
15
|
16
|
17
|
|
18
|
19
|
20
|
21
|
22
|
23
|
24
|
|
25
|
26
(5) |
27
(1) |
28
|
29
|
30
|
31
(1) |
|
From: Nicholas N. <nj...@ca...> - 2003-05-07 21:02:50
|
On Wed, 7 May 2003, Josef Weidendorfer wrote: > > Is a round-robin mapping of threads to processors accurate, ie. > > representative of what would really happen? > > I think the regular case is e.g. to run a multithreaded application with 4 > threads on a 4-processor machine, and round robin mapping is accurate in a "n > threads/n processor" szenario. How common is that scenario? > The typical use case is here to check if there's e.g. cache trashing > (independed data regularily accessed by the two processors are located in the > same cache line, leading to a lot of cache invalidation/misses) or general > performance slowdown because shared data is accessed often. So with the cache trashing, Cachegrind/Calltree with this feature wouldn't necessarily be reporting a figure that's representative of any real-life configuration, but would give a general indication of how well different threads interact, yes? That sounds like it could be useful. N |
|
From: Josef W. <Jos...@gm...> - 2003-05-07 15:36:26
|
On Wednesday 07 May 2003 09:58, Nicholas Nethercote wrote: > On Mon, 5 May 2003, Josef Weidendorfer wrote: > > what's needed in cachegrind to support multiple processor caches and > > coherency protocols among them? I have a wish item here, and perhaps > > it's quite easy to implement. > > > > Motivation: > > Multithreaded (PThread) programs are handled quite fine with cachegrind, > > but the results can be misleading because only one cache hierarchy is > > simulated: If the real program will be run on a 2-processor machine and > > we have 2 threads, there should be 2 caches (for each processor) > > simulated. The default configuration could be to simulate as many caches > > as there are processors in your machine, and use a simple static round > > robin mapping from threads to the simulated caches. > > > > Items I think that have to be done: > > 1. Reserve some bits from the tag value of each cache entry for state > > bits of the coherence protocol for this entry (should always be fine > > because there's no direct-mapped cache with a cacheline-size of 1 byte). > > 1. allocate multiple "static cache_t2 I1, D1, L2", > > 2. switch the cache_t2 structures on a thread switch, > > 3. change cachesim_##L##_doref to handle a cache coherence protocol (E.g. > > invalidating cache entries of remote caches on writes). > > > > Do you think this is doable/useful at all, or am I overlooking something? > > My knowledge of multi-processors is very poor, but it seems plausible. > Is a round-robin mapping of threads to processors accurate, ie. > representative of what would really happen? Implementing dynamic scheduling would be far too complex. I think the regular case is e.g. to run a multithreaded application with 4 threads on a 4-processor machine, and round robin mapping is accurate in a "n threads/n processor" szenario. The typical use case is here to check if there's e.g. cache trashing (independed data regularily accessed by the two processors are located in the same cache line, leading to a lot of cache invalidation/misses) or general performance slowdown because shared data is accessed often. > Also, are you thinking of doing this in Cachegrind or in Calltree? Either > way, I guess implementing it is the only way to really see if it's doable. > If the results for a 1-processor machine are the same as the current > results (I imagine they would be) and it's not too complex, I wouldn't > object to it going into Cachegrind. First I'd like to implement it in calltree. The added functionality of calltree has almost nothing to do with the addition of multiple caches (besides allowing for multiple BBCCs for one BB, see below). > One thing: how will annotation work? The simplest way would be to just > add up all the accesses + misses for each line of code, ignoring which > processor the accesses + misses were on. But maybe this is useful > information? I don't know. Adding up all misses in itself is already useful. You could run it with 1 cache first (this is e.g. the case when using 2 threads on a P4 with hyperthreading). Then run the same with 2 caches, and check the differences of cache misses. OTOH, the calltree skin already allows for separate profile data per thread ID. You get separate dumps and can compare them. This doesn't need any changes in cg_annotate. KCachegrind allows to load multiple dumps for this (still on my TODO list: Allow the user to specify that costs of one dump will be subtracted. This would allow for a difference view of cost events for comparations of runs/threads). If you are interested in this part of calltree (separate dumping for threads) for integration into cachegrind, I can prepare a patch. The changes are: 1) additional instrumentation: a call at start of each BB, setting up a base pointer to a BBCC struct for all cost centers in this BB. 2) the log_* callbacks for each memory access are changed to give an offset to the BBCC start. With the base pointer for the BBCC, you get the actual pointer of the CC for this instruction. 3) On demand, there are clones of BBCCs created for different threads in the setup-function mentioned in (1). 4) In the BBCC hash lookup, I added the thread ID into the hash key. Step (1) now involves a hash table look up to set the base pointer. This is speeded up by caching the last BBCC used for a BB. BTW, I do the whole logic for tracing function calls for calltree in this setup function. Josef |
|
From: Nicholas N. <nj...@ca...> - 2003-05-07 07:58:05
|
On Mon, 5 May 2003, Josef Weidendorfer wrote: > what's needed in cachegrind to support multiple processor caches and > coherency protocols among them? I have a wish item here, and perhaps > it's quite easy to implement. > > Motivation: > Multithreaded (PThread) programs are handled quite fine with cachegrind, but > the results can be misleading because only one cache hierarchy is simulated: > If the real program will be run on a 2-processor machine and we have 2 > threads, there should be 2 caches (for each processor) simulated. > The default configuration could be to simulate as many caches as there are > processors in your machine, and use a simple static round robin mapping from > threads to the simulated caches. > > Items I think that have to be done: > 1. Reserve some bits from the tag value of each cache entry for state bits of > the coherence protocol for this entry (should always be fine because there's > no direct-mapped cache with a cacheline-size of 1 byte). > 1. allocate multiple "static cache_t2 I1, D1, L2", > 2. switch the cache_t2 structures on a thread switch, > 3. change cachesim_##L##_doref to handle a cache coherence protocol (E.g. > invalidating cache entries of remote caches on writes). > > Do you think this is doable/useful at all, or am I overlooking something? My knowledge of multi-processors is very poor, but it seems plausible. Is a round-robin mapping of threads to processors accurate, ie. representative of what would really happen? Also, are you thinking of doing this in Cachegrind or in Calltree? Either way, I guess implementing it is the only way to really see if it's doable. If the results for a 1-processor machine are the same as the current results (I imagine they would be) and it's not too complex, I wouldn't object to it going into Cachegrind. One thing: how will annotation work? The simplest way would be to just add up all the accesses + misses for each line of code, ignoring which processor the accesses + misses were on. But maybe this is useful information? I don't know. N |