You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
|
3
|
4
|
5
(2) |
6
(3) |
7
|
8
(2) |
9
(3) |
|
10
(3) |
11
(5) |
12
(1) |
13
|
14
(21) |
15
(6) |
16
(4) |
|
17
(9) |
18
(13) |
19
(15) |
20
(15) |
21
(11) |
22
(16) |
23
(4) |
|
24
|
25
(8) |
26
(4) |
27
(3) |
28
(1) |
29
|
30
(2) |
|
From: Jeremy F. <je...@go...> - 2002-11-18 18:35:01
|
On Mon, 2002-11-18 at 09:53, Nicholas Nethercote wrote:
> The tutorial is a bit unclear, but I think the figures in the first part
> don't account for any optimisations within the traces. So it is likely
> underestimating the advantage of the trace cache.
I'll reread it more closely.
> Interestingly, the original version of Dynamo, implemented on HPPA-RISC,
> actually sped up most programs by up to 20%. There's no clear reason why
> the x86 version mostly slows them down; maybe it's just an
> engineering/maturity thing.
Well, there was that section on dynamic optimisation, but I don't think
it applied to any of the earlier discussion about the basic Dynamo
mechanism. The optimisations are really simple though; it should be
possible to do something similar.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-18 17:59:01
|
I've put up my current version of 44-symbolic-addr. It would be
interesting for people to try it out and see if it does anything useful
for you. I've only added support to helgrind for now, so you'd want to
be trying that out.
There's some interesting problems with using it in memcheck. Memcheck
doesn't really deal with memory locations so much as values, so it can't
use VG_(describe_addr)() for most of its reports. For those times when
it does talk about memory addresses, such as free() errors, it does so
from within the context of vg_clientfuncs.c, which means that
VG_(describe_addr)() will talk about locals in that file rather than
from within the user's code. Still, its not like its hard to work out
what V is talking about in those cases.
Oh, there are the reports when you touch freed memory itself; those
should give useful descriptions. I'll do a patch.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-18 17:54:13
|
Oh, another thing: On 18 Nov 2002, Jeremy Fitzhardinge wrote: > > It's at www.cag.lcs.mit.edu/dynamorio/. The slides from the ASPLOS > > tutorial are very interesting. The whole thing was only released in June. > > Looks interesting. It seems their biggest performance win was using BB > chaining. They also implemented a trace cache, though it seems to be a > fair bit of complexity for hardly any incremental improvement. The tutorial is a bit unclear, but I think the figures in the first part don't account for any optimisations within the traces. So it is likely underestimating the advantage of the trace cache. Interestingly, the original version of Dynamo, implemented on HPPA-RISC, actually sped up most programs by up to 20%. There's no clear reason why the x86 version mostly slows them down; maybe it's just an engineering/maturity thing. N |
|
From: Nicholas N. <nj...@ca...> - 2002-11-18 17:51:21
|
On 18 Nov 2002, Jeremy Fitzhardinge wrote: > > It clearly has some advantages over Valgrind (eg. performance, runs under > > Windows). Now I'm trying to work out what advantages has over it... > > V is under the GPL, as opposed to: > > permission to copy and use (but not distribute to others) this software > for the sole purpose of internal evaluation is hereby granted without > fee. > > Oh, and they don't seem to have released any source... Sure. It also doesn't work with pthreaded Linux programs. And I can't get their example skins (which they confusingly call "clients" to work). I was thinking more from my biased need-a-PhD-in-two-years point of view :) N |
|
From: Jeremy F. <je...@go...> - 2002-11-18 17:44:49
|
On Mon, 2002-11-18 at 08:03, Nicholas Nethercote wrote: > It's performance is better than Valgrind's; without instrumentation > programs under Linux run about 1.0--2.0 times slower, with Valgrind it's > more like 5 times slower. > > It's at www.cag.lcs.mit.edu/dynamorio/. The slides from the ASPLOS > tutorial are very interesting. The whole thing was only released in June. Looks interesting. It seems their biggest performance win was using BB chaining. They also implemented a trace cache, though it seems to be a fair bit of complexity for hardly any incremental improvement. > It clearly has some advantages over Valgrind (eg. performance, runs under > Windows). Now I'm trying to work out what advantages has over it... V is under the GPL, as opposed to: permission to copy and use (but not distribute to others) this software for the sole purpose of internal evaluation is hereby granted without fee. Oh, and they don't seem to have released any source... J |
|
From: Jeremy F. <je...@go...> - 2002-11-18 16:59:40
|
On Mon, 2002-11-18 at 00:45, Nicholas Nethercote wrote:
> On 17 Nov 2002, Jeremy Fitzhardinge wrote:
>
> > The P4 has the interesting architectural feature of a trace cache rather
> > than a traditional icache.
>
> I'd be wary of trying to target a single architecture. For example, the
> Athlon has a traditional I cache. And Valgrind should (hopefully!) be
> around longer than the P4...
The P4 is a good example of a modern architecture. Anything that's good
for P4 will be good for Athlon (and Hammer) as well.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-18 16:03:52
|
Hi, I was just reading about DynamoRIO, a "runtime optimisation and introspection" tool I recently discovered that is awfully similar to Valgrind. Basically, it does the same thing; their API looks rather familiar: dynamorio_init(); dynamorio_exit(); dynamorio_basic_block(void *drcontext, app_pc tag, InstrList *bb); dynamorio_fragment_deleted(...); etc. One large difference is that it works at the x86 instruction level, which means it's probably a better tool for low-level stuff such as runtime optimisation. There are also lots of functions for poking at instructions eg. to determine if its operands touch memory, so it might be able to be used for complicated skins like MemCheck and Cachegrind feasibly. It works under Windows NT/2000/XP and Linux. There are various limitations, and it looks a little less robust than Valgrind under Linux, but it ran Konqueror fine when I tried it. It's performance is better than Valgrind's; without instrumentation programs under Linux run about 1.0--2.0 times slower, with Valgrind it's more like 5 times slower. It's at www.cag.lcs.mit.edu/dynamorio/. The slides from the ASPLOS tutorial are very interesting. The whole thing was only released in June. It clearly has some advantages over Valgrind (eg. performance, runs under Windows). Now I'm trying to work out what advantages has over it... UCode is arguably one, but also arguably not. Hmm. N |
|
From: Carlo W. <ca...@al...> - 2002-11-18 15:08:16
|
Hiya fellow hackers,
I am the author of libcwd - which exists a lot longer than valgrind! heheh.
I recently started to use valgrind and I find a few overlapping
functionalities (sadly) that probably should be turned into
a new library and then being used by both from one source...
- Reading and analysing ELF symbol tables
- Lookup of source file line number locations and function names
- Demangling of new ABI mangled function names (gcc-3.x)
Also, although not for use by the users of libcwd, I wrote
code to test potential deadlocks in POSIX threaded applications
(in fact, to test libcwd itself for potential deadlocks).
The latter might be a nice (new?) skin, or be added to helgrind.
---------------------------------------------------------------------------
Let me see what I did...
Aha, basically I call a function 'test_for_deadlock' with some
parameters, every time a lock is set. The parameters specify
among others a unique id of the lock. 'test_for_deadlock'
then does the following (just scan quickly over it):
void test_for_deadlock(int instance, struct TSD_st& __libcwd_tsd, void const* from)
{
if (!WST_multi_threaded)
return; // Give libcwd the time to get initialized.
if (instance == keypair_map_instance)
return;
set_alloc_checking_off(__libcwd_tsd);
// Initialization.
if (!keypair_map)
keypair_map = new keypair_map_t;
// We don't use a lock here because we can't. I hope that in fact this is
// not a problem because threadlist is a list<> and new elements will be added
// to the end. None of the new items will be of interest to us. It is possible
// that an item is *removed* from the list (see thread_ct::tsd_destroyed above)
// but that is very unlikely in the cases where this test is actually being
// used (we don't test with more than 1024 threads).
for (threadlist_t::iterator iter = threadlist->begin(); iter != threadlist->end(); ++iter)
{
mutex_ct const* mutex = &((*iter).thread_mutex);
if (mutex->M_locked_by == __libcwd_tsd.tid && mutex->M_instance_locked >= 1)
test_lock_pair(reinterpret_cast<int>(mutex), mutex->M_locked_from, instance, from);
}
for (int inst = 0; inst < instance_locked_size; ++inst)
{
// Check for read locks that are already set. Because it was never stored wether or not
// it was a high priority lock, this information is lost. This is not a problem though
// because we treat high priority and normal read locks the same when they are set first.
if (inst < instance_rdlocked_size && __libcwd_tsd.rdlocked_by1[inst] == __libcwd_tsd.tid && __libcwd_tsd.instance_rdlocked[inst] >= 1)
test_lock_pair(inst + read_lock_offset, __libcwd_tsd.rdlocked_from1[inst], instance, from);
if (inst < instance_rdlocked_size && __libcwd_tsd.rdlocked_by2[inst] == __libcwd_tsd.tid && __libcwd_tsd.instance_rdlocked[inst] >= 2)
test_lock_pair(inst + read_lock_offset, __libcwd_tsd.rdlocked_from2[inst], instance, from);
// Check for write locks and normal mutexes.
if (locked_by[inst] == __libcwd_tsd.tid && instance_locked[inst] >= 1 && inst != keypair_map_instance)
test_lock_pair(inst, locked_from[inst], instance, from);
}
set_alloc_checking_on(__libcwd_tsd);
}
What this does is the following: It calls 'test_lock_pair' for every
combination of two lock ids (including the location at which they were locked)
where the first one is any lock that is locked at this moment, and the
second one is the new lock that is being set right now.
test_lock_pair does the following (just scan quickly over it):
// These are the different groups that are allowed together.
uint16_t const group1 = 0x002; // Instance 1 locked first.
uint16_t const group2 = 0x004; // Instance 2 locked first.
uint16_t const group3 = 0x020; // Instance 1 read only and high_priority when second.
uint16_t const group4 = 0x040; // Instance 2 read only and high_priority when second.
uint16_t const group5 = 0x200; // (Instance 1 locked first and (either one read only and high_priority when second))
// or (both read only and high_priority when second).
uint16_t const group6 = 0x400; // (Instance 2 locked first and (either one read only and high_priority when second))
// or (both read only and high_priority when second).
struct keypair_key_st {
int instance1;
int instance2;
};
struct keypair_info_st {
uint16_t state;
int limited;
void const* from_first[5];
void const* from_second[5];
};
struct keypair_compare_st {
bool operator()(keypair_key_st const& a, keypair_key_st const& b) const;
};
// Definition of the map<> that holds the key instance pairs that are ever locked simultaneously by the same thread.
typedef std::pair<keypair_key_st const, keypair_info_st> keypair_map_value_t;
typedef std::map<keypair_key_st, keypair_info_st, keypair_compare_st> keypair_map_t;
static keypair_map_t* keypair_map;
// Bring some arbitrary ordering into the map with key pairs.
bool keypair_compare_st::operator()(keypair_key_st const& a, keypair_key_st const& b) const
{
return (a.instance1 == b.instance1) ? (a.instance2 < b.instance2) : (a.instance1 < b.instance1);
}
extern "C" int raise(int);
void test_lock_pair(int instance_first, void const* from_first, int instance_second, void const* from_second)
{
if (instance_first == instance_second)
return; // Must have been a recursive lock.
keypair_key_st keypair_key;
keypair_info_st keypair_info;
// Do some decoding and get rid of the 'read_lock_offset' and 'high_priority_read_lock_offset' flags.
// During the decoding, we assume that the first instance is the smallest (instance 1).
keypair_info.state = group1;
bool first_is_readonly = false;
bool second_is_high_priority = false;
if (instance_first < 0x10000)
{
if (instance_first >= read_lock_offset)
{
first_is_readonly = true;
keypair_info.state |= (group3 | group5);
}
instance_first %= read_lock_offset;
}
if (instance_second < 0x10000)
{
if (instance_second >= high_priority_read_lock_offset)
{
second_is_high_priority = true;
keypair_info.state |= (group4 | group5);
if (first_is_readonly)
keypair_info.state |= group6;
}
instance_second %= read_lock_offset;
}
// Put the smallest instance in instance1.
if (instance_first < instance_second)
{
keypair_key.instance1 = instance_first;
keypair_key.instance2 = instance_second;
}
else
{
keypair_key.instance1 = instance_second;
keypair_key.instance2 = instance_first;
// Correct the state by swapping groups 1 <-> 2, 3 <-> 4, and 5 <-> 6.
keypair_info.state = ((keypair_info.state << 1) | (keypair_info.state >> 1)) & (group1|group2|group3|group4|group5|group6);
}
// Store the locations where the locks were set.
keypair_info.limited = 0;
keypair_info.from_first[0] = from_first;
keypair_info.from_second[0] = from_second;
mutex_tct<keypair_map_instance>::lock();
std::pair<keypair_map_t::iterator, bool> result = keypair_map->insert(keypair_map_value_t(keypair_key, keypair_info));
if (!result.second)
{
keypair_info_st& stored_info((*result.first).second);
uint16_t prev_state = stored_info.state;
stored_info.state &= keypair_info.state; // Limit possible groups.
if (prev_state != stored_info.state)
{
stored_info.limited++;
stored_info.from_first[stored_info.limited] = from_first;
stored_info.from_second[stored_info.limited] = from_second;
DEBUGDEBUG_CERR("\nKEYPAIR: first: " << instance_first << "; second: " << instance_second << "; groups: " << (void*)(unsigned long)stored_info.state << '\n');
}
if (stored_info.state == 0) // No group left?
{
FATALDEBUGDEBUG_CERR("\nKEYPAIR: There is a potential deadlock between lock " << keypair_key.instance1 << " and " << keypair_key.instance2 << '.');
FATALDEBUGDEBUG_CERR("\nKEYPAIR: Previously, these locks were locked in the following locations:");
for (int cnt = 0; cnt <= stored_info.limited; ++cnt)
FATALDEBUGDEBUG_CERR("\nKEYPAIR: First at " << stored_info.from_first[cnt] << " and then at " << stored_info.from_second[cnt]);
mutex_tct<keypair_map_instance>::unlock();
core_dump();
}
}
else
{
DEBUGDEBUG_CERR("\nKEYPAIR: first: " << instance_first << "; second: " << instance_second << "; groups: " << (void*)(unsigned long)keypair_info.state << '\n');
#if 0
// X1,W/R2 + Y2,Z3 imply X1,Z3.
// First assume the new pair is a possible X1,W/R2:
if (!second_is_high_priority)
{
for (keypair_map_t::iterator iter = keypair_map->begin(); iter != keypair_map->end(); ++iter)
{
}
}
#endif
}
mutex_tct<keypair_map_instance>::unlock();
}
If this gives you a headache, then just look at the possible
fatal debug output that this can cause:
KEYPAIR: There is a potential deadlock between lock " << keypair_key.instance1 << " and " << keypair_key.instance2 << '.'
KEYPAIR: Previously, these locks were locked in the following locations:"
KEYPAIR: First at " << stored_info.from_first[cnt] << " and then at " << stored_info.from_second[cnt]
Basically, this detects that if a thread first locks 'A' and then locked 'B'
it is never allowed that another thread at another time first locks 'B' and
then locks 'A'. And then multiply the complexity of that with 1000 by
introducing read-locks, write-locks and high priority locks.
Where, a 'read lock' blocks other write locks, but not other read locks.
A 'write lock' locks both, other read and write locks.
And a high priority (read) lock is libcwd specific and has to do with
the queueing order in which pending locks are waiting. Normally a pending
write lock gets precendence over new read-lock attempts, but not when the
new read-lock attempt is high priority. I needed this to fix all possible
dead locks.
If you would add this, then realize that it still is VERY hard to fix
a deadlock related bug - even while with the above code you can easily
and likely detect them all. What is needed is a full backtrace of where
and when locks where set (that are involved in a potential deadlock)
and an administration per lock of where it is waiting for.
More complexity is needed to detect potential deadlocks.
For example if first instance 1 is locked (read or write) and
then a write lock is set (any) OR instance 2 is read locked,
and at any other time (different thread, different time)
first instance 2 is locked (read or write) and then an
instance 3 is locked (read or write), then a potential
deadlock needs to be detected when later (possibily different thread,
different time) first instance 3 is locked and then instance 1
is locked. Hence the remark that X1,W/R2 + Y2,Z3 implies a X1,Z3 pair.
There are more "formulas" like that to be considered.
--
Carlo Wood <ca...@al...>
|
|
From: Julian S. <js...@ac...> - 2002-11-18 13:49:16
|
Just fwiw, Yesterday afternoon I ran KDE-3.1-rc3 on addrcheck. This is the entire process tree starting from startkde. It pretty much worked, but was a bit slow and one process bombed valgrind. All output was logged to a different machine running the listener. So following some robustification, I think this might be workable (P4 1.7 GHz, 512 MB). Even from my small test it picked up some reads/writes of freed memory in artsd (something to do with KDE sound I think). J |
|
From: Nicholas N. <nj...@ca...> - 2002-11-18 08:45:20
|
On 17 Nov 2002, Jeremy Fitzhardinge wrote: > The P4 has the interesting architectural feature of a trace cache rather > than a traditional icache. I'd be wary of trying to target a single architecture. For example, the Athlon has a traditional I cache. And Valgrind should (hopefully!) be around longer than the P4... Just my $0.02. N |
|
From: Jeremy F. <je...@go...> - 2002-11-18 02:53:26
|
On Sun, 2002-11-17 at 10:17, Julian Seward wrote:
> My understanding of the ld.so scheme (which may be wrong!) is:
>
> Initially a call to (eg) printf is sent to a stub function (which perhaps
> lives in the PLT, the procedure linkage table?). This
> invokes the dynamic linker, to find the address of the real printf.
> The linker overwrites this stub with a jump to the real address, and
> jumps onwards there.
>
> Second and subsequent calls to printf still jump to the stub, but that
> simply makes a second jump to the real code. So once the initial lookup
> is done, there is a 1-insn overhead for all subsequent calls, which is
> not bad.
>
> This scheme has the advantage that neither the caller nor callee has
> to be patched, since that would negate the advantages of text segments
> shared between multiple processes. Only each processes' PLT(s?) need be
> modified, but these are relatively small.
>
> Does this clarify? Does you know if this is even correct?
Yes, that's correct, but it wasn't my understanding of Josef's proposal
(or more to the point, Josef's proposal is very close, but not quite the
same). Of course, I may have misunderstood.
I think he was proposing a per-BB table which encodes the address of
that BB's generated code. Jumps to that BB would therefore be something
like jmp *(bb_targets[idx]) (where idx is computed at compile time, and
therefore doesn't allow the table to be rearranged without invalidating
the translation cache). That table would be incrementally generated
while compiling basic blocks which contain jumps to that address.
Indirect and computed jumps would still have to go through the current
mechanism.
The distinction is that the PLT is table in each caller's object file
which is either an array of instructions which is overwritten by the
dynamic linker (Sparc, Mips), or just an indirect jump through a table
of pointers to functions (i386, though that's a pretty bad idea these
days, for the reasons I talk about in the other mail I just sent). For
Josef's proposal, that table would be built incrementally in much the
same way as the dynamically generated code is built (in fact, it would
be like a wavefront just in advance of all possible predicted jumps in
the code).
However, I think we'd get the most benefit from chaining by effectively
compiling this table into the generated code, by having an instruction
which gets overwritten with either a chained jump or a return back into
the dispatch loop. I think this is actually simpler to implement as
well.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-18 02:42:18
|
On Sun, 2002-11-17 at 06:15, Josef Weidendorfer wrote:
> > generated code, the entrypoint of "patch_me"? This is OK, but it would
> > be nice not to have an indirect jump in the generated code, since this
> > makes branch prediction much harder. A nice absolute unconditional jump
>
> Is branch prediction with indirect jumps always a problem for the CPU
> hardware? Perhaps register relative, loading the register a few cycles before
> the jump? Then the pipeline shouldn't stall.
> But that needs a register. OK, go with patching the generated code.
That still doesn't really help. The P4 has a 20-stage pipeline, it gets
really expensive if there are any bubbles in it.
The worst kind of bubble is a cache miss, so you want to keep cache
pressure as low as possible. Building the target address into the
instruction stream rather than requiring a data cache line for every
branch is a win there.
The lesser bubble is when you get a branch mis-prediction, or a stall
because the jump's target can't otherwise be determined. The branch
target buffer (BTB) will help there, because in the absence of other
information it will predict that the branch will go to the last place it
went; unfortunately if every branch is indirect, it will put a lot of
pressure on the BTB and cause a high miss rate.
Even if the CPU could make use of the fact that you preloaded the target
into a register, you'd need to do it 20 instructions before the actual
jump, which would be very hard (in that most basic blocks aren't 20
instructions long, and it would cause quite a lot of register pressure
keeping all possible jump targets in the next 20 instructions in
registers).
The P4 has the interesting architectural feature of a trace cache rather
than a traditional icache. The trace cache ignores branches altogether,
and just encodes the uops making up the instruction stream. In other
words, it effectively builds branch prediction into the icache. I
suspect whenever there's a mis-predict or some complex jump (ie,
indirect) it will break the current trace, with some performance
impact. If we can give the trace cache unconditional jumps, it will get
nice long traces to work with.
As you suggest, we could emulate this by running together two basic
blocks which have an unconditional jump between them (we do this
implicitly anyway, to some extent, if we get a jump into the middle of
an existing BB). I don't think this would be worthwhile though.
The other advantage of allowing the P4 to do good branch prediction is
to allow it to speculate deeply into the future; this will generate data
cache prefetches, and therefore lower the observed data cache miss
rate. If it speculates into the wrong future, it will end up polluting
the data cache with useless stuff.
Pre-P4 CPUs are less sensitive to all this, I suspect, but would still
benefit from having nicely predictable jumps where ever possible (they
will still prefetch and speculate over the jump).
BTW, the current dispatch loop has about worst-possible behaviour with
regard to branch prediction. It is a single call instruction which is
indirect and always goes to a different address, making the BTB almost
completely ineffective; I would expect a near 0% prediction rate there.
At least the call-return sequence is matched so the return stack cache
will get a good hit rate.
Another useful thing to do (which may already be there, I haven't
looked) is to make sure that generated code always starts at a 16 or 32
byte boundary (depending on the CPU architecture), to make sure it
starts on a cache line. That way it saves the CPU from wasting memory
bandwidth fetching unused memory from before the start of the generated
code.
> Another idea: If two BBs are to be executed unconditionally after each other,
> couldn't you try to put them in the translation cache direct behind each other
> with no jump at all? Perhaps padding with some NOPs if a jump will be needed
> later on...
As I said, this isn't all that important if you use unconditional jumps.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-18 02:08:49
|
On Sun, 2002-11-17 at 04:37, Julian Seward wrote:
> > On Sat, 16 Nov 2002, Julian Seward wrote:
> > > - how to cleanly deal with jumps to unknown addresses, which always
> > > require a lookup
> >
> > Dynamo has some kind of table for these.
>
> That seems unavoidable. We'd probably just recycle the current
> arrangement, or perhaps a simplified version of it, in this case. Bear in
> mind that x86 RET instructions are going to pull an original address off
> the stack and jump to it, so there will still me a significant minority
> of such lookups needed.
It's likely that rets will be the majority of jumps with unknown
destinations (though very OO C++ code's use of virtual method
invocations will get close). I think the current mechanism is fine.
> > > Finally -- and this is the last part of the trick -- whenever we want to
> > > move or discard any translations, we first unchain *all* of them.
> >
> > I think Dynamo actually flushes the entire code cache in this situation.
> > But it doesn't have LRU eviction, and their code cache seems to be a lot
> > smaller. So maybe that's not relevant.
>
> Yes, there's a tradeoff between complexity of the cache management and
> expense of translation. Since we've got a hugely expensive translator
> by JIT standards (10s of 000s of cycles / byte of generated code) we
> have to a have a more complex arrangement designed to minimise the miss
> rate.
My profiling doesn't show more than about 5% of time being spent in
codegen when there's a non-trivial skin running. Unfortunately oprofile
doesn't account for time spent in generated code, so its hard to tell
what the time for code generation vs. time running generated code ratio
is.
> > Also, w.r.t. indirect jumps, I think Dynamo encodes a lot of them as
> > direct jumps and then checks before making the jump that the direct jump
> > is to the right place. This somehow turns out as a win...
>
> Yes, known trick. I think VMware probably does this too (it's a light
> weight JIT, in effect). (How VMware works is really described in an
> earlier paper from Stanford; I can dig it out if anyone wants).
Yes please.
J
|