|
From: Niall D. <ndo...@bl...> - 2013-04-05 15:40:58
Attachments:
smime.p7s
|
Dear valgrind developers, I'm developing an enhancement to cachegrind/callgrind's output - an estimated likely execution time log - which can multiply the instruction type counts by their average execution time on the target CPU in order to generate somewhat more realistic profiling results. This would be highly useful to us for ARM targets especially as these have an unusually slow main memory relative to other architectures. At the start of cachegrind/callgrind, a cpucacheconfig.xml is loaded in with the CPU's configuration. If none is present, a completely generic and non-CPU specific timing routine is run which figures out cache sizes, configurations and timings and stores them into cpucacheconfig.xml. This is obviously a great boon for non-x86/x64 architecture support already, as right now defaults are hardcoded rather than something closer to reality. The problem, at present, is that VG_(read_millisecond_timer) is the only timing routine I can see. The generic cache configuration detection routine is far more accurate if it is given microsecond or better accurate timing. So would it be okay if I add VG_(read_nanosecond_timer) returning a ULong? VG_(read_millisecond_timer) is currently implemented by calling the clock_gettime syscall directly. It is therefore trivial to convert this to return nanoseconds instead. Thanks, Niall |
|
From: Niall D. <ndo...@bl...> - 2013-04-10 20:15:58
Attachments:
smime.p7s
|
> > That's exactly the intent. Cachegrind/Callgrind output would simply > > include the host's cache and memory latencies prepended as comments; > > if in XML form, it appears as an additional XML stanza. That way it > > doesn't break any tooling which relies on output to not change. > > Cachegrind/Callgrind do not support XML output for profile data (at least for > now). But it should be quite easy to define a sensible XML format for the data > (not saying that we want that - I do not see any benefit). I've since realized that, so I different approach to modifying the output (see other post for samples). > Both the *_annotate scripts and KCachegrind support arbitrarily named event > types. And as you can see at section 3.1.7.2 on > > http://valgrind.org/docs/manual/cl-format.html > > the callgrind format allows to specify formulas for derived event types. > I think this is everything you need: Add a line with e.g. > > "event: CycleEst = Ir + 100 DLmr + 100 DLmw" > > where 100 is your cycle penalty for LL cache misses. > Only {K,Q}Cachegrind currently support such lines, but it should not be too > complex to add that to the *_annotate scripts. That's *very* useful. You just saved me a raft of code. Thank you. > Actually, also for Intel, main memory accesses are slow, similar to the numbers > you quote for the mobile chips. > > However, miss penalties are quite different for random and stream access, > where hardware prefetchers kick in. GenericCPUConfigDetect mutates the cache line fetch using a bit reverser of the last byte indexing cache lines to try and "randomly" straddle 4Kb pages in order to try and persuade Intel's prefetchers to stop being so damn clever. So the results I quoted earlier are a supposedly a worst case latency. But I completely agree, under the bonnet PC memory is no different to mobile memory, albeit usually clocked a bit faster. Intel's prefetchers are really impressive. I just bought a Cortex A15 Chromebook, and I look forward to finding out how its much improved prefetchers perform. > >> Why does this microbenchmark measurement have to be part of the tool? > > > right (mea culpa, I should have read the docs more closely). I have no > > issue with having cachegrind/callgrind refuse to run without a known > > good cache config BTW, but that does seem a bit overkill. > > Not so sure about that. You said yourself that you wasted much time. Right now I have it printing a warning saying one should really examine cpucacheconfig.xml for accuracy. But maybe you're right and it should print its configured cache parameters every program run. That way I wouldn't have lost a week. > Micro-benchmarks often are very sensitive on what else is going on in the > systems. If the system is loaded, results may be way off. > It seems better to run that benchmark at a time controlled by the user. Agreed. As I mentioned in my other post, it loads from a supplied file where possible. I would expect it to be rare that people don't supply their own file, except on Intel where CPUID is useful in user mode code. Niall |
|
From: Rick G. <rcg...@ve...> - 2013-04-11 14:52:05
|
Niall,
An alternative approach that comes to my mind is to collect the hot
paths, save them, and then post-process with the chip specific
instruction timings.
Would that work?
Regards,
Richard
|
|
From: Niall D. <ndo...@bl...> - 2013-04-11 14:45:14
Attachments:
smime.p7s
|
> I've just finished the port of my overall patch to Linux valgrind, so
please find
> attached a patch implementing VG_(read_nanosecond_timer). It looks like
your
> bug tracker is for bugs and not feature patches, so please do mention if
you
> want this patch to go elsewhere.
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: 0002-Separate-patch-for-coregrind-adding-nanosecond-timer.patch
> Type: application/octet-stream
> Size: 3256 bytes
> Desc: not available
Well that's very unhelpful.
Here it is copy and pasted so:
>From 10f7e83c3d91baff94fae3f2ec4b13c7a9812bea Mon Sep 17 00:00:00 2001
From: Niall Douglas <ndo...@ri...>
Date: Wed, 10 Apr 2013 14:47:40 -0400
Subject: [PATCH 2/2] Separate patch for coregrind adding nanosecond timer
resolution.
---
coregrind/m_libcproc.c | 17 +++++++++++------
include/pub_tool_libcproc.h | 7 ++++++-
2 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/coregrind/m_libcproc.c b/coregrind/m_libcproc.c
index e801b25..16ebb19 100644
--- a/coregrind/m_libcproc.c
+++ b/coregrind/m_libcproc.c
@@ -608,9 +608,9 @@ Int VG_(fork) ( void )
Timing stuff
------------------------------------------------------------------ */
-UInt VG_(read_millisecond_timer) ( void )
+ULong VG_(read_nanosecond_timer) ( void )
{
- /* 'now' and 'base' are in microseconds */
+ /* 'now' and 'base' are in nanoseconds */
static ULong base = 0;
ULong now;
@@ -620,12 +620,12 @@ UInt VG_(read_millisecond_timer) ( void )
res = VG_(do_syscall2)(__NR_clock_gettime, VKI_CLOCK_MONOTONIC,
(UWord)&ts_now);
if (sr_isError(res) == 0) {
- now = ts_now.tv_sec * 1000000ULL + ts_now.tv_nsec / 1000;
+ now = ts_now.tv_sec * 1000000000ULL + ts_now.tv_nsec;
} else {
struct vki_timeval tv_now;
res = VG_(do_syscall2)(__NR_gettimeofday, (UWord)&tv_now,
(UWord)NULL);
vg_assert(! sr_isError(res));
- now = tv_now.tv_sec * 1000000ULL + tv_now.tv_usec;
+ now = tv_now.tv_sec * 1000000000ULL + tv_now.tv_usec * 1000;
}
}
@@ -638,7 +638,7 @@ UInt VG_(read_millisecond_timer) ( void )
struct vki_timeval tv_now = { 0, 0 };
res = VG_(do_syscall2)(__NR_gettimeofday, (UWord)&tv_now,
(UWord)NULL);
vg_assert(! sr_isError(res));
- now = sr_Res(res) * 1000000ULL + sr_ResHI(res);
+ now = sr_Res(res) * 1000000000ULL + sr_ResHI(res) * 1000;
}
# else
@@ -649,9 +649,14 @@ UInt VG_(read_millisecond_timer) ( void )
if (base == 0)
base = now;
- return (now - base) / 1000;
+ return (now - base);
}
+UInt VG_(read_millisecond_timer) ( void )
+{
+ ULong now = VG_(read_nanosecond_timer)();
+ return (UInt)(now / 1000000ULL);
+}
/* ---------------------------------------------------------------------
atfork()
diff --git a/include/pub_tool_libcproc.h b/include/pub_tool_libcproc.h
index 2ff3f83..a93e609 100644
--- a/include/pub_tool_libcproc.h
+++ b/include/pub_tool_libcproc.h
@@ -81,9 +81,14 @@ extern Int VG_(getegid) ( void );
Timing
------------------------------------------------------------------ */
-// Returns the number of milliseconds passed since the progam started
+// Returns the number of nanoseconds passed since the progam started
// (roughly; it gets initialised partway through Valgrind's initialisation
// steps).
+extern ULong VG_(read_nanosecond_timer) ( void );
+
+// Returns the number of milliseconds passed since the progam started
+// (roughly; it gets initialised partway through Valgrind's initialisation
+// steps). This is VG_(read_nanosecond_timer) divided by one million.
extern UInt VG_(read_millisecond_timer) ( void );
/* ---------------------------------------------------------------------
--
1.7.11.msysgit.1
Niall
|
|
From: Julian S. <js...@ac...> - 2013-04-05 15:46:05
|
> The problem, at present, is that VG_(read_millisecond_timer) is the only > timing routine I can see. The generic cache configuration detection routine > is far more accurate if it is given microsecond or better accurate timing. > So would it be okay if I add VG_(read_nanosecond_timer) returning a ULong? How do you plan to implement VG_(read_nanosecond_timer) ? J |
|
From: Niall D. <ndo...@bl...> - 2013-04-05 15:48:34
Attachments:
smime.p7s
|
> > The problem, at present, is that VG_(read_millisecond_timer) is the
> > only timing routine I can see. The generic cache configuration
> > detection routine is far more accurate if it is given microsecond or
better
> accurate timing.
> > So would it be okay if I add VG_(read_nanosecond_timer) returning a
ULong?
>
> How do you plan to implement VG_(read_nanosecond_timer) ?
Like this:
/* ---------------------------------------------------------------------
Timing stuff
------------------------------------------------------------------ */
ULong VG_(read_nanosecond_timer) ( void )
{
/* 'now' and 'base' are in nanoseconds */
static ULong base = 0;
ULong now;
# if defined(VGO_linux)
{ SysRes res;
struct vki_timespec ts_now;
res = VG_(do_syscall2)(__NR_clock_gettime, VKI_CLOCK_MONOTONIC,
(UWord)&ts_now);
if (sr_isError(res) == 0) {
now = ts_now.tv_sec * 1000000000ULL + ts_now.tv_nsec;
} else {
struct vki_timeval tv_now;
res = VG_(do_syscall2)(__NR_gettimeofday, (UWord)&tv_now,
(UWord)NULL);
vg_assert(! sr_isError(res));
now = tv_now.tv_sec * 1000000000ULL + tv_now.tv_usec * 1000;
}
}
# elif defined(VGO_darwin)
// Weird: it seems that gettimeofday() doesn't fill in the timeval, but
// rather returns the tv_sec as the low 32 bits of the result and the
// tv_usec as the high 32 bits of the result. (But the timeval cannot be
// NULL!) See bug 200990.
{ SysRes res;
struct vki_timeval tv_now = { 0, 0 };
res = VG_(do_syscall2)(__NR_gettimeofday, (UWord)&tv_now, (UWord)NULL);
vg_assert(! sr_isError(res));
now = sr_Res(res) * 1000000000ULL + sr_ResHI(res) * 1000;
}
# else
# error "Unknown OS"
# endif
/* COMMON CODE */
if (base == 0)
base = now;
return (now - base);
}
UInt VG_(read_millisecond_timer) ( void )
{
ULong now = VG_(read_nanosecond_timer)();
return (UInt)(now / 1000000ULL);
}
|
|
From: Niall D. <ndo...@bl...> - 2013-04-10 20:02:47
|
I've just finished the port of my overall patch to Linux valgrind, so please find attached a patch implementing VG_(read_nanosecond_timer). It looks like your bug tracker is for bugs and not feature patches, so please do mention if you want this patch to go elsewhere. I'll detail the overall patch to cachegrind in a separate post. Niall > -----Original Message----- > From: Julian Seward [mailto:js...@ac...] > Sent: 05 April 2013 11:45 > To: Niall Douglas > Cc: val...@li... > Subject: Re: [Valgrind-developers] Any objection if I add > VG_(read_nanosecond_timer) as well as VG_(read_millisecond_timer)? > > > > The problem, at present, is that VG_(read_millisecond_timer) is the > > only timing routine I can see. The generic cache configuration > > detection routine is far more accurate if it is given microsecond or better > accurate timing. > > So would it be okay if I add VG_(read_nanosecond_timer) returning a ULong? > > How do you plan to implement VG_(read_nanosecond_timer) ? > > J |
|
From: Josef W. <Jos...@gm...> - 2013-04-05 18:55:55
|
Am 05.04.2013 17:25, schrieb Niall Douglas: > Dear valgrind developers, > > I'm developing an enhancement to cachegrind/callgrind's output - an > estimated likely execution time log - which can multiply the instruction > type counts by their average execution time on the target CPU in order to > generate somewhat more realistic profiling results. This would be highly > useful to us for ARM targets especially as these have an unusually slow main > memory relative to other architectures. If this is mainly about memory, aren't the events collected by cachegrind/callgrind already enough, and you can calculate what you want in a post-processing step? (KCachegrind does such a thing to come up with a cycle estimation. It would be useful to add that feature also to the *_annotate scripts) Taking the instruction types into account may be useful (e.g. add vs. div). I suppose you would add another event type for that to callgrind/cachegrind, something like "core cycle estimation"? What kind of instruction types do you have in mind? How do you get them, from VEX IR or guest machine opcodes? > At the start of cachegrind/callgrind, a cpucacheconfig.xml is loaded in with > the CPU's configuration. So the idea is to have instruction type latencies, cache parameters and miss latencies in this file? Cachegrind/callgrind also has simple branch prediction, and it would be useful to also have a microbenchmark for that ;-) If none is present, a completely generic and > non-CPU specific timing routine is run which figures out cache sizes, > configurations and timings and stores them into cpucacheconfig.xml. This is > obviously a great boon for non-x86/x64 architecture support already, as > right now defaults are hardcoded rather than something closer to reality. > > The problem, at present, is that VG_(read_millisecond_timer) is the only > timing routine I can see. The generic cache configuration detection routine > is far more accurate if it is given microsecond or better accurate timing. > So would it be okay if I add VG_(read_nanosecond_timer) returning a ULong? Why does this microbenchmark measurement have to be part of the tool? If the config is not available, the tool could error out with a message explaining what to do to generate the config (e.g. run another binary). Josef > VG_(read_millisecond_timer) is currently implemented by calling the > clock_gettime syscall directly. It is therefore trivial to convert this to > return nanoseconds instead. > > Thanks, > Niall > > > > ------------------------------------------------------------------------------ > Minimize network downtime and maximize team effectiveness. > Reduce network management and security costs.Learn how to hire > the most talented Cisco Certified professionals. Visit the > Employer Resources Portal > http://www.cisco.com/web/learning/employer_resources/index.html > > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |