|
From: Paul F. <pj...@wa...> - 2020-05-26 14:07:46
|
> That doesn't sound right. I use DHAT extensively and expect a slowdown of> perhaps 50:1, maybe less. What you're describing is a slowdown factor of> at least several thousand.>> Bear in mind though that (1) V sequentialises thread execution, which wil> make a big difference if the program is heavily multithreaded, and (2)> I suspect dhat's scheme of looking up all memory accesses in an AVL tree> (of malloc'd blocks) doesn't scale all that well if you have tens of> millions of blocks.>> Can you run it on a smaller workload? HiI'll try on something smaller and also get some info on the number of blocks of memory allocated.A+Paul |
|
From: Paul F. <pj...@wa...> - 2020-05-26 16:05:50
|
> Message du 26/05/20 13:19> De : "John Reiser" the ratio is about 1:50. So right away, that's a hardware slowdown of 4X.Maybe more. The machine has 12Mbyte of cache according to cpuinfo.> Valgrind runs every tool single-threaded. So if your app averages 5 active threads,> then that is a slowdown of 5X.I was running the application in single thread mode.> Valgrind's JIT (Just-In-Time) instruction emulator has a slowdown. Assume 10X (or measure nulgrind.)Yes, this is what I see with nulgrind, about a 11x factor slowdown. However this will also account for a large part of the cache overhead.> Finally we get to "useful work": the slowdown of the tool DHAT. Assume 3X.> So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes into 100 hours.What I'm seeing is a DHAT-only slowdown that is much more than that.A+Paul |
|
From: John R. <jr...@bi...> - 2020-05-26 16:29:11
|
> > So (4 * 5 * 10 * 3) is a slowdown of 600X, which turns 10 minutes into 100 hours.
>
> What I'm seeing is a DHAT-only slowdown that is much more than that.
Running 'perf' is likely to give data and strong hints about what is going on.
The overhead of 'perf' is only a few percent.
perf record valgrind --tool=DHAT <<valgrind_args>> ./my_app <<my_app_args>>
perf report > perf_output.txt
The "perf record ..." will stop shortly after the valgrind sub-process terminates.
You don't have to wait for DHAT to finish; just 'kill' it after a while.
|
|
From: Paul F. <pj...@wa...> - 2020-05-27 20:33:30
|
[snip - perf]
Well, no real surprises. This is with a testcase that runs standalone in about 5 seconds and under DHAT in about 200 seconds (so a reasonable slowdown of 40x).
# Overhead Command Shared Object
Symbol
# ........ ............... .................. ................................................................................................................................................................................................................
#
29.11% dhat-amd64-linu dhat-amd64-linux [.] interval_tree_Cmp
21.13% dhat-amd64-linu perf-26905.map [.] 0x00000010057a25f8
13.32% dhat-amd64-linu dhat-amd64-linux [.] vgPlain_lookupFM
9.56% dhat-amd64-linu dhat-amd64-linux [.] dh_handle_read
8.83% dhat-amd64-linu dhat-amd64-linux [.] vgPlain_nextIterFM
4.66% dhat-amd64-linu dhat-amd64-linux [.] check_for_peak
1.85% dhat-amd64-linu dhat-amd64-linux [.] vgPlain_disp_cp_xindir
1.32% dhat-amd64-linu [kernel.kallsyms] [k] 0xffffffff8103ec0a
1.00% dhat-amd64-linu dhat-amd64-linux [.] dh_handle_write
A+
Paul
|
|
From: John R. <jr...@bi...> - 2020-05-27 21:26:56
|
On 5/27/20 Paul FLOYD wrote: > Well, no real surprises. This is with a testcase that runs standalone in about 5 seconds and under DHAT in about 200 seconds (so a reasonable slowdown of 40x). > > # Overhead Command Shared Object > Symbol > # ........ ............... .................. ................................................................................................................................................................................................................ > # > 29.11% dhat-amd64-linu dhat-amd64-linux [.] interval_tree_Cmp > 21.13% dhat-amd64-linu perf-26905.map [.] 0x00000010057a25f8 > 13.32% dhat-amd64-linu dhat-amd64-linux [.] vgPlain_lookupFM > 9.56% dhat-amd64-linu dhat-amd64-linux [.] dh_handle_read > 8.83% dhat-amd64-linu dhat-amd64-linux [.] vgPlain_nextIterFM > 4.66% dhat-amd64-linu dhat-amd64-linux [.] check_for_peak > 1.85% dhat-amd64-linu dhat-amd64-linux [.] vgPlain_disp_cp_xindir > 1.32% dhat-amd64-linu [kernel.kallsyms] [k] 0xffffffff8103ec0a > 1.00% dhat-amd64-linu dhat-amd64-linux [.] dh_handle_write To me this suggests two things: 1) investigate the coding of the 4 or 5 highest-use subroutines (interval_tree_Cmp, vgPlain_lookupFM, dh_handle_read, vgPlain_nextIterFM) 2) see whether DHAT might recognize and use higher-level abstractions than MemoryRead and MemoryWrite of individual addresses. Similar to memcheck intercepting and analyzing strlen (etc.) as a complete concept instead of as its individual Reads and Writes, perhaps DHAT could intercept (and/or recognize) vector linear search, vector addition, vector partial sum, other BLAS routines, etc., and then analyze the algorithm as a whole. |