|
From: Belmon, S. <sb...@we...> - 2009-09-01 17:22:35
|
When Massif culls "old" snapshots, it drops about half of them, and then resets its period to the MIN() of the inter-snapshots times.
For long runs, that has somewhat unexpected consequences. You can easily end up with 75 snapshots that are about a minute apart, and then have the last 25 about 20 seconds apart for example (or worse). Graphs which don't look at the "time=X" can then be a bit misleading -- and that includes ms_print's I believe, Martijn's massifplot, and I'd expect most homegrown tools.
You could say RTFM of course, but it might be more desirable, at least when the inter-snapshot time is "long", to only decimate (cull 10% of snapshots). I try to do something like that in the following, YMMV. A couple of things had to move, the diff is "morally" a lot smaller than it looks.
It might be better to have an explicit option to set the decimation fraction (which dictates the frequency). It might also be better to take an average of the time between existing snapshots, rather than the MIN() -- which makes it very likely these new snapshots will be among the first ones to be discarded at the next culling.
A mighty fine tool!
diff --git a/massif/ms_main.c b/massif/ms_main.c
index a6062fc..530b461 100644
--- a/massif/ms_main.c
+++ b/massif/ms_main.c
@@ -220,7 +220,10 @@ Number of snapshots: 50
VG_(message)(Vg_DebugMsg, "Massif: " format, ##args); \
}
-
+// Time is measured either in i or ms or bytes, depending on the --time-unit
+// option. It's a Long because it can exceed 32-bits reasonably easily, and
+// because we need to allow negative values to represent unset times.
+typedef Long Time;
//------------------------------------------------------------//
//--- Statistics ---//
@@ -283,6 +286,13 @@ static ULong total_allocs_deallocs_szB = 0;
//
static Bool have_started_executing_code = False;
+// 'current_min_time_interval' is the minimum time interval between snapshots.
+// If we try to take a snapshot and less than this much time has passed,
+// we don't take it. It gets larger as the program runs longer. It's
+// initialised to zero so that we begin by taking snapshots as quickly as
+// possible.
+static Time current_min_time_interval = 0;
+
//------------------------------------------------------------//
//--- Alloc fns ---//
//------------------------------------------------------------//
@@ -966,11 +976,6 @@ static void update_XCon(XPt* xpt, SSizeT space_delta)
// limit again, we again cull and then take them even more slowly, and so
// on.
-// Time is measured either in i or ms or bytes, depending on the --time-unit
-// option. It's a Long because it can exceed 32-bits reasonably easily, and
-// because we need to allow negative values to represent unset times.
-typedef Long Time;
-
#define UNUSED_SNAPSHOT_TIME -333 // A conspicuous negative number.
typedef
@@ -1091,12 +1096,37 @@ static void VERB_snapshot(Int verbosity, Char* prefix, Int i)
);
}
+// Given the current conditions, returns the fraction of snapshots to
+// cull (i.e. returns 19 to mean "cull 1/10th")
+//
+// Initially we discard N/2 snapshots, but when the
+// inter-snapshot-time grows large enough, we can afford to cull
+// fewer. The rationale being that, when snapshots are taken once a
+// minute, we can afford to cull every tenth snapshot to get a more
+// uniform distribution.
+//
+// A non-uniform distribution can give misleading graphs (from
+// ms_print or tools like massifplot), unless they do the right thing
+// and use an X-Y plot type (being ASCII art, ms_print would be hard-
+// pressed to do this anyway).
+
+static Int fraction_of_snapshots_to_discard(void)
+{
+ if ((clo_time_unit == TimeMS && current_min_time_interval > 10000) ||
+ (clo_time_unit == TimeI && current_min_time_interval > 10000000000LL))
+ {
+ return 10;
+ }
+
+ return 2; // also TimeB case, used in short-lived programs anyway.
+}
+
// Cull half the snapshots; we choose those that represent the smallest
// time-spans, because that gives us the most even distribution of snapshots
// over time. (It's possible to lose interesting spikes, however.)
//
// Algorithm for N snapshots: We find the snapshot representing the smallest
-// timeframe, and remove it. We repeat this until (N/2) snapshots are gone.
+// timeframe, and remove it. We repeat this until e.g. (N/2) snapshots are gone.
// We have to do this one snapshot at a time, rather than finding the (N/2)
// smallest snapshots in one hit, because when a snapshot is removed, its
// neighbours immediately cover greater timespans. So it's O(N^2), but N is
@@ -1109,6 +1139,7 @@ static UInt cull_snapshots(void)
Int i, jp, j, jn, min_timespan_i;
Int n_deleted = 0;
Time min_timespan;
+ Int fraction_to_discard = fraction_of_snapshots_to_discard();
n_cullings++;
@@ -1122,7 +1153,7 @@ static UInt cull_snapshots(void)
// First we remove enough snapshots by clearing them in-place. Once
// that's done, we can slide the remaining ones down.
- for (i = 0; i < clo_max_snapshots/2; i++) {
+ for (i = 0; i < (1+clo_max_snapshots/fraction_to_discard); i++) {
// Find the snapshot representing the smallest timespan. The timespan
// for snapshot n = d(N-1,N)+d(N,N+1), where d(A,B) is the time between
// snapshot A and B. We don't consider the first and last snapshots for
@@ -1300,12 +1331,6 @@ take_snapshot(Snapshot* snapshot, SnapshotKind kind, Time time,
static void
maybe_take_snapshot(SnapshotKind kind, Char* what)
{
- // 'min_time_interval' is the minimum time interval between snapshots.
- // If we try to take a snapshot and less than this much time has passed,
- // we don't take it. It gets larger as the program runs longer. It's
- // initialised to zero so that we begin by taking snapshots as quickly as
- // possible.
- static Time min_time_interval = 0;
// Zero allows startup snapshot.
static Time earliest_possible_time_of_next_snapshot = 0;
static Int n_snapshots_since_last_detailed = 0;
@@ -1390,11 +1415,11 @@ maybe_take_snapshot(SnapshotKind kind, Char* what)
// Cull the entries, if our snapshot table is full.
next_snapshot_i++;
if (clo_max_snapshots == next_snapshot_i) {
- min_time_interval = cull_snapshots();
+ current_min_time_interval = cull_snapshots();
}
// Work out the earliest time when the next snapshot can happen.
- earliest_possible_time_of_next_snapshot = time + min_time_interval;
+ earliest_possible_time_of_next_snapshot = time + current_min_time_interval;
}
--
Stephane Belmon
Protected by Websense Hosted Email Security -- www.websense.com
|
|
From: Nicholas N. <n.n...@gm...> - 2009-09-01 22:21:19
|
On Wed, Sep 2, 2009 at 3:20 AM, Belmon, Stephane<sb...@we...> wrote: > When Massif culls "old" snapshots, it drops about half of them, and then resets its period to the MIN() of the inter-snapshots times. > > For long runs, that has somewhat unexpected consequences. You can easily end up with 75 snapshots that are about a minute apart, and then have the last 25 about 20 seconds apart for example (or worse). Graphs which don't look at the "time=X" can then be a bit misleading -- and that includes ms_print's I believe, Martijn's massifplot, and I'd expect most homegrown tools. Hmm, interesting, I didn't realise that could happen. Would you mind filing a bug for this? (http://www.valgrind.org/support/bug_reports.html) Nick |
|
From: Belmon, S. <sb...@we...> - 2009-09-01 23:20:35
|
Added, bug # 205937. -- Stephane Belmon -----Original Message----- From: Nicholas Nethercote [mailto:n.n...@gm...] Sent: Tuesday, September 01, 2009 3:21 PM To: Belmon, Stephane Cc: val...@li... Subject: Re: [Valgrind-users] Uniformity of inter-snapshot times in Massif On Wed, Sep 2, 2009 at 3:20 AM, Belmon, Stephane<sb...@we...> wrote: > When Massif culls "old" snapshots, it drops about half of them, and then resets its period to the MIN() of the inter-snapshots times. > > For long runs, that has somewhat unexpected consequences. You can easily end up with 75 snapshots that are about a minute apart, and then have the last 25 about 20 seconds apart for example (or worse). Graphs which don't look at the "time=X" can then be a bit misleading -- and that includes ms_print's I believe, Martijn's massifplot, and I'd expect most homegrown tools. Hmm, interesting, I didn't realise that could happen. Would you mind filing a bug for this? (http://www.valgrind.org/support/bug_reports.html) Nick Protected by Websense Hosted Email Security -- www.websense.com |