Thread: [Lse-tech] [RFC][Patch 0/5] Per-task delay accounting

lse-tech

[Lse-tech] [RFC][Patch 0/5] Per-task delay accounting

From: Shailabh N. <na...@wa...> - 2005-12-07 22:08:24

The following patches add accounting for the delays seen by a task in
a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages (i.e. capacity misses).

Such delays provide feedback for a task's cpu priority, io priority and
rss limit values. Long delays, especially relative to other tasks, can be
a trigger for changing a task's cpu/io priorities and modifying its rss usage
(either directly through sys_getprlimit() that was proposed earlier on lkml or
by throttling cpu consumption or process calling sys_setrlimit etc.)

There are quite a few differences from the earlier posting of these patches
(http://www.uwsg.indiana.edu/hypermail/linux/kernel/0511.1/2275.html):

- block I/O is (hopefully) being accounted properly now instead of just counting the
time spent in io_schedule() as done earlier.

- instead of accounting for time spent in all page faults, only swapping in of pages
is being counted since thats the only part that one can really control (capacity misses
vs. compulsory misses)

- a /proc interface is being used instead of connector-based interface. Andrew Morton
suggested a generic connector-based interface useful for future usage of
connectors fo stats. This revised connector-based interface will be posted separately
since its useful for efficient delivery of any per-task statistics, not just the ones
being introduced by these patches.

- the timestamping code has been made generic (following the suggestions to Matt Helsley's
patches to add timestamps to process events connectors)

More comments in individual patches.

Series

nstimestamp-diff.patch
delayacct-init.patch
delayacct-blkio.patch
delayacct-swapin.patch
delayacct-procfs.patch

[Lse-tech] [Patch 0/6] Per-task delay accounting

From: Shailabh N. <na...@wa...> - 2006-01-03 23:43:06

Forwarding again as this patch didn't make it to lse-tech, ckrm-tech and elsa-devel.
Please include Andrew Morton and lkml on replies.
-- Shailabh

-------- Original Message --------
Subject: [Patch 0/6] Per-task delay accounting
Date: Tue, 03 Jan 2006 23:16:40 +0000
From: Shailabh Nagar <na...@wa...>
To: Andrew Morton <ak...@os...>,	linux-kernel <lin...@vg...>
CC: elsa-devel <els...@li...>,	LSE <lse...@li...>,	ckrm-tech <ckr...@li...>

Andrew,

Could you please consider these patches for inclusion in -mm ?
The comments from earlier postings of these patches have been addressed,
including the one you made about making the connector interface generic
(more about that in the connector patch).

Thanks,
Shailabh

The following patches add accounting for the delays seen by tasks in
a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages (i.e. capacity misses).

Such delays provide feedback for a task's cpu priority, io priority and
rss limit values. Long delays, especially relative to other tasks, can
be a trigger for changing a task's cpu/io priorities and modifying its
rss usage (either directly through sys_getprlimit() that was proposed
earlier on lkml or by throttling cpu consumption or process calling
sys_setrlimit etc.)

The major change since the previous posting of these patches
(http://www.ussg.iu.edu/hypermail/linux/kernel/0512.0/2152.html)
is the resurrection of the connector interface (in addition to /proc)
and, as part of the same patch, the ability to get stats per-tgid in
addition to per-pid.

More comments in individual patches.

Series

nstimestamp-diff.patch
delayacct-init.patch
delayacct-blkio.patch
delayacct-swapin.patch
delayacct-procfs.patch
delayacct-connector.patch

[Lse-tech] [Patch 0/7] Per-task delay accounting

From: Shailabh N. <na...@wa...> - 2006-02-27 07:56:46

The following patches add accounting for the delays seen by tasks in
a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages (i.e. capacity misses).

Such delays provide feedback for a task's cpu priority, io priority and
rss limit values. Long delays, especially relative to other tasks, can
be a trigger for changing a task's cpu/io priorities and modifying its
rss usage (either directly through sys_getprlimit() that was proposed
earlier on lkml or by throttling cpu consumption or process calling
sys_setrlimit etc.)

The major changes since the previous posting of these patches are

- use of the new generic netlink interface (NETLINK_GENERIC family)
with provision for reuse by other (non-delay accounting) kernel
components
- sysctl option for turning delay accounting collection on/off
dynamically
- similar sysctl option for schedstats. Delay accounting leverages
schedstats code for cpu delays.
- dynamic allocation of delay accounting structures

More comments in individual patches. Please give feedback.

--Shailabh

Series
nstimestamp-diff.patch
schedstats-sysctl.patch
delayacct-setup.patch
delayacct-sysctl.patch
delayacct-blkio.patch
delayacct-swapin.patch
delayacct-genetlink.patch

[Lse-tech] [Patch 1/7] timespec diff utility

From: Shailabh N. <na...@wa...> - 2006-02-27 08:02:56

nstimestamp_diff.patch

Add kernel utility function for measuring the
interval (diff) between two timespec values, adjusting for overflow

Signed-off-by: Shailabh Nagar <na...@wa...>

 include/linux/time.h |   14 ++++++++++++++
 1 files changed, 14 insertions(+)

Index: linux-2.6.16-rc4/include/linux/time.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/time.h	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/include/linux/time.h	2006-02-27 01:52:49.000000000 -0500
@@ -147,6 +147,20 @@ extern struct timespec ns_to_timespec(co
  */
 extern struct timeval ns_to_timeval(const nsec_t nsec);
 
+/*
+ * timespec_diff_ns - Return difference of two timestamps in nanoseconds
+ * In the rare case of @end being earlier than @start, return zero
+ */
+static inline nsec_t timespec_diff_ns(struct timespec *start, struct timespec *end)
+{
+	nsec_t ret;
+
+	ret = (nsec_t)(end->tv_sec - start->tv_sec)*NSEC_PER_SEC;
+	ret += (nsec_t)(end->tv_nsec - start->tv_nsec);
+	if (ret < 0)
+ 		return 0;
+	return ret;
+}
 #endif /* __KERNEL__ */
 
 #define NFDBITS			__NFDBITS

[Lse-tech] [Patch 2/7] Add sysctl for schedstats

From: Shailabh N. <na...@wa...> - 2006-02-27 08:12:37

schedstats-sysctl.patch

Add sysctl option for controlling schedstats collection
dynamically. Delay accounting leverages schedstats for
cpu delay statistics.

Signed-off-by: Chandra Seetharaman <sek...@us...>
Signed-off-by: Shailabh Nagar <na...@wa...>

 Documentation/kernel-parameters.txt |    2 
 include/linux/sched.h               |    4 +
 include/linux/sysctl.h              |    1 
 kernel/sched.c                      |   74 +++++++++++++++++++++++++++++++++---
 kernel/sysctl.c                     |   10 ++++
 lib/Kconfig.debug                   |    6 +-
 6 files changed, 90 insertions(+), 7 deletions(-)

Index: linux-2.6.16-rc4/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/sched.h	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/include/linux/sched.h	2006-02-27 01:52:52.000000000 -0500
@@ -15,6 +15,7 @@
 #include <linux/cpumask.h>
 #include <linux/errno.h>
 #include <linux/nodemask.h>
+#include <linux/sysctl.h>
 
 #include <asm/system.h>
 #include <asm/semaphore.h>
@@ -525,6 +526,9 @@ struct backing_dev_info;
 struct reclaim_state;
 
 #ifdef CONFIG_SCHEDSTATS
+extern int schedstats_sysctl;
+extern int schedstats_sysctl_handler(ctl_table *, int, struct file *,
+					void __user *, size_t *, loff_t *);
 struct sched_info {
 	/* cumulative counters */
 	unsigned long	cpu_time,	/* time spent on the cpu */
Index: linux-2.6.16-rc4/include/linux/sysctl.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/sysctl.h	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/include/linux/sysctl.h	2006-02-27 01:52:52.000000000 -0500
@@ -146,6 +146,7 @@ enum
 	KERN_RANDOMIZE=68, /* int: randomize virtual address space */
 	KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */
 	KERN_SPIN_RETRY=70,	/* int: number of spinlock retries */
+	KERN_SCHEDSTATS=71,	/* int: Schedstats on/off */
 };
 
 
Index: linux-2.6.16-rc4/kernel/sched.c
===================================================================
--- linux-2.6.16-rc4.orig/kernel/sched.c	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/kernel/sched.c	2006-02-27 01:52:52.000000000 -0500
@@ -382,11 +382,56 @@ static inline void task_rq_unlock(runque
 }
 
 #ifdef CONFIG_SCHEDSTATS
+
+int schedstats_sysctl = 0;		/* schedstats turned off by default */
+static DEFINE_PER_CPU(int, schedstats) = 0;
+
+static void schedstats_set(int val)
+{
+	int i;
+	static spinlock_t schedstats_lock = SPIN_LOCK_UNLOCKED;
+
+	spin_lock(&schedstats_lock);
+	schedstats_sysctl = val;
+	for (i = 0; i < NR_CPUS; i++)
+		per_cpu(schedstats, i) = val;
+	spin_unlock(&schedstats_lock);
+}
+
+static int __init schedstats_setup_enable(char *str)
+{
+	schedstats_sysctl = 1;
+	schedstats_set(schedstats_sysctl);
+	return 1;
+}
+
+__setup("schedstats", schedstats_setup_enable);
+
+int schedstats_sysctl_handler(ctl_table *table, int write, struct file *filp,
+			void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret, prev = schedstats_sysctl;
+	struct task_struct *g, *t;
+
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+	if ((ret != 0) || (prev == schedstats_sysctl))
+		return ret;
+	if (schedstats_sysctl) {
+		read_lock(&tasklist_lock);
+		do_each_thread(g, t) {
+			memset(&t->sched_info, 0, sizeof(t->sched_info));
+		} while_each_thread(g, t);
+		read_unlock(&tasklist_lock);
+	}
+	schedstats_set(schedstats_sysctl);
+	return ret;
+}
+
 /*
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 12
+#define SCHEDSTAT_VERSION 13
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -394,6 +439,10 @@ static int show_schedstat(struct seq_fil
 
 	seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
 	seq_printf(seq, "timestamp %lu\n", jiffies);
+	if (!schedstats_sysctl) {
+		seq_printf(seq, "State Off\n");
+		return 0;
+	}
 	for_each_online_cpu(cpu) {
 		runqueue_t *rq = cpu_rq(cpu);
 #ifdef CONFIG_SMP
@@ -472,8 +521,17 @@ struct file_operations proc_schedstat_op
 	.release = single_release,
 };
 
-# define schedstat_inc(rq, field)	do { (rq)->field++; } while (0)
-# define schedstat_add(rq, field, amt)	do { (rq)->field += (amt); } while (0)
+#define schedstats_on	(per_cpu(schedstats, smp_processor_id()) != 0)
+#define schedstat_inc(rq, field)	\
+do {					\
+	if (unlikely(schedstats_on))	\
+		(rq)->field++;		\
+} while (0)
+#define schedstat_add(rq, field, amt)	\
+do {					\
+	if (unlikely(schedstats_on))	\
+		(rq)->field += (amt);	\
+} while (0)
 #else /* !CONFIG_SCHEDSTATS */
 # define schedstat_inc(rq, field)	do { } while (0)
 # define schedstat_add(rq, field, amt)	do { } while (0)
@@ -556,7 +614,7 @@ static void sched_info_arrive(task_t *t)
  */
 static inline void sched_info_queued(task_t *t)
 {
-	if (!t->sched_info.last_queued)
+	if (unlikely(schedstats_on && !t->sched_info.last_queued))
 		t->sched_info.last_queued = jiffies;
 }
 
@@ -580,7 +638,7 @@ static inline void sched_info_depart(tas
  * their time slice.  (This may also be called when switching to or from
  * the idle task.)  We are only called when prev != next.
  */
-static inline void sched_info_switch(task_t *prev, task_t *next)
+static inline void __sched_info_switch(task_t *prev, task_t *next)
 {
 	struct runqueue *rq = task_rq(prev);
 
@@ -595,6 +653,12 @@ static inline void sched_info_switch(tas
 	if (next != rq->idle)
 		sched_info_arrive(next);
 }
+
+static inline void sched_info_switch(task_t *prev, task_t *next)
+{
+	if (unlikely(schedstats_on))
+		__sched_info_switch(prev, next);
+}
 #else
 #define sched_info_queued(t)		do { } while (0)
 #define sched_info_switch(t, next)	do { } while (0)
Index: linux-2.6.16-rc4/kernel/sysctl.c
===================================================================
--- linux-2.6.16-rc4.orig/kernel/sysctl.c	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/kernel/sysctl.c	2006-02-27 01:52:52.000000000 -0500
@@ -656,6 +656,16 @@ static ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#if defined(CONFIG_SCHEDSTATS)
+	{
+		.ctl_name	= KERN_SCHEDSTATS,
+		.procname	= "schedstats",
+		.data		= &schedstats_sysctl,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &schedstats_sysctl_handler,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
Index: linux-2.6.16-rc4/lib/Kconfig.debug
===================================================================
--- linux-2.6.16-rc4.orig/lib/Kconfig.debug	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/lib/Kconfig.debug	2006-02-27 01:52:52.000000000 -0500
@@ -67,15 +67,17 @@ config DETECT_SOFTLOCKUP
 
 config SCHEDSTATS
 	bool "Collect scheduler statistics"
-	depends on DEBUG_KERNEL && PROC_FS
+	depends on PROC_FS
 	help
 	  If you say Y here, additional code will be inserted into the
 	  scheduler and related routines to collect statistics about
 	  scheduler behavior and provide them in /proc/schedstat.  These
-	  stats may be useful for both tuning and debugging the scheduler
+	  stats may be useful for both tuning and debugging the scheduler.
 	  If you aren't debugging the scheduler or trying to tune a specific
 	  application, you can say N to avoid the very slight overhead
 	  this adds.
+	  Schedstats collection, and most of its overhead, can also be
+	  controlled dyanmically through the schedstats sysctl.
 
 config DEBUG_SLAB
 	bool "Debug memory allocations"
Index: linux-2.6.16-rc4/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.16-rc4.orig/Documentation/kernel-parameters.txt	2006-02-27 01:19:52.000000000 -0500
+++ linux-2.6.16-rc4/Documentation/kernel-parameters.txt	2006-02-27 01:52:52.000000000 -0500
@@ -1333,6 +1333,8 @@ running once the system is up.
 	sc1200wdt=	[HW,WDT] SC1200 WDT (watchdog) driver
 			Format: <io>[,<timeout>[,<isapnp>]]
 
+	schedstats	[KNL] Collect CPU scheduler statistics
+
 	scsi_debug_*=	[SCSI]
 			See drivers/scsi/scsi_debug.c.

[Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Ingo M. <mi...@el...> - 2006-02-27 08:53:29

the principle looks OK to me, just a few minor nits:

>  #ifdef CONFIG_SCHEDSTATS
> +
> +int schedstats_sysctl = 0;		/* schedstats turned off by default */

no need to initialize to 0.

> +static DEFINE_PER_CPU(int, schedstats) = 0;

ditto.

> +
> +static void schedstats_set(int val)
> +{
> +	int i;
> +	static spinlock_t schedstats_lock = SPIN_LOCK_UNLOCKED;

move spinlock out of the function and use DEFINE_SPINLOCK. (But ... see 
below for suggestion to get rid of this lock altogether.)

> +	spin_lock(&schedstats_lock);
> +	schedstats_sysctl = val;
> +	for (i = 0; i < NR_CPUS; i++)
> +		per_cpu(schedstats, i) = val;
> +	spin_unlock(&schedstats_lock);
> +}
> +
> +static int __init schedstats_setup_enable(char *str)
> +{
> +	schedstats_sysctl = 1;
> +	schedstats_set(schedstats_sysctl);
> +	return 1;
> +}
> +
> +__setup("schedstats", schedstats_setup_enable);
> +
> +int schedstats_sysctl_handler(ctl_table *table, int write, struct file *filp,
> +			void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	int ret, prev = schedstats_sysctl;
> +	struct task_struct *g, *t;
> +
> +	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +	if ((ret != 0) || (prev == schedstats_sysctl))
> +		return ret;
> +	if (schedstats_sysctl) {
> +		read_lock(&tasklist_lock);
> +		do_each_thread(g, t) {
> +			memset(&t->sched_info, 0, sizeof(t->sched_info));
> +		} while_each_thread(g, t);
> +		read_unlock(&tasklist_lock);
> +	}
> +	schedstats_set(schedstats_sysctl);

why not just introduce a schedstats_lock mutex, and acquire it for both 
the 'if (schedstats_sysctl)' line and the schedstats_set() line. That 
will make the locking meaningful: two parallel sysctl ops will be atomic 
to each other. [right now they wont be and they can clear schedstat data 
in parallel -> not a big problem but it makes schedstats_lock rather 
meaningless]

> -#define SCHEDSTAT_VERSION 12
> +#define SCHEDSTAT_VERSION 13
>  
>  static int show_schedstat(struct seq_file *seq, void *v)
>  {
> @@ -394,6 +439,10 @@ static int show_schedstat(struct seq_fil
>  
>  	seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
>  	seq_printf(seq, "timestamp %lu\n", jiffies);
> +	if (!schedstats_sysctl) {
> +		seq_printf(seq, "State Off\n");
> +		return 0;
> +	}

and show_schedstat() should then also take the schedstats_lock mutex.

	Ingo

Re: [Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Balbir S. <ba...@in...> - 2006-02-27 10:46:45

<snip>
> why not just introduce a schedstats_lock mutex, and acquire it for both 
> the 'if (schedstats_sysctl)' line and the schedstats_set() line. That 
> will make the locking meaningful: two parallel sysctl ops will be atomic 
> to each other. [right now they wont be and they can clear schedstat data 
> in parallel -> not a big problem but it makes schedstats_lock rather 
> meaningless]
>

Ingo,

Can sysctl's run in parallel? sys_sysctl() is protects the call
to do_sysctl() with BKL (lock_kernel/unlock_kernel).

Am I missing something?

Balbir

Re: [Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Arjan v. de V. <ar...@in...> - 2006-02-27 12:19:12

On Mon, 2006-02-27 at 16:16 +0530, Balbir Singh wrote:
> <snip>
> > why not just introduce a schedstats_lock mutex, and acquire it for both 
> > the 'if (schedstats_sysctl)' line and the schedstats_set() line. That 
> > will make the locking meaningful: two parallel sysctl ops will be atomic 
> > to each other. [right now they wont be and they can clear schedstat data 
> > in parallel -> not a big problem but it makes schedstats_lock rather 
> > meaningless]
> >
> 
> Ingo,
> 
> Can sysctl's run in parallel? sys_sysctl() is protects the call
> to do_sysctl() with BKL (lock_kernel/unlock_kernel).
> 
> Am I missing something?


your sysctl functions sleep. the BKL is useless in the light of sleeping
code...

Re: [Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Balbir S. <ba...@in...> - 2006-02-27 12:29:59

> your sysctl functions sleep. the BKL is useless in the light of sleeping
> code...
>

But wouldn't all sysctls potentially sleep (on account of copying data from
the user).

Thanks for clarifying,
Balbir

Re: [Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Arjan v. de V. <ar...@in...> - 2006-02-27 13:07:31

On Mon, 2006-02-27 at 17:59 +0530, Balbir Singh wrote:
> > your sysctl functions sleep. the BKL is useless in the light of sleeping
> > code...
> >
> 
> But wouldn't all sysctls potentially sleep (on account of copying data from
> the user).

.. I'm not the one saying the BKL was useful... you were ;)

Re: [Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Balbir S. <ba...@in...> - 2006-02-27 16:16:47

> > But wouldn't all sysctls potentially sleep (on account of copying data from
> > the user).
> 
> .. I'm not the one saying the BKL was useful... you were ;)

My tiny mind must have been confused by the presence of the code, which
I presumed would be useful. I guess that is not the case always :-)

Balbir

[Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Nick P. <nic...@ya...> - 2006-02-27 09:17:52

Shailabh Nagar wrote:
> schedstats-sysctl.patch
> 
> Add sysctl option for controlling schedstats collection
> dynamically. Delay accounting leverages schedstats for
> cpu delay statistics.
> 

I'd sort of rather not tie this in with schedstats if possible.
Schedstats adds a reasonable amount of cache footprint and
branches in hot paths. Most of schedstats stuff is something that
hardly anyone will use.

Sure you can share common code though...

>  
> Index: linux-2.6.16-rc4/kernel/sched.c
> ===================================================================
> --- linux-2.6.16-rc4.orig/kernel/sched.c	2006-02-27 01:20:04.000000000 -0500
> +++ linux-2.6.16-rc4/kernel/sched.c	2006-02-27 01:52:52.000000000 -0500
> @@ -382,11 +382,56 @@ static inline void task_rq_unlock(runque
>  }
>  
>  #ifdef CONFIG_SCHEDSTATS
> +
> +int schedstats_sysctl = 0;		/* schedstats turned off by default */

Should be read mostly.

> +static DEFINE_PER_CPU(int, schedstats) = 0;
> +

When the above is in the read mostly section, you won't need this at all.

You don't intend to switch the sysctl with great frequency, do you?

> +static void schedstats_set(int val)
> +{
> +	int i;
> +	static spinlock_t schedstats_lock = SPIN_LOCK_UNLOCKED;
> +
> +	spin_lock(&schedstats_lock);
> +	schedstats_sysctl = val;
> +	for (i = 0; i < NR_CPUS; i++)
> +		per_cpu(schedstats, i) = val;
> +	spin_unlock(&schedstats_lock);
> +}
> +
> +static int __init schedstats_setup_enable(char *str)
> +{
> +	schedstats_sysctl = 1;
> +	schedstats_set(schedstats_sysctl);
> +	return 1;
> +}
> +
> +__setup("schedstats", schedstats_setup_enable);
> +
> +int schedstats_sysctl_handler(ctl_table *table, int write, struct file *filp,
> +			void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	int ret, prev = schedstats_sysctl;
> +	struct task_struct *g, *t;
> +
> +	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +	if ((ret != 0) || (prev == schedstats_sysctl))
> +		return ret;
> +	if (schedstats_sysctl) {
> +		read_lock(&tasklist_lock);
> +		do_each_thread(g, t) {
> +			memset(&t->sched_info, 0, sizeof(t->sched_info));
> +		} while_each_thread(g, t);
> +		read_unlock(&tasklist_lock);
> +	}
> +	schedstats_set(schedstats_sysctl);

You don't clear the rq's schedstats stuff here.

And clearing this at all is not really needed for the schedstats interface.
You have a timestamp and a set of accumulated values, so it is easy to work
out deltas. So do you need this?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

[Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Shailabh N. <na...@wa...> - 2006-02-27 09:41:48

Nick Piggin wrote:

<snip>

>> +int schedstats_sysctl_handler(ctl_table *table, int write, struct 
>> file *filp,
>> +            void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +    int ret, prev = schedstats_sysctl;
>> +    struct task_struct *g, *t;
>> +
>> +    ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
>> +    if ((ret != 0) || (prev == schedstats_sysctl))
>> +        return ret;
>> +    if (schedstats_sysctl) {
>> +        read_lock(&tasklist_lock);
>> +        do_each_thread(g, t) {
>> +            memset(&t->sched_info, 0, sizeof(t->sched_info));
>> +        } while_each_thread(g, t);
>> +        read_unlock(&tasklist_lock);
>> +    }
>> +    schedstats_set(schedstats_sysctl);
>
>
> You don't clear the rq's schedstats stuff here.

Good point.

>
> And clearing this at all is not really needed for the schedstats 
> interface.
> You have a timestamp and a set of accumulated values, so it is easy to 
> work
> out deltas. So do you need this?

Not clearing the stats will mean userspace has to distinguish between 
the tasks
that are hanging around from before the last turn off, and the ones 
created after
wards. Any delta taken across an interval where schedstats was turned 
off will
give the impression a task was sleeping during the interval (and hence 
show it had
a lesser average wait time than it might actually have experienced). 

--Shailabh

[Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Nick P. <nic...@ya...> - 2006-02-27 12:28:30

Shailabh Nagar wrote:
> Nick Piggin wrote:

>>
>> And clearing this at all is not really needed for the schedstats 
>> interface.
>> You have a timestamp and a set of accumulated values, so it is easy to 
>> work
>> out deltas. So do you need this?
> 
> 
> Not clearing the stats will mean userspace has to distinguish between 
> the tasks
> that are hanging around from before the last turn off, and the ones 
> created after
> wards. Any delta taken across an interval where schedstats was turned 
> off will
> give the impression a task was sleeping during the interval (and hence 
> show it had
> a lesser average wait time than it might actually have experienced).

Presumably a delta taken acrsoss an interval where schedstats
was turned off would be rather inaccurate, no matter what.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

Re: [Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats

From: Chandra S. <sek...@us...> - 2006-02-27 19:09:54

On Mon, 2006-02-27 at 20:17 +1100, Nick Piggin wrote:
> >  #ifdef CONFIG_SCHEDSTATS
> > +
> > +int schedstats_sysctl = 0;		/* schedstats turned off by default */
> 
> Should be read mostly.
> 
> > +static DEFINE_PER_CPU(int, schedstats) = 0;
> > +
> 
> When the above is in the read mostly section, you won't need this at all.
> 
> You don't intend to switch the sysctl with great frequency, do you?

No, it is not expected to switch often.

We originally coded it as __read_mostly, but thought the variable
bouncing between CPUs would be costly. Is it cheaper with
__read_mostly ? or it doesn't matter ?


-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sek...@us...   |      .......you may get it.
----------------------------------------------------------------------

schedstats refinement (was Re: [Lse-tech] Re: [Patch 2/7] Add sysctl for schedstats)

From: Balbir S. <ba...@in...> - 2006-03-07 17:27:16

On Mon, Feb 27, 2006 at 08:17:47PM +1100, Nick Piggin wrote:
> Shailabh Nagar wrote:
> >schedstats-sysctl.patch
> >
> >Add sysctl option for controlling schedstats collection
> >dynamically. Delay accounting leverages schedstats for
> >cpu delay statistics.
> >
> 
> I'd sort of rather not tie this in with schedstats if possible.
> Schedstats adds a reasonable amount of cache footprint and
> branches in hot paths. Most of schedstats stuff is something that
> hardly anyone will use.
> 
> Sure you can share common code though...
> 

This patch refines scheduler statistics collection and display to three levels,
tasks, runqueue and scheddomains. CONFIG_SCHEDSTATS now requires a boot time
option in the form of schedstats or schedstats= to display scheduler statistics

They can all be enabled together to get complete statistics by passing "all" 
as a boot time option. schedstat_inc has been split into schedstat_rq_inc and
schedstat_sd_inc, each of which checks if rq and sd statistics gathering is
enabled or not.  schedstat_add has been changed to schedstat_sd_add, it checks 
if sd statistics gathering is on prior to gathering the statistics.
Similar changes have been made for task schedstats gathering.

The output of /proc/schedstat and /proc/<pid>/schedstat and /proc/<pid>/task/
*/schedstat has been modified to print a gentle message suggesting that
statistics gathering is off. Also a header "statistics for cpuXXX" has been
added (for each cpu) to the /proc/schedstat output.

This patch is motivated by comments for sharing code with 
CONFIG_SCHEDSTATS but not incurring the complete overhead of the entire
CONFIG_SCHEDSTATS code.

Testing
=======

a) booted with schedstats (schedstats=all)
------------------------------------------

cat /proc/schedstat 
version 13
timestamp 4294922919

statistics for cpu0

cpu0 10 10 132 142 240 97775 24181 48251 45935 5664 2376 73594
domain0 00000003 24464 24242 162 224 62 4 0 24242 149 148 0 1 1 0 0 148 24290 24117 64 173 109 0 0 24117 0 0 0 0 0 0 0 0 0 4392 835 0

statistics for cpu1

cpu1 3 3 180 14504 387 50735 6520 15430 11035 4195 2376 44215
domain0 00000003 25870 25203 107 695 588 3 0 25203 198 198 0 0 0 0 0 198 6608 6428 91 184 93 0 0 6428 0 0 0 0 0 0 0 0 0 2316 307 0

cat /proc/1/schedstat 
506 34 1102


b) booted with schedstats=tasks
-------------------------------

cat /proc/schedstat 
version 13
timestamp 4294937241
runqueue and scheddomain stats are not enabled

cat /proc/1/schedstat 
505 58 1097

c) booted with schedstats=rq
-----------------------------

cat /proc/schedstat 
version 13
timestamp 4294913832

statistics for cpu0

cpu0 14 14 56 102 260 96332 18867 47278 45064 3556397949 2216 77465
scheddomain stats are not enabled

statistics for cpu1

cpu1 3 3 12134 12138 333 42878 4224 12874 8779 1714457722 2071 38654
scheddomain stats are not enabled

cat /proc/1/schedstat 
tasks schedstats is not enabled


d) booted with schedstats=sd
----------------------------

cat /proc/schedstat 
version 13
timestamp 4294936220

statistics for cpu0

runqueue stats are not enabled
domain0 00000003 38048 37802 140 248 108 0 0 37802 151 149 0 3 3 0 0 149 27574 27417 59 158 99 0 0 27417 0 0 0 0 0 0 0 0 0 4168 827 0

statistics for cpu1

runqueue stats are not enabled
domain0 00000003 39094 38441 119 682 563 3 0 38441 199 196 0 5 5 0 0 196 9167 8970 107 203 96 0 0 8970 0 0 0 0 0 0 0 0 0 2159 330 0

cat /proc/1/schedstat 
tasks schedstats is not enabled


Alternatives considered
=======================
The other alternative that was considered  was that instead of changing the
format of /proc/schedstat and /proc/<pid>*/schedstat, we could print zeros for
all levels for which statistics is not collected. But zeros could be treated
as valid values, so this solution was not implemented.

Limitations
===========
The effectiveness of this patch is limited to run-time statistics collection.
The run-time overhead is proportional to the level of statistics enabled.
The space consumed is the same as before.



This patch was created against 2.6.16-rc5

Signed-off-by: Balbir Singh <ba...@in...>
---

 Documentation/kernel-parameters.txt |   11 ++
 fs/proc/base.c                      |    4 
 include/linux/sched.h               |   23 ++++
 kernel/sched.c                      |  195 +++++++++++++++++++++++++-----------
 4 files changed, 175 insertions(+), 58 deletions(-)

diff -puN kernel/sched.c~schedstats_refinement kernel/sched.c
--- linux-2.6.16-rc5/kernel/sched.c~schedstats_refinement	2006-03-07 16:14:59.000000000 +0530
+++ linux-2.6.16-rc5-balbir/kernel/sched.c	2006-03-07 20:43:46.000000000 +0530
@@ -386,7 +386,28 @@ static inline void task_rq_unlock(runque
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 12
+#define SCHEDSTAT_VERSION 13
+
+int schedstats_on __read_mostly;
+
+/*
+ * Parse the schedstats options passed at boot time
+ */
+static int __init schedstats_setup_enable(char *str)
+{
+	if (!str || !strcmp(str, "") || !strcmp(str, "=all"))
+		schedstats_on = SCHEDSTATS_ALL;
+	else if (!strcmp(str, "=tasks"))
+		schedstats_on = SCHEDSTATS_TASKS;
+	else if (!strcmp(str, "=sd"))
+		schedstats_on = SCHEDSTATS_SD;
+	else if (!strcmp(str, "=rq"))
+		schedstats_on = SCHEDSTATS_RQ;
+
+	return 1;
+}
+
+__setup("schedstats", schedstats_setup_enable);
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -394,26 +415,44 @@ static int show_schedstat(struct seq_fil
 
 	seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
 	seq_printf(seq, "timestamp %lu\n", jiffies);
+
+	if (!schedstats_rq_on() && !schedstats_sd_on()) {
+		seq_printf(seq, "runqueue and scheddomain stats are not "
+				"enabled\n");
+		return 0;
+	}
+
 	for_each_online_cpu(cpu) {
 		runqueue_t *rq = cpu_rq(cpu);
 #ifdef CONFIG_SMP
 		struct sched_domain *sd;
 		int dcnt = 0;
 #endif
+		seq_printf(seq, "\nstatistics for cpu%d\n\n", cpu);
 
-		/* runqueue-specific stats */
-		seq_printf(seq,
-		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
-		    cpu, rq->yld_both_empty,
-		    rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt,
-		    rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
-		    rq->ttwu_cnt, rq->ttwu_local,
-		    rq->rq_sched_info.cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
+		if (schedstats_rq_on()) {
+			/* runqueue-specific stats */
+			seq_printf(seq,
+			    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu "
+			    "%lu",
+			    cpu, rq->yld_both_empty,
+			    rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt,
+			    rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
+			    rq->ttwu_cnt, rq->ttwu_local,
+			    rq->rq_sched_info.cpu_time,
+			    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
 
-		seq_printf(seq, "\n");
+			seq_printf(seq, "\n");
+		} else
+			seq_printf(seq, "runqueue stats are not enabled\n");
 
 #ifdef CONFIG_SMP
+
+		if (!schedstats_sd_on()) {
+			seq_printf(seq, "scheddomain stats are not enabled\n");
+			continue;
+		}
+
 		/* domain-specific stats */
 		preempt_disable();
 		for_each_domain(cpu, sd) {
@@ -472,11 +511,30 @@ struct file_operations proc_schedstat_op
 	.release = single_release,
 };
 
-# define schedstat_inc(rq, field)	do { (rq)->field++; } while (0)
-# define schedstat_add(rq, field, amt)	do { (rq)->field += (amt); } while (0)
+# define schedstat_sd_inc(sd, field)		\
+do {						\
+	if (unlikely(schedstats_sd_on()))	\
+		(sd)->field++;			\
+} while (0)
+
+# define schedstat_sd_add(sd, field, amt)	\
+do {						\
+	if (unlikely(schedstats_sd_on()))	\
+		(sd)->field += (amt);		\
+} while (0)
+
+# define schedstat_rq_inc(rq, field)		\
+do {						\
+	if (unlikely(schedstats_rq_on()))	\
+		(rq)->field++;			\
+} while (0)
+
 #else /* !CONFIG_SCHEDSTATS */
-# define schedstat_inc(rq, field)	do { } while (0)
-# define schedstat_add(rq, field, amt)	do { } while (0)
+
+# define schedstat_sd_inc(rq, field)		do { } while (0)
+# define schedstat_sd_add(rq, field, amt)	do { } while (0)
+# define schedstat_rq_inc(rq, field)		do { } while (0)
+
 #endif
 
 /*
@@ -515,6 +573,15 @@ static inline void sched_info_dequeued(t
 	t->sched_info.last_queued = 0;
 }
 
+static void rq_sched_info_arrive(struct runqueue *rq, unsigned long diff)
+{
+	if (!schedstats_rq_on() || !rq)
+		return;
+
+	rq->rq_sched_info.run_delay += diff;
+	rq->rq_sched_info.pcnt++;
+}
+
 /*
  * Called when a task finally hits the cpu.  We can now calculate how
  * long it was waiting to run.  We also note when it began so that we
@@ -523,20 +590,23 @@ static inline void sched_info_dequeued(t
 static void sched_info_arrive(task_t *t)
 {
 	unsigned long now = jiffies, diff = 0;
-	struct runqueue *rq = task_rq(t);
 
+	if (!schedstats_tasks_on() && !schedstats_rq_on())
+		return;
+
+	/*
+	 * diff is required in case schedstats is on for tasks or rq
+	 */
 	if (t->sched_info.last_queued)
 		diff = now - t->sched_info.last_queued;
 	sched_info_dequeued(t);
-	t->sched_info.run_delay += diff;
-	t->sched_info.last_arrival = now;
-	t->sched_info.pcnt++;
-
-	if (!rq)
-		return;
 
-	rq->rq_sched_info.run_delay += diff;
-	rq->rq_sched_info.pcnt++;
+	if (schedstats_tasks_on()) {
+		t->sched_info.run_delay += diff;
+		t->sched_info.last_arrival = now;
+		t->sched_info.pcnt++;
+	}
+	rq_sched_info_arrive(task_rq(t), diff);
 }
 
 /*
@@ -556,23 +626,32 @@ static void sched_info_arrive(task_t *t)
  */
 static inline void sched_info_queued(task_t *t)
 {
+	if (!schedstats_tasks_on() && !schedstats_rq_on())
+		return;
+
 	if (!t->sched_info.last_queued)
 		t->sched_info.last_queued = jiffies;
 }
 
+static inline void rq_sched_info_depart(struct runqueue *rq, unsigned long diff)
+{
+	if (!schedstats_rq_on() || !rq)
+		return;
+
+	rq->rq_sched_info.cpu_time += diff;
+}
+
 /*
  * Called when a process ceases being the active-running process, either
  * voluntarily or involuntarily.  Now we can calculate how long we ran.
  */
 static inline void sched_info_depart(task_t *t)
 {
-	struct runqueue *rq = task_rq(t);
 	unsigned long diff = jiffies - t->sched_info.last_arrival;
 
-	t->sched_info.cpu_time += diff;
-
-	if (rq)
-		rq->rq_sched_info.cpu_time += diff;
+	if (schedstats_tasks_on())
+		t->sched_info.cpu_time += diff;
+	rq_sched_info_depart(task_rq(t), diff);
 }
 
 /*
@@ -1190,15 +1269,15 @@ static int try_to_wake_up(task_t *p, uns
 
 	new_cpu = cpu;
 
-	schedstat_inc(rq, ttwu_cnt);
+	schedstat_rq_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {
-		schedstat_inc(rq, ttwu_local);
+		schedstat_rq_inc(rq, ttwu_local);
 		goto out_set_cpu;
 	}
 
 	for_each_domain(this_cpu, sd) {
 		if (cpu_isset(cpu, sd->span)) {
-			schedstat_inc(sd, ttwu_wake_remote);
+			schedstat_sd_inc(sd, ttwu_wake_remote);
 			this_sd = sd;
 			break;
 		}
@@ -1239,7 +1318,7 @@ static int try_to_wake_up(task_t *p, uns
 				 * p is cache cold in this domain, and
 				 * there is no bad imbalance.
 				 */
-				schedstat_inc(this_sd, ttwu_move_affine);
+				schedstat_sd_inc(this_sd, ttwu_move_affine);
 				goto out_set_cpu;
 			}
 		}
@@ -1250,7 +1329,7 @@ static int try_to_wake_up(task_t *p, uns
 		 */
 		if (this_sd->flags & SD_WAKE_BALANCE) {
 			if (imbalance*this_load <= 100*load) {
-				schedstat_inc(this_sd, ttwu_move_balance);
+				schedstat_sd_inc(this_sd, ttwu_move_balance);
 				goto out_set_cpu;
 			}
 		}
@@ -1894,7 +1973,7 @@ skip_queue:
 
 #ifdef CONFIG_SCHEDSTATS
 	if (task_hot(tmp, busiest->timestamp_last_tick, sd))
-		schedstat_inc(sd, lb_hot_gained[idle]);
+		schedstat_sd_inc(sd, lb_hot_gained[idle]);
 #endif
 
 	pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
@@ -1913,7 +1992,7 @@ out:
 	 * so we can safely collect pull_task() stats here rather than
 	 * inside pull_task().
 	 */
-	schedstat_add(sd, lb_gained[idle], pulled);
+	schedstat_sd_add(sd, lb_gained[idle], pulled);
 
 	if (all_pinned)
 		*all_pinned = pinned;
@@ -2109,23 +2188,23 @@ static int load_balance(int this_cpu, ru
 	if (idle != NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER)
 		sd_idle = 1;
 
-	schedstat_inc(sd, lb_cnt[idle]);
+	schedstat_sd_inc(sd, lb_cnt[idle]);
 
 	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle);
 	if (!group) {
-		schedstat_inc(sd, lb_nobusyg[idle]);
+		schedstat_sd_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
 	}
 
 	busiest = find_busiest_queue(group, idle);
 	if (!busiest) {
-		schedstat_inc(sd, lb_nobusyq[idle]);
+		schedstat_sd_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
 	}
 
 	BUG_ON(busiest == this_rq);
 
-	schedstat_add(sd, lb_imbalance[idle], imbalance);
+	schedstat_sd_add(sd, lb_imbalance[idle], imbalance);
 
 	nr_moved = 0;
 	if (busiest->nr_running > 1) {
@@ -2146,7 +2225,7 @@ static int load_balance(int this_cpu, ru
 	}
 
 	if (!nr_moved) {
-		schedstat_inc(sd, lb_failed[idle]);
+		schedstat_sd_inc(sd, lb_failed[idle]);
 		sd->nr_balance_failed++;
 
 		if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
@@ -2199,7 +2278,7 @@ static int load_balance(int this_cpu, ru
 	return nr_moved;
 
 out_balanced:
-	schedstat_inc(sd, lb_balanced[idle]);
+	schedstat_sd_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;
 
@@ -2233,22 +2312,22 @@ static int load_balance_newidle(int this
 	if (sd->flags & SD_SHARE_CPUPOWER)
 		sd_idle = 1;
 
-	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
+	schedstat_sd_inc(sd, lb_cnt[NEWLY_IDLE]);
 	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE, &sd_idle);
 	if (!group) {
-		schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
+		schedstat_sd_inc(sd, lb_nobusyg[NEWLY_IDLE]);
 		goto out_balanced;
 	}
 
 	busiest = find_busiest_queue(group, NEWLY_IDLE);
 	if (!busiest) {
-		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
+		schedstat_sd_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out_balanced;
 	}
 
 	BUG_ON(busiest == this_rq);
 
-	schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance);
+	schedstat_sd_add(sd, lb_imbalance[NEWLY_IDLE], imbalance);
 
 	nr_moved = 0;
 	if (busiest->nr_running > 1) {
@@ -2260,7 +2339,7 @@ static int load_balance_newidle(int this
 	}
 
 	if (!nr_moved) {
-		schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
+		schedstat_sd_inc(sd, lb_failed[NEWLY_IDLE]);
 		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER)
 			return -1;
 	} else
@@ -2269,7 +2348,7 @@ static int load_balance_newidle(int this
 	return nr_moved;
 
 out_balanced:
-	schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
+	schedstat_sd_inc(sd, lb_balanced[NEWLY_IDLE]);
 	if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER)
 		return -1;
 	sd->nr_balance_failed = 0;
@@ -2333,12 +2412,12 @@ static void active_load_balance(runqueue
 	if (unlikely(sd == NULL))
 		goto out;
 
-	schedstat_inc(sd, alb_cnt);
+	schedstat_sd_inc(sd, alb_cnt);
 
 	if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE, NULL))
-		schedstat_inc(sd, alb_pushed);
+		schedstat_sd_inc(sd, alb_pushed);
 	else
-		schedstat_inc(sd, alb_failed);
+		schedstat_sd_inc(sd, alb_failed);
 out:
 	spin_unlock(&target_rq->lock);
 }
@@ -2906,7 +2985,7 @@ need_resched_nonpreemptible:
 		dump_stack();
 	}
 
-	schedstat_inc(rq, sched_cnt);
+	schedstat_rq_inc(rq, sched_cnt);
 	now = sched_clock();
 	if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
 		run_time = now - prev->timestamp;
@@ -2974,7 +3053,7 @@ go_idle:
 		/*
 		 * Switch the active and expired arrays.
 		 */
-		schedstat_inc(rq, sched_switch);
+		schedstat_rq_inc(rq, sched_switch);
 		rq->active = rq->expired;
 		rq->expired = array;
 		array = rq->active;
@@ -3007,7 +3086,7 @@ go_idle:
 	next->activated = 0;
 switch_tasks:
 	if (next == rq->idle)
-		schedstat_inc(rq, sched_goidle);
+		schedstat_rq_inc(rq, sched_goidle);
 	prefetch(next);
 	prefetch_stack(next);
 	clear_tsk_need_resched(prev);
@@ -3979,7 +4058,7 @@ asmlinkage long sys_sched_yield(void)
 	prio_array_t *array = current->array;
 	prio_array_t *target = rq->expired;
 
-	schedstat_inc(rq, yld_cnt);
+	schedstat_rq_inc(rq, yld_cnt);
 	/*
 	 * We implement yielding by moving the task into the expired
 	 * queue.
@@ -3991,11 +4070,11 @@ asmlinkage long sys_sched_yield(void)
 		target = rq->active;
 
 	if (array->nr_active == 1) {
-		schedstat_inc(rq, yld_act_empty);
+		schedstat_rq_inc(rq, yld_act_empty);
 		if (!rq->expired->nr_active)
-			schedstat_inc(rq, yld_both_empty);
+			schedstat_rq_inc(rq, yld_both_empty);
 	} else if (!rq->expired->nr_active)
-		schedstat_inc(rq, yld_exp_empty);
+		schedstat_rq_inc(rq, yld_exp_empty);
 
 	if (array != target) {
 		dequeue_task(current, array);
diff -puN include/linux/sched.h~schedstats_refinement include/linux/sched.h
--- linux-2.6.16-rc5/include/linux/sched.h~schedstats_refinement	2006-03-07 16:14:59.000000000 +0530
+++ linux-2.6.16-rc5-balbir/include/linux/sched.h	2006-03-07 20:46:10.000000000 +0530
@@ -525,6 +525,29 @@ struct backing_dev_info;
 struct reclaim_state;
 
 #ifdef CONFIG_SCHEDSTATS
+
+#define SCHEDSTATS_TASKS	0x1
+#define SCHEDSTATS_RQ		0x2
+#define SCHEDSTATS_SD		0x4
+#define SCHEDSTATS_ALL	(SCHEDSTATS_TASKS | SCHEDSTATS_RQ | SCHEDSTATS_SD)
+
+extern int schedstats_on;
+
+static inline int schedstats_tasks_on(void)
+{
+	return schedstats_on & SCHEDSTATS_TASKS;
+}
+
+static inline int schedstats_rq_on(void)
+{
+	return schedstats_on & SCHEDSTATS_RQ;
+}
+
+static inline int schedstats_sd_on(void)
+{
+	return schedstats_on & SCHEDSTATS_SD;
+}
+
 struct sched_info {
 	/* cumulative counters */
 	unsigned long	cpu_time,	/* time spent on the cpu */
diff -puN Documentation/kernel-parameters.txt~schedstats_refinement Documentation/kernel-parameters.txt
--- linux-2.6.16-rc5/Documentation/kernel-parameters.txt~schedstats_refinement	2006-03-07 16:14:59.000000000 +0530
+++ linux-2.6.16-rc5-balbir/Documentation/kernel-parameters.txt	2006-03-07 20:48:44.000000000 +0530
@@ -1333,6 +1333,17 @@ running once the system is up.
 	sc1200wdt=	[HW,WDT] SC1200 WDT (watchdog) driver
 			Format: <io>[,<timeout>[,<isapnp>]]
 
+	schedstats	[KNL]
+			Enable all schedstats if CONFIG_SCHEDSTATS is defined
+			same as the schedstats=all
+
+	schedstats=	[KNL]
+			Format: {"all", "tasks", "sd", "rq"}
+			all    -- turns on the complete schedstats
+			rq     -- turns on schedstats only for runqueue
+			sd     -- turns on schedstats only for scheddomains
+			tasks  -- turns on schedstats only for tasks
+
 	scsi_debug_*=	[SCSI]
 			See drivers/scsi/scsi_debug.c.
 
diff -puN fs/proc/base.c~schedstats_refinement fs/proc/base.c
--- linux-2.6.16-rc5/fs/proc/base.c~schedstats_refinement	2006-03-07 16:14:59.000000000 +0530
+++ linux-2.6.16-rc5-balbir/fs/proc/base.c	2006-03-07 16:17:39.000000000 +0530
@@ -72,6 +72,7 @@
 #include <linux/cpuset.h>
 #include <linux/audit.h>
 #include <linux/poll.h>
+#include <linux/sched.h>
 #include "internal.h"
 
 /*
@@ -504,6 +505,9 @@ static int proc_pid_wchan(struct task_st
  */
 static int proc_pid_schedstat(struct task_struct *task, char *buffer)
 {
+	if (!schedstats_tasks_on())
+		return sprintf(buffer, "tasks schedstats is not enabled\n");
+
 	return sprintf(buffer, "%lu %lu %lu\n",
 			task->sched_info.cpu_time,
 			task->sched_info.run_delay,
_

Re: [Lse-tech] [Patch 2/7] Add sysctl for schedstats

From: Bryan O'S. <bo...@se...> - 2006-02-27 17:05:30

On Mon, 2006-02-27 at 03:12 -0500, Shailabh Nagar wrote:

> Add sysctl option for controlling schedstats collection
> dynamically. Delay accounting leverages schedstats for
> cpu delay statistics.

Is there some reason you're using the sysctl interface, and not say
sysfs instead?

	<b

[Lse-tech] Re: [Patch 1/7] timespec diff utility

From: Arjan v. de V. <ar...@in...> - 2006-02-27 08:22:47

> +/*
> + * timespec_diff_ns - Return difference of two timestamps in nanoseconds
> + * In the rare case of @end being earlier than @start, return zero
> + */
> +static inline nsec_t timespec_diff_ns(struct timespec *start, struct timespec *end)
> +{
> +	nsec_t ret;
> +
> +	ret = (nsec_t)(end->tv_sec - start->tv_sec)*NSEC_PER_SEC;
> +	ret += (nsec_t)(end->tv_nsec - start->tv_nsec);
> +	if (ret < 0)
> + 		return 0;
> +	return ret;
> +}
>  #endif /* __KERNEL__ */
>  

wouldn't it be more useful to have this return a timespec as well, and
then it'd be generically useful (and it also probably should then be
uninlined ;)

[Lse-tech] [Patch 3/7] delay accounting initial setup

From: Shailabh N. <na...@wa...> - 2006-02-27 08:15:58

delayacct-setup.patch

Initialization code related to collection of per-task "delay" 
statistics which measure how long it had to wait for cpu, 
sync block io, swapping etc. The collection of statistics and 
the interface are in other patches. This patch sets up the data
structures and allows the statistics collection to be disabled 
through a  kernel boot paramater.

Signed-off-by: Shailabh Nagar <na...@wa...>

 Documentation/kernel-parameters.txt |    2 +
 include/linux/delayacct.h           |   55 ++++++++++++++++++++++++++++++
 include/linux/sched.h               |   15 ++++++++
 init/Kconfig                        |   13 +++++++
 init/main.c                         |    2 +
 kernel/Makefile                     |    1 
 kernel/delayacct.c                  |   65 ++++++++++++++++++++++++++++++++++++
 kernel/exit.c                       |    3 +
 kernel/fork.c                       |    2 +
 9 files changed, 158 insertions(+)

Index: linux-2.6.16-rc4/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.16-rc4.orig/Documentation/kernel-parameters.txt	2006-02-27 01:52:52.000000000 -0500
+++ linux-2.6.16-rc4/Documentation/kernel-parameters.txt	2006-02-27 01:52:54.000000000 -0500
@@ -410,6 +410,8 @@ running once the system is up.
 			Format: <area>[,<node>]
 			See also Documentation/networking/decnet.txt.
 
+	delayacct	[KNL] Enable per-task delay accounting
+
 	devfs=		[DEVFS]
 			See Documentation/filesystems/devfs/boot-options.
 
Index: linux-2.6.16-rc4/kernel/Makefile
===================================================================
--- linux-2.6.16-rc4.orig/kernel/Makefile	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/kernel/Makefile	2006-02-27 01:52:54.000000000 -0500
@@ -34,6 +34,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <al...@li...>, the -fno-omit-frame-pointer is
Index: linux-2.6.16-rc4/include/linux/delayacct.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc4/include/linux/delayacct.h	2006-02-27 01:52:54.000000000 -0500
@@ -0,0 +1,55 @@
+/* delayacct.h - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKDELAYS_H
+#define _LINUX_TASKDELAYS_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASK_DELAY_ACCT
+extern int delayacct_on;	/* Delay accounting turned on/off */
+extern kmem_cache_t *delayacct_cache;
+extern int delayacct_init(void);
+extern void __delayacct_tsk_init(struct task_struct *);
+extern void __delayacct_tsk_exit(struct task_struct *);
+
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{
+	/* reinitialize in case parent's non-null pointer was dup'ed*/
+	tsk->delays = NULL;
+	if (unlikely(delayacct_on))
+		__delayacct_tsk_init(tsk);
+}
+
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{
+	if (unlikely(tsk->delays))
+		__delayacct_tsk_exit(tsk);
+}
+
+static inline void delayacct_timestamp_start(void)
+{
+	if (unlikely(current->delays && delayacct_on))
+		do_posix_clock_monotonic_gettime(&current->delays->start);
+}
+#else
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{}
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{}
+static inline void delayacct_timestamp_start(void)
+{}
+static inline int delayacct_init(void)
+{}
+#endif /* CONFIG_TASK_DELAY_ACCT */
+#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16-rc4/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/sched.h	2006-02-27 01:52:52.000000000 -0500
+++ linux-2.6.16-rc4/include/linux/sched.h	2006-02-27 01:52:54.000000000 -0500
@@ -543,6 +543,18 @@ struct sched_info {
 extern struct file_operations proc_schedstat_operations;
 #endif
 
+#ifdef CONFIG_TASK_DELAY_ACCT
+struct task_delay_info {
+	spinlock_t	lock;
+
+	/* timestamp recording variables (to reduce stack usage) */
+	struct timespec start, end;
+
+	/* Add stats in pairs: u64 delay, u32 count, aligned properly */
+};
+#endif
+
+
 enum idle_type
 {
 	SCHED_IDLE,
@@ -874,6 +886,9 @@ struct task_struct {
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
+#ifdef	CONFIG_TASK_DELAY_ACCT
+	struct task_delay_info *delays;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6.16-rc4/init/Kconfig
===================================================================
--- linux-2.6.16-rc4.orig/init/Kconfig	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/init/Kconfig	2006-02-27 01:52:54.000000000 -0500
@@ -150,6 +150,19 @@ config BSD_PROCESS_ACCT_V3
 	  for processing it. A preliminary version of these tools is available
 	  at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.
 
+config TASK_DELAY_ACCT
+	bool "Enable per-task delay accounting (EXPERIMENTAL)"
+	help
+	  Collect information on time spent by a task waiting for system
+	  resources like cpu, synchronous block I/O completion and swapping
+	  in pages. Such statistics can help in setting a task's priorities
+	  relative to other tasks for cpu, io, rss limits etc.
+
+	  Unlike BSD process accounting, this information is available
+	  continuously during the lifetime of a task.
+
+	  Say N if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
Index: linux-2.6.16-rc4/init/main.c
===================================================================
--- linux-2.6.16-rc4.orig/init/main.c	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/init/main.c	2006-02-27 01:52:54.000000000 -0500
@@ -47,6 +47,7 @@
 #include <linux/rmap.h>
 #include <linux/mempolicy.h>
 #include <linux/key.h>
+#include <linux/delayacct.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -537,6 +538,7 @@ asmlinkage void __init start_kernel(void
 	proc_root_init();
 #endif
 	cpuset_init();
+	delayacct_init();
 
 	check_bugs();
 
Index: linux-2.6.16-rc4/kernel/delayacct.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc4/kernel/delayacct.c	2006-02-27 01:52:54.000000000 -0500
@@ -0,0 +1,65 @@
+/* delayacct.c - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/time.h>
+#include <linux/sysctl.h>
+#include <linux/delayacct.h>
+
+int delayacct_on = 0;		/* Delay accounting turned on/off */
+kmem_cache_t *delayacct_cache;
+
+static int __init delayacct_setup_enable(char *str)
+{
+	delayacct_on = 1;
+	return 1;
+}
+__setup("delayacct", delayacct_setup_enable);
+
+int delayacct_init(void)
+{
+	delayacct_cache = kmem_cache_create("delayacct_cache",
+					    sizeof(struct task_delay_info),
+					    0,
+					    SLAB_PANIC,
+					    NULL, NULL);
+	if (!delayacct_cache)
+		return -ENOMEM;
+	delayacct_tsk_init(&init_task);
+	return 0;
+}
+
+void __delayacct_tsk_init(struct task_struct *tsk)
+{
+	tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
+	if (tsk->delays) {
+		memset(tsk->delays, 0, sizeof(*tsk->delays));
+		spin_lock_init(&tsk->delays->lock);
+	}
+}
+
+void __delayacct_tsk_exit(struct task_struct *tsk)
+{
+	kmem_cache_free(delayacct_cache, tsk->delays);
+	tsk->delays = NULL;
+}
+
+static inline nsec_t delayacct_measure(void)
+{
+	if ((current->delays->start.tv_sec == 0) &&
+	    (current->delays->start.tv_nsec == 0))
+		return -EINVAL;
+	do_posix_clock_monotonic_gettime(&current->delays->end);
+	return timespec_diff_ns(&current->delays->start, &current->delays->end);
+}
Index: linux-2.6.16-rc4/kernel/fork.c
===================================================================
--- linux-2.6.16-rc4.orig/kernel/fork.c	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/kernel/fork.c	2006-02-27 01:52:54.000000000 -0500
@@ -44,6 +44,7 @@
 #include <linux/rmap.h>
 #include <linux/acct.h>
 #include <linux/cn_proc.h>
+#include <linux/delayacct.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -970,6 +971,7 @@ static task_t *copy_process(unsigned lon
 		goto bad_fork_cleanup_put_domain;
 
 	p->did_exec = 0;
+	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
 	copy_flags(clone_flags, p);
 	p->pid = pid;
 	retval = -EFAULT;
Index: linux-2.6.16-rc4/kernel/exit.c
===================================================================
--- linux-2.6.16-rc4.orig/kernel/exit.c	2006-02-27 01:20:04.000000000 -0500
+++ linux-2.6.16-rc4/kernel/exit.c	2006-02-27 01:52:54.000000000 -0500
@@ -31,6 +31,7 @@
 #include <linux/signal.h>
 #include <linux/cn_proc.h>
 #include <linux/mutex.h>
+#include <linux/delayacct.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -839,6 +840,8 @@ fastcall NORET_TYPE void do_exit(long co
 				preempt_count());
 
 	acct_update_integrals(tsk);
+	delayacct_tsk_exit(tsk);
+
 	if (tsk->mm) {
 		update_hiwater_rss(tsk->mm);
 		update_hiwater_vm(tsk->mm);

[Lse-tech] [Patch 4/7] Add sysctl for delay accounting

From: Shailabh N. <na...@wa...> - 2006-02-27 08:18:50

delayacct-sysctl.patch

Adds a sysctl to turn delay accounting on/off dynamically. 
(defaults to off). When turning off, struct task_delay_info
associated with each task need to be cleared. When turning
on, tasks without struct task_delay_info need to be allocated
one. 

Signed-off-by: Shailabh Nagar <na...@wa...>
Signed-off-by: Balbir Singh <ba...@in...>
Signed-off-by: Srivatsa Vaddagiri <va...@in...>

 include/linux/delayacct.h |   12 +++-
 include/linux/sysctl.h    |    1 
 kernel/delayacct.c        |  128 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/fork.c             |    3 -
 kernel/sysctl.c           |   11 +++
 5 files changed, 147 insertions(+), 8 deletions(-)

Index: linux-2.6.16-rc4/include/linux/delayacct.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/delayacct.h	2006-02-27 01:52:54.000000000 -0500
+++ linux-2.6.16-rc4/include/linux/delayacct.h	2006-02-27 01:52:56.000000000 -0500
@@ -15,18 +15,24 @@
 #define _LINUX_TASKDELAYS_H
 
 #include <linux/sched.h>
+#include <linux/sysctl.h>
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;	/* Delay accounting turned on/off */
 extern kmem_cache_t *delayacct_cache;
+int delayacct_sysctl_handler(ctl_table *table, int write, struct file *filp,
+			     void __user *buffer, size_t *lenp, loff_t *ppos);
 extern int delayacct_init(void);
 extern void __delayacct_tsk_init(struct task_struct *);
 extern void __delayacct_tsk_exit(struct task_struct *);
 
-static inline void delayacct_tsk_init(struct task_struct *tsk)
+static inline void delayacct_tsk_early_init(struct task_struct *tsk)
 {
-	/* reinitialize in case parent's non-null pointer was dup'ed*/
 	tsk->delays = NULL;
+}
+
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{
 	if (unlikely(delayacct_on))
 		__delayacct_tsk_init(tsk);
 }
@@ -43,6 +49,8 @@ static inline void delayacct_timestamp_s
 		do_posix_clock_monotonic_gettime(&current->delays->start);
 }
 #else
+static inline void delayacct_tsk_early_init(struct task_struct *tsk)
+{}
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {}
 static inline void delayacct_tsk_exit(struct task_struct *tsk)
Index: linux-2.6.16-rc4/include/linux/sysctl.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/sysctl.h	2006-02-27 01:52:52.000000000 -0500
+++ linux-2.6.16-rc4/include/linux/sysctl.h	2006-02-27 01:52:56.000000000 -0500
@@ -147,6 +147,7 @@ enum
 	KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */
 	KERN_SPIN_RETRY=70,	/* int: number of spinlock retries */
 	KERN_SCHEDSTATS=71,	/* int: Schedstats on/off */
+	KERN_DELAYACCT=74,	/* int: Per-task delay accounting on/off */
 };
 
 
Index: linux-2.6.16-rc4/kernel/delayacct.c
===================================================================
--- linux-2.6.16-rc4.orig/kernel/delayacct.c	2006-02-27 01:52:54.000000000 -0500
+++ linux-2.6.16-rc4/kernel/delayacct.c	2006-02-27 01:52:56.000000000 -0500
@@ -1,6 +1,7 @@
 /* delayacct.c - per-task delay accounting
  *
  * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of version 2.1 of the GNU Lesser General Public License
@@ -42,17 +43,30 @@ int delayacct_init(void)
 
 void __delayacct_tsk_init(struct task_struct *tsk)
 {
-	tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
-	if (tsk->delays) {
+	struct task_delay_info *delays = NULL;
+
+	delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
+	if (!delays)
+		return;
+
+	task_lock(tsk);
+	if (!tsk->delays) {
+		tsk->delays = delays;
 		memset(tsk->delays, 0, sizeof(*tsk->delays));
 		spin_lock_init(&tsk->delays->lock);
-	}
+	} else
+		kmem_cache_free(delayacct_cache, delays);
+	task_unlock(tsk);
 }
 
 void __delayacct_tsk_exit(struct task_struct *tsk)
 {
-	kmem_cache_free(delayacct_cache, tsk->delays);
-	tsk->delays = NULL;
+	task_lock(tsk);
+	if (tsk->delays) {
+		kmem_cache_free(delayacct_cache, tsk->delays);
+		tsk->delays = NULL;
+	}
+	task_unlock(tsk);
 }
 
 static inline nsec_t delayacct_measure(void)
@@ -63,3 +77,107 @@ static inline nsec_t delayacct_measure(v
 	do_posix_clock_monotonic_gettime(&current->delays->end);
 	return timespec_diff_ns(&current->delays->start, &current->delays->end);
 }
+
+/* Allocate task_delay_info for all tasks without one */
+static int alloc_delays(void)
+{
+	int cnt=0, i, j;
+	struct task_struct *g, *t;
+	struct task_delay_info **delayp;
+	int err = 0;
+
+	read_lock(&tasklist_lock);
+	do_each_thread(g, t)
+		if (!t->delays && !(t->flags & (PF_EXITING | PF_DEAD)))
+			cnt++;
+	while_each_thread(g, t);
+	read_unlock(&tasklist_lock);
+
+	if (!cnt)
+		return 0;
+retry_allocs:
+
+	delayp = kmalloc(cnt *sizeof(struct task_delay_info *), GFP_KERNEL);
+	if (!delayp)
+		return -ENOMEM;
+	for (i = 0; i < cnt; i++) {
+		delayp[i] = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
+		if (!delayp[i]) {
+			err = -ENOMEM;
+			goto out;
+		}
+		memset(delayp[i], 0, sizeof(*delayp[i]));
+		spin_lock_init(&delayp[i]->lock);
+	}
+
+	i--;
+	j = 0;
+	read_lock(&tasklist_lock);
+	do_each_thread(g, t) {
+		task_lock(t);
+		if (t->delays) {
+			task_unlock(t);
+			continue;
+		}
+		/* Did some additional unaccounted tasks get created */
+		if (i < 0) {
+			j++;
+			task_unlock(t);
+			continue;
+		}
+		if (!(t->flags & (PF_EXITING | PF_DEAD))) {
+			t->delays = delayp[i--];
+		}
+		task_unlock(t);
+	} while_each_thread(g, t);
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * Retry allocations for all tasks created in between the two
+	 * tasklist_locks
+	 */
+	if (j > 0) {
+		kfree(delayp);
+		cnt = j;
+		goto retry_allocs;
+	}
+out:
+	while (i >= 0)
+		kmem_cache_free(delayacct_cache, delayp[i--]);
+	kfree(delayp);
+	return err;
+}
+
+/* Reset task_delay_info structs for all tasks */
+static void reset_delays(void)
+{
+	struct task_struct *g, *t;
+
+	read_lock(&tasklist_lock);
+	do_each_thread(g, t) {
+		if (!t->delays)
+			continue;
+		memset(t->delays, 0, sizeof(struct task_delay_info));
+		spin_lock_init(&t->delays->lock);
+	} while_each_thread(g, t);
+	read_unlock(&tasklist_lock);
+}
+
+int delayacct_sysctl_handler(ctl_table *table, int write, struct file *filp,
+			void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret, prev;
+
+	prev = delayacct_on;
+	ret = proc_dointvec(table, write, filp, buffer, lenp, ppos);
+	if (ret || (prev == delayacct_on))
+		return ret;
+
+	if (delayacct_on)
+		ret = alloc_delays();
+	else
+		reset_delays();
+	if (ret)
+		delayacct_on = prev;
+	return ret;
+}
Index: linux-2.6.16-rc4/kernel/fork.c
===================================================================
--- linux-2.6.16-rc4.orig/kernel/fork.c	2006-02-27 01:52:54.000000000 -0500
+++ linux-2.6.16-rc4/kernel/fork.c	2006-02-27 01:52:56.000000000 -0500
@@ -971,7 +971,6 @@ static task_t *copy_process(unsigned lon
 		goto bad_fork_cleanup_put_domain;
 
 	p->did_exec = 0;
-	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
 	copy_flags(clone_flags, p);
 	p->pid = pid;
 	retval = -EFAULT;
@@ -1013,6 +1012,7 @@ static task_t *copy_process(unsigned lon
 	p->io_wait = NULL;
 	p->audit_context = NULL;
 	cpuset_fork(p);
+	delayacct_tsk_early_init(p);
 #ifdef CONFIG_NUMA
  	p->mempolicy = mpol_copy(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
@@ -1191,6 +1191,7 @@ static task_t *copy_process(unsigned lon
 	total_forks++;
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
+	delayacct_tsk_init(p);
 	proc_fork_connector(p);
 	return p;
 
Index: linux-2.6.16-rc4/kernel/sysctl.c
===================================================================
--- linux-2.6.16-rc4.orig/kernel/sysctl.c	2006-02-27 01:52:52.000000000 -0500
+++ linux-2.6.16-rc4/kernel/sysctl.c	2006-02-27 01:52:56.000000000 -0500
@@ -44,6 +44,7 @@
 #include <linux/limits.h>
 #include <linux/dcache.h>
 #include <linux/syscalls.h>
+#include <linux/delayacct.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -666,6 +667,16 @@ static ctl_table kern_table[] = {
 		.proc_handler	= &schedstats_sysctl_handler,
 	},
 #endif
+#if defined(CONFIG_TASK_DELAY_ACCT)
+	{
+		.ctl_name	= KERN_DELAYACCT,
+		.procname	= "delayacct",
+		.data		= &delayacct_on,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &delayacct_sysctl_handler,
+	},
+#endif
 	{ .ctl_name = 0 }
 };

[Lse-tech] Re: [Patch 4/7] Add sysctl for delay accounting

From: Arjan v. de V. <ar...@in...> - 2006-02-27 08:26:31

> +/* Allocate task_delay_info for all tasks without one */
> +static int alloc_delays(void)

I'm sorry but this function seems to be highly horrible

[Lse-tech] Re: [Patch 4/7] Add sysctl for delay accounting

From: Shailabh N. <na...@wa...> - 2006-02-27 08:38:55

Arjan van de Ven wrote:

>>+/* Allocate task_delay_info for all tasks without one */
>>+static int alloc_delays(void)
>>    
>>
>
>I'm sorry but this function seems to be highly horrible
>  
>
Could you be more specific ? Is it the way its coded or the design 
(preallocate, then assign)
itself ?

The function needs to allocate task_delay_info structs for all tasks 
that might
have been forked since the last time delay accounting was turned off.
Either we have to count how many such tasks there are, or preallocate
nr_tasks (as an upper bound) and then use as many as needed.

Thanks for reviewing so quickly.
-- Shailabh

[Lse-tech] Re: [Patch 4/7] Add sysctl for delay accounting

From: Arjan v. de V. <ar...@in...> - 2006-02-27 08:42:31

On Mon, 2006-02-27 at 03:38 -0500, Shailabh Nagar wrote:
> Arjan van de Ven wrote:
> 
> >>+/* Allocate task_delay_info for all tasks without one */
> >>+static int alloc_delays(void)
> >>    
> >>
> >
> >I'm sorry but this function seems to be highly horrible
> >  
> >
> Could you be more specific ? Is it the way its coded or the design 
> (preallocate, then assign)
> itself ?
> 
> The function needs to allocate task_delay_info structs for all tasks 
> that might
> have been forked since the last time delay accounting was turned off.
> Either we have to count how many such tasks there are, or preallocate
> nr_tasks (as an upper bound) and then use as many as needed.

it generally feels really fragile, especially with the task enumeration
going to RCU soon. (eg you'd lose the ability to lock out new task
creation)

On first sight it looks a lot better to allocate these things on demand,
but I'm not sure how the sleeping-allocation would interact with the
places it'd need to be called...

[Lse-tech] Re: [Patch 4/7] Add sysctl for delay accounting

From: Shailabh N. <na...@wa...> - 2006-02-27 08:59:51

Arjan van de Ven wrote:

>On Mon, 2006-02-27 at 03:38 -0500, Shailabh Nagar wrote:
>  
>
>>Arjan van de Ven wrote:
>>
>>    
>>
>>>>+/* Allocate task_delay_info for all tasks without one */
>>>>+static int alloc_delays(void)
>>>>   
>>>>
>>>>        
>>>>
>>>I'm sorry but this function seems to be highly horrible
>>> 
>>>
>>>      
>>>
>>Could you be more specific ? Is it the way its coded or the design 
>>(preallocate, then assign)
>>itself ?
>>
>>The function needs to allocate task_delay_info structs for all tasks 
>>that might
>>have been forked since the last time delay accounting was turned off.
>>Either we have to count how many such tasks there are, or preallocate
>>nr_tasks (as an upper bound) and then use as many as needed.
>>    
>>
>
>it generally feels really fragile, especially with the task enumeration
>going to RCU soon. (eg you'd lose the ability to lock out new task
>creation)
>  
>
>On first sight it looks a lot better to allocate these things on demand,
>but I'm not sure how the sleeping-allocation would interact with the
>places it'd need to be called...
>  
>
Yes, thats  the reason why we didn't do the on-demand allocation...the 
next time a task is checked
could be in any of the places where the timestamping is done. Doing the 
allocation there (and incurring
the extra cost of the check even when sysctl hasn't been used) didn't 
seem worthwhile, esp. when we
have a point (sysctl handler) where we can catch most of the allocs needed.

But if task enumeration is going to get more difficult, we'll need to 
keep the on-demand allocation (on
next use) as a backup for tasks that weren't caught during the sysctl 
change.


>
>  
>

Re: [Lse-tech] Re: [Patch 4/7] Add sysctl for delay accounting

From: Balbir S. <ba...@in...> - 2006-02-27 11:19:04

> But if task enumeration is going to get more difficult, we'll need to 
> keep the on-demand allocation (on
> next use) as a backup for tasks that weren't caught during the sysctl 
> change.
>

One possible issue with on-demand allocation is that under heavy load
allocation causes IO to happen and when we try to timestamp that IO, we
do not have a delays structure to do so.

Balbir

1 2 3 4 > >> (Page 1 of 4)