From: Shailabh N. <na...@wa...> - 2005-12-07 22:08:24
|
The following patches add accounting for the delays seen by a task in a) waiting for a CPU (while being runnable) b) completion of synchronous block I/O initiated by the task c) swapping in pages (i.e. capacity misses). Such delays provide feedback for a task's cpu priority, io priority and rss limit values. Long delays, especially relative to other tasks, can be a trigger for changing a task's cpu/io priorities and modifying its rss usage (either directly through sys_getprlimit() that was proposed earlier on lkml or by throttling cpu consumption or process calling sys_setrlimit etc.) There are quite a few differences from the earlier posting of these patches (http://www.uwsg.indiana.edu/hypermail/linux/kernel/0511.1/2275.html): - block I/O is (hopefully) being accounted properly now instead of just counting the time spent in io_schedule() as done earlier. - instead of accounting for time spent in all page faults, only swapping in of pages is being counted since thats the only part that one can really control (capacity misses vs. compulsory misses) - a /proc interface is being used instead of connector-based interface. Andrew Morton suggested a generic connector-based interface useful for future usage of connectors fo stats. This revised connector-based interface will be posted separately since its useful for efficient delivery of any per-task statistics, not just the ones being introduced by these patches. - the timestamping code has been made generic (following the suggestions to Matt Helsley's patches to add timestamps to process events connectors) More comments in individual patches. Series nstimestamp-diff.patch delayacct-init.patch delayacct-blkio.patch delayacct-swapin.patch delayacct-procfs.patch |
From: Shailabh N. <na...@wa...> - 2006-01-03 23:43:06
|
Forwarding again as this patch didn't make it to lse-tech, ckrm-tech and elsa-devel. Please include Andrew Morton and lkml on replies. -- Shailabh -------- Original Message -------- Subject: [Patch 0/6] Per-task delay accounting Date: Tue, 03 Jan 2006 23:16:40 +0000 From: Shailabh Nagar <na...@wa...> To: Andrew Morton <ak...@os...>, linux-kernel <lin...@vg...> CC: elsa-devel <els...@li...>, LSE <lse...@li...>, ckrm-tech <ckr...@li...> Andrew, Could you please consider these patches for inclusion in -mm ? The comments from earlier postings of these patches have been addressed, including the one you made about making the connector interface generic (more about that in the connector patch). Thanks, Shailabh The following patches add accounting for the delays seen by tasks in a) waiting for a CPU (while being runnable) b) completion of synchronous block I/O initiated by the task c) swapping in pages (i.e. capacity misses). Such delays provide feedback for a task's cpu priority, io priority and rss limit values. Long delays, especially relative to other tasks, can be a trigger for changing a task's cpu/io priorities and modifying its rss usage (either directly through sys_getprlimit() that was proposed earlier on lkml or by throttling cpu consumption or process calling sys_setrlimit etc.) The major change since the previous posting of these patches (http://www.ussg.iu.edu/hypermail/linux/kernel/0512.0/2152.html) is the resurrection of the connector interface (in addition to /proc) and, as part of the same patch, the ability to get stats per-tgid in addition to per-pid. More comments in individual patches. Series nstimestamp-diff.patch delayacct-init.patch delayacct-blkio.patch delayacct-swapin.patch delayacct-procfs.patch delayacct-connector.patch |
From: Shailabh N. <na...@wa...> - 2006-02-27 07:56:46
|
The following patches add accounting for the delays seen by tasks in a) waiting for a CPU (while being runnable) b) completion of synchronous block I/O initiated by the task c) swapping in pages (i.e. capacity misses). Such delays provide feedback for a task's cpu priority, io priority and rss limit values. Long delays, especially relative to other tasks, can be a trigger for changing a task's cpu/io priorities and modifying its rss usage (either directly through sys_getprlimit() that was proposed earlier on lkml or by throttling cpu consumption or process calling sys_setrlimit etc.) The major changes since the previous posting of these patches are - use of the new generic netlink interface (NETLINK_GENERIC family) with provision for reuse by other (non-delay accounting) kernel components - sysctl option for turning delay accounting collection on/off dynamically - similar sysctl option for schedstats. Delay accounting leverages schedstats code for cpu delays. - dynamic allocation of delay accounting structures More comments in individual patches. Please give feedback. --Shailabh Series nstimestamp-diff.patch schedstats-sysctl.patch delayacct-setup.patch delayacct-sysctl.patch delayacct-blkio.patch delayacct-swapin.patch delayacct-genetlink.patch |
From: Shailabh N. <na...@wa...> - 2006-02-27 08:02:56
|
nstimestamp_diff.patch Add kernel utility function for measuring the interval (diff) between two timespec values, adjusting for overflow Signed-off-by: Shailabh Nagar <na...@wa...> include/linux/time.h | 14 ++++++++++++++ 1 files changed, 14 insertions(+) Index: linux-2.6.16-rc4/include/linux/time.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/time.h 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/include/linux/time.h 2006-02-27 01:52:49.000000000 -0500 @@ -147,6 +147,20 @@ extern struct timespec ns_to_timespec(co */ extern struct timeval ns_to_timeval(const nsec_t nsec); +/* + * timespec_diff_ns - Return difference of two timestamps in nanoseconds + * In the rare case of @end being earlier than @start, return zero + */ +static inline nsec_t timespec_diff_ns(struct timespec *start, struct timespec *end) +{ + nsec_t ret; + + ret = (nsec_t)(end->tv_sec - start->tv_sec)*NSEC_PER_SEC; + ret += (nsec_t)(end->tv_nsec - start->tv_nsec); + if (ret < 0) + return 0; + return ret; +} #endif /* __KERNEL__ */ #define NFDBITS __NFDBITS |
From: Shailabh N. <na...@wa...> - 2006-02-27 08:12:37
|
schedstats-sysctl.patch Add sysctl option for controlling schedstats collection dynamically. Delay accounting leverages schedstats for cpu delay statistics. Signed-off-by: Chandra Seetharaman <sek...@us...> Signed-off-by: Shailabh Nagar <na...@wa...> Documentation/kernel-parameters.txt | 2 include/linux/sched.h | 4 + include/linux/sysctl.h | 1 kernel/sched.c | 74 +++++++++++++++++++++++++++++++++--- kernel/sysctl.c | 10 ++++ lib/Kconfig.debug | 6 +- 6 files changed, 90 insertions(+), 7 deletions(-) Index: linux-2.6.16-rc4/include/linux/sched.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/sched.h 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/include/linux/sched.h 2006-02-27 01:52:52.000000000 -0500 @@ -15,6 +15,7 @@ #include <linux/cpumask.h> #include <linux/errno.h> #include <linux/nodemask.h> +#include <linux/sysctl.h> #include <asm/system.h> #include <asm/semaphore.h> @@ -525,6 +526,9 @@ struct backing_dev_info; struct reclaim_state; #ifdef CONFIG_SCHEDSTATS +extern int schedstats_sysctl; +extern int schedstats_sysctl_handler(ctl_table *, int, struct file *, + void __user *, size_t *, loff_t *); struct sched_info { /* cumulative counters */ unsigned long cpu_time, /* time spent on the cpu */ Index: linux-2.6.16-rc4/include/linux/sysctl.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/sysctl.h 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/include/linux/sysctl.h 2006-02-27 01:52:52.000000000 -0500 @@ -146,6 +146,7 @@ enum KERN_RANDOMIZE=68, /* int: randomize virtual address space */ KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */ KERN_SPIN_RETRY=70, /* int: number of spinlock retries */ + KERN_SCHEDSTATS=71, /* int: Schedstats on/off */ }; Index: linux-2.6.16-rc4/kernel/sched.c =================================================================== --- linux-2.6.16-rc4.orig/kernel/sched.c 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/kernel/sched.c 2006-02-27 01:52:52.000000000 -0500 @@ -382,11 +382,56 @@ static inline void task_rq_unlock(runque } #ifdef CONFIG_SCHEDSTATS + +int schedstats_sysctl = 0; /* schedstats turned off by default */ +static DEFINE_PER_CPU(int, schedstats) = 0; + +static void schedstats_set(int val) +{ + int i; + static spinlock_t schedstats_lock = SPIN_LOCK_UNLOCKED; + + spin_lock(&schedstats_lock); + schedstats_sysctl = val; + for (i = 0; i < NR_CPUS; i++) + per_cpu(schedstats, i) = val; + spin_unlock(&schedstats_lock); +} + +static int __init schedstats_setup_enable(char *str) +{ + schedstats_sysctl = 1; + schedstats_set(schedstats_sysctl); + return 1; +} + +__setup("schedstats", schedstats_setup_enable); + +int schedstats_sysctl_handler(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int ret, prev = schedstats_sysctl; + struct task_struct *g, *t; + + ret = proc_dointvec(table, write, filp, buffer, lenp, ppos); + if ((ret != 0) || (prev == schedstats_sysctl)) + return ret; + if (schedstats_sysctl) { + read_lock(&tasklist_lock); + do_each_thread(g, t) { + memset(&t->sched_info, 0, sizeof(t->sched_info)); + } while_each_thread(g, t); + read_unlock(&tasklist_lock); + } + schedstats_set(schedstats_sysctl); + return ret; +} + /* * bump this up when changing the output format or the meaning of an existing * format, so that tools can adapt (or abort) */ -#define SCHEDSTAT_VERSION 12 +#define SCHEDSTAT_VERSION 13 static int show_schedstat(struct seq_file *seq, void *v) { @@ -394,6 +439,10 @@ static int show_schedstat(struct seq_fil seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION); seq_printf(seq, "timestamp %lu\n", jiffies); + if (!schedstats_sysctl) { + seq_printf(seq, "State Off\n"); + return 0; + } for_each_online_cpu(cpu) { runqueue_t *rq = cpu_rq(cpu); #ifdef CONFIG_SMP @@ -472,8 +521,17 @@ struct file_operations proc_schedstat_op .release = single_release, }; -# define schedstat_inc(rq, field) do { (rq)->field++; } while (0) -# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) +#define schedstats_on (per_cpu(schedstats, smp_processor_id()) != 0) +#define schedstat_inc(rq, field) \ +do { \ + if (unlikely(schedstats_on)) \ + (rq)->field++; \ +} while (0) +#define schedstat_add(rq, field, amt) \ +do { \ + if (unlikely(schedstats_on)) \ + (rq)->field += (amt); \ +} while (0) #else /* !CONFIG_SCHEDSTATS */ # define schedstat_inc(rq, field) do { } while (0) # define schedstat_add(rq, field, amt) do { } while (0) @@ -556,7 +614,7 @@ static void sched_info_arrive(task_t *t) */ static inline void sched_info_queued(task_t *t) { - if (!t->sched_info.last_queued) + if (unlikely(schedstats_on && !t->sched_info.last_queued)) t->sched_info.last_queued = jiffies; } @@ -580,7 +638,7 @@ static inline void sched_info_depart(tas * their time slice. (This may also be called when switching to or from * the idle task.) We are only called when prev != next. */ -static inline void sched_info_switch(task_t *prev, task_t *next) +static inline void __sched_info_switch(task_t *prev, task_t *next) { struct runqueue *rq = task_rq(prev); @@ -595,6 +653,12 @@ static inline void sched_info_switch(tas if (next != rq->idle) sched_info_arrive(next); } + +static inline void sched_info_switch(task_t *prev, task_t *next) +{ + if (unlikely(schedstats_on)) + __sched_info_switch(prev, next); +} #else #define sched_info_queued(t) do { } while (0) #define sched_info_switch(t, next) do { } while (0) Index: linux-2.6.16-rc4/kernel/sysctl.c =================================================================== --- linux-2.6.16-rc4.orig/kernel/sysctl.c 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/kernel/sysctl.c 2006-02-27 01:52:52.000000000 -0500 @@ -656,6 +656,16 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif +#if defined(CONFIG_SCHEDSTATS) + { + .ctl_name = KERN_SCHEDSTATS, + .procname = "schedstats", + .data = &schedstats_sysctl, + .maxlen = sizeof (int), + .mode = 0644, + .proc_handler = &schedstats_sysctl_handler, + }, +#endif { .ctl_name = 0 } }; Index: linux-2.6.16-rc4/lib/Kconfig.debug =================================================================== --- linux-2.6.16-rc4.orig/lib/Kconfig.debug 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/lib/Kconfig.debug 2006-02-27 01:52:52.000000000 -0500 @@ -67,15 +67,17 @@ config DETECT_SOFTLOCKUP config SCHEDSTATS bool "Collect scheduler statistics" - depends on DEBUG_KERNEL && PROC_FS + depends on PROC_FS help If you say Y here, additional code will be inserted into the scheduler and related routines to collect statistics about scheduler behavior and provide them in /proc/schedstat. These - stats may be useful for both tuning and debugging the scheduler + stats may be useful for both tuning and debugging the scheduler. If you aren't debugging the scheduler or trying to tune a specific application, you can say N to avoid the very slight overhead this adds. + Schedstats collection, and most of its overhead, can also be + controlled dyanmically through the schedstats sysctl. config DEBUG_SLAB bool "Debug memory allocations" Index: linux-2.6.16-rc4/Documentation/kernel-parameters.txt =================================================================== --- linux-2.6.16-rc4.orig/Documentation/kernel-parameters.txt 2006-02-27 01:19:52.000000000 -0500 +++ linux-2.6.16-rc4/Documentation/kernel-parameters.txt 2006-02-27 01:52:52.000000000 -0500 @@ -1333,6 +1333,8 @@ running once the system is up. sc1200wdt= [HW,WDT] SC1200 WDT (watchdog) driver Format: <io>[,<timeout>[,<isapnp>]] + schedstats [KNL] Collect CPU scheduler statistics + scsi_debug_*= [SCSI] See drivers/scsi/scsi_debug.c. |
From: Ingo M. <mi...@el...> - 2006-02-27 08:53:29
|
the principle looks OK to me, just a few minor nits: > #ifdef CONFIG_SCHEDSTATS > + > +int schedstats_sysctl = 0; /* schedstats turned off by default */ no need to initialize to 0. > +static DEFINE_PER_CPU(int, schedstats) = 0; ditto. > + > +static void schedstats_set(int val) > +{ > + int i; > + static spinlock_t schedstats_lock = SPIN_LOCK_UNLOCKED; move spinlock out of the function and use DEFINE_SPINLOCK. (But ... see below for suggestion to get rid of this lock altogether.) > + spin_lock(&schedstats_lock); > + schedstats_sysctl = val; > + for (i = 0; i < NR_CPUS; i++) > + per_cpu(schedstats, i) = val; > + spin_unlock(&schedstats_lock); > +} > + > +static int __init schedstats_setup_enable(char *str) > +{ > + schedstats_sysctl = 1; > + schedstats_set(schedstats_sysctl); > + return 1; > +} > + > +__setup("schedstats", schedstats_setup_enable); > + > +int schedstats_sysctl_handler(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + int ret, prev = schedstats_sysctl; > + struct task_struct *g, *t; > + > + ret = proc_dointvec(table, write, filp, buffer, lenp, ppos); > + if ((ret != 0) || (prev == schedstats_sysctl)) > + return ret; > + if (schedstats_sysctl) { > + read_lock(&tasklist_lock); > + do_each_thread(g, t) { > + memset(&t->sched_info, 0, sizeof(t->sched_info)); > + } while_each_thread(g, t); > + read_unlock(&tasklist_lock); > + } > + schedstats_set(schedstats_sysctl); why not just introduce a schedstats_lock mutex, and acquire it for both the 'if (schedstats_sysctl)' line and the schedstats_set() line. That will make the locking meaningful: two parallel sysctl ops will be atomic to each other. [right now they wont be and they can clear schedstat data in parallel -> not a big problem but it makes schedstats_lock rather meaningless] > -#define SCHEDSTAT_VERSION 12 > +#define SCHEDSTAT_VERSION 13 > > static int show_schedstat(struct seq_file *seq, void *v) > { > @@ -394,6 +439,10 @@ static int show_schedstat(struct seq_fil > > seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION); > seq_printf(seq, "timestamp %lu\n", jiffies); > + if (!schedstats_sysctl) { > + seq_printf(seq, "State Off\n"); > + return 0; > + } and show_schedstat() should then also take the schedstats_lock mutex. Ingo |
From: Balbir S. <ba...@in...> - 2006-02-27 10:46:45
|
<snip> > why not just introduce a schedstats_lock mutex, and acquire it for both > the 'if (schedstats_sysctl)' line and the schedstats_set() line. That > will make the locking meaningful: two parallel sysctl ops will be atomic > to each other. [right now they wont be and they can clear schedstat data > in parallel -> not a big problem but it makes schedstats_lock rather > meaningless] > Ingo, Can sysctl's run in parallel? sys_sysctl() is protects the call to do_sysctl() with BKL (lock_kernel/unlock_kernel). Am I missing something? Balbir |
From: Arjan v. de V. <ar...@in...> - 2006-02-27 12:19:12
|
On Mon, 2006-02-27 at 16:16 +0530, Balbir Singh wrote: > <snip> > > why not just introduce a schedstats_lock mutex, and acquire it for both > > the 'if (schedstats_sysctl)' line and the schedstats_set() line. That > > will make the locking meaningful: two parallel sysctl ops will be atomic > > to each other. [right now they wont be and they can clear schedstat data > > in parallel -> not a big problem but it makes schedstats_lock rather > > meaningless] > > > > Ingo, > > Can sysctl's run in parallel? sys_sysctl() is protects the call > to do_sysctl() with BKL (lock_kernel/unlock_kernel). > > Am I missing something? your sysctl functions sleep. the BKL is useless in the light of sleeping code... |
From: Balbir S. <ba...@in...> - 2006-02-27 12:29:59
|
> your sysctl functions sleep. the BKL is useless in the light of sleeping > code... > But wouldn't all sysctls potentially sleep (on account of copying data from the user). Thanks for clarifying, Balbir |
From: Arjan v. de V. <ar...@in...> - 2006-02-27 13:07:31
|
On Mon, 2006-02-27 at 17:59 +0530, Balbir Singh wrote: > > your sysctl functions sleep. the BKL is useless in the light of sleeping > > code... > > > > But wouldn't all sysctls potentially sleep (on account of copying data from > the user). .. I'm not the one saying the BKL was useful... you were ;) |
From: Balbir S. <ba...@in...> - 2006-02-27 16:16:47
|
> > But wouldn't all sysctls potentially sleep (on account of copying data from > > the user). > > .. I'm not the one saying the BKL was useful... you were ;) My tiny mind must have been confused by the presence of the code, which I presumed would be useful. I guess that is not the case always :-) Balbir |
From: Nick P. <nic...@ya...> - 2006-02-27 09:17:52
|
Shailabh Nagar wrote: > schedstats-sysctl.patch > > Add sysctl option for controlling schedstats collection > dynamically. Delay accounting leverages schedstats for > cpu delay statistics. > I'd sort of rather not tie this in with schedstats if possible. Schedstats adds a reasonable amount of cache footprint and branches in hot paths. Most of schedstats stuff is something that hardly anyone will use. Sure you can share common code though... > > Index: linux-2.6.16-rc4/kernel/sched.c > =================================================================== > --- linux-2.6.16-rc4.orig/kernel/sched.c 2006-02-27 01:20:04.000000000 -0500 > +++ linux-2.6.16-rc4/kernel/sched.c 2006-02-27 01:52:52.000000000 -0500 > @@ -382,11 +382,56 @@ static inline void task_rq_unlock(runque > } > > #ifdef CONFIG_SCHEDSTATS > + > +int schedstats_sysctl = 0; /* schedstats turned off by default */ Should be read mostly. > +static DEFINE_PER_CPU(int, schedstats) = 0; > + When the above is in the read mostly section, you won't need this at all. You don't intend to switch the sysctl with great frequency, do you? > +static void schedstats_set(int val) > +{ > + int i; > + static spinlock_t schedstats_lock = SPIN_LOCK_UNLOCKED; > + > + spin_lock(&schedstats_lock); > + schedstats_sysctl = val; > + for (i = 0; i < NR_CPUS; i++) > + per_cpu(schedstats, i) = val; > + spin_unlock(&schedstats_lock); > +} > + > +static int __init schedstats_setup_enable(char *str) > +{ > + schedstats_sysctl = 1; > + schedstats_set(schedstats_sysctl); > + return 1; > +} > + > +__setup("schedstats", schedstats_setup_enable); > + > +int schedstats_sysctl_handler(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + int ret, prev = schedstats_sysctl; > + struct task_struct *g, *t; > + > + ret = proc_dointvec(table, write, filp, buffer, lenp, ppos); > + if ((ret != 0) || (prev == schedstats_sysctl)) > + return ret; > + if (schedstats_sysctl) { > + read_lock(&tasklist_lock); > + do_each_thread(g, t) { > + memset(&t->sched_info, 0, sizeof(t->sched_info)); > + } while_each_thread(g, t); > + read_unlock(&tasklist_lock); > + } > + schedstats_set(schedstats_sysctl); You don't clear the rq's schedstats stuff here. And clearing this at all is not really needed for the schedstats interface. You have a timestamp and a set of accumulated values, so it is easy to work out deltas. So do you need this? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com |
From: Shailabh N. <na...@wa...> - 2006-02-27 09:41:48
|
Nick Piggin wrote: <snip> >> +int schedstats_sysctl_handler(ctl_table *table, int write, struct >> file *filp, >> + void __user *buffer, size_t *lenp, loff_t *ppos) >> +{ >> + int ret, prev = schedstats_sysctl; >> + struct task_struct *g, *t; >> + >> + ret = proc_dointvec(table, write, filp, buffer, lenp, ppos); >> + if ((ret != 0) || (prev == schedstats_sysctl)) >> + return ret; >> + if (schedstats_sysctl) { >> + read_lock(&tasklist_lock); >> + do_each_thread(g, t) { >> + memset(&t->sched_info, 0, sizeof(t->sched_info)); >> + } while_each_thread(g, t); >> + read_unlock(&tasklist_lock); >> + } >> + schedstats_set(schedstats_sysctl); > > > You don't clear the rq's schedstats stuff here. Good point. > > And clearing this at all is not really needed for the schedstats > interface. > You have a timestamp and a set of accumulated values, so it is easy to > work > out deltas. So do you need this? Not clearing the stats will mean userspace has to distinguish between the tasks that are hanging around from before the last turn off, and the ones created after wards. Any delta taken across an interval where schedstats was turned off will give the impression a task was sleeping during the interval (and hence show it had a lesser average wait time than it might actually have experienced). --Shailabh |
From: Nick P. <nic...@ya...> - 2006-02-27 12:28:30
|
Shailabh Nagar wrote: > Nick Piggin wrote: >> >> And clearing this at all is not really needed for the schedstats >> interface. >> You have a timestamp and a set of accumulated values, so it is easy to >> work >> out deltas. So do you need this? > > > Not clearing the stats will mean userspace has to distinguish between > the tasks > that are hanging around from before the last turn off, and the ones > created after > wards. Any delta taken across an interval where schedstats was turned > off will > give the impression a task was sleeping during the interval (and hence > show it had > a lesser average wait time than it might actually have experienced). Presumably a delta taken acrsoss an interval where schedstats was turned off would be rather inaccurate, no matter what. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com |
From: Chandra S. <sek...@us...> - 2006-02-27 19:09:54
|
On Mon, 2006-02-27 at 20:17 +1100, Nick Piggin wrote: > > #ifdef CONFIG_SCHEDSTATS > > + > > +int schedstats_sysctl = 0; /* schedstats turned off by default */ > > Should be read mostly. > > > +static DEFINE_PER_CPU(int, schedstats) = 0; > > + > > When the above is in the read mostly section, you won't need this at all. > > You don't intend to switch the sysctl with great frequency, do you? No, it is not expected to switch often. We originally coded it as __read_mostly, but thought the variable bouncing between CPUs would be costly. Is it cheaper with __read_mostly ? or it doesn't matter ? -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Balbir S. <ba...@in...> - 2006-03-07 17:27:16
|
On Mon, Feb 27, 2006 at 08:17:47PM +1100, Nick Piggin wrote: > Shailabh Nagar wrote: > >schedstats-sysctl.patch > > > >Add sysctl option for controlling schedstats collection > >dynamically. Delay accounting leverages schedstats for > >cpu delay statistics. > > > > I'd sort of rather not tie this in with schedstats if possible. > Schedstats adds a reasonable amount of cache footprint and > branches in hot paths. Most of schedstats stuff is something that > hardly anyone will use. > > Sure you can share common code though... > This patch refines scheduler statistics collection and display to three levels, tasks, runqueue and scheddomains. CONFIG_SCHEDSTATS now requires a boot time option in the form of schedstats or schedstats= to display scheduler statistics They can all be enabled together to get complete statistics by passing "all" as a boot time option. schedstat_inc has been split into schedstat_rq_inc and schedstat_sd_inc, each of which checks if rq and sd statistics gathering is enabled or not. schedstat_add has been changed to schedstat_sd_add, it checks if sd statistics gathering is on prior to gathering the statistics. Similar changes have been made for task schedstats gathering. The output of /proc/schedstat and /proc/<pid>/schedstat and /proc/<pid>/task/ */schedstat has been modified to print a gentle message suggesting that statistics gathering is off. Also a header "statistics for cpuXXX" has been added (for each cpu) to the /proc/schedstat output. This patch is motivated by comments for sharing code with CONFIG_SCHEDSTATS but not incurring the complete overhead of the entire CONFIG_SCHEDSTATS code. Testing ======= a) booted with schedstats (schedstats=all) ------------------------------------------ cat /proc/schedstat version 13 timestamp 4294922919 statistics for cpu0 cpu0 10 10 132 142 240 97775 24181 48251 45935 5664 2376 73594 domain0 00000003 24464 24242 162 224 62 4 0 24242 149 148 0 1 1 0 0 148 24290 24117 64 173 109 0 0 24117 0 0 0 0 0 0 0 0 0 4392 835 0 statistics for cpu1 cpu1 3 3 180 14504 387 50735 6520 15430 11035 4195 2376 44215 domain0 00000003 25870 25203 107 695 588 3 0 25203 198 198 0 0 0 0 0 198 6608 6428 91 184 93 0 0 6428 0 0 0 0 0 0 0 0 0 2316 307 0 cat /proc/1/schedstat 506 34 1102 b) booted with schedstats=tasks ------------------------------- cat /proc/schedstat version 13 timestamp 4294937241 runqueue and scheddomain stats are not enabled cat /proc/1/schedstat 505 58 1097 c) booted with schedstats=rq ----------------------------- cat /proc/schedstat version 13 timestamp 4294913832 statistics for cpu0 cpu0 14 14 56 102 260 96332 18867 47278 45064 3556397949 2216 77465 scheddomain stats are not enabled statistics for cpu1 cpu1 3 3 12134 12138 333 42878 4224 12874 8779 1714457722 2071 38654 scheddomain stats are not enabled cat /proc/1/schedstat tasks schedstats is not enabled d) booted with schedstats=sd ---------------------------- cat /proc/schedstat version 13 timestamp 4294936220 statistics for cpu0 runqueue stats are not enabled domain0 00000003 38048 37802 140 248 108 0 0 37802 151 149 0 3 3 0 0 149 27574 27417 59 158 99 0 0 27417 0 0 0 0 0 0 0 0 0 4168 827 0 statistics for cpu1 runqueue stats are not enabled domain0 00000003 39094 38441 119 682 563 3 0 38441 199 196 0 5 5 0 0 196 9167 8970 107 203 96 0 0 8970 0 0 0 0 0 0 0 0 0 2159 330 0 cat /proc/1/schedstat tasks schedstats is not enabled Alternatives considered ======================= The other alternative that was considered was that instead of changing the format of /proc/schedstat and /proc/<pid>*/schedstat, we could print zeros for all levels for which statistics is not collected. But zeros could be treated as valid values, so this solution was not implemented. Limitations =========== The effectiveness of this patch is limited to run-time statistics collection. The run-time overhead is proportional to the level of statistics enabled. The space consumed is the same as before. This patch was created against 2.6.16-rc5 Signed-off-by: Balbir Singh <ba...@in...> --- Documentation/kernel-parameters.txt | 11 ++ fs/proc/base.c | 4 include/linux/sched.h | 23 ++++ kernel/sched.c | 195 +++++++++++++++++++++++++----------- 4 files changed, 175 insertions(+), 58 deletions(-) diff -puN kernel/sched.c~schedstats_refinement kernel/sched.c --- linux-2.6.16-rc5/kernel/sched.c~schedstats_refinement 2006-03-07 16:14:59.000000000 +0530 +++ linux-2.6.16-rc5-balbir/kernel/sched.c 2006-03-07 20:43:46.000000000 +0530 @@ -386,7 +386,28 @@ static inline void task_rq_unlock(runque * bump this up when changing the output format or the meaning of an existing * format, so that tools can adapt (or abort) */ -#define SCHEDSTAT_VERSION 12 +#define SCHEDSTAT_VERSION 13 + +int schedstats_on __read_mostly; + +/* + * Parse the schedstats options passed at boot time + */ +static int __init schedstats_setup_enable(char *str) +{ + if (!str || !strcmp(str, "") || !strcmp(str, "=all")) + schedstats_on = SCHEDSTATS_ALL; + else if (!strcmp(str, "=tasks")) + schedstats_on = SCHEDSTATS_TASKS; + else if (!strcmp(str, "=sd")) + schedstats_on = SCHEDSTATS_SD; + else if (!strcmp(str, "=rq")) + schedstats_on = SCHEDSTATS_RQ; + + return 1; +} + +__setup("schedstats", schedstats_setup_enable); static int show_schedstat(struct seq_file *seq, void *v) { @@ -394,26 +415,44 @@ static int show_schedstat(struct seq_fil seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION); seq_printf(seq, "timestamp %lu\n", jiffies); + + if (!schedstats_rq_on() && !schedstats_sd_on()) { + seq_printf(seq, "runqueue and scheddomain stats are not " + "enabled\n"); + return 0; + } + for_each_online_cpu(cpu) { runqueue_t *rq = cpu_rq(cpu); #ifdef CONFIG_SMP struct sched_domain *sd; int dcnt = 0; #endif + seq_printf(seq, "\nstatistics for cpu%d\n\n", cpu); - /* runqueue-specific stats */ - seq_printf(seq, - "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu", - cpu, rq->yld_both_empty, - rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt, - rq->sched_switch, rq->sched_cnt, rq->sched_goidle, - rq->ttwu_cnt, rq->ttwu_local, - rq->rq_sched_info.cpu_time, - rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt); + if (schedstats_rq_on()) { + /* runqueue-specific stats */ + seq_printf(seq, + "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu " + "%lu", + cpu, rq->yld_both_empty, + rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt, + rq->sched_switch, rq->sched_cnt, rq->sched_goidle, + rq->ttwu_cnt, rq->ttwu_local, + rq->rq_sched_info.cpu_time, + rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt); - seq_printf(seq, "\n"); + seq_printf(seq, "\n"); + } else + seq_printf(seq, "runqueue stats are not enabled\n"); #ifdef CONFIG_SMP + + if (!schedstats_sd_on()) { + seq_printf(seq, "scheddomain stats are not enabled\n"); + continue; + } + /* domain-specific stats */ preempt_disable(); for_each_domain(cpu, sd) { @@ -472,11 +511,30 @@ struct file_operations proc_schedstat_op .release = single_release, }; -# define schedstat_inc(rq, field) do { (rq)->field++; } while (0) -# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) +# define schedstat_sd_inc(sd, field) \ +do { \ + if (unlikely(schedstats_sd_on())) \ + (sd)->field++; \ +} while (0) + +# define schedstat_sd_add(sd, field, amt) \ +do { \ + if (unlikely(schedstats_sd_on())) \ + (sd)->field += (amt); \ +} while (0) + +# define schedstat_rq_inc(rq, field) \ +do { \ + if (unlikely(schedstats_rq_on())) \ + (rq)->field++; \ +} while (0) + #else /* !CONFIG_SCHEDSTATS */ -# define schedstat_inc(rq, field) do { } while (0) -# define schedstat_add(rq, field, amt) do { } while (0) + +# define schedstat_sd_inc(rq, field) do { } while (0) +# define schedstat_sd_add(rq, field, amt) do { } while (0) +# define schedstat_rq_inc(rq, field) do { } while (0) + #endif /* @@ -515,6 +573,15 @@ static inline void sched_info_dequeued(t t->sched_info.last_queued = 0; } +static void rq_sched_info_arrive(struct runqueue *rq, unsigned long diff) +{ + if (!schedstats_rq_on() || !rq) + return; + + rq->rq_sched_info.run_delay += diff; + rq->rq_sched_info.pcnt++; +} + /* * Called when a task finally hits the cpu. We can now calculate how * long it was waiting to run. We also note when it began so that we @@ -523,20 +590,23 @@ static inline void sched_info_dequeued(t static void sched_info_arrive(task_t *t) { unsigned long now = jiffies, diff = 0; - struct runqueue *rq = task_rq(t); + if (!schedstats_tasks_on() && !schedstats_rq_on()) + return; + + /* + * diff is required in case schedstats is on for tasks or rq + */ if (t->sched_info.last_queued) diff = now - t->sched_info.last_queued; sched_info_dequeued(t); - t->sched_info.run_delay += diff; - t->sched_info.last_arrival = now; - t->sched_info.pcnt++; - - if (!rq) - return; - rq->rq_sched_info.run_delay += diff; - rq->rq_sched_info.pcnt++; + if (schedstats_tasks_on()) { + t->sched_info.run_delay += diff; + t->sched_info.last_arrival = now; + t->sched_info.pcnt++; + } + rq_sched_info_arrive(task_rq(t), diff); } /* @@ -556,23 +626,32 @@ static void sched_info_arrive(task_t *t) */ static inline void sched_info_queued(task_t *t) { + if (!schedstats_tasks_on() && !schedstats_rq_on()) + return; + if (!t->sched_info.last_queued) t->sched_info.last_queued = jiffies; } +static inline void rq_sched_info_depart(struct runqueue *rq, unsigned long diff) +{ + if (!schedstats_rq_on() || !rq) + return; + + rq->rq_sched_info.cpu_time += diff; +} + /* * Called when a process ceases being the active-running process, either * voluntarily or involuntarily. Now we can calculate how long we ran. */ static inline void sched_info_depart(task_t *t) { - struct runqueue *rq = task_rq(t); unsigned long diff = jiffies - t->sched_info.last_arrival; - t->sched_info.cpu_time += diff; - - if (rq) - rq->rq_sched_info.cpu_time += diff; + if (schedstats_tasks_on()) + t->sched_info.cpu_time += diff; + rq_sched_info_depart(task_rq(t), diff); } /* @@ -1190,15 +1269,15 @@ static int try_to_wake_up(task_t *p, uns new_cpu = cpu; - schedstat_inc(rq, ttwu_cnt); + schedstat_rq_inc(rq, ttwu_cnt); if (cpu == this_cpu) { - schedstat_inc(rq, ttwu_local); + schedstat_rq_inc(rq, ttwu_local); goto out_set_cpu; } for_each_domain(this_cpu, sd) { if (cpu_isset(cpu, sd->span)) { - schedstat_inc(sd, ttwu_wake_remote); + schedstat_sd_inc(sd, ttwu_wake_remote); this_sd = sd; break; } @@ -1239,7 +1318,7 @@ static int try_to_wake_up(task_t *p, uns * p is cache cold in this domain, and * there is no bad imbalance. */ - schedstat_inc(this_sd, ttwu_move_affine); + schedstat_sd_inc(this_sd, ttwu_move_affine); goto out_set_cpu; } } @@ -1250,7 +1329,7 @@ static int try_to_wake_up(task_t *p, uns */ if (this_sd->flags & SD_WAKE_BALANCE) { if (imbalance*this_load <= 100*load) { - schedstat_inc(this_sd, ttwu_move_balance); + schedstat_sd_inc(this_sd, ttwu_move_balance); goto out_set_cpu; } } @@ -1894,7 +1973,7 @@ skip_queue: #ifdef CONFIG_SCHEDSTATS if (task_hot(tmp, busiest->timestamp_last_tick, sd)) - schedstat_inc(sd, lb_hot_gained[idle]); + schedstat_sd_inc(sd, lb_hot_gained[idle]); #endif pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu); @@ -1913,7 +1992,7 @@ out: * so we can safely collect pull_task() stats here rather than * inside pull_task(). */ - schedstat_add(sd, lb_gained[idle], pulled); + schedstat_sd_add(sd, lb_gained[idle], pulled); if (all_pinned) *all_pinned = pinned; @@ -2109,23 +2188,23 @@ static int load_balance(int this_cpu, ru if (idle != NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER) sd_idle = 1; - schedstat_inc(sd, lb_cnt[idle]); + schedstat_sd_inc(sd, lb_cnt[idle]); group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle); if (!group) { - schedstat_inc(sd, lb_nobusyg[idle]); + schedstat_sd_inc(sd, lb_nobusyg[idle]); goto out_balanced; } busiest = find_busiest_queue(group, idle); if (!busiest) { - schedstat_inc(sd, lb_nobusyq[idle]); + schedstat_sd_inc(sd, lb_nobusyq[idle]); goto out_balanced; } BUG_ON(busiest == this_rq); - schedstat_add(sd, lb_imbalance[idle], imbalance); + schedstat_sd_add(sd, lb_imbalance[idle], imbalance); nr_moved = 0; if (busiest->nr_running > 1) { @@ -2146,7 +2225,7 @@ static int load_balance(int this_cpu, ru } if (!nr_moved) { - schedstat_inc(sd, lb_failed[idle]); + schedstat_sd_inc(sd, lb_failed[idle]); sd->nr_balance_failed++; if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) { @@ -2199,7 +2278,7 @@ static int load_balance(int this_cpu, ru return nr_moved; out_balanced: - schedstat_inc(sd, lb_balanced[idle]); + schedstat_sd_inc(sd, lb_balanced[idle]); sd->nr_balance_failed = 0; @@ -2233,22 +2312,22 @@ static int load_balance_newidle(int this if (sd->flags & SD_SHARE_CPUPOWER) sd_idle = 1; - schedstat_inc(sd, lb_cnt[NEWLY_IDLE]); + schedstat_sd_inc(sd, lb_cnt[NEWLY_IDLE]); group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE, &sd_idle); if (!group) { - schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]); + schedstat_sd_inc(sd, lb_nobusyg[NEWLY_IDLE]); goto out_balanced; } busiest = find_busiest_queue(group, NEWLY_IDLE); if (!busiest) { - schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]); + schedstat_sd_inc(sd, lb_nobusyq[NEWLY_IDLE]); goto out_balanced; } BUG_ON(busiest == this_rq); - schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance); + schedstat_sd_add(sd, lb_imbalance[NEWLY_IDLE], imbalance); nr_moved = 0; if (busiest->nr_running > 1) { @@ -2260,7 +2339,7 @@ static int load_balance_newidle(int this } if (!nr_moved) { - schedstat_inc(sd, lb_failed[NEWLY_IDLE]); + schedstat_sd_inc(sd, lb_failed[NEWLY_IDLE]); if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER) return -1; } else @@ -2269,7 +2348,7 @@ static int load_balance_newidle(int this return nr_moved; out_balanced: - schedstat_inc(sd, lb_balanced[NEWLY_IDLE]); + schedstat_sd_inc(sd, lb_balanced[NEWLY_IDLE]); if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER) return -1; sd->nr_balance_failed = 0; @@ -2333,12 +2412,12 @@ static void active_load_balance(runqueue if (unlikely(sd == NULL)) goto out; - schedstat_inc(sd, alb_cnt); + schedstat_sd_inc(sd, alb_cnt); if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE, NULL)) - schedstat_inc(sd, alb_pushed); + schedstat_sd_inc(sd, alb_pushed); else - schedstat_inc(sd, alb_failed); + schedstat_sd_inc(sd, alb_failed); out: spin_unlock(&target_rq->lock); } @@ -2906,7 +2985,7 @@ need_resched_nonpreemptible: dump_stack(); } - schedstat_inc(rq, sched_cnt); + schedstat_rq_inc(rq, sched_cnt); now = sched_clock(); if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) { run_time = now - prev->timestamp; @@ -2974,7 +3053,7 @@ go_idle: /* * Switch the active and expired arrays. */ - schedstat_inc(rq, sched_switch); + schedstat_rq_inc(rq, sched_switch); rq->active = rq->expired; rq->expired = array; array = rq->active; @@ -3007,7 +3086,7 @@ go_idle: next->activated = 0; switch_tasks: if (next == rq->idle) - schedstat_inc(rq, sched_goidle); + schedstat_rq_inc(rq, sched_goidle); prefetch(next); prefetch_stack(next); clear_tsk_need_resched(prev); @@ -3979,7 +4058,7 @@ asmlinkage long sys_sched_yield(void) prio_array_t *array = current->array; prio_array_t *target = rq->expired; - schedstat_inc(rq, yld_cnt); + schedstat_rq_inc(rq, yld_cnt); /* * We implement yielding by moving the task into the expired * queue. @@ -3991,11 +4070,11 @@ asmlinkage long sys_sched_yield(void) target = rq->active; if (array->nr_active == 1) { - schedstat_inc(rq, yld_act_empty); + schedstat_rq_inc(rq, yld_act_empty); if (!rq->expired->nr_active) - schedstat_inc(rq, yld_both_empty); + schedstat_rq_inc(rq, yld_both_empty); } else if (!rq->expired->nr_active) - schedstat_inc(rq, yld_exp_empty); + schedstat_rq_inc(rq, yld_exp_empty); if (array != target) { dequeue_task(current, array); diff -puN include/linux/sched.h~schedstats_refinement include/linux/sched.h --- linux-2.6.16-rc5/include/linux/sched.h~schedstats_refinement 2006-03-07 16:14:59.000000000 +0530 +++ linux-2.6.16-rc5-balbir/include/linux/sched.h 2006-03-07 20:46:10.000000000 +0530 @@ -525,6 +525,29 @@ struct backing_dev_info; struct reclaim_state; #ifdef CONFIG_SCHEDSTATS + +#define SCHEDSTATS_TASKS 0x1 +#define SCHEDSTATS_RQ 0x2 +#define SCHEDSTATS_SD 0x4 +#define SCHEDSTATS_ALL (SCHEDSTATS_TASKS | SCHEDSTATS_RQ | SCHEDSTATS_SD) + +extern int schedstats_on; + +static inline int schedstats_tasks_on(void) +{ + return schedstats_on & SCHEDSTATS_TASKS; +} + +static inline int schedstats_rq_on(void) +{ + return schedstats_on & SCHEDSTATS_RQ; +} + +static inline int schedstats_sd_on(void) +{ + return schedstats_on & SCHEDSTATS_SD; +} + struct sched_info { /* cumulative counters */ unsigned long cpu_time, /* time spent on the cpu */ diff -puN Documentation/kernel-parameters.txt~schedstats_refinement Documentation/kernel-parameters.txt --- linux-2.6.16-rc5/Documentation/kernel-parameters.txt~schedstats_refinement 2006-03-07 16:14:59.000000000 +0530 +++ linux-2.6.16-rc5-balbir/Documentation/kernel-parameters.txt 2006-03-07 20:48:44.000000000 +0530 @@ -1333,6 +1333,17 @@ running once the system is up. sc1200wdt= [HW,WDT] SC1200 WDT (watchdog) driver Format: <io>[,<timeout>[,<isapnp>]] + schedstats [KNL] + Enable all schedstats if CONFIG_SCHEDSTATS is defined + same as the schedstats=all + + schedstats= [KNL] + Format: {"all", "tasks", "sd", "rq"} + all -- turns on the complete schedstats + rq -- turns on schedstats only for runqueue + sd -- turns on schedstats only for scheddomains + tasks -- turns on schedstats only for tasks + scsi_debug_*= [SCSI] See drivers/scsi/scsi_debug.c. diff -puN fs/proc/base.c~schedstats_refinement fs/proc/base.c --- linux-2.6.16-rc5/fs/proc/base.c~schedstats_refinement 2006-03-07 16:14:59.000000000 +0530 +++ linux-2.6.16-rc5-balbir/fs/proc/base.c 2006-03-07 16:17:39.000000000 +0530 @@ -72,6 +72,7 @@ #include <linux/cpuset.h> #include <linux/audit.h> #include <linux/poll.h> +#include <linux/sched.h> #include "internal.h" /* @@ -504,6 +505,9 @@ static int proc_pid_wchan(struct task_st */ static int proc_pid_schedstat(struct task_struct *task, char *buffer) { + if (!schedstats_tasks_on()) + return sprintf(buffer, "tasks schedstats is not enabled\n"); + return sprintf(buffer, "%lu %lu %lu\n", task->sched_info.cpu_time, task->sched_info.run_delay, _ |
From: Bryan O'S. <bo...@se...> - 2006-02-27 17:05:30
|
On Mon, 2006-02-27 at 03:12 -0500, Shailabh Nagar wrote: > Add sysctl option for controlling schedstats collection > dynamically. Delay accounting leverages schedstats for > cpu delay statistics. Is there some reason you're using the sysctl interface, and not say sysfs instead? <b |
From: Arjan v. de V. <ar...@in...> - 2006-02-27 08:22:47
|
> +/* > + * timespec_diff_ns - Return difference of two timestamps in nanoseconds > + * In the rare case of @end being earlier than @start, return zero > + */ > +static inline nsec_t timespec_diff_ns(struct timespec *start, struct timespec *end) > +{ > + nsec_t ret; > + > + ret = (nsec_t)(end->tv_sec - start->tv_sec)*NSEC_PER_SEC; > + ret += (nsec_t)(end->tv_nsec - start->tv_nsec); > + if (ret < 0) > + return 0; > + return ret; > +} > #endif /* __KERNEL__ */ > wouldn't it be more useful to have this return a timespec as well, and then it'd be generically useful (and it also probably should then be uninlined ;) |
From: Shailabh N. <na...@wa...> - 2006-02-27 08:15:58
|
delayacct-setup.patch Initialization code related to collection of per-task "delay" statistics which measure how long it had to wait for cpu, sync block io, swapping etc. The collection of statistics and the interface are in other patches. This patch sets up the data structures and allows the statistics collection to be disabled through a kernel boot paramater. Signed-off-by: Shailabh Nagar <na...@wa...> Documentation/kernel-parameters.txt | 2 + include/linux/delayacct.h | 55 ++++++++++++++++++++++++++++++ include/linux/sched.h | 15 ++++++++ init/Kconfig | 13 +++++++ init/main.c | 2 + kernel/Makefile | 1 kernel/delayacct.c | 65 ++++++++++++++++++++++++++++++++++++ kernel/exit.c | 3 + kernel/fork.c | 2 + 9 files changed, 158 insertions(+) Index: linux-2.6.16-rc4/Documentation/kernel-parameters.txt =================================================================== --- linux-2.6.16-rc4.orig/Documentation/kernel-parameters.txt 2006-02-27 01:52:52.000000000 -0500 +++ linux-2.6.16-rc4/Documentation/kernel-parameters.txt 2006-02-27 01:52:54.000000000 -0500 @@ -410,6 +410,8 @@ running once the system is up. Format: <area>[,<node>] See also Documentation/networking/decnet.txt. + delayacct [KNL] Enable per-task delay accounting + devfs= [DEVFS] See Documentation/filesystems/devfs/boot-options. Index: linux-2.6.16-rc4/kernel/Makefile =================================================================== --- linux-2.6.16-rc4.orig/kernel/Makefile 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/kernel/Makefile 2006-02-27 01:52:54.000000000 -0500 @@ -34,6 +34,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <al...@li...>, the -fno-omit-frame-pointer is Index: linux-2.6.16-rc4/include/linux/delayacct.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc4/include/linux/delayacct.h 2006-02-27 01:52:54.000000000 -0500 @@ -0,0 +1,55 @@ +/* delayacct.h - per-task delay accounting + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKDELAYS_H +#define _LINUX_TASKDELAYS_H + +#include <linux/sched.h> + +#ifdef CONFIG_TASK_DELAY_ACCT +extern int delayacct_on; /* Delay accounting turned on/off */ +extern kmem_cache_t *delayacct_cache; +extern int delayacct_init(void); +extern void __delayacct_tsk_init(struct task_struct *); +extern void __delayacct_tsk_exit(struct task_struct *); + +static inline void delayacct_tsk_init(struct task_struct *tsk) +{ + /* reinitialize in case parent's non-null pointer was dup'ed*/ + tsk->delays = NULL; + if (unlikely(delayacct_on)) + __delayacct_tsk_init(tsk); +} + +static inline void delayacct_tsk_exit(struct task_struct *tsk) +{ + if (unlikely(tsk->delays)) + __delayacct_tsk_exit(tsk); +} + +static inline void delayacct_timestamp_start(void) +{ + if (unlikely(current->delays && delayacct_on)) + do_posix_clock_monotonic_gettime(¤t->delays->start); +} +#else +static inline void delayacct_tsk_init(struct task_struct *tsk) +{} +static inline void delayacct_tsk_exit(struct task_struct *tsk) +{} +static inline void delayacct_timestamp_start(void) +{} +static inline int delayacct_init(void) +{} +#endif /* CONFIG_TASK_DELAY_ACCT */ +#endif /* _LINUX_TASKDELAYS_H */ Index: linux-2.6.16-rc4/include/linux/sched.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/sched.h 2006-02-27 01:52:52.000000000 -0500 +++ linux-2.6.16-rc4/include/linux/sched.h 2006-02-27 01:52:54.000000000 -0500 @@ -543,6 +543,18 @@ struct sched_info { extern struct file_operations proc_schedstat_operations; #endif +#ifdef CONFIG_TASK_DELAY_ACCT +struct task_delay_info { + spinlock_t lock; + + /* timestamp recording variables (to reduce stack usage) */ + struct timespec start, end; + + /* Add stats in pairs: u64 delay, u32 count, aligned properly */ +}; +#endif + + enum idle_type { SCHED_IDLE, @@ -874,6 +886,9 @@ struct task_struct { #endif atomic_t fs_excl; /* holding fs exclusive resources */ struct rcu_head rcu; +#ifdef CONFIG_TASK_DELAY_ACCT + struct task_delay_info *delays; +#endif }; static inline pid_t process_group(struct task_struct *tsk) Index: linux-2.6.16-rc4/init/Kconfig =================================================================== --- linux-2.6.16-rc4.orig/init/Kconfig 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/init/Kconfig 2006-02-27 01:52:54.000000000 -0500 @@ -150,6 +150,19 @@ config BSD_PROCESS_ACCT_V3 for processing it. A preliminary version of these tools is available at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>. +config TASK_DELAY_ACCT + bool "Enable per-task delay accounting (EXPERIMENTAL)" + help + Collect information on time spent by a task waiting for system + resources like cpu, synchronous block I/O completion and swapping + in pages. Such statistics can help in setting a task's priorities + relative to other tasks for cpu, io, rss limits etc. + + Unlike BSD process accounting, this information is available + continuously during the lifetime of a task. + + Say N if unsure. + config SYSCTL bool "Sysctl support" ---help--- Index: linux-2.6.16-rc4/init/main.c =================================================================== --- linux-2.6.16-rc4.orig/init/main.c 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/init/main.c 2006-02-27 01:52:54.000000000 -0500 @@ -47,6 +47,7 @@ #include <linux/rmap.h> #include <linux/mempolicy.h> #include <linux/key.h> +#include <linux/delayacct.h> #include <asm/io.h> #include <asm/bugs.h> @@ -537,6 +538,7 @@ asmlinkage void __init start_kernel(void proc_root_init(); #endif cpuset_init(); + delayacct_init(); check_bugs(); Index: linux-2.6.16-rc4/kernel/delayacct.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc4/kernel/delayacct.c 2006-02-27 01:52:54.000000000 -0500 @@ -0,0 +1,65 @@ +/* delayacct.c - per-task delay accounting + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/time.h> +#include <linux/sysctl.h> +#include <linux/delayacct.h> + +int delayacct_on = 0; /* Delay accounting turned on/off */ +kmem_cache_t *delayacct_cache; + +static int __init delayacct_setup_enable(char *str) +{ + delayacct_on = 1; + return 1; +} +__setup("delayacct", delayacct_setup_enable); + +int delayacct_init(void) +{ + delayacct_cache = kmem_cache_create("delayacct_cache", + sizeof(struct task_delay_info), + 0, + SLAB_PANIC, + NULL, NULL); + if (!delayacct_cache) + return -ENOMEM; + delayacct_tsk_init(&init_task); + return 0; +} + +void __delayacct_tsk_init(struct task_struct *tsk) +{ + tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL); + if (tsk->delays) { + memset(tsk->delays, 0, sizeof(*tsk->delays)); + spin_lock_init(&tsk->delays->lock); + } +} + +void __delayacct_tsk_exit(struct task_struct *tsk) +{ + kmem_cache_free(delayacct_cache, tsk->delays); + tsk->delays = NULL; +} + +static inline nsec_t delayacct_measure(void) +{ + if ((current->delays->start.tv_sec == 0) && + (current->delays->start.tv_nsec == 0)) + return -EINVAL; + do_posix_clock_monotonic_gettime(¤t->delays->end); + return timespec_diff_ns(¤t->delays->start, ¤t->delays->end); +} Index: linux-2.6.16-rc4/kernel/fork.c =================================================================== --- linux-2.6.16-rc4.orig/kernel/fork.c 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/kernel/fork.c 2006-02-27 01:52:54.000000000 -0500 @@ -44,6 +44,7 @@ #include <linux/rmap.h> #include <linux/acct.h> #include <linux/cn_proc.h> +#include <linux/delayacct.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -970,6 +971,7 @@ static task_t *copy_process(unsigned lon goto bad_fork_cleanup_put_domain; p->did_exec = 0; + delayacct_tsk_init(p); /* Must remain after dup_task_struct() */ copy_flags(clone_flags, p); p->pid = pid; retval = -EFAULT; Index: linux-2.6.16-rc4/kernel/exit.c =================================================================== --- linux-2.6.16-rc4.orig/kernel/exit.c 2006-02-27 01:20:04.000000000 -0500 +++ linux-2.6.16-rc4/kernel/exit.c 2006-02-27 01:52:54.000000000 -0500 @@ -31,6 +31,7 @@ #include <linux/signal.h> #include <linux/cn_proc.h> #include <linux/mutex.h> +#include <linux/delayacct.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -839,6 +840,8 @@ fastcall NORET_TYPE void do_exit(long co preempt_count()); acct_update_integrals(tsk); + delayacct_tsk_exit(tsk); + if (tsk->mm) { update_hiwater_rss(tsk->mm); update_hiwater_vm(tsk->mm); |
From: Shailabh N. <na...@wa...> - 2006-02-27 08:18:50
|
delayacct-sysctl.patch Adds a sysctl to turn delay accounting on/off dynamically. (defaults to off). When turning off, struct task_delay_info associated with each task need to be cleared. When turning on, tasks without struct task_delay_info need to be allocated one. Signed-off-by: Shailabh Nagar <na...@wa...> Signed-off-by: Balbir Singh <ba...@in...> Signed-off-by: Srivatsa Vaddagiri <va...@in...> include/linux/delayacct.h | 12 +++- include/linux/sysctl.h | 1 kernel/delayacct.c | 128 ++++++++++++++++++++++++++++++++++++++++++++-- kernel/fork.c | 3 - kernel/sysctl.c | 11 +++ 5 files changed, 147 insertions(+), 8 deletions(-) Index: linux-2.6.16-rc4/include/linux/delayacct.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/delayacct.h 2006-02-27 01:52:54.000000000 -0500 +++ linux-2.6.16-rc4/include/linux/delayacct.h 2006-02-27 01:52:56.000000000 -0500 @@ -15,18 +15,24 @@ #define _LINUX_TASKDELAYS_H #include <linux/sched.h> +#include <linux/sysctl.h> #ifdef CONFIG_TASK_DELAY_ACCT extern int delayacct_on; /* Delay accounting turned on/off */ extern kmem_cache_t *delayacct_cache; +int delayacct_sysctl_handler(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); extern int delayacct_init(void); extern void __delayacct_tsk_init(struct task_struct *); extern void __delayacct_tsk_exit(struct task_struct *); -static inline void delayacct_tsk_init(struct task_struct *tsk) +static inline void delayacct_tsk_early_init(struct task_struct *tsk) { - /* reinitialize in case parent's non-null pointer was dup'ed*/ tsk->delays = NULL; +} + +static inline void delayacct_tsk_init(struct task_struct *tsk) +{ if (unlikely(delayacct_on)) __delayacct_tsk_init(tsk); } @@ -43,6 +49,8 @@ static inline void delayacct_timestamp_s do_posix_clock_monotonic_gettime(¤t->delays->start); } #else +static inline void delayacct_tsk_early_init(struct task_struct *tsk) +{} static inline void delayacct_tsk_init(struct task_struct *tsk) {} static inline void delayacct_tsk_exit(struct task_struct *tsk) Index: linux-2.6.16-rc4/include/linux/sysctl.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/sysctl.h 2006-02-27 01:52:52.000000000 -0500 +++ linux-2.6.16-rc4/include/linux/sysctl.h 2006-02-27 01:52:56.000000000 -0500 @@ -147,6 +147,7 @@ enum KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */ KERN_SPIN_RETRY=70, /* int: number of spinlock retries */ KERN_SCHEDSTATS=71, /* int: Schedstats on/off */ + KERN_DELAYACCT=74, /* int: Per-task delay accounting on/off */ }; Index: linux-2.6.16-rc4/kernel/delayacct.c =================================================================== --- linux-2.6.16-rc4.orig/kernel/delayacct.c 2006-02-27 01:52:54.000000000 -0500 +++ linux-2.6.16-rc4/kernel/delayacct.c 2006-02-27 01:52:56.000000000 -0500 @@ -1,6 +1,7 @@ /* delayacct.c - per-task delay accounting * * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2.1 of the GNU Lesser General Public License @@ -42,17 +43,30 @@ int delayacct_init(void) void __delayacct_tsk_init(struct task_struct *tsk) { - tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL); - if (tsk->delays) { + struct task_delay_info *delays = NULL; + + delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL); + if (!delays) + return; + + task_lock(tsk); + if (!tsk->delays) { + tsk->delays = delays; memset(tsk->delays, 0, sizeof(*tsk->delays)); spin_lock_init(&tsk->delays->lock); - } + } else + kmem_cache_free(delayacct_cache, delays); + task_unlock(tsk); } void __delayacct_tsk_exit(struct task_struct *tsk) { - kmem_cache_free(delayacct_cache, tsk->delays); - tsk->delays = NULL; + task_lock(tsk); + if (tsk->delays) { + kmem_cache_free(delayacct_cache, tsk->delays); + tsk->delays = NULL; + } + task_unlock(tsk); } static inline nsec_t delayacct_measure(void) @@ -63,3 +77,107 @@ static inline nsec_t delayacct_measure(v do_posix_clock_monotonic_gettime(¤t->delays->end); return timespec_diff_ns(¤t->delays->start, ¤t->delays->end); } + +/* Allocate task_delay_info for all tasks without one */ +static int alloc_delays(void) +{ + int cnt=0, i, j; + struct task_struct *g, *t; + struct task_delay_info **delayp; + int err = 0; + + read_lock(&tasklist_lock); + do_each_thread(g, t) + if (!t->delays && !(t->flags & (PF_EXITING | PF_DEAD))) + cnt++; + while_each_thread(g, t); + read_unlock(&tasklist_lock); + + if (!cnt) + return 0; +retry_allocs: + + delayp = kmalloc(cnt *sizeof(struct task_delay_info *), GFP_KERNEL); + if (!delayp) + return -ENOMEM; + for (i = 0; i < cnt; i++) { + delayp[i] = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL); + if (!delayp[i]) { + err = -ENOMEM; + goto out; + } + memset(delayp[i], 0, sizeof(*delayp[i])); + spin_lock_init(&delayp[i]->lock); + } + + i--; + j = 0; + read_lock(&tasklist_lock); + do_each_thread(g, t) { + task_lock(t); + if (t->delays) { + task_unlock(t); + continue; + } + /* Did some additional unaccounted tasks get created */ + if (i < 0) { + j++; + task_unlock(t); + continue; + } + if (!(t->flags & (PF_EXITING | PF_DEAD))) { + t->delays = delayp[i--]; + } + task_unlock(t); + } while_each_thread(g, t); + read_unlock(&tasklist_lock); + + /* + * Retry allocations for all tasks created in between the two + * tasklist_locks + */ + if (j > 0) { + kfree(delayp); + cnt = j; + goto retry_allocs; + } +out: + while (i >= 0) + kmem_cache_free(delayacct_cache, delayp[i--]); + kfree(delayp); + return err; +} + +/* Reset task_delay_info structs for all tasks */ +static void reset_delays(void) +{ + struct task_struct *g, *t; + + read_lock(&tasklist_lock); + do_each_thread(g, t) { + if (!t->delays) + continue; + memset(t->delays, 0, sizeof(struct task_delay_info)); + spin_lock_init(&t->delays->lock); + } while_each_thread(g, t); + read_unlock(&tasklist_lock); +} + +int delayacct_sysctl_handler(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int ret, prev; + + prev = delayacct_on; + ret = proc_dointvec(table, write, filp, buffer, lenp, ppos); + if (ret || (prev == delayacct_on)) + return ret; + + if (delayacct_on) + ret = alloc_delays(); + else + reset_delays(); + if (ret) + delayacct_on = prev; + return ret; +} Index: linux-2.6.16-rc4/kernel/fork.c =================================================================== --- linux-2.6.16-rc4.orig/kernel/fork.c 2006-02-27 01:52:54.000000000 -0500 +++ linux-2.6.16-rc4/kernel/fork.c 2006-02-27 01:52:56.000000000 -0500 @@ -971,7 +971,6 @@ static task_t *copy_process(unsigned lon goto bad_fork_cleanup_put_domain; p->did_exec = 0; - delayacct_tsk_init(p); /* Must remain after dup_task_struct() */ copy_flags(clone_flags, p); p->pid = pid; retval = -EFAULT; @@ -1013,6 +1012,7 @@ static task_t *copy_process(unsigned lon p->io_wait = NULL; p->audit_context = NULL; cpuset_fork(p); + delayacct_tsk_early_init(p); #ifdef CONFIG_NUMA p->mempolicy = mpol_copy(p->mempolicy); if (IS_ERR(p->mempolicy)) { @@ -1191,6 +1191,7 @@ static task_t *copy_process(unsigned lon total_forks++; spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); + delayacct_tsk_init(p); proc_fork_connector(p); return p; Index: linux-2.6.16-rc4/kernel/sysctl.c =================================================================== --- linux-2.6.16-rc4.orig/kernel/sysctl.c 2006-02-27 01:52:52.000000000 -0500 +++ linux-2.6.16-rc4/kernel/sysctl.c 2006-02-27 01:52:56.000000000 -0500 @@ -44,6 +44,7 @@ #include <linux/limits.h> #include <linux/dcache.h> #include <linux/syscalls.h> +#include <linux/delayacct.h> #include <asm/uaccess.h> #include <asm/processor.h> @@ -666,6 +667,16 @@ static ctl_table kern_table[] = { .proc_handler = &schedstats_sysctl_handler, }, #endif +#if defined(CONFIG_TASK_DELAY_ACCT) + { + .ctl_name = KERN_DELAYACCT, + .procname = "delayacct", + .data = &delayacct_on, + .maxlen = sizeof (int), + .mode = 0644, + .proc_handler = &delayacct_sysctl_handler, + }, +#endif { .ctl_name = 0 } }; |
From: Arjan v. de V. <ar...@in...> - 2006-02-27 08:26:31
|
> +/* Allocate task_delay_info for all tasks without one */ > +static int alloc_delays(void) I'm sorry but this function seems to be highly horrible |
From: Shailabh N. <na...@wa...> - 2006-02-27 08:38:55
|
Arjan van de Ven wrote: >>+/* Allocate task_delay_info for all tasks without one */ >>+static int alloc_delays(void) >> >> > >I'm sorry but this function seems to be highly horrible > > Could you be more specific ? Is it the way its coded or the design (preallocate, then assign) itself ? The function needs to allocate task_delay_info structs for all tasks that might have been forked since the last time delay accounting was turned off. Either we have to count how many such tasks there are, or preallocate nr_tasks (as an upper bound) and then use as many as needed. Thanks for reviewing so quickly. -- Shailabh |
From: Arjan v. de V. <ar...@in...> - 2006-02-27 08:42:31
|
On Mon, 2006-02-27 at 03:38 -0500, Shailabh Nagar wrote: > Arjan van de Ven wrote: > > >>+/* Allocate task_delay_info for all tasks without one */ > >>+static int alloc_delays(void) > >> > >> > > > >I'm sorry but this function seems to be highly horrible > > > > > Could you be more specific ? Is it the way its coded or the design > (preallocate, then assign) > itself ? > > The function needs to allocate task_delay_info structs for all tasks > that might > have been forked since the last time delay accounting was turned off. > Either we have to count how many such tasks there are, or preallocate > nr_tasks (as an upper bound) and then use as many as needed. it generally feels really fragile, especially with the task enumeration going to RCU soon. (eg you'd lose the ability to lock out new task creation) On first sight it looks a lot better to allocate these things on demand, but I'm not sure how the sleeping-allocation would interact with the places it'd need to be called... |
From: Shailabh N. <na...@wa...> - 2006-02-27 08:59:51
|
Arjan van de Ven wrote: >On Mon, 2006-02-27 at 03:38 -0500, Shailabh Nagar wrote: > > >>Arjan van de Ven wrote: >> >> >> >>>>+/* Allocate task_delay_info for all tasks without one */ >>>>+static int alloc_delays(void) >>>> >>>> >>>> >>>> >>>I'm sorry but this function seems to be highly horrible >>> >>> >>> >>> >>Could you be more specific ? Is it the way its coded or the design >>(preallocate, then assign) >>itself ? >> >>The function needs to allocate task_delay_info structs for all tasks >>that might >>have been forked since the last time delay accounting was turned off. >>Either we have to count how many such tasks there are, or preallocate >>nr_tasks (as an upper bound) and then use as many as needed. >> >> > >it generally feels really fragile, especially with the task enumeration >going to RCU soon. (eg you'd lose the ability to lock out new task >creation) > > >On first sight it looks a lot better to allocate these things on demand, >but I'm not sure how the sleeping-allocation would interact with the >places it'd need to be called... > > Yes, thats the reason why we didn't do the on-demand allocation...the next time a task is checked could be in any of the places where the timestamping is done. Doing the allocation there (and incurring the extra cost of the check even when sysctl hasn't been used) didn't seem worthwhile, esp. when we have a point (sysctl handler) where we can catch most of the allocs needed. But if task enumeration is going to get more difficult, we'll need to keep the on-demand allocation (on next use) as a backup for tasks that weren't caught during the sysctl change. > > > |
From: Balbir S. <ba...@in...> - 2006-02-27 11:19:04
|
> But if task enumeration is going to get more difficult, we'll need to > keep the on-demand allocation (on > next use) as a backup for tasks that weren't caught during the sysctl > change. > One possible issue with on-demand allocation is that under heavy load allocation causes IO to happen and when we try to timestamp that IO, we do not have a delays structure to do so. Balbir |