From: Shailabh N. <na...@wa...> - 2006-04-22 02:16:28
|
Here are the delay accounting patches again. I'm not using the earlier email thread due to code being refactored a bit. The previous posting http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.3/1776.html of these patches elicited several review comments from Andrew Morton all of which have been addressed. The other main thread of the comments was whether other accounting stakeholders would be ok with this interface. Towards this end, I'd posted an overview of what the other packages do (which didn't seem to make the archives) and some of the stakeholders responded. I'll repost the analysis as a reply to this post. Meanwhile, here's the list of the stakeholders identified by Andrew and a summary of status of their comments. 1. CSA accounting/PAGG/JOB: Jay Lan <jl...@en...> Raised several points http://www.uwsg.indiana.edu/hypermail/linux/kernel/0604.1/0397.html all of which have been addressed in this set of patches. 2. per-process IO statistics: Levent Serinol <lse...@gm...> No reponse. I'd ascertained that its needs are a subset of CSA. 3. per-cpu time statistics: Erich Focht <ef...@es...> No response. I'd ascertained that its needs can be met by taskstats interface whenever these statistics are submitted for inclusion. 4. Microstate accounting: Peter Chubb <pe...@ge...> Mentioned overlap of patches with delay accounting http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.3/2286.html and also that a /proc interface was preferable due to convenience. My position is that the netlink interface is a superset of /proc due to former's ability to supply exit-time data. 5. ELSA: Guillaume Thouvenin <gui...@bu...> Confirmed that ELSA is not a direct user of a new kernel statistics interface since it is a consumer of CSA or BSD accounting's statistics. 6. pnotify: Jes Sorensen <je...@sg...> (taken over pnotify from Erik Jacobson) Informed over private email that pnotify replacement is being worked on. I'd ascertained that pnotify (or its replacemenent) will not be concerned with exporting data to userspace or collecting any stats. Thats left to the kernel module that uses pnotify to get notifications. CSA is one expected user of pnotify. Hence CSA's concerns are the only ones relevant to pnotify as well. 7. Scalable statistics counters with /proc reporting: Ravikiran G Thirumalai, Dipankar Sarma <dip...@in...> Confirmed these counters aren't relevant to this discussion. --Shailabh Series delayacct-setup.patch delayacct-blkio-swapin.patch delayacct-schedstats.patch genetlink-utils.patch taskstats-setup.patch delayacct-taskstats.patch delayacct-doc.patch delayacct-procfs.patch |
From: Balbir S. <ba...@in...> - 2006-05-02 06:14:16
|
From: Shailabh Nagar <na...@wa...> Cc: Jes Sorensen <je...@sg...>, Peter Chubb <pe...@ge...>, Erich Focht <ef...@es...>, Levent Serinol <lse...@gm...>, Jay Lan <jl...@en...> Here are the delay accounting patches again. The patches are against 2.6.17-rc3 Andrew, could you please consider them for inclusion in -mm? The previous posting of these patches is at http://www.ussg.iu.edu/hypermail/linux/kernel/0604.2/1831.html Here's the list of the stakeholders identified by Andrew and a summary of status of their comments. 1. CSA accounting/PAGG/JOB: Jay Lan <jl...@en...> Raised several points http://www.ussg.iu.edu/hypermail/linux/kernel/0604.3/1036.html all of which have been addressed in this set of patches. 2. per-process IO statistics: Levent Serinol <lse...@gm...> No reponse. we have ascertained that its needs are a subset of CSA. 3. per-cpu time statistics: Erich Focht <ef...@es...> No response. we have ascertained that its needs can be met by taskstats interface whenever these statistics are submitted for inclusion. 4. Microstate accounting: Peter Chubb <pe...@ge...> Mentioned overlap of patches with delay accounting http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.3/2286.html and also that a /proc interface was preferable due to convenience. Our position is that the netlink interface is a superset of /proc due to former's ability to supply exit-time data. 5. ELSA: Guillaume Thouvenin <gui...@bu...> Confirmed that ELSA is not a direct user of a new kernel statistics interface since it is a consumer of CSA or BSD accounting's statistics. 6. pnotify: Jes Sorensen <je...@sg...> (taken over pnotify from Erik Jacobson) Informed over private email that pnotify replacement is being worked on. we have ascertained that pnotify (or its replacemenent) will not be concerned with exporting data to userspace or collecting any stats. Thats left to the kernel module that uses pnotify to get notifications. CSA is one expected user of pnotify. Hence CSA's concerns are the only ones relevant to pnotify as well. 7. Scalable statistics counters with /proc reporting: Ravikiran G Thirumalai, Dipankar Sarma <dip...@in...> Confirmed these counters aren't relevant to this discussion. Balbir Series delayacct-setup.patch delayacct-blkio-swapin.patch delayacct-schedstats.patch genetlink-utils.patch taskstats-setup.patch delayacct-taskstats.patch delayacct-doc.patch delayacct-procfs.patch -- <--- Balbir |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:23:36
|
Changelog Fixes comments by akpm - unnecessary initialization of delayacct_on - use kmem_cache_zalloc - redundant check in __delayacct_tsk_exit delayacct-setup.patch Initialization code related to collection of per-task "delay" statistics which measure how long it had to wait for cpu, sync block io, swapping etc. The collection of statistics and the interface are in other patches. This patch sets up the data structures and allows the statistics collection to be disabled through a kernel boot paramater. Signed-off-by: Shailabh Nagar <na...@wa...> Documentation/kernel-parameters.txt | 2 include/linux/delayacct.h | 69 ++++++++++++++++++++++++++++ include/linux/sched.h | 21 ++++++++ include/linux/time.h | 10 ++++ init/Kconfig | 13 +++++ init/main.c | 2 kernel/Makefile | 1 kernel/delayacct.c | 87 ++++++++++++++++++++++++++++++++++++ kernel/exit.c | 3 + kernel/fork.c | 2 10 files changed, 210 insertions(+) Index: linux-2.6.17-rc1/Documentation/kernel-parameters.txt =================================================================== --- linux-2.6.17-rc1.orig/Documentation/kernel-parameters.txt 2006-04-13 10:55:54.000000000 -0400 +++ linux-2.6.17-rc1/Documentation/kernel-parameters.txt 2006-04-14 14:59:21.000000000 -0400 @@ -430,6 +430,8 @@ running once the system is up. Format: <area>[,<node>] See also Documentation/networking/decnet.txt. + delayacct [KNL] Enable per-task delay accounting + devfs= [DEVFS] See Documentation/filesystems/devfs/boot-options. Index: linux-2.6.17-rc1/kernel/Makefile =================================================================== --- linux-2.6.17-rc1.orig/kernel/Makefile 2006-04-13 10:55:54.000000000 -0400 +++ linux-2.6.17-rc1/kernel/Makefile 2006-04-21 19:39:28.000000000 -0400 @@ -38,6 +38,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_RELAY) += relay.o +obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <al...@li...>, the -fno-omit-frame-pointer is Index: linux-2.6.17-rc1/include/linux/delayacct.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 19:39:29.000000000 -0400 @@ -0,0 +1,69 @@ +/* delayacct.h - per-task delay accounting + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See + * the GNU General Public License for more details. + * + */ + +#ifndef _LINUX_TASKDELAYS_H +#define _LINUX_TASKDELAYS_H + +#include <linux/sched.h> + +#ifdef CONFIG_TASK_DELAY_ACCT + +extern int delayacct_on; /* Delay accounting turned on/off */ +extern kmem_cache_t *delayacct_cache; +extern void delayacct_init(void); +extern void __delayacct_tsk_init(struct task_struct *); +extern void __delayacct_tsk_exit(struct task_struct *); + +static inline void delayacct_set_flag(int flag) +{ + if (current->delays) + current->delays->flags |= flag; +} + +static inline void delayacct_clear_flag(int flag) +{ + if (current->delays) + current->delays->flags &= ~flag; +} + +static inline void delayacct_tsk_init(struct task_struct *tsk) +{ + /* reinitialize in case parent's non-null pointer was dup'ed*/ + tsk->delays = NULL; + if (unlikely(delayacct_on)) + __delayacct_tsk_init(tsk); +} + +static inline void delayacct_tsk_exit(struct task_struct *tsk) +{ + if (tsk->delays) + __delayacct_tsk_exit(tsk); +} + +#else +static inline void delayacct_set_flag(int flag) +{} +static inline void delayacct_clear_flag(int flag) +{} +static inline void delayacct_init(void) +{} +static inline void delayacct_tsk_init(struct task_struct *tsk) +{} +static inline void delayacct_tsk_exit(struct task_struct *tsk) +{} +#endif /* CONFIG_TASK_DELAY_ACCT */ + +#endif Index: linux-2.6.17-rc1/include/linux/sched.h =================================================================== --- linux-2.6.17-rc1.orig/include/linux/sched.h 2006-04-13 10:55:54.000000000 -0400 +++ linux-2.6.17-rc1/include/linux/sched.h 2006-04-21 19:39:29.000000000 -0400 @@ -536,6 +536,24 @@ struct sched_info { extern struct file_operations proc_schedstat_operations; #endif +#ifdef CONFIG_TASK_DELAY_ACCT +struct task_delay_info { + spinlock_t lock; + unsigned int flags; /* Private per-task flags */ + + /* For each stat XXX, add following, aligned appropriately + * + * struct timespec XXX_start, XXX_end; + * u64 XXX_delay; + * u32 XXX_count; + * + * Atomicity of updates to XXX_delay, XXX_count protected by + * single lock above (split into XXX_lock if contention is an issue). + */ +}; +#endif + + enum idle_type { SCHED_IDLE, @@ -882,6 +900,9 @@ struct task_struct { atomic_t fs_excl; /* holding fs exclusive resources */ struct rcu_head rcu; +#ifdef CONFIG_TASK_DELAY_ACCT + struct task_delay_info *delays; +#endif }; static inline pid_t process_group(struct task_struct *tsk) Index: linux-2.6.17-rc1/init/Kconfig =================================================================== --- linux-2.6.17-rc1.orig/init/Kconfig 2006-04-13 10:55:54.000000000 -0400 +++ linux-2.6.17-rc1/init/Kconfig 2006-04-21 19:39:28.000000000 -0400 @@ -150,6 +150,19 @@ config BSD_PROCESS_ACCT_V3 for processing it. A preliminary version of these tools is available at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>. +config TASK_DELAY_ACCT + bool "Enable per-task delay accounting (EXPERIMENTAL)" + help + Collect information on time spent by a task waiting for system + resources like cpu, synchronous block I/O completion and swapping + in pages. Such statistics can help in setting a task's priorities + relative to other tasks for cpu, io, rss limits etc. + + Unlike BSD process accounting, this information is available + continuously during the lifetime of a task. + + Say N if unsure. + config SYSCTL bool "Sysctl support" ---help--- Index: linux-2.6.17-rc1/init/main.c =================================================================== --- linux-2.6.17-rc1.orig/init/main.c 2006-04-13 10:55:54.000000000 -0400 +++ linux-2.6.17-rc1/init/main.c 2006-04-21 19:39:28.000000000 -0400 @@ -47,6 +47,7 @@ #include <linux/rmap.h> #include <linux/mempolicy.h> #include <linux/key.h> +#include <linux/delayacct.h> #include <asm/io.h> #include <asm/bugs.h> @@ -541,6 +542,7 @@ asmlinkage void __init start_kernel(void proc_root_init(); #endif cpuset_init(); + delayacct_init(); check_bugs(); Index: linux-2.6.17-rc1/kernel/delayacct.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.17-rc1/kernel/delayacct.c 2006-04-21 19:39:29.000000000 -0400 @@ -0,0 +1,87 @@ +/* delayacct.c - per-task delay accounting + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See + * the GNU General Public License for more details. + */ + +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/time.h> +#include <linux/sysctl.h> +#include <linux/delayacct.h> + +int delayacct_on __read_mostly; /* Delay accounting turned on/off */ +kmem_cache_t *delayacct_cache; + +static int __init delayacct_setup_enable(char *str) +{ + delayacct_on = 1; + return 1; +} +__setup("delayacct", delayacct_setup_enable); + +void delayacct_init(void) +{ + delayacct_cache = kmem_cache_create("delayacct_cache", + sizeof(struct task_delay_info), + 0, + SLAB_PANIC, + NULL, NULL); + delayacct_tsk_init(&init_task); +} + +void __delayacct_tsk_init(struct task_struct *tsk) +{ + tsk->delays = kmem_cache_zalloc(delayacct_cache, SLAB_KERNEL); + if (tsk->delays) + spin_lock_init(&tsk->delays->lock); +} + +void __delayacct_tsk_exit(struct task_struct *tsk) +{ + kmem_cache_free(delayacct_cache, tsk->delays); + tsk->delays = NULL; +} + +/* + * Start accounting for a delay statistic using + * its starting timestamp (@start) + */ + +static inline void delayacct_start(struct timespec *start) +{ + do_posix_clock_monotonic_gettime(start); +} + +/* + * Finish delay accounting for a statistic using + * its timestamps (@start, @end), accumalator (@total) and @count + */ + +static inline void delayacct_end(struct timespec *start, struct timespec *end, + u64 *total, u32 *count) +{ + struct timespec ts; + s64 ns; + + do_posix_clock_monotonic_gettime(end); + timespec_sub(&ts, start, end); + ns = timespec_to_ns(&ts); + if (ns < 0) + return; + + spin_lock(¤t->delays->lock); + *total += ns; + (*count)++; + spin_unlock(¤t->delays->lock); +} + Index: linux-2.6.17-rc1/kernel/fork.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/fork.c 2006-04-13 10:55:54.000000000 -0400 +++ linux-2.6.17-rc1/kernel/fork.c 2006-04-14 14:59:21.000000000 -0400 @@ -44,6 +44,7 @@ #include <linux/rmap.h> #include <linux/acct.h> #include <linux/cn_proc.h> +#include <linux/delayacct.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -989,6 +990,7 @@ static task_t *copy_process(unsigned lon goto bad_fork_cleanup_put_domain; p->did_exec = 0; + delayacct_tsk_init(p); /* Must remain after dup_task_struct() */ copy_flags(clone_flags, p); p->pid = pid; retval = -EFAULT; Index: linux-2.6.17-rc1/kernel/exit.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/exit.c 2006-04-13 10:55:54.000000000 -0400 +++ linux-2.6.17-rc1/kernel/exit.c 2006-04-21 19:39:28.000000000 -0400 @@ -34,6 +34,7 @@ #include <linux/mutex.h> #include <linux/futex.h> #include <linux/compat.h> +#include <linux/delayacct.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -893,6 +894,7 @@ fastcall NORET_TYPE void do_exit(long co preempt_count()); acct_update_integrals(tsk); + if (tsk->mm) { update_hiwater_rss(tsk->mm); update_hiwater_vm(tsk->mm); @@ -909,6 +911,7 @@ fastcall NORET_TYPE void do_exit(long co if (unlikely(tsk->compat_robust_list)) compat_exit_robust_list(tsk); #endif + delayacct_tsk_exit(tsk); exit_mm(tsk); exit_sem(tsk); Index: linux-2.6.17-rc1/include/linux/time.h =================================================================== --- linux-2.6.17-rc1.orig/include/linux/time.h 2006-04-13 10:55:54.000000000 -0400 +++ linux-2.6.17-rc1/include/linux/time.h 2006-04-14 14:59:21.000000000 -0400 @@ -68,6 +68,16 @@ extern unsigned long mktime(const unsign extern void set_normalized_timespec(struct timespec *ts, time_t sec, long nsec); /* + * sub = end - start, in normalized form + */ +static inline void timespec_sub(struct timespec *start, struct timespec *end, + struct timespec *sub) +{ + set_normalized_timespec(sub, end->tv_sec - start->tv_sec, + end->tv_nsec - start->tv_nsec); +} + +/* * Returns true if the timespec is norm, false if denorm: */ #define timespec_valid(ts) \ |
From: Randy.Dunlap <rd...@xe...> - 2006-04-24 02:00:12
|
On Fri, 21 Apr 2006 22:23:25 -0400 Shailabh Nagar wrote: > Index: linux-2.6.17-rc1/include/linux/delayacct.h > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 19:39:29.000000000 -0400 > @@ -0,0 +1,69 @@ > +/* delayacct.h - per-task delay accounting > + */ > + > +#ifndef _LINUX_TASKDELAYS_H > +#define _LINUX_TASKDELAYS_H Probably _LINUX_DELAYACCT_H. Or if I add linux/taskdelays.h, what #include guard should I use? --- ~Randy |
From: Shailabh N. <na...@wa...> - 2006-04-24 17:26:33
|
Randy.Dunlap wrote: >On Fri, 21 Apr 2006 22:23:25 -0400 Shailabh Nagar wrote: > > > >>Index: linux-2.6.17-rc1/include/linux/delayacct.h >>=================================================================== >>--- /dev/null 1970-01-01 00:00:00.000000000 +0000 >>+++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 19:39:29.000000000 -0400 >>@@ -0,0 +1,69 @@ >>+/* delayacct.h - per-task delay accounting >>+ */ >>+ >>+#ifndef _LINUX_TASKDELAYS_H >>+#define _LINUX_TASKDELAYS_H >> >> > >Probably _LINUX_DELAYACCT_H. > > Yup. Hangover from old name...will fix. >Or if I add linux/taskdelays.h, what #include guard should I use? > >--- >~Randy > > |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:29:48
|
Changelog Fixes comments by akpm - avoid creating new per-process flag PF_SWAPIN delayacct-blkio-swapin.patch Collect per-task block I/O delay statistics. Unlike earlier iterations of the delay accounting patches, now delays are only collected for the actual I/O waits rather than try and cover the delays seen in I/O submission paths. Account separately for block I/O delays incurred as a result of swapin page faults whose frequency can be affected by the task/process' rss limit. Hence swapin delays can act as feedback for rss limit changes independent of I/O priority changes. Signed-off-by: Shailabh Nagar <na...@wa...> include/linux/delayacct.h | 25 +++++++++++++++++++++++++ include/linux/sched.h | 6 ++++++ kernel/delayacct.c | 19 +++++++++++++++++++ kernel/sched.c | 5 +++++ mm/memory.c | 4 ++++ 5 files changed, 59 insertions(+) Index: linux-2.6.17-rc1/include/linux/delayacct.h =================================================================== --- linux-2.6.17-rc1.orig/include/linux/delayacct.h 2006-04-21 22:27:18.000000000 -0400 +++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 22:27:19.000000000 -0400 @@ -19,6 +19,13 @@ #include <linux/sched.h> +/* + * Per-task flags relevant to delay accounting + * maintained privately to avoid exhausting similar flags in sched.h:PF_* + * Used to set current->delays->flags + */ +#define DELAYACCT_PF_SWAPIN 0x00000001 /* I am doing a swapin */ + #ifdef CONFIG_TASK_DELAY_ACCT extern int delayacct_on; /* Delay accounting turned on/off */ @@ -26,6 +33,8 @@ extern kmem_cache_t *delayacct_cache; extern void delayacct_init(void); extern void __delayacct_tsk_init(struct task_struct *); extern void __delayacct_tsk_exit(struct task_struct *); +extern void __delayacct_blkio_start(void); +extern void __delayacct_blkio_end(void); static inline void delayacct_set_flag(int flag) { @@ -53,6 +62,18 @@ static inline void delayacct_tsk_exit(st __delayacct_tsk_exit(tsk); } +static inline void delayacct_blkio_start(void) +{ + if (current->delays) + __delayacct_blkio_start(); +} + +static inline void delayacct_blkio_end(void) +{ + if (current->delays) + __delayacct_blkio_end(); +} + #else static inline void delayacct_set_flag(int flag) {} @@ -64,6 +85,10 @@ static inline void delayacct_tsk_init(st {} static inline void delayacct_tsk_exit(struct task_struct *tsk) {} +static inline void delayacct_blkio_start(void) +{} +static inline void delayacct_blkio_end(void) +{} #endif /* CONFIG_TASK_DELAY_ACCT */ #endif Index: linux-2.6.17-rc1/kernel/delayacct.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/delayacct.c 2006-04-21 22:27:18.000000000 -0400 +++ linux-2.6.17-rc1/kernel/delayacct.c 2006-04-21 22:27:19.000000000 -0400 @@ -85,3 +85,22 @@ static inline void delayacct_end(struct spin_unlock(¤t->delays->lock); } +void __delayacct_blkio_start(void) +{ + delayacct_start(¤t->delays->blkio_start); +} + +void __delayacct_blkio_end(void) +{ + if (current->delays->flags & DELAYACCT_PF_SWAPIN) + /* Swapin block I/O */ + delayacct_end(¤t->delays->blkio_start, + ¤t->delays->blkio_end, + ¤t->delays->swapin_delay, + ¤t->delays->swapin_count); + else /* Other block I/O */ + delayacct_end(¤t->delays->blkio_start, + ¤t->delays->blkio_end, + ¤t->delays->blkio_delay, + ¤t->delays->blkio_count); +} Index: linux-2.6.17-rc1/include/linux/sched.h =================================================================== --- linux-2.6.17-rc1.orig/include/linux/sched.h 2006-04-21 22:27:18.000000000 -0400 +++ linux-2.6.17-rc1/include/linux/sched.h 2006-04-21 22:27:19.000000000 -0400 @@ -550,6 +550,12 @@ struct task_delay_info { * Atomicity of updates to XXX_delay, XXX_count protected by * single lock above (split into XXX_lock if contention is an issue). */ + + struct timespec blkio_start, blkio_end; /* Shared by blkio, swapin */ + u64 blkio_delay; /* wait for sync block io completion */ + u64 swapin_delay; /* wait for swapin block io completion */ + u32 blkio_count; + u32 swapin_count; }; #endif Index: linux-2.6.17-rc1/kernel/sched.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/sched.c 2006-04-21 22:27:18.000000000 -0400 +++ linux-2.6.17-rc1/kernel/sched.c 2006-04-21 22:27:19.000000000 -0400 @@ -50,6 +50,7 @@ #include <linux/times.h> #include <linux/acct.h> #include <linux/kprobes.h> +#include <linux/delayacct.h> #include <asm/tlb.h> #include <asm/unistd.h> @@ -4144,9 +4145,11 @@ void __sched io_schedule(void) { struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id()); + delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); schedule(); atomic_dec(&rq->nr_iowait); + delayacct_blkio_end(); } EXPORT_SYMBOL(io_schedule); @@ -4156,9 +4159,11 @@ long __sched io_schedule_timeout(long ti struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id()); long ret; + delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); ret = schedule_timeout(timeout); atomic_dec(&rq->nr_iowait); + delayacct_blkio_end(); return ret; } Index: linux-2.6.17-rc1/mm/memory.c =================================================================== --- linux-2.6.17-rc1.orig/mm/memory.c 2006-04-21 22:27:18.000000000 -0400 +++ linux-2.6.17-rc1/mm/memory.c 2006-04-21 22:27:19.000000000 -0400 @@ -48,6 +48,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/init.h> +#include <linux/delayacct.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -1880,6 +1881,7 @@ static int do_swap_page(struct mm_struct entry = pte_to_swp_entry(orig_pte); again: + delayacct_set_flag(DELAYACCT_PF_SWAPIN); page = lookup_swap_cache(entry); if (!page) { swapin_readahead(entry, address, vma); @@ -1892,6 +1894,7 @@ again: page_table = pte_offset_map_lock(mm, pmd, address, &ptl); if (likely(pte_same(*page_table, orig_pte))) ret = VM_FAULT_OOM; + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); goto unlock; } @@ -1903,6 +1906,7 @@ again: mark_page_accessed(page); lock_page(page); + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); if (!PageSwapCache(page)) { /* Page migration has occured */ unlock_page(page); |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:33:54
|
Changelog Fixes comments by akpm - comments about locking used in rq_sched_info_arrive/depart No fix needed/possible - redundant extern declaration of delayacct_on in sched.h suggested location (delayacct.h) cannot be used as it includes sched.h extern declaration moved to where its needed - move unlikely declaration inside sched_info_on Function only returns constants. Cannot be done. - removal of #if defined in sched_fork (Dave Hansen) Refactoring suggested does not work if only SCHEDSTATS is configured delayacct-shedstats.patch Make the task-related schedstats functions callable by delay accounting even if schedstats collection isn't turned on. This removes the dependency of delay accounting on schedstats. Signed-off-by: Chandra Seetharaman <sek...@us...> Signed-off-by: Shailabh Nagar <na...@wa...> include/linux/sched.h | 21 +++++++++++++++--- kernel/sched.c | 56 ++++++++++++++++++++++++++++++++++---------------- 2 files changed, 56 insertions(+), 21 deletions(-) Index: linux-2.6.17-rc1/include/linux/sched.h =================================================================== --- linux-2.6.17-rc1.orig/include/linux/sched.h 2006-04-21 20:29:13.000000000 -0400 +++ linux-2.6.17-rc1/include/linux/sched.h 2006-04-21 20:29:15.000000000 -0400 @@ -521,7 +521,7 @@ typedef struct prio_array prio_array_t; struct backing_dev_info; struct reclaim_state; -#ifdef CONFIG_SCHEDSTATS +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info { /* cumulative counters */ unsigned long cpu_time, /* time spent on the cpu */ @@ -532,9 +532,11 @@ struct sched_info { unsigned long last_arrival, /* when we last ran on a cpu */ last_queued; /* when we were last queued to run */ }; +#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */ +#ifdef CONFIG_SCHEDSTATS extern struct file_operations proc_schedstat_operations; -#endif +#endif /* CONFIG_SCHEDSTATS */ #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info { @@ -557,8 +559,19 @@ struct task_delay_info { u32 blkio_count; u32 swapin_count; }; -#endif +#endif /* CONFIG_TASK_DELAY_ACCT */ +static inline int sched_info_on(void) +{ +#ifdef CONFIG_SCHEDSTATS + return 1; +#elif defined(CONFIG_TASK_DELAY_ACCT) + extern int delayacct_on; + return delayacct_on; +#else + return 0; +#endif +} enum idle_type { @@ -744,7 +757,7 @@ struct task_struct { cpumask_t cpus_allowed; unsigned int time_slice, first_time_slice; -#ifdef CONFIG_SCHEDSTATS +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; #endif Index: linux-2.6.17-rc1/kernel/sched.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/sched.c 2006-04-21 20:29:13.000000000 -0400 +++ linux-2.6.17-rc1/kernel/sched.c 2006-04-21 20:29:15.000000000 -0400 @@ -469,9 +469,32 @@ struct file_operations proc_schedstat_op .release = single_release, }; +/* + * Expects runqueue lock to be held for atomicity of update + */ +static inline void rq_sched_info_arrive(struct runqueue *rq, + unsigned long diff) +{ + if (rq) { + rq->rq_sched_info.run_delay += diff; + rq->rq_sched_info.pcnt++; + } +} + +/* + * Expects runqueue lock to be held for atomicity of update + */ +static inline void rq_sched_info_depart(struct runqueue *rq, + unsigned long diff) +{ + if (rq) + rq->rq_sched_info.cpu_time += diff; +} # define schedstat_inc(rq, field) do { (rq)->field++; } while (0) # define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) #else /* !CONFIG_SCHEDSTATS */ +static inline void rq_sched_info_arrive(struct runqueue *rq, unsigned long diff) {} +static inline void rq_sched_info_depart(struct runqueue *rq, unsigned long diff) {} # define schedstat_inc(rq, field) do { } while (0) # define schedstat_add(rq, field, amt) do { } while (0) #endif @@ -491,7 +514,7 @@ static inline runqueue_t *this_rq_lock(v return rq; } -#ifdef CONFIG_SCHEDSTATS +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) /* * Called when a process is dequeued from the active array and given * the cpu. We should note that with the exception of interactive @@ -520,7 +543,6 @@ static inline void sched_info_dequeued(t static void sched_info_arrive(task_t *t) { unsigned long now = jiffies, diff = 0; - struct runqueue *rq = task_rq(t); if (t->sched_info.last_queued) diff = now - t->sched_info.last_queued; @@ -529,11 +551,7 @@ static void sched_info_arrive(task_t *t) t->sched_info.last_arrival = now; t->sched_info.pcnt++; - if (!rq) - return; - - rq->rq_sched_info.run_delay += diff; - rq->rq_sched_info.pcnt++; + rq_sched_info_arrive(task_rq(t), diff); } /* @@ -553,8 +571,9 @@ static void sched_info_arrive(task_t *t) */ static inline void sched_info_queued(task_t *t) { - if (!t->sched_info.last_queued) - t->sched_info.last_queued = jiffies; + if (unlikely(sched_info_on())) + if (!t->sched_info.last_queued) + t->sched_info.last_queued = jiffies; } /* @@ -563,13 +582,10 @@ static inline void sched_info_queued(tas */ static inline void sched_info_depart(task_t *t) { - struct runqueue *rq = task_rq(t); unsigned long diff = jiffies - t->sched_info.last_arrival; t->sched_info.cpu_time += diff; - - if (rq) - rq->rq_sched_info.cpu_time += diff; + rq_sched_info_depart(task_rq(t), diff); } /* @@ -577,7 +593,7 @@ static inline void sched_info_depart(tas * their time slice. (This may also be called when switching to or from * the idle task.) We are only called when prev != next. */ -static inline void sched_info_switch(task_t *prev, task_t *next) +static inline void __sched_info_switch(task_t *prev, task_t *next) { struct runqueue *rq = task_rq(prev); @@ -592,10 +608,15 @@ static inline void sched_info_switch(tas if (next != rq->idle) sched_info_arrive(next); } +static inline void sched_info_switch(task_t *prev, task_t *next) +{ + if (unlikely(sched_info_on())) + __sched_info_switch(prev, next); +} #else #define sched_info_queued(t) do { } while (0) #define sched_info_switch(t, next) do { } while (0) -#endif /* CONFIG_SCHEDSTATS */ +#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */ /* * Adding/removing a task to/from a priority array: @@ -1351,8 +1372,9 @@ void fastcall sched_fork(task_t *p, int p->state = TASK_RUNNING; INIT_LIST_HEAD(&p->run_list); p->array = NULL; -#ifdef CONFIG_SCHEDSTATS - memset(&p->sched_info, 0, sizeof(p->sched_info)); +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) + if (unlikely(sched_info_on())) + memset(&p->sched_info, 0, sizeof(p->sched_info)); #endif #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) p->oncpu = 0; |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:35:53
|
genetlink-utils.patch Two utilities for simplifying usage of NETLINK_GENERIC interface. Signed-off-by: Balbir Singh <ba...@in...> Signed-off-by: Shailabh Nagar <na...@wa...> include/net/genetlink.h | 20 ++++++++++++++++++++ 1 files changed, 20 insertions(+) Index: linux-2.6.17-rc1/include/net/genetlink.h =================================================================== --- linux-2.6.17-rc1.orig/include/net/genetlink.h 2006-04-21 19:39:29.000000000 -0400 +++ linux-2.6.17-rc1/include/net/genetlink.h 2006-04-21 20:29:19.000000000 -0400 @@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct return nlmsg_unicast(genl_sock, skb, pid); } +/** + * gennlmsg_data - head of message payload + * @gnlh: genetlink messsage header + */ +static inline void *genlmsg_data(const struct genlmsghdr *gnlh) +{ + return ((unsigned char *) gnlh + GENL_HDRLEN); +} + +/** + * genlmsg_len - length of message payload + * @gnlh: genetlink message header + */ +static inline int genlmsg_len(const struct genlmsghdr *gnlh) +{ + struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh - + NLMSG_HDRLEN); + return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN); +} + #endif /* __NET_GENERIC_NETLINK_H */ |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:37:57
|
Changelog Fixes comments by jl...@en... - separate out taskstats interface from delay accounting completely including separate documentation - permit different accounting subsystems to fill in parts of common structure separately before common taskstats code sends it out on genetlink - send common structure to userspace after update_hiwater_rss and before exit_mm in do_exit Fixes comments by akpm - comment to indicate locking used for taskstats struct - whitespace issues - unnecessary use of constant taskstats_version - uninline fill_pid(), fill_tgid() - unnecessary cast to pid_t in taskstats_send_stats() - too early evaluation of thread_group_empty() in taskstats_exit_pid - returning -EFAULT on genl_register_family failure in taskstats_init - comment for late_initcall of taskstats_init No fix needed - moving kmem_cache_free of tsk->delays outside the exit mutex (mutex shifted and tsk->delays freeing being done elsewhere now) - __delayacct_add_tsk returning -EINVAL if delay accounting isn't enabled user should know that no values can be returned returning zero would be misleading - combining fill_pid(), fill_tgid() into a common function combined code convoluted and less readable taskstats-setup.patch Create a "taskstats" interface based on generic netlink (NETLINK_GENERIC family), for getting statistics of tasks and thread groups during their lifetime and when they exit. The interface is intended for use by multiple accounting packages though it is being created in the context of delay accounting. This patch creates the interface without populating the fields of the data that is sent to the user in response to a command or upon the exit of a task. Each accounting package interested in using taskstats has to provide an additional patch to add its stats to the common structure. Signed-off-by: Shailabh Nagar <na...@us...> Signed-off-by: Balbir Singh <ba...@in...> Documentation/accounting/taskstats.txt | 146 +++++++++++++++ include/linux/taskstats.h | 85 ++++++++ include/linux/taskstats_kern.h | 55 +++++ init/Kconfig | 15 + init/main.c | 2 kernel/Makefile | 1 kernel/exit.c | 7 kernel/taskstats.c | 321 +++++++++++++++++++++++++++++++++ 8 files changed, 629 insertions(+), 3 deletions(-) Index: linux-2.6.17-rc1/include/linux/taskstats.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.17-rc1/include/linux/taskstats.h 2006-04-21 20:31:11.000000000 -0400 @@ -0,0 +1,85 @@ +/* taskstats.h - exporting per-task statistics + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKSTATS_H +#define _LINUX_TASKSTATS_H + +/* Format for per-task data returned to userland when + * - a task exits + * - listener requests stats for a task + * + * The struct is versioned. Newer versions should only add fields to + * the bottom of the struct to maintain backward compatibility. + * + * + * To add new fields + * a) bump up TASKSTATS_VERSION + * b) add comment indicating new version number at end of struct + * c) add new fields after version comment; maintain 64-bit alignment + */ + +#define TASKSTATS_VERSION 1 + +struct taskstats { + + /* Version 1 */ + + int filler_avoids_empty_struct_warnings; +}; + + +#define TASKSTATS_LISTEN_GROUP 0x1 + +/* + * Commands sent from userspace + * Not versioned. New commands should only be inserted at the enum's end + * prior to __TASKSTATS_CMD_MAX + */ + +enum { + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */ + TASKSTATS_CMD_GET, /* user->kernel request/get-response */ + TASKSTATS_CMD_NEW, /* kernel->user event */ + __TASKSTATS_CMD_MAX, +}; + +#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1) + +enum { + TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */ + TASKSTATS_TYPE_PID, /* Process id */ + TASKSTATS_TYPE_TGID, /* Thread group id */ + TASKSTATS_TYPE_STATS, /* taskstats structure */ + TASKSTATS_TYPE_AGGR_PID, /* contains pid + stats */ + TASKSTATS_TYPE_AGGR_TGID, /* contains tgid + stats */ + __TASKSTATS_TYPE_MAX, +}; + +#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1) + +enum { + TASKSTATS_CMD_ATTR_UNSPEC = 0, + TASKSTATS_CMD_ATTR_PID, + TASKSTATS_CMD_ATTR_TGID, + __TASKSTATS_CMD_ATTR_MAX, +}; + +#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1) + +/* NETLINK_GENERIC related info */ + +#define TASKSTATS_GENL_NAME "TASKSTATS" +#define TASKSTATS_GENL_VERSION 0x1 + +#endif /* _LINUX_TASKSTATS_H */ Index: linux-2.6.17-rc1/init/Kconfig =================================================================== --- linux-2.6.17-rc1.orig/init/Kconfig 2006-04-21 19:39:28.000000000 -0400 +++ linux-2.6.17-rc1/init/Kconfig 2006-04-21 20:29:22.000000000 -0400 @@ -150,6 +150,18 @@ config BSD_PROCESS_ACCT_V3 for processing it. A preliminary version of these tools is available at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>. +config TASKSTATS + bool "Export task/process statistics through netlink (EXPERIMENTAL)" + default n + help + Export selected statistics for tasks/processes through the + generic netlink interface. Unlike BSD process accounting, the + statistics are available during the lifetime of tasks/processes as + responses to commands. Like BSD accounting, they are sent to user + space on task exit. + + Say N if unsure. + config TASK_DELAY_ACCT bool "Enable per-task delay accounting (EXPERIMENTAL)" help @@ -158,9 +170,6 @@ config TASK_DELAY_ACCT in pages. Such statistics can help in setting a task's priorities relative to other tasks for cpu, io, rss limits etc. - Unlike BSD process accounting, this information is available - continuously during the lifetime of a task. - Say N if unsure. config SYSCTL Index: linux-2.6.17-rc1/kernel/Makefile =================================================================== --- linux-2.6.17-rc1.orig/kernel/Makefile 2006-04-21 19:39:28.000000000 -0400 +++ linux-2.6.17-rc1/kernel/Makefile 2006-04-21 20:29:22.000000000 -0400 @@ -39,6 +39,7 @@ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o +obj-$(CONFIG_TASKSTATS) += taskstats.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <al...@li...>, the -fno-omit-frame-pointer is Index: linux-2.6.17-rc1/kernel/taskstats.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.17-rc1/kernel/taskstats.c 2006-04-21 20:29:22.000000000 -0400 @@ -0,0 +1,321 @@ +/* + * taskstats.c - Export per-task statistics to userland + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include <linux/kernel.h> +#include <linux/taskstats_kern.h> +#include <net/genetlink.h> +#include <asm/atomic.h> + +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 }; +static int family_registered = 0; +kmem_cache_t *taskstats_cache; +static DEFINE_MUTEX(taskstats_exit_mutex); + +static struct genl_family family = { + .id = GENL_ID_GENERATE, + .name = TASKSTATS_GENL_NAME, + .version = TASKSTATS_GENL_VERSION, + .maxattr = TASKSTATS_CMD_ATTR_MAX, +}; + +static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = { + [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 }, + [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 }, +}; + + +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp, + void **replyp, size_t size) +{ + struct sk_buff *skb; + void *reply; + + /* + * If new attributes are added, please revisit this allocation + */ + skb = nlmsg_new(size); + if (!skb) + return -ENOMEM; + + if (!info) { + int seq = get_cpu_var(taskstats_seqnum)++; + put_cpu_var(taskstats_seqnum); + + reply = genlmsg_put(skb, 0, seq, + family.id, 0, 0, + cmd, family.version); + } else + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq, + family.id, 0, 0, + cmd, family.version); + if (reply == NULL) { + nlmsg_free(skb); + return -EINVAL; + } + + *skbp = skb; + *replyp = reply; + return 0; +} + +static int send_reply(struct sk_buff *skb, pid_t pid, int event) +{ + struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); + void *reply; + int rc; + + reply = genlmsg_data(genlhdr); + + rc = genlmsg_end(skb, reply); + if (rc < 0) { + nlmsg_free(skb); + return rc; + } + + if (event == TASKSTATS_MSG_MULTICAST) + return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP); + return genlmsg_unicast(skb, pid); +} + +static int fill_pid(pid_t pid, struct task_struct *pidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk = pidtsk; + + if (!pidtsk) { + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (!tsk) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + get_task_struct(tsk); + read_unlock(&tasklist_lock); + } else + get_task_struct(tsk); + + /* + * Each accounting subsystem adds calls to its functions to + * fill in relevant parts of struct taskstsats as follows + * + * rc = per-task-foo(stats, tsk); + * if (rc) + * goto err; + */ + +err: + put_task_struct(tsk); + return rc; + +} + +static int fill_tgid(pid_t tgid, struct task_struct *tgidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk, *first; + + first = tgidtsk; + read_lock(&tasklist_lock); + if (!first) { + first = find_task_by_pid(tgid); + if (!first) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + } + tsk = first; + do { + /* + * Each accounting subsystem adds calls its functions to + * fill in relevant parts of struct taskstsats as follows + * + * rc = per-task-foo(stats, tsk); + * if (rc) + * break; + */ + + } while_each_thread(first, tsk); + read_unlock(&tasklist_lock); + + /* + * Accounting subsytems can also add calls here if they don't + * wish to aggregate statistics for per-tgid stats + */ + + return rc; +} + +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) +{ + int rc = 0; + struct sk_buff *rep_skb; + struct taskstats stats; + void *reply; + size_t size; + struct nlattr *na; + + /* + * Size includes space for nested attributes + */ + size = nla_total_size(sizeof(u32)) + + nla_total_size(sizeof(struct taskstats)) + nla_total_size(0); + + memset(&stats, 0, sizeof(stats)); + rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size); + if (rc < 0) + return rc; + + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); + rc = fill_pid(pid, NULL, &stats); + if (rc < 0) + goto err; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, + stats); + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]); + rc = fill_tgid(tgid, NULL, &stats); + if (rc < 0) + goto err; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, + stats); + } else { + rc = -EINVAL; + goto err; + } + + nla_nest_end(rep_skb, na); + + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST); + +nla_put_failure: + return genlmsg_cancel(rep_skb, reply); +err: + nlmsg_free(rep_skb); + return rc; +} + +/* Send pid data out on exit */ +void taskstats_exit_send(struct task_struct *tsk, struct taskstats *tidstats, + struct taskstats *tgidstats) +{ + int rc; + struct sk_buff *rep_skb; + void *reply; + size_t size; + int is_thread_group; + struct nlattr *na; + + if (!family_registered) + return; + + mutex_lock(&taskstats_exit_mutex); + + is_thread_group = !thread_group_empty(tsk); + rc = 0; + + /* + * Size includes space for nested attributes + */ + size = nla_total_size(sizeof(u32)) + + nla_total_size(sizeof(struct taskstats)) + nla_total_size(0); + + if (is_thread_group) + size = 2 * size; // PID + STATS + TGID + STATS + + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size); + if (rc < 0) + goto ret; + + if (!tidstats) + goto err_skb; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, *tidstats); + nla_nest_end(rep_skb, na); + + if (!is_thread_group || !tgidstats) { + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + goto ret; + } + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, *tgidstats); + nla_nest_end(rep_skb, na); + + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + goto ret; + +nla_put_failure: + genlmsg_cancel(rep_skb, reply); + goto ret; +err_skb: + nlmsg_free(rep_skb); +ret: + mutex_unlock(&taskstats_exit_mutex); + return; +} + +static struct genl_ops taskstats_ops = { + .cmd = TASKSTATS_CMD_GET, + .doit = taskstats_send_stats, + .policy = taskstats_cmd_get_policy, +}; + +/* Needed early in initialization */ +void __init taskstats_init_early(void) +{ + taskstats_cache = kmem_cache_create("taskstats_cache", + sizeof(struct taskstats), + 0, SLAB_PANIC, NULL, NULL); +} + +static int __init taskstats_init(void) +{ + int rc; + + rc = genl_register_family(&family); + if (rc) + return rc; + family_registered = 1; + + if ((rc = genl_register_ops(&family, &taskstats_ops)) < 0) + goto err; + + return 0; +err: + genl_unregister_family(&family); + family_registered = 0; + return rc; +} + +/* + * late initcall ensures initialization of statistics collection + * mechanisms precedes initialization of the taskstats interface + */ +late_initcall(taskstats_init); Index: linux-2.6.17-rc1/kernel/exit.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/exit.c 2006-04-21 19:39:28.000000000 -0400 +++ linux-2.6.17-rc1/kernel/exit.c 2006-04-21 20:29:22.000000000 -0400 @@ -35,6 +35,7 @@ #include <linux/futex.h> #include <linux/compat.h> #include <linux/delayacct.h> +#include <linux/taskstats_kern.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -847,6 +848,7 @@ static void exit_notify(struct task_stru fastcall NORET_TYPE void do_exit(long code) { struct task_struct *tsk = current; + struct taskstats *tidstats, *tgidstats; int group_dead; profile_task_exit(tsk); @@ -893,6 +895,8 @@ fastcall NORET_TYPE void do_exit(long co current->comm, current->pid, preempt_count()); + taskstats_exit_alloc(&tidstats, &tgidstats); + acct_update_integrals(tsk); if (tsk->mm) { @@ -911,7 +915,10 @@ fastcall NORET_TYPE void do_exit(long co if (unlikely(tsk->compat_robust_list)) compat_exit_robust_list(tsk); #endif + taskstats_exit_send(tsk, tidstats, tgidstats); + taskstats_exit_free(tidstats, tgidstats); delayacct_tsk_exit(tsk); + exit_mm(tsk); exit_sem(tsk); Index: linux-2.6.17-rc1/include/linux/taskstats_kern.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.17-rc1/include/linux/taskstats_kern.h 2006-04-21 20:29:22.000000000 -0400 @@ -0,0 +1,55 @@ +/* taskstats_kern.h - kernel header for per-task statistics interface + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + */ + +#ifndef _LINUX_TASKSTATS_KERN_H +#define _LINUX_TASKSTATS_KERN_H + +#include <linux/taskstats.h> +#include <linux/sched.h> + +enum { + TASKSTATS_MSG_UNICAST, /* send data only to requester */ + TASKSTATS_MSG_MULTICAST, /* send data to a group */ +}; + +#ifdef CONFIG_TASKSTATS +extern kmem_cache_t *taskstats_cache; + +static inline void taskstats_exit_alloc(struct taskstats **ptidstats, + struct taskstats **ptgidstats) +{ + *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); + *ptgidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); +} + +static inline void taskstats_exit_free(struct taskstats *tidstats, + struct taskstats *tgidstats) +{ + if (tidstats) + kmem_cache_free(taskstats_cache, tidstats); + if (tgidstats) + kmem_cache_free(taskstats_cache, tgidstats); +} + +extern void taskstats_exit_send(struct task_struct *, struct taskstats *, + struct taskstats *); +extern void taskstats_init_early(void); + +#else +static inline void taskstats_exit_alloc(struct taskstats **ptidstats, + struct taskstats **ptgidstats) +{} +static inline void taskstats_exit_free(struct taskstats *ptidstats, + struct taskstats *ptgidstats) +{} +static inline void taskstats_exit_send(struct task_struct *tsk) +{} +static inline void taskstats_init_early(void) +{} +#endif /* CONFIG_TASKSTATS */ + +#endif + Index: linux-2.6.17-rc1/Documentation/accounting/taskstats.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.17-rc1/Documentation/accounting/taskstats.txt 2006-04-21 20:29:22.000000000 -0400 @@ -0,0 +1,146 @@ +Per-task statistics interface +----------------------------- + + +Taskstats is a netlink-based interface for sending per-task and +per-process statistics from the kernel to userspace. + +Taskstats was designed for the following benefits: + +- efficiently provide statistics during lifetime of a task and on its exit +- unified interface for multiple accounting subsystems +- extensibility for use by future accounting patches + +Terminology +----------- + +"pid", "tid" and "task" are used interchangeably and refer to the standard +Linux task defined by struct task_struct. per-pid stats are the same as +per-task stats. + +"tgid", "process" and "thread group" are used interchangeably and refer to the +tasks that share an mm_struct i.e. the traditional Unix process. Despite the +use of tgid, there is no special treatment for the task that is thread group +leader - a process is deemed alive as long as it has any task belonging to it. + +Usage +----- + +To get statistics during task's lifetime, userspace opens a unicast netlink +socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. +The response contains statistics for a task (if pid is specified) or the sum of +statistics for all tasks of the process (if tgid is specified). + +To obtain statistics for tasks which are exiting, userspace opens a multicast +netlink socket. Each time a task exits, two records are sent by the kernel to +each listener on the multicast socket. The first the per-pid task's statistics +and the second is the sum for all tasks of the process to which the task +belongs (the task does not need to be the thread group leader). The need for +per-tgid stats to be sent for each exiting task is explained in the Advanced +Usage section below. + + +Interface +--------- + +The user-kernel interface is encapsulated in include/linux/taskstats.h + +To avoid this documentation becoming obsolete as the interface evolves, only +an outline of the current version is given. taskstats.h always overrides the +description here. + +struct taskstats is the common accounting structure for both per-pid and +per-tgid data. It is versioned and can be extended by each accounting subsystem +that is added to the kernel. The fields and their semantics are defined in the +taskstats_struct.h file. + +The data exchanged between user and kernel space is a netlink message belonging +to the NETLINK_GENERIC family and using the netlink attributes interface. +The messages are in the format + + +----------+- - -+-------------+-------------------+ + | nlmsghdr | Pad | genlmsghdr | taskstats payload | + +----------+- - -+-------------+-------------------+ + + +The taskstats payload is one of the following three kinds: + +1. Commands: Sent from user to kernel. The payload is one attribute, of type +TASKSTATS_CMD_ATTR_PID/TGID, containing a u32 pid or tgid in the attribute +payload. The pid/tgid denotes the task/process for which userspace wants +statistics. + +2. Response for a command: sent from the kernel in response to a userspace +command. The payload is a series of three attributes of type: + +a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates +a pid/tgid will be followed by some stats. + +b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats +is being returned. + +c) TASKSTATS_TYPE_STATS: attribute with a struct taskstsats as payload. The +same structure is used for both per-pid and per-tgid stats. + +3. New message sent by kernel whenever a task exits. The payload consists of a + series of attributes of the following type: + +a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats +b) TASKSTATS_TYPE_PID: contains exiting task's pid +c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats +d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats +e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs +f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process + + +per-tgid stats +-------------- + +Taskstats provides per-process stats, in addition to per-task stats, since +resource management is often done at a process granularity and aggregating task +stats in userspace alone is inefficient and potentially inaccurate (due to lack +of atomicity). + +However, maintaining per-process, in addition to per-task stats, within the +kernel has space and time overheads. Hence the taskstats implementation +dynamically sums up the per-task stats for each task belonging to a process +whenever per-process stats are needed. + +Not maintaining per-tgid stats creates a problem when userspace is interested +in getting these stats when the process dies i.e. the last thread of +a process exits. It isn't possible to simply return some aggregated per-process +statistic from the kernel. + +The approach taken by taskstats is to return the per-tgid stats *each* time +a task exits, in addition to the per-pid stats for that task. Userspace can +maintain task<->process mappings and use them to maintain the per-process stats +in userspace, updating the aggregate appropriately as the tasks of a process +exit. + +Extending taskstats +------------------- + +There are two ways to extend the taskstats interface to export more +per-task/process stats as patches to collect them get added to the kernel +in future: + +1. Adding more fields to the end of the existing struct taskstats. Backward + compatibility is ensured by the version number within the + structure. Userspace will use only the fields of the struct that correspond + to the version its using. + +2. Defining separate statistic structs and using the netlink attributes + interface to return them. Since userspace processes each netlink attribute + independently, it can always ignore attributes whose type it does not + understand (because it is using an older version of the interface). + + +Choosing between 1. and 2. is a matter of trading off flexibility and +overhead. If only a few fields need to be added, then 1. is the preferable +path since the kernel and userspace don't need to incur the overhead of +processing new netlink attributes. But if the new fields expand the existing +struct too much, requiring disparate userspace accounting utilities to +unnecessarily receive large structures whose fields are of no interest, then +extending the attributes structure would be worthwhile. + +---- \ No newline at end of file Index: linux-2.6.17-rc1/init/main.c =================================================================== --- linux-2.6.17-rc1.orig/init/main.c 2006-04-21 19:39:28.000000000 -0400 +++ linux-2.6.17-rc1/init/main.c 2006-04-21 20:29:22.000000000 -0400 @@ -47,6 +47,7 @@ #include <linux/rmap.h> #include <linux/mempolicy.h> #include <linux/key.h> +#include <linux/taskstats_kern.h> #include <linux/delayacct.h> #include <asm/io.h> @@ -542,6 +543,7 @@ asmlinkage void __init start_kernel(void proc_root_init(); #endif cpuset_init(); + taskstats_init_early(); delayacct_init(); check_bugs(); |
From: Jay L. <jl...@en...> - 2006-04-27 01:12:57
|
Hi Shailabh, Thanks for your effort in taskstats interface! Really appreciated! I think this interface can offer a good foundation for other packages to build on. Here are a few more comments: 1) You mentioned the "version number within the (taskstats) structure" in taskstats.txt and a few other places, but i do not see that field defined in struct taskstats in taskstats.h? 2) In taskstats.txt "Extending taskstats" section, you mentioned two ways to extend the interface. The second method looks like a method to encoureage other package developers to create their own interface (ie, not taskstats) based on generic netlink to avoid reading large number of fields not interested to other particular applications? I will be fine with this as long as it is understood and agreed. Alternatively, you may have considered the pros and cons of #ifdef fields specific to only one accounting package in the struct taskstats. If you do, care to share your thoughts? Specific payload information can be carried in the version field. I am sure the version number of struct taskstats does not need 64 bits. With the version number and payload info, application can surely interpret the taskstats data correctly. 3) In taskstats.txt "Usage" section, you mentioned "... in the Advanced Usage section below...", but that section does not exist. 4) In do_exit() routine, you do: + taskstats_exit_alloc(&tidstats, &tgidstats); The tidstats and tgidstats are checked in taskstats_exit_send() in taskstats.c for allocation failure, but a lot has been processed before the check. The allocation failure happens when system is stressed in memory. I think we want to do the check earlier? Regards, - jay |
From: Shailabh N. <na...@wa...> - 2006-04-27 04:01:26
|
Jay Lan wrote: >Hi Shailabh, > >Thanks for your effort in taskstats interface! Really appreciated! >I think this interface can offer a good foundation for other packages >to build on. > >Here are a few more comments: > >1) You mentioned the "version number within the (taskstats) > structure" in taskstats.txt and a few other places, but i do not see > that field defined in struct taskstats in taskstats.h? > > Missed out on that. Need to add it back in. >2) In taskstats.txt "Extending taskstats" section, you mentioned two > ways to extend the interface. The second method looks like a method > to encoureage other package developers to create their own interface > (ie, not taskstats) based on generic netlink to avoid reading large >number > of fields not interested to other particular applications? I will be >fine > with this as long as it is understood and agreed. > > Yes, the second method is for other packages, which have very little in common with the struct taskstats to extend the stats returned (using netlink attribs to extend rather than extending the structure). > Alternatively, you may have considered the pros and cons of #ifdef > fields specific to only one accounting package in the struct taskstats. > If you do, care to share your thoughts? > I'd rather avoid doing an #ifdef'ed definition of the fields based on configuration of one or the other accounting package...it'll add complexity for userspace parsing of the structure. Its quite acceptable to have the fields have zero as content if the corresponding package isn't configured. >Specific payload information > can be carried in the version field. I am sure the version number of >struct > taskstats does not need 64 bits. With the version number and payload > info, application can surely interpret the taskstats data correctly. > > By "payload info" you mean some sort of bitmask (or encoding) which specifies which fields are present or absent ? I suppose that could be done but it adds unnecessary complexity ? e.g once delay accounting is there, all six to eight fields corresponding to it will be present...I don't see much value in further being able to configure cpu delays, mem delays etc. separately. Is that different for CSA ? >3) In taskstats.txt "Usage" section, you mentioned "... in the Advanced > Usage section below...", but that section does not exist. > > Thanks for pointing it out. Should replace it with "per-tgid stats section". >4) In do_exit() routine, you do: >+ taskstats_exit_alloc(&tidstats, &tgidstats); > > The tidstats and tgidstats are checked in taskstats_exit_send() in > taskstats.c for allocation failure, but a lot has been processed before > the check. The allocation failure happens when system is stressed in > memory. I think we want to do the check earlier? > > Since accounting is non-critical, I didn't see the need for doing the check earlier if we're not going to do anything about it. The first use of the allocated structure is in the taskstats_exit_send() where filling of the stats is not done if allocation failed. What would you suggest we do, on allocation failure, if the check is performed immediately after the alloc ? --Shailabh > >Regards, > - jay > > > |
From: Balbir S. <ba...@in...> - 2006-04-27 06:45:34
|
On Thu, Apr 27, 2006 at 12:00:43AM -0400, Shailabh Nagar wrote: > Jay Lan wrote: > > >Hi Shailabh, > > > >Thanks for your effort in taskstats interface! Really appreciated! > >I think this interface can offer a good foundation for other packages > >to build on. > > > >Here are a few more comments: > > > >1) You mentioned the "version number within the (taskstats) > > structure" in taskstats.txt and a few other places, but i do not see > > that field defined in struct taskstats in taskstats.h? > > > > > Missed out on that. Need to add it back in. There is a version field in genl_family as well. That can be used for versioning as well. When we user space tool queries for the family id, it can obtain and interpret the version information. > > >2) In taskstats.txt "Extending taskstats" section, you mentioned two > > ways to extend the interface. The second method looks like a method > > to encoureage other package developers to create their own interface > > (ie, not taskstats) based on generic netlink to avoid reading large > >number > > of fields not interested to other particular applications? I will be > >fine > > with this as long as it is understood and agreed. > > > > > Yes, the second method is for other packages, which have very little in > common with the struct > taskstats to extend the stats returned (using netlink attribs to extend > rather than extending the structure). The second method will require the following 1. An API to return the length of data it wants to fill in 2. Another API to fill in the statistics along with the type - Like Shailabh mentioned, this will require creating a new TASKSTATS_TYPE_XXXX > > > Alternatively, you may have considered the pros and cons of #ifdef > > fields specific to only one accounting package in the struct taskstats. > > If you do, care to share your thoughts? > > > I'd rather avoid doing an #ifdef'ed definition of the fields based on > configuration of one or the other > accounting package...it'll add complexity for userspace parsing of the > structure. > > Its quite acceptable to have the fields have zero as content if the > corresponding package isn't configured. > I agree with Shailabh, building in knowledge of other subsystems into taskstats.h might not be the best choice. > > >Specific payload information > > can be carried in the version field. I am sure the version number of > >struct > > taskstats does not need 64 bits. With the version number and payload > > info, application can surely interpret the taskstats data correctly. > > > > > By "payload info" you mean some sort of bitmask (or encoding) which > specifies which fields are present > or absent ? I suppose that could be done but it adds unnecessary > complexity ? e.g once delay accounting is there, > all six to eight fields corresponding to it will be present...I don't > see much value in further being able to configure > cpu delays, mem delays etc. separately. Is that different for CSA ? Netlink attributes can be used to determine which attribute types are present in the payload. libnl does a great job of providing a good set of APIs to determine all attribute types present. This is one of the biggest advantages I see of genetlink (attributes are optional and can co-exist simultaneously) > > > >3) In taskstats.txt "Usage" section, you mentioned "... in the Advanced > > Usage section below...", but that section does not exist. > > > > > Thanks for pointing it out. Should replace it with "per-tgid stats section". > > >4) In do_exit() routine, you do: > >+ taskstats_exit_alloc(&tidstats, &tgidstats); > > > > The tidstats and tgidstats are checked in taskstats_exit_send() in > > taskstats.c for allocation failure, but a lot has been processed before > > the check. The allocation failure happens when system is stressed in > > memory. I think we want to do the check earlier? > > > > > Since accounting is non-critical, I didn't see the need for doing the > check earlier if we're not going to do > anything about it. The first use of the allocated structure is in the > taskstats_exit_send() where filling of the > stats is not done if allocation failed. What would you suggest we do, on > allocation failure, if the check is > performed immediately after the alloc ? > > --Shailabh > > > > >Regards, > >- jay > > > > > > > > > <snip> <--- Balbir |
From: Jay L. <jl...@en...> - 2006-04-27 17:52:36
|
Balbir Singh wrote: > On Thu, Apr 27, 2006 at 12:00:43AM -0400, Shailabh Nagar wrote: > >>Jay Lan wrote: >> >> >>>Hi Shailabh, >>> >>>Thanks for your effort in taskstats interface! Really appreciated! >>>I think this interface can offer a good foundation for other packages >>>to build on. >>> >>>Here are a few more comments: >>> >>>1) You mentioned the "version number within the (taskstats) >>> structure" in taskstats.txt and a few other places, but i do not see >>> that field defined in struct taskstats in taskstats.h? >>> >>> >> >>Missed out on that. Need to add it back in. > > > There is a version field in genl_family as well. That can be used > for versioning as well. When we user space tool queries for the family > id, it can obtain and interpret the version information. Hi Shailabh and Balbir, Are TASKSTATS_GENL_VERSION and TASKSTATS_VERSION the same thing? If they are meant to serve different purposes, we still need it. > > >>>2) In taskstats.txt "Extending taskstats" section, you mentioned two >>> ways to extend the interface. The second method looks like a method >>> to encoureage other package developers to create their own interface >>> (ie, not taskstats) based on generic netlink to avoid reading large >>>number >>> of fields not interested to other particular applications? I will be >>>fine >>> with this as long as it is understood and agreed. >>> >>> >> >>Yes, the second method is for other packages, which have very little in >>common with the struct >>taskstats to extend the stats returned (using netlink attribs to extend >>rather than extending the structure). > > > The second method will require the following > > 1. An API to return the length of data it wants to fill in > 2. Another API to fill in the statistics along with the type - > Like Shailabh mentioned, this will require creating a new TASKSTATS_TYPE_XXXX > > >>> Alternatively, you may have considered the pros and cons of #ifdef >>> fields specific to only one accounting package in the struct taskstats. >>> If you do, care to share your thoughts? >>> >> >>I'd rather avoid doing an #ifdef'ed definition of the fields based on >>configuration of one or the other >>accounting package...it'll add complexity for userspace parsing of the >>structure. >> >>Its quite acceptable to have the fields have zero as content if the >>corresponding package isn't configured. >> > > > I agree with Shailabh, building in knowledge of other subsystems into > taskstats.h might not be the best choice. > > >>>Specific payload information >>> can be carried in the version field. I am sure the version number of >>>struct >>> taskstats does not need 64 bits. With the version number and payload >>> info, application can surely interpret the taskstats data correctly. >>> >>> >> >>By "payload info" you mean some sort of bitmask (or encoding) which >>specifies which fields are present >>or absent ? I suppose that could be done but it adds unnecessary >>complexity ? e.g once delay accounting is there, >>all six to eight fields corresponding to it will be present...I don't >>see much value in further being able to configure >>cpu delays, mem delays etc. separately. Is that different for CSA ? I was thinking of a bitmask thing. But instead of keying specific fields, one bit may be used to key delay accounting, and another bit for CSA, el at. This way you do not need to have CSA-specifc fields in the payload and applications know how to correctly interpret the payload. Taskstats and application do not need to have knowledge of accounting packages, only need to set the bitmasks correctly. When we start sending sys stats of each tasks to userland, that is s lot of data. Note that BSD accounting even uses encode_comp_t() routine to compress data into a 13-bit fraction with 3-bit exponent field to shrink its size. Even though you do not need to care about those zero's in taskstats, they still need to be delievered through netlink socket. I must admit that this may create a point of failure due to the payload info not set correctly according to the CONFIG flags. The idea was to eliminate the need of #2 methods, but maybe #2 method is better... I am a little confused after reading Balbir's reply. It seems to me that Shailabh suggested to create a different struct to contain stats data. Is that also what Balbir talked about? If a different package builds a different taskstat-like interface as suggested in #2, would the data travel on the same socket as delay accounting? > > > Netlink attributes can be used to determine which attribute types are > present in the payload. libnl does a great job of providing a good set of > APIs to determine all attribute types present. This is one of the biggest > advantages I see of genetlink (attributes are optional and can co-exist > simultaneously) > > >> >>>3) In taskstats.txt "Usage" section, you mentioned "... in the Advanced >>> Usage section below...", but that section does not exist. >>> >>> >> >>Thanks for pointing it out. Should replace it with "per-tgid stats section". >> >> >>>4) In do_exit() routine, you do: >>>+ taskstats_exit_alloc(&tidstats, &tgidstats); >>> >>> The tidstats and tgidstats are checked in taskstats_exit_send() in >>> taskstats.c for allocation failure, but a lot has been processed before >>> the check. The allocation failure happens when system is stressed in >>> memory. I think we want to do the check earlier? >>> >>> >> >>Since accounting is non-critical, I didn't see the need for doing the >>check earlier if we're not going to do >>anything about it. The first use of the allocated structure is in the >>taskstats_exit_send() where filling of the >>stats is not done if allocation failed. What would you suggest we do, on >>allocation failure, if the check is >>performed immediately after the alloc ? I would suggest to do the check at the beginning of taskstats_exit_send() before mutex_lock(&taskstats_exit_mutex). Regards, - jay |
From: Balbir S. <ba...@in...> - 2006-04-27 18:30:30
|
Hi Jay, > Hi Shailabh and Balbir, > > Are TASKSTATS_GENL_VERSION and TASKSTATS_VERSION the same thing? > If they are meant to serve different purposes, we still need it. > Yes, thats true. But for now from what I can see, one version should be sufficient. <snip> > I was thinking of a bitmask thing. But instead of keying specific > fields, one bit may be used to key delay accounting, and another bit > for CSA, el at. This way you do not need to have CSA-specifc fields > in the payload and applications know how to correctly interpret the > payload. Taskstats and application do not need to have knowledge of > accounting packages, only need to set the bitmasks correctly. > Yes, but scanning the entire payload for various types is also feasible. It is a bit slow, but feasible and generally the recommended approach for dealing with genetlink types. What you are saying is still possible, the application can ignore types it does not understand. > When we start sending sys stats of each tasks to userland, that is > s lot of data. Note that BSD accounting even uses encode_comp_t() > routine to compress data into a 13-bit fraction with 3-bit exponent > field to shrink its size. Even though you do not need to care > about those zero's in taskstats, they still need to be delievered > through netlink socket. Yes, thats true. We can leave the decision of compressing, etc to the specific subsystem. It can encode it and the user level application can decode the data. > > I must admit that this may create a point of failure due to the > payload info not set correctly according to the CONFIG flags. > > The idea was to eliminate the need of #2 methods, but maybe > #2 method is better... > > I am a little confused after reading Balbir's reply. It seems to > me that Shailabh suggested to create a different struct to contain > stats data. Is that also what Balbir talked about? If a different > package builds a different taskstat-like interface as suggested > in #2, would the data travel on the same socket as delay > accounting? Sorry for the confusion. Yes, even I would recommend creating a different struct for the stats data. The data will pass over the same socket as delay accounting (separate sockets can be used, but that would become inefficient). > > I would suggest to do the check at the beginning of > taskstats_exit_send() before mutex_lock(&taskstats_exit_mutex). Good suggestion, we can move the check to that point. > > Regards, > - jay -- Warm Regards, <--- Balbir |
From: Jay L. <jl...@en...> - 2006-04-27 19:34:51
|
Hi Balbir, Balbir Singh wrote: >>Are TASKSTATS_GENL_VERSION and TASKSTATS_VERSION the same thing? >>If they are meant to serve different purposes, we still need it. >> > > > Yes, thats true. But for now from what I can see, one version should > be sufficient. If we envision a need of it in the future, we'd better put it in today. It would be nice to have the revision number at beginning of the struct. Shailabh's instruction says to add new field after existing fields. > > <snip> > > >>I was thinking of a bitmask thing. But instead of keying specific >>fields, one bit may be used to key delay accounting, and another bit >>for CSA, el at. This way you do not need to have CSA-specifc fields >>in the payload and applications know how to correctly interpret the >>payload. Taskstats and application do not need to have knowledge of >>accounting packages, only need to set the bitmasks correctly. >> > > > Yes, but scanning the entire payload for various types is also feasible. It is > a bit slow, but feasible and generally the recommended approach for > dealing with genetlink types. What you are saying is still possible, the > application can ignore types it does not understand. > > >>When we start sending sys stats of each tasks to userland, that is >>s lot of data. Note that BSD accounting even uses encode_comp_t() >>routine to compress data into a 13-bit fraction with 3-bit exponent >>field to shrink its size. Even though you do not need to care >>about those zero's in taskstats, they still need to be delievered >>through netlink socket. > > > Yes, thats true. We can leave the decision of compressing, etc to the > specific subsystem. It can encode it and the user level application > can decode the data. I am sorry that i did not make myself clear. My suggestion of using the bitmask payload info is to be combined with #ifdef CONFIG_* to eliminate unnecessary fields from the traffic. I am concerned about losing data due to application not reading data fast enough. Well, we can revisit this suggestion when we start losing data though. ;-) Regards, - jay |
From: Balbir S. <ba...@in...> - 2006-04-28 03:02:23
|
> If we envision a need of it in the future, we'd better put it in > today. It would be nice to have the revision number at beginning of > the struct. Shailabh's instruction says to add new field after existing > fields. > Yes, true. It does not hurt to have a version number for taskstats. I will add it in. <snip> > > I am sorry that i did not make myself clear. My suggestion of using > the bitmask payload info is to be combined with #ifdef CONFIG_* to > eliminate unnecessary fields from the traffic. I am concerned about > losing data due to application not reading data fast enough. > > Well, we can revisit this suggestion when we start losing data > though. ;-) Like Shailabh said #ifdef CONFIG_* adds complexity for userspace parsing of the structure, but if it helps avoid sending unnecessary data we can consider using that approach. Would something like the structure below be useful? struct csastats { #if defined(CONFIG_CSA) || defined(CONFIG_CSA_MODULE) char acctent[sizeof(struct acctcsa) + sizeof(struct acctmem) + sizeof(struct acctio)]; int filled; #endif }; The filled member can be a bool or an int to indicate that the structure contains meaningful data and the CONFIG_* is used to control the inclusion of meaningful fields. Instead of using a bitmap we use the filled member. Is this what you had in mind? -- <--- Balbir |
From: Jay L. <jl...@en...> - 2006-04-28 18:21:21
|
Balbir Singh wrote: >>If we envision a need of it in the future, we'd better put it in >>today. It would be nice to have the revision number at beginning of >>the struct. Shailabh's instruction says to add new field after existing >>fields. >> >> > >Yes, true. It does not hurt to have a version number for taskstats. >I will add it in. > ><snip> > > >>I am sorry that i did not make myself clear. My suggestion of using >>the bitmask payload info is to be combined with #ifdef CONFIG_* to >>eliminate unnecessary fields from the traffic. I am concerned about >>losing data due to application not reading data fast enough. >> >>Well, we can revisit this suggestion when we start losing data >>though. ;-) >> > >Like Shailabh said #ifdef CONFIG_* adds complexity for userspace parsing >of the structure, but if it helps avoid sending unnecessary data we >can consider using that approach. > >Would something like the structure below be useful? > >struct csastats { >#if defined(CONFIG_CSA) || defined(CONFIG_CSA_MODULE) > char acctent[sizeof(struct acctcsa) + > sizeof(struct acctmem) + > sizeof(struct acctio)]; > int filled; >#endif >}; > >The filled member can be a bool or an int to indicate that the structure >contains meaningful data and the CONFIG_* is used to control the >inclusion of meaningful fields. Instead of using a bitmap we use >the filled member. > >Is this what you had in mind? > No exactly. The payload information must be always available for application. On a second thought, the idea of one big taskstats struct with many #ifconfig is not really a good idea. My goal is to cut down unnecessary data being transfered throught the socket. Here is my Take 2. We can have a taskstats header containing taskstats version and other general fields useful to more than one taskstats application including a payload information. Then, we define accounting subsystem specific structs for delayacct, csa, etc. The kernel/{delayacct.c,csa.c,etc.c} set the payload information and fill the buffer with desired subsystem structs. The header thus contain enough information to tell applications how to map the data following the header. Would IBM propose more accounting subsystems besides delayacct? If we only see delayacct and csa on the horizon, this scheme is really not necessary since delayacct does not have as much data (as csa :)) and csa can use part of the delayacct data. You gain more than csa can benefit from this. ;-) I guess i just speak from design point of view. :) But, if one day somebody who does not need a paycheck decides to convert BSD accounting to use taskstats interface, this can be helpful. Thanks, - jay |
From: Balbir S. <ba...@in...> - 2006-04-28 18:40:04
|
> >Is this what you had in mind? > > > No exactly. The payload information must be always available for > application. > > On a second thought, the idea of one big taskstats struct with many > #ifconfig is not really a good idea. My goal is to cut down unnecessary > data being transfered throught the socket. Yes, so we agree that #ifdef CONFIG_* is not good. > > Here is my Take 2. We can have a taskstats header containing taskstats > version and other general fields useful to more than one taskstats > application including a payload information. Then, we define > accounting subsystem specific structs for delayacct, csa, etc. > The kernel/{delayacct.c,csa.c,etc.c} set the payload information and > fill the buffer with desired subsystem structs. The header thus contain > enough information to tell applications how to map the data following > the header. I agree with this suggestion. Each netlink attribute contains the following fields (also referred to as TLV) +----+--------+------+ |Type| length | value| +----+--------+------+ The type is meant to serve the purpose of the header you describe. The type value can be used by the application to map the data. getdelays.c is a sample application posted in the previous patches, it interprets data based on type. > > Would IBM propose more accounting subsystems besides delayacct? > If we only see delayacct and csa on the horizon, this scheme is really > not necessary since delayacct does not have as much data (as csa :)) > and csa can use part of the delayacct data. You gain more than > csa can benefit from this. ;-) I guess i just speak from design point > of view. :) > > But, if one day somebody who does not need a paycheck decides > to convert BSD accounting to use taskstats interface, this can > be helpful. > Yes, I think in the long term it would be more useful to use the scheme of adding subsystem structs. taskstats.txt explains the process of extending taskstats. Point #2 is the same as what we have just discussed. Could you please see if the text needs any changes based on our discussions so far (taskstats.txt was posted in the delayacct-doc.patch). > Thanks, > - jay > > -- <--- Balbir |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:39:34
|
Changelog Fixes comments by akpm (on earlier patch now incorporated here) - detailed comments on atomicity rules of accounting fields - replace use of nsec_t delayacct-taskstats.patch Usage of taskstats interface by delay accounting. Signed-off-by: Shailabh Nagar <na...@us...> Signed-off-by: Balbir Singh <ba...@in...> include/linux/delayacct.h | 11 ++++++++++ include/linux/taskstats.h | 48 +++++++++++++++++++++++++++++++++++++++++++++- init/Kconfig | 1 kernel/delayacct.c | 42 ++++++++++++++++++++++++++++++++++++++++ kernel/taskstats.c | 9 +++++++- 5 files changed, 109 insertions(+), 2 deletions(-) Index: linux-2.6.17-rc1/include/linux/delayacct.h =================================================================== --- linux-2.6.17-rc1.orig/include/linux/delayacct.h 2006-04-21 20:29:13.000000000 -0400 +++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 20:42:41.000000000 -0400 @@ -18,6 +18,7 @@ #define _LINUX_TASKDELAYS_H #include <linux/sched.h> +#include <linux/taskstats_kern.h> /* * Per-task flags relevant to delay accounting @@ -35,6 +36,7 @@ extern void __delayacct_tsk_init(struct extern void __delayacct_tsk_exit(struct task_struct *); extern void __delayacct_blkio_start(void); extern void __delayacct_blkio_end(void); +extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); static inline void delayacct_set_flag(int flag) { @@ -74,6 +76,13 @@ static inline void delayacct_blkio_end(v __delayacct_blkio_end(); } +static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) +{ + if (!tsk->delays) + return -EINVAL; + return __delayacct_add_tsk(d, tsk); +} + #else static inline void delayacct_set_flag(int flag) {} @@ -89,6 +98,8 @@ static inline void delayacct_blkio_start {} static inline void delayacct_blkio_end(void) {} +static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) +{ return 0; } #endif /* CONFIG_TASK_DELAY_ACCT */ #endif Index: linux-2.6.17-rc1/kernel/delayacct.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/delayacct.c 2006-04-21 20:29:13.000000000 -0400 +++ linux-2.6.17-rc1/kernel/delayacct.c 2006-04-21 20:40:03.000000000 -0400 @@ -104,3 +104,45 @@ void __delayacct_blkio_end(void) ¤t->delays->blkio_delay, ¤t->delays->blkio_count); } + +int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) +{ + s64 tmp; + struct timespec ts; + unsigned long t1,t2,t3; + + + tmp = (s64)d->cpu_run_real_total; + tmp += (u64)(tsk->utime + tsk->stime) * TICK_NSEC; + d->cpu_run_real_total = (tmp < (s64)d->cpu_run_real_total) ? 0 : tmp; + + /* No locking available for sched_info (and too expensive to add one) + * Mitigate by taking snapshot of values + */ + t1 = tsk->sched_info.pcnt; + t2 = tsk->sched_info.run_delay; + t3 = tsk->sched_info.cpu_time; + + d->cpu_count += t1; + + jiffies_to_timespec(t2, &ts); + tmp = (s64)d->cpu_delay_total + timespec_to_ns(&ts); + d->cpu_delay_total = (tmp < (s64)d->cpu_delay_total) ? 0 : tmp; + + tmp = (s64)d->cpu_run_virtual_total + (s64)jiffies_to_usecs(t3) * 1000; + d->cpu_run_virtual_total = + (tmp < (s64)d->cpu_run_virtual_total) ? 0 : tmp; + + /* zero XXX_total, non-zero XXX_count implies XXX stat overflowed */ + + spin_lock(&tsk->delays->lock); + tmp = d->blkio_delay_total + tsk->delays->blkio_delay; + d->blkio_delay_total = (tmp < d->blkio_delay_total) ? 0 : tmp; + tmp = d->swapin_delay_total + tsk->delays->swapin_delay; + d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp; + d->blkio_count += tsk->delays->blkio_count; + d->swapin_count += tsk->delays->swapin_count; + spin_unlock(&tsk->delays->lock); + + return 0; +} Index: linux-2.6.17-rc1/kernel/taskstats.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/taskstats.c 2006-04-21 20:29:22.000000000 -0400 +++ linux-2.6.17-rc1/kernel/taskstats.c 2006-04-21 20:40:03.000000000 -0400 @@ -18,6 +18,7 @@ #include <linux/kernel.h> #include <linux/taskstats_kern.h> +#include <linux/delayacct.h> #include <net/genetlink.h> #include <asm/atomic.h> @@ -119,7 +120,9 @@ static int fill_pid(pid_t pid, struct ta * goto err; */ -err: + rc = delayacct_add_tsk(stats, tsk); + + /* Define err: label here if needed */ put_task_struct(tsk); return rc; @@ -151,6 +154,10 @@ static int fill_tgid(pid_t tgid, struct * break; */ + rc = delayacct_add_tsk(stats, tsk); + if (rc) + break; + } while_each_thread(first, tsk); read_unlock(&tasklist_lock); Index: linux-2.6.17-rc1/init/Kconfig =================================================================== --- linux-2.6.17-rc1.orig/init/Kconfig 2006-04-21 20:29:22.000000000 -0400 +++ linux-2.6.17-rc1/init/Kconfig 2006-04-21 20:40:03.000000000 -0400 @@ -164,6 +164,7 @@ config TASKSTATS config TASK_DELAY_ACCT bool "Enable per-task delay accounting (EXPERIMENTAL)" + depends on TASKSTATS help Collect information on time spent by a task waiting for system resources like cpu, synchronous block I/O completion and swapping Index: linux-2.6.17-rc1/include/linux/taskstats.h =================================================================== --- linux-2.6.17-rc1.orig/include/linux/taskstats.h 2006-04-21 20:31:11.000000000 -0400 +++ linux-2.6.17-rc1/include/linux/taskstats.h 2006-04-21 20:45:17.000000000 -0400 @@ -35,7 +35,53 @@ struct taskstats { /* Version 1 */ - int filler_avoids_empty_struct_warnings; + /* Delay accounting fields start + * + * All values, until comment "Delay accounting fields end" are + * available only if delay accounting is enabled, even though the last + * few fields are not delays + * + * xxx_count is the number of delay values recorded + * xxx_delay_total is the corresponding cumulative delay in nanoseconds + * + * xxx_delay_total wraps around to zero on overflow + * xxx_count incremented regardless of overflow + */ + + /* Delay waiting for cpu, while runnable + * count, delay_total NOT updated atomically + */ + __u64 cpu_count; + __u64 cpu_delay_total; + + /* Following four fields atomically updated using task->delays->lock */ + + /* Delay waiting for synchronous block I/O to complete + * does not account for delays in I/O submission + */ + __u64 blkio_count; + __u64 blkio_delay_total; + + /* Delay waiting for page fault I/O (swap in only) */ + __u64 swapin_count; + __u64 swapin_delay_total; + + /* cpu "wall-clock" running time + * On some architectures, value will adjust for cpu time stolen + * from the kernel in involuntary waits due to virtualization. + * Value is cumulative, in nanoseconds, without a corresponding count + * and wraps around to zero silently on overflow + */ + __u64 cpu_run_real_total; + + /* cpu "virtual" running time + * Uses time intervals seen by the kernel i.e. no adjustment + * for kernel's involuntary waits due to virtualization. + * Value is cumulative, in nanoseconds, without a corresponding count + * and wraps around to zero silently on overflow + */ + __u64 cpu_run_virtual_total; + /* Delay accounting fields end */ }; |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:41:00
|
delayacct-doc.patch Some documentation for delay accounting. Signed-off-by: Shailabh Nagar <na...@wa...> Signed-off-by: Balbir Singh <ba...@in...> Documentation/accounting/delay-accounting.txt | 115 +++++++ Documentation/accounting/getdelays.c | 376 ++++++++++++++++++++++++++ Documentation/accounting/taskstats.txt | 2 3 files changed, 493 insertions(+) Index: linux-2.6.17-rc1/Documentation/accounting/delay-accounting.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.17-rc1/Documentation/accounting/delay-accounting.txt 2006-04-21 20:50:22.000000000 -0400 @@ -0,0 +1,115 @@ +Delay accounting +---------------- + +Tasks encounter delays in execution when they wait +for some kernel resource to become available e.g. a +runnable task may wait for a free CPU to run on. + +The per-task delay accounting functionality measures +the delays experienced by a task while + +a) waiting for a CPU (while being runnable) +b) completion of synchronous block I/O initiated by the task +c) swapping in pages + +and makes these statistics available to userspace through +the taskstats interface. + +Such delays provide feedback for setting a task's cpu priority, +io priority and rss limit values appropriately. Long delays for +important tasks could be a trigger for raising its corresponding priority. + +The functionality, through its use of the taskstats interface, also provides +delay statistics aggregated for all tasks (or threads) belonging to a +thread group (corresponding to a traditional Unix process). This is a commonly +needed aggregation that is more efficiently done by the kernel. + +Userspace utilities, particularly resource management applications, can also +aggregate delay statistics into arbitrary groups. To enable this, delay +statistics of a task are available both during its lifetime as well as on its +exit, ensuring continuous and complete monitoring can be done. + + +Interface +--------- + +Delay accounting uses the taskstats interface which is described +in detail in a separate document in this directory. Taskstats returns a +generic data structure to userspace corresponding to per-pid and per-tgid +statistics. The delay accounting functionality populates specific fields of +this structure. See + include/linux/taskstats.h +for a description of the fields pertaining to delay accounting. +It will generally be in the form of counters returning the cumulative +delay seen for cpu, sync block I/O, swapin etc. + +Taking the difference of two successive readings of a given +counter (say cpu_delay_total) for a task will give the delay +experienced by the task waiting for the corresponding resource +in that interval. + +When a task exits, records containing the per-task and per-process statistics +are sent to userspace without requiring a command. More details are given in +the taskstats interface description. + +The getdelays.c userspace utility in this directory allows simple commands to +be run and the corresponding delay statistics to be displayed. It also serves +as an example of using the taskstats interface. + +Usage +----- + +Compile the kernel with + CONFIG_TASK_DELAY_ACCT=y + CONFIG_TASKSTATS=y + +Enable the accounting at boot time by adding +the following to the kernel boot options + delayacct + +and after the system has booted up, use a utility +similar to getdelays.c to access the delays +seen by a given task or a task group (tgid). +The utility also allows a given command to be +executed and the corresponding delays to be +seen. + +General format of the getdelays command + +getdelays [-t tgid] [-p pid] [-c cmd...] + + +Get delays, since system boot, for pid 10 +# ./getdelays -p 10 +(output similar to next case) + +Get sum of delays, since system boot, for all pids with tgid 5 +# ./getdelays -t 5 + + +CPU count real total virtual total delay total + 7876 92005750 100000000 24001500 +IO count delay total + 0 0 +MEM count delay total + 0 0 + +Get delays seen in executing a given simple command +# ./getdelays -c ls / + +bin data1 data3 data5 dev home media opt root srv sys usr +boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var + + +CPU count real total virtual total delay total + 6 4000250 4000000 0 +IO count delay total + 0 0 +MEM count delay total + 0 0 + + + + + + Index: linux-2.6.17-rc1/Documentation/accounting/getdelays.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.17-rc1/Documentation/accounting/getdelays.c 2006-04-21 20:53:54.000000000 -0400 @@ -0,0 +1,376 @@ +/* getdelays.c + * + * Utility to get per-pid and per-tgid delay accounting statistics + * Also illustrates usage of the taskstats interface + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2005 + * Copyright (C) Balbir Singh, IBM Corp. 2006 + * + */ + +#include <stdio.h> +#include <stdlib.h> +#include <errno.h> +#include <unistd.h> +#include <poll.h> +#include <string.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/socket.h> +#include <sys/types.h> +#include <signal.h> + +#include <linux/genetlink.h> +#include <linux/taskstats.h> + +/* + * Generic macros for dealing with netlink sockets. Might be duplicated + * elsewhere. It is recommended that commercial grade applications use + * libnl or libnetlink and use the interfaces provided by the library + */ +#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN)) +#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN) +#define NLA_DATA(na) ((void *)((char*)(na) + NLA_HDRLEN)) +#define NLA_PAYLOAD(len) (len - NLA_HDRLEN) + +#define err(code, fmt, arg...) do { printf(fmt, ##arg); exit(code); } while (0) +int done = 0; + +/* + * Create a raw netlink socket and bind + */ +static int create_nl_socket(int protocol, int groups) +{ + socklen_t addr_len; + int fd; + struct sockaddr_nl local; + + fd = socket(AF_NETLINK, SOCK_RAW, protocol); + if (fd < 0) + return -1; + + memset(&local, 0, sizeof(local)); + local.nl_family = AF_NETLINK; + local.nl_groups = groups; + + if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0) + goto error; + + return fd; + error: + close(fd); + return -1; +} + +int sendto_fd(int s, const char *buf, int bufLen) +{ + struct sockaddr_nl nladdr; + int r; + + memset(&nladdr, 0, sizeof(nladdr)); + nladdr.nl_family = AF_NETLINK; + + while ((r = sendto(s, buf, bufLen, 0, (struct sockaddr *) &nladdr, + sizeof(nladdr))) < bufLen) { + if (r > 0) { + buf += r; + bufLen -= r; + } else if (errno != EAGAIN) + return -1; + } + return 0; +} + +/* + * Probe the controller in genetlink to find the family id + * for the TASKSTATS family + */ +int get_family_id(int sd) +{ + struct { + struct nlmsghdr n; + struct genlmsghdr g; + char buf[256]; + } family_req; + struct { + struct nlmsghdr n; + struct genlmsghdr g; + char buf[256]; + } ans; + + int id; + struct nlattr *na; + int rep_len; + + /* Get family name */ + family_req.n.nlmsg_type = GENL_ID_CTRL; + family_req.n.nlmsg_flags = NLM_F_REQUEST; + family_req.n.nlmsg_seq = 0; + family_req.n.nlmsg_pid = getpid(); + family_req.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN); + family_req.g.cmd = CTRL_CMD_GETFAMILY; + family_req.g.version = 0x1; + na = (struct nlattr *) GENLMSG_DATA(&family_req); + na->nla_type = CTRL_ATTR_FAMILY_NAME; + na->nla_len = strlen(TASKSTATS_GENL_NAME) + 1 + NLA_HDRLEN; + strcpy(NLA_DATA(na), TASKSTATS_GENL_NAME); + family_req.n.nlmsg_len += NLMSG_ALIGN(na->nla_len); + + if (sendto_fd(sd, (char *) &family_req, family_req.n.nlmsg_len) < 0) + err(1, "error sending message via Netlink\n"); + + rep_len = recv(sd, &ans, sizeof(ans), 0); + + if (rep_len < 0) + err(1, "error receiving reply message via Netlink\n"); + + + /* Validate response message */ + if (!NLMSG_OK((&ans.n), rep_len)) + err(1, "invalid reply message received via Netlink\n"); + + if (ans.n.nlmsg_type == NLMSG_ERROR) { /* error */ + printf("error received NACK - leaving\n"); + exit(1); + } + + + na = (struct nlattr *) GENLMSG_DATA(&ans); + na = (struct nlattr *) ((char *) na + NLA_ALIGN(na->nla_len)); + if (na->nla_type == CTRL_ATTR_FAMILY_ID) { + id = *(__u16 *) NLA_DATA(na); + } + return id; +} + +void print_taskstats(struct taskstats *t) +{ + printf("\n\nCPU %15s%15s%15s%15s\n" + " %15llu%15llu%15llu%15llu\n" + "IO %15s%15s\n" + " %15llu%15llu\n" + "MEM %15s%15s\n" + " %15llu%15llu\n\n", + "count", "real total", "virtual total", "delay total", + t->cpu_count, t->cpu_run_real_total, t->cpu_run_virtual_total, + t->cpu_delay_total, + "count", "delay total", + t->blkio_count, t->blkio_delay_total, + "count", "delay total", t->swapin_count, t->swapin_delay_total); +} + +void sigchld(int sig) +{ + done = 1; +} + +int main(int argc, char *argv[]) +{ + int rc; + int sk_nl; + struct nlmsghdr *nlh; + struct genlmsghdr *genlhdr; + char *buf; + struct taskstats_cmd_param *param; + __u16 id; + struct nlattr *na; + + /* For receiving */ + struct sockaddr_nl kern_nla, from_nla; + socklen_t from_nla_len; + int recv_len; + struct taskstats_reply *reply; + + struct { + struct nlmsghdr n; + struct genlmsghdr g; + char buf[256]; + } req; + + struct { + struct nlmsghdr n; + struct genlmsghdr g; + char buf[256]; + } ans; + + int nl_sd = -1; + int rep_len; + int len = 0; + int aggr_len, len2; + struct sockaddr_nl nladdr; + pid_t tid = 0; + pid_t rtid = 0; + int cmd_type = TASKSTATS_TYPE_TGID; + int c, status; + int forking = 0; + struct sigaction act = { + .sa_handler = SIG_IGN, + .sa_mask = SA_NOMASK, + }; + struct sigaction tact ; + + if (argc < 3) { + printf("usage %s [-t tgid][-p pid][-c cmd]\n", argv[0]); + exit(-1); + } + + tact.sa_handler = sigchld; + sigemptyset(&tact.sa_mask); + if (sigaction(SIGCHLD, &tact, NULL) < 0) + err(1, "sigaction failed for SIGCHLD\n"); + + while (1) { + + c = getopt(argc, argv, "t:p:c:"); + if (c < 0) + break; + + switch (c) { + case 't': + tid = atoi(optarg); + if (!tid) + err(1, "Invalid tgid\n"); + cmd_type = TASKSTATS_CMD_ATTR_TGID; + break; + case 'p': + tid = atoi(optarg); + if (!tid) + err(1, "Invalid pid\n"); + cmd_type = TASKSTATS_CMD_ATTR_TGID; + break; + case 'c': + opterr = 0; + tid = fork(); + if (tid < 0) + err(1, "fork failed\n"); + + if (tid == 0) { /* child process */ + if (execvp(argv[optind - 1], &argv[optind - 1]) < 0) { + exit(-1); + } + } + forking = 1; + break; + default: + printf("usage %s [-t tgid][-p pid][-c cmd]\n", argv[0]); + exit(-1); + break; + } + if (c == 'c') + break; + } + + /* Construct Netlink request message */ + + /* Send Netlink request message & get reply */ + + if ((nl_sd = + create_nl_socket(NETLINK_GENERIC, TASKSTATS_LISTEN_GROUP)) < 0) + err(1, "error creating Netlink socket\n"); + + + id = get_family_id(nl_sd); + + /* Send command needed */ + req.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN); + req.n.nlmsg_type = id; + req.n.nlmsg_flags = NLM_F_REQUEST; + req.n.nlmsg_seq = 0; + req.n.nlmsg_pid = tid; + req.g.cmd = TASKSTATS_CMD_GET; + na = (struct nlattr *) GENLMSG_DATA(&req); + na->nla_type = cmd_type; + na->nla_len = sizeof(unsigned int) + NLA_HDRLEN; + *(__u32 *) NLA_DATA(na) = tid; + req.n.nlmsg_len += NLMSG_ALIGN(na->nla_len); + + + if (!forking && sendto_fd(nl_sd, (char *) &req, req.n.nlmsg_len) < 0) + err(1, "error sending message via Netlink\n"); + + act.sa_handler = SIG_IGN; + sigemptyset(&act.sa_mask); + if (sigaction(SIGINT, &act, NULL) < 0) + err(1, "sigaction failed for SIGINT\n"); + + do { + int i; + struct pollfd pfd; + int pollres; + + pfd.events = 0xffff & ~POLLOUT; + pfd.fd = nl_sd; + pollres = poll(&pfd, 1, 5000); + if (pollres < 0 || done) { + break; + } + + rep_len = recv(nl_sd, &ans, sizeof(ans), 0); + nladdr.nl_family = AF_NETLINK; + nladdr.nl_groups = TASKSTATS_LISTEN_GROUP; + + if (ans.n.nlmsg_type == NLMSG_ERROR) { /* error */ + printf("error received NACK - leaving\n"); + exit(1); + } + + if (rep_len < 0) { + err(1, "error receiving reply message via Netlink\n"); + break; + } + + /* Validate response message */ + if (!NLMSG_OK((&ans.n), rep_len)) + err(1, "invalid reply message received via Netlink\n"); + + rep_len = GENLMSG_PAYLOAD(&ans.n); + + na = (struct nlattr *) GENLMSG_DATA(&ans); + len = 0; + i = 0; + while (len < rep_len) { + len += NLA_ALIGN(na->nla_len); + switch (na->nla_type) { + case TASKSTATS_TYPE_AGGR_PID: + /* Fall through */ + case TASKSTATS_TYPE_AGGR_TGID: + aggr_len = NLA_PAYLOAD(na->nla_len); + len2 = 0; + /* For nested attributes, na follows */ + na = (struct nlattr *) NLA_DATA(na); + done = 0; + while (len2 < aggr_len) { + switch (na->nla_type) { + case TASKSTATS_TYPE_PID: + rtid = *(int *) NLA_DATA(na); + break; + case TASKSTATS_TYPE_TGID: + rtid = *(int *) NLA_DATA(na); + break; + case TASKSTATS_TYPE_STATS: + if (rtid == tid) { + print_taskstats((struct taskstats *) + NLA_DATA(na)); + done = 1; + } + break; + } + len2 += NLA_ALIGN(na->nla_len); + na = (struct nlattr *) ((char *) na + len2); + if (done) + break; + } + } + na = (struct nlattr *) (GENLMSG_DATA(&ans) + len); + if (done) + break; + } + if (done) + break; + } + while (1); + + close(nl_sd); + return 0; +} Index: linux-2.6.17-rc1/Documentation/accounting/taskstats.txt =================================================================== --- linux-2.6.17-rc1.orig/Documentation/accounting/taskstats.txt 2006-04-21 20:29:22.000000000 -0400 +++ linux-2.6.17-rc1/Documentation/accounting/taskstats.txt 2006-04-21 20:50:22.000000000 -0400 @@ -39,6 +39,8 @@ belongs (the task does not need to be th per-tgid stats to be sent for each exiting task is explained in the Advanced Usage section below. +getdelays.c is a simple utility demonstrating usage of the taskstats interface +for reporting delay accounting statistics. Interface --------- |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:42:26
|
Changelog Fixed comments by akpm - use __u64 for delayacct_blkio_ticks() return type - redundant check for tsk->delays in __delayacct_blkio_ticks() delayacct-procfs.patch Export I/O delays seen by a task through /proc/<tgid>/stats for use in top etc. Note that delays for I/O done for swapping in pages (swapin I/O) is clubbed together with all other I/O here (this is not the case in the netlink interface where the swapin I/O is kept distinct) Signed-off-by: Shailabh Nagar <na...@wa...> fs/proc/array.c | 6 ++++-- include/linux/delayacct.h | 10 ++++++++++ kernel/delayacct.c | 12 ++++++++++++ 3 files changed, 26 insertions(+), 2 deletions(-) Index: linux-2.6.17-rc1/fs/proc/array.c =================================================================== --- linux-2.6.17-rc1.orig/fs/proc/array.c 2006-04-21 19:39:28.000000000 -0400 +++ linux-2.6.17-rc1/fs/proc/array.c 2006-04-21 20:55:09.000000000 -0400 @@ -75,6 +75,7 @@ #include <linux/times.h> #include <linux/cpuset.h> #include <linux/rcupdate.h> +#include <linux/delayacct.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -412,7 +413,7 @@ static int do_task_stat(struct task_stru res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \ %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \ -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n", +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n", task->pid, tcomm, state, @@ -456,7 +457,8 @@ static int do_task_stat(struct task_stru task->exit_signal, task_cpu(task), task->rt_priority, - task->policy); + task->policy, + delayacct_blkio_ticks(task)); if(mm) mmput(mm); return res; Index: linux-2.6.17-rc1/include/linux/delayacct.h =================================================================== --- linux-2.6.17-rc1.orig/include/linux/delayacct.h 2006-04-21 20:42:41.000000000 -0400 +++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 20:55:58.000000000 -0400 @@ -37,6 +37,7 @@ extern void __delayacct_tsk_exit(struct extern void __delayacct_blkio_start(void); extern void __delayacct_blkio_end(void); extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); +extern __u64 __delayacct_blkio_ticks(struct task_struct *); static inline void delayacct_set_flag(int flag) { @@ -83,6 +84,13 @@ static inline int delayacct_add_tsk(stru return __delayacct_add_tsk(d, tsk); } +static inline __u64 delayacct_blkio_ticks(struct task_struct *tsk) +{ + if (tsk->delays) + return __delayacct_blkio_ticks(tsk); + return 0; +} + #else static inline void delayacct_set_flag(int flag) {} @@ -100,6 +108,8 @@ static inline void delayacct_blkio_end(v {} static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) { return 0; } +static inline __u64 delayacct_blkio_ticks(struct task_struct *tsk) +{ return 0; } #endif /* CONFIG_TASK_DELAY_ACCT */ #endif Index: linux-2.6.17-rc1/kernel/delayacct.c =================================================================== --- linux-2.6.17-rc1.orig/kernel/delayacct.c 2006-04-21 20:40:03.000000000 -0400 +++ linux-2.6.17-rc1/kernel/delayacct.c 2006-04-21 20:55:09.000000000 -0400 @@ -146,3 +146,15 @@ int __delayacct_add_tsk(struct taskstats return 0; } + +__u64 __delayacct_blkio_ticks(struct task_struct *tsk) +{ + __u64 ret; + + spin_lock(&tsk->delays->lock); + ret = nsec_to_clock_t(tsk->delays->blkio_delay + + tsk->delays->swapin_delay); + spin_unlock(&tsk->delays->lock); + return ret; +} + |
From: Andi K. <ak...@su...> - 2006-04-22 07:46:33
|
On Saturday 22 April 2006 04:42, Shailabh Nagar wrote: > Changelog > Fixed comments by akpm > - use __u64 for delayacct_blkio_ticks() return type > - redundant check for tsk->delays in __delayacct_blkio_ticks() I think these basic statistics in /proc are quite useful. Hopefully top etc. would learn quickly about them too so that normal people can actually make use of it. -Andi |
From: Shailabh N. <na...@wa...> - 2006-04-22 02:51:50
|
Here's a repost of my overview of the other stakeholders. Following Andrew's suggestion, here's my quick overview of the various other accounting packages that have been proposed on lse-tech with a focus on whether they can utilize the netlink-based taskstats interface being proposed by the delay accounting patches. Please note that unification of statistics *collection* is not being discussed since that kind of merger can be done as these patches get accepted, if at all, into the kernel. To try and unify right away would hold every patch (esp. delay accounting !) hostage to the problems in every other patch unnecessarily. As long as the interface can be unified, the merger of the collection bits can always happen without affecting user space. Stakeholders of each of these patches, on cc, are requested to please correct any misunderstandings of what their patches do so we can make forward progress. --Shailabh Summary The following can use the taskstats netlink-based interface by extending the returned data structure - Comprehensive System Accounting - per-process I/O stats - Microstate accounting - per cpu time stats The following patches' interface needs are independent of taskstats or subsumed by one of above: - Enhanced Linux System Accounting - pnotify - scalable statistics counters Details (please correct if these are misunderstood) 1. Comprehensive System Accounting (Jay Lan) -------------------------------------------- - Collect various per-task statistics and write an accounting record containing these stats at task exit. Interface similar to BSD process accounting but the accounting record structure is quite different. - CSA could utilize some stats collected/exported by delay accounting blkio wait time cpu run time for task - CSA only needs data to be available at task exit, not during the task's lifetime. Moreover, at task exit, it needs the accounting record to be written to a file. - CSA could utilize delay accounting's taskstats netlink interface to gather task data at exit through a userspace utility that then writes it out to its expected file. To do so, CSA would need the taskstats struct to be extended with whatever additional stats it needs. The additional stats could be selectively exported only on task exit to avoid imposing a space burden on users of delay accounting who query a process's statistics during its lifetime. Collection of the additional stats needed by CSA may be tied to pnotify and job patches which are still being reviewed/considered for acceptance. As such, unification in the collection of stats can be deferred until status of pnotify/job/CSA patches becomes more clear. 2. per-process I/O statistics (Levent Serinol) ---------------------------------------------- - Exports task->{rchar,wchar} through /proc/tgid/iostat (earlier version proposed export through /proc/tgid/stats) - No new stats collection. Just export of existing task fields - Problem with accepting the patch stems from the accuracy of the statistics in these fields. The fields are updated only in three cases today (sys_read/write, sys_readv/writev, do_sendfile) so they aren't accurate. async I/O, memory-mapped I/O is not counted at the very least). CSA patches also export these fields through their accounting record but don't appear to be doing anything to improve accuracy of collection (or maybe it doesn't matter to them). BSD accounting, which ought to be using the sum of these fields for its ac_io field, doesn't (it hardcodes the output to zero). When the fate of task->rchar/wchar is decided, based on CSA's needs, those fields can be easily added to taskstats. 3. per-cpu time statistics (Erich Focht) ---------------------------------------- - Collects time spent by a task on each cpu of a system and exports it through new interface /proc/tgid/cpu - Statistic is needed for performance analysis/debugging (like schedstats) and not for production systems. - Unsure why push for acceptance was abandoned. Possibly due to one or more of: space overhead of allocating NR_CPUS variables in task_struct time overhead of collecting the data ? - Can use taskstats interface to export the data by adding needed fields to struct taskstats and bumping up the version. 4. Microstate accounting (Peter Chubb) -------------------------------------- - Measure time spent by a thread in various interesting states, while accounting for interrupts, and export through /proc/tid/msa and through a syscall interface - Interesting states have some overlap with delay accounting - Exporting of per-task stats can be done through taskstats netlink interface 5. Enhanced Linux System Accounting (Guillaume Thouvenine) ---------------------------------------------------------- - Group tasks at a user level into "jobs" and aggregate, at user level, per-task statistics collected by CSA and/or BSD process accounting. - ELSA does not introduce any new requirement for either collection or export of statistics from the kernel. It can use either BSD and/or CSA's method of using an accounting file. - ELSA needs notification of forks and exits which it can already get through the process events connector in the kernel. Hence ELSA's needs are either met by the kernel today or are a strict subset of CSA (since BSD accounting is already there). 6. pnotify (Erik Jacobson) -------------------------- - Infrastructure for kernel modules to be notified when an event (like fork/exit/exec) happens to a task. Also provides some per-task data for the modules' convenience - pnotify isn't concerned with exporting data to userspace or collecting any stats. Thats left to the kernel module that uses pnotify to get notifications. CSA is one expected user of pnotify. 7. Scalable statistics counters (Ravikiran Thirumalai, Dipankar Sarma) ---------------------------------------------------------------------- - Infrastructure for setting up per-cpu counters (not per-task necessarily) - No specific stats collection proposed as part of patch - May have need for interface for fast export to userspace but requirements not clear - Not per-task and unlikely to have unification prospects at interface level |
From: Shailabh N. <na...@wa...> - 2006-04-25 15:07:56
|
Here's a repost of my overview of the other stakeholders. For some reason, lkml keeps rejecting this and its earlier post wasn't archived either. Retrying. Following Andrew's suggestion, here's my quick overview of the various other accounting packages that have been proposed on lse-tech with a focus on whether they can utilize the netlink-based taskstats interface being proposed by the delay accounting patches. Please note that unification of statistics *collection* is not being discussed since that kind of merger can be done as these patches get accepted, if at all, into the kernel. To try and unify right away would hold every patch (esp. delay accounting !) hostage to the problems in every other patch unnecessarily. As long as the interface can be unified, the merger of the collection bits can always happen without affecting user space. Stakeholders of each of these patches, on cc, are requested to please correct any misunderstandings of what their patches do so we can make forward progress. --Shailabh Summary The following can use the taskstats netlink-based interface by extending the returned data structure - Comprehensive System Accounting - per-process I/O stats - Microstate accounting - per cpu time stats The following patches' interface needs are independent of taskstats or subsumed by one of above: - Enhanced Linux System Accounting - pnotify - scalable statistics counters Details (please correct if these are misunderstood) 1. Comprehensive System Accounting (Jay Lan) -------------------------------------------- - Collect various per-task statistics and write an accounting record containing these stats at task exit. Interface similar to BSD process accounting but the accounting record structure is quite different. - CSA could utilize some stats collected/exported by delay accounting blkio wait time cpu run time for task - CSA only needs data to be available at task exit, not during the task's lifetime. Moreover, at task exit, it needs the accounting record to be written to a file. - CSA could utilize delay accounting's taskstats netlink interface to gather task data at exit through a userspace utility that then writes it out to its expected file. To do so, CSA would need the taskstats struct to be extended with whatever additional stats it needs. The additional stats could be selectively exported only on task exit to avoid imposing a space burden on users of delay accounting who query a process's statistics during its lifetime. Collection of the additional stats needed by CSA may be tied to pnotify and job patches which are still being reviewed/considered for acceptance. As such, unification in the collection of stats can be deferred until status of pnotify/job/CSA patches becomes more clear. 2. per-process I/O statistics (Levent Serinol) ---------------------------------------------- - Exports task->{rchar,wchar} through /proc/tgid/iostat (earlier version proposed export through /proc/tgid/stats) - No new stats collection. Just export of existing task fields - Problem with accepting the patch stems from the accuracy of the statistics in these fields. The fields are updated only in three cases today (sys_read/write, sys_readv/writev, do_sendfile) so they aren't accurate. async I/O, memory-mapped I/O is not counted at the very least). CSA patches also export these fields through their accounting record but don't appear to be doing anything to improve accuracy of collection (or maybe it doesn't matter to them). BSD accounting, which ought to be using the sum of these fields for its ac_io field, doesn't (it hardcodes the output to zero). When the fate of task->rchar/wchar is decided, based on CSA's needs, those fields can be easily added to taskstats. 3. per-cpu time statistics (Erich Focht) ---------------------------------------- - Collects time spent by a task on each cpu of a system and exports it through new interface /proc/tgid/cpu - Statistic is needed for performance analysis/debugging (like schedstats) and not for production systems. - Unsure why push for acceptance was abandoned. Possibly due to one or more of: space overhead of allocating NR_CPUS variables in task_struct time overhead of collecting the data ? - Can use taskstats interface to export the data by adding needed fields to struct taskstats and bumping up the version. 4. Microstate accounting (Peter Chubb) -------------------------------------- - Measure time spent by a thread in various interesting states, while accounting for interrupts, and export through /proc/tid/msa and through a syscall interface - Interesting states have some overlap with delay accounting - Exporting of per-task stats can be done through taskstats netlink interface 5. Enhanced Linux System Accounting (Guillaume Thouvenine) ---------------------------------------------------------- - Group tasks at a user level into "jobs" and aggregate, at user level, per-task statistics collected by CSA and/or BSD process accounting. - ELSA does not introduce any new requirement for either collection or export of statistics from the kernel. It can use either BSD and/or CSA's method of using an accounting file. - ELSA needs notification of forks and exits which it can already get through the process events connector in the kernel. Hence ELSA's needs are either met by the kernel today or are a strict subset of CSA (since BSD accounting is already there). 6. pnotify (Erik Jacobson) -------------------------- - Infrastructure for kernel modules to be notified when an event (like fork/exit/exec) happens to a task. Also provides some per-task data for the modules' convenience |
From: Shailabh N. <na...@wa...> - 2006-04-25 15:14:13
|
Shailabh Nagar wrote: > Here's a repost of my overview of the other stakeholders. > For some reason, lkml keeps rejecting this and its earlier post > wasn't archived either. Retrying. > > Following Andrew's suggestion, here's my quick overview > of the various other accounting packages that have been > proposed on lse-tech with a focus on whether they can > utilize the netlink-based taskstats interface being proposed > by the delay accounting patches. > > Please note that unification of statistics *collection* is not > being discussed since that kind of merger can be done as these > patches get accepted, if at all, into the kernel. To try and unify right > away would hold every patch (esp. delay accounting !) > hostage to the problems in every other patch unnecessarily. As > long as the interface can be unified, the merger of the collection bits > can always happen without affecting user space. > > Stakeholders of each of these patches, on cc, are requested to > please correct any misunderstandings of what their patches do > so we can make forward progress. > > > --Shailabh > > > Summary > > The following can use the taskstats netlink-based > interface by extending the returned data structure > > - Comprehensive System Accounting > - per-process I/O stats > - Microstate accounting > - per cpu time stats > > > The following patches' interface needs are independent > of taskstats or subsumed by one of above: > - Enhanced Linux System Accounting > - pnotify > - scalable statistics counters > > > > Details > (please correct if these are misunderstood) > > 1. Comprehensive System Accounting (Jay Lan) > -------------------------------------------- > > - Collect various per-task statistics and write an accounting record > containing > these stats at task exit. Interface similar to BSD process accounting > but the accounting record structure is quite different. > > - CSA could utilize some stats collected/exported by delay accounting > blkio wait time > cpu run time for task > - CSA only needs data to be available at task exit, not during the > task's lifetime. Moreover, at task exit, it needs the accounting record > to be written to a file. > - CSA could utilize delay accounting's taskstats netlink interface to > gather task data at exit through > a userspace utility that then writes it out to its expected file. > > To do so, CSA would need the taskstats struct to be extended with > whatever additional stats it needs. The additional stats could be > selectively exported only on task exit to avoid imposing a space burden > on users of delay accounting who query a process's statistics during its > lifetime. > > Collection of the additional stats needed by CSA may be tied to pnotify > and job patches which are still being reviewed/considered for > acceptance. As such, unification in the collection of stats can be > deferred until status of pnotify/job/CSA patches becomes more clear. > > > > 2. per-process I/O statistics (Levent Serinol) > ---------------------------------------------- > > - Exports task->{rchar,wchar} through /proc/tgid/iostat > (earlier version proposed export through /proc/tgid/stats) > > - No new stats collection. Just export of existing task fields > - Problem with accepting the patch stems from the accuracy of the > statistics > in these fields. The fields are updated only in three cases today > (sys_read/write, sys_readv/writev, do_sendfile) > so they aren't accurate. async I/O, memory-mapped I/O is not counted > at the very least). > > CSA patches also export these fields through their accounting record > but don't appear to be doing anything to improve accuracy of collection > (or maybe it doesn't matter to them). > BSD accounting, which ought to be using the sum of these fields for its > ac_io field, doesn't (it hardcodes the output to zero). > > When the fate of task->rchar/wchar is decided, based on CSA's needs, > those fields can be easily added to taskstats. > > 3. per-cpu time statistics (Erich Focht) > ---------------------------------------- > > - Collects time spent by a task on each cpu of a system > and exports it through new interface /proc/tgid/cpu > > - Statistic is needed for performance analysis/debugging (like > schedstats) and not for production systems. > - Unsure why push for acceptance was abandoned. Possibly due to one or > more of: > space overhead of allocating NR_CPUS variables in task_struct > time overhead of collecting the data ? > > - Can use taskstats interface to export the data by adding needed fields > to struct taskstats and bumping up the version. > > > > 4. Microstate accounting (Peter Chubb) > -------------------------------------- > > - Measure time spent by a thread in various interesting states, while > accounting for interrupts, and export through /proc/tid/msa and > through a syscall interface > > - Interesting states have some overlap with delay accounting > - Exporting of per-task stats can be done through taskstats netlink > interface > > > 5. Enhanced Linux System Accounting (Guillaume Thouvenine) > ---------------------------------------------------------- > > - Group tasks at a user level into "jobs" and aggregate, at user level, > per-task statistics collected by CSA and/or BSD process accounting. > > - ELSA does not introduce any new requirement for either collection or > export of statistics from the kernel. It can use either BSD and/or CSA's > method of using an accounting file. > > - ELSA needs notification of forks and exits which it can already get > through the process events connector in the kernel. > > Hence ELSA's needs are either met by the kernel today or are a strict > subset of CSA (since BSD accounting is already there). > > > 6. pnotify (Erik Jacobson) > -------------------------- > > - Infrastructure for kernel modules to be notified when an event (like > fork/exit/exec) > happens to a task. Also provides some per-task data for the modules' > convenience (last part of the post got chopped off by mistake). Here it is: - pnotify isn't concerned with exporting data to userspace or collecting any stats. Thats left to the kernel module that uses pnotify to get notifications. CSA is one expected user of pnotify. 7. Scalable statistics counters (Ravikiran Thirumalai, Dipankar Sarma) ---------------------------------------------------------------------- - Infrastructure for setting up per-cpu counters (not per-task necessarily) - No specific stats collection proposed as part of patch - May have need for interface for fast export to userspace but requirements not clear - Not per-task and unlikely to have unification prospects at interface level |