From: Arjan v. de V. <ar...@in...> - 2006-02-27 09:18:55
|
On Mon, 2006-02-27 at 04:13 -0500, Shailabh Nagar wrote: > Arjan van de Ven wrote: > > >>+static inline void delayacct_blkio(void) > >>+{ > >>+ if (unlikely(current->delays && delayacct_on)) > >>+ __delayacct_blkio(); > >>+} > >> > >> > > > >why is this unlikely? > > > > > delayacct_on is expected to be off most of the time, that's not really enough I think to warrent a compiler hint > hence the compound is > unlikely too. you then should move that as first in the test instead ;-) |
From: Arjan v. de V. <ar...@in...> - 2006-02-27 09:24:55
|
On Mon, 2006-02-27 at 04:13 -0500, Shailabh Nagar wrote: > Arjan van de Ven wrote: > > >>+static inline void delayacct_blkio(void) > >>+{ > >>+ if (unlikely(current->delays && delayacct_on)) > >>+ __delayacct_blkio(); > >>+} > >> > >> > > > >why is this unlikely? > > > > > delayacct_on is expected to be off most of the time, hence the compound is > unlikely too. ok that opens the question: why is this a runtime tunable? Is it really worth all this complexity? |
From: Andi K. <ak...@su...> - 2006-02-27 14:18:59
|
Shailabh Nagar <na...@wa...> writes: > delayacct-blkio.patch > > Record time spent by a task waiting for completion of > userspace initiated synchronous block I/O. This can help > determine the right I/O priority for the task. I think it's a good idea to have such a statistic by default. Can you add a counter that is summed up in task_struct and reports in /proc/*/stat so that it could be displayed by top? This way it would be useful even with "normal" user space. -Andi |
From: Shailabh N. <na...@wa...> - 2006-02-27 20:55:54
|
Bryan O'Sullivan wrote: >On Mon, 2006-02-27 at 03:12 -0500, Shailabh Nagar wrote: > > > >>Add sysctl option for controlling schedstats collection >>dynamically. Delay accounting leverages schedstats for >>cpu delay statistics. >> >> > >Is there some reason you're using the sysctl interface, and not say >sysfs instead? > > <b > No. Is /proc/sys/kernel deprecated in favor of /sys/kernel now ? --Shailabh |
From: Bryan O'S. <bo...@se...> - 2006-02-27 22:26:29
|
On Mon, 2006-02-27 at 15:55 -0500, Shailabh Nagar wrote: > No. Is /proc/sys/kernel deprecated in favor of /sys/kernel now ? Yes. <b |
From: Shailabh N. <na...@wa...> - 2006-02-27 21:31:55
|
Andi Kleen wrote: >Shailabh Nagar <na...@wa...> writes: > > > >>delayacct-blkio.patch >> >>Record time spent by a task waiting for completion of >>userspace initiated synchronous block I/O. This can help >>determine the right I/O priority for the task. >> >> > >I think it's a good idea to have such a statistic by default. > >Can you add a counter that is summed up in task_struct and reports >in /proc/*/stat so that it could be displayed by top? > > Sure. That would make delayacct code simpler too since it could just read off from task_struct. >This way it would be useful even with "normal" user space. > > Yes. Need to resolve the multiple entry point semantic...more in separate mail. --Shailabh >-Andi > > |
From: Shailabh N. <na...@wa...> - 2006-02-27 22:09:49
|
Andi Kleen wrote: >Shailabh Nagar <na...@wa...> writes: > > > >>delayacct-blkio.patch >> >>Record time spent by a task waiting for completion of >>userspace initiated synchronous block I/O. This can help >>determine the right I/O priority for the task. >> >> > >I think it's a good idea to have such a statistic by default. > > Besides the paths we're counting and the one's Arjan listed below, are there others you had in mind ? >Can you add a counter that is summed up in task_struct and reports >in /proc/*/stat so that it could be displayed by top? > >This way it would be useful even with "normal" user space. > >-Andi > Arjan van de Ven wrote: >this misses O_SYNC, msync(), and general throttling. >I get the feeling this is being measured at the wrong level >currently.... since the number of entry points that needs measuring at >the current level is hardly finite... > > Our intent was to get an idea of user-initiated sync block I/O because there is some expectation from user space that a higher I/O priority will result in lower delays for such I/O. General throttling writes wouldn't fit in this category though msync and O_SYNC would. Are there a lot of other paths you see ? I'll root around more but if you could just list a few more, it'll help. As for the level at which the counting is being done, the reason for choosing this one was to avoid counting time spent waiting for async I/O completion and also to keep the accounting simple (diff of two timestamps without modifying block I/O structures). To our usage model, async I/O is also not as useful to be counted since userspace has already taken steps to tolerate the latency and can do useful work (and not be "delayed"). However, I would have liked to capture the time spent within sys_io_getevents when a timeout is specified, since there the user is again going to be delayed, but the mingling of block and network I/O events makes that more complex. Going further down the I/O processing stack than the current level would probably require more elaborate mechanisms to keep track of the submitter ? Or is there a better merging point for sync I/O that I'm missing ? Your comments would be welcome to improve this code... --Shailabh P.S. Sorry if merging the two responses violates any netiquette :-) |
From: Andi K. <ak...@su...> - 2006-02-27 22:38:22
|
On Monday 27 February 2006 23:09, Shailabh Nagar wrote: > Besides the paths we're counting and the one's Arjan listed below, are > there others > you had in mind ? I would like to see all reads including metadata reads in file systems. -Andi |
From: Arjan v. de V. <ar...@in...> - 2006-02-28 08:11:07
|
> Our intent was to get an idea of user-initiated sync block I/O because > there is some expectation from user space that a higher I/O priority will > result in lower delays for such I/O. General throttling writes wouldn't > fit in > this category though msync and O_SYNC would. > > Are there a lot of other paths you see ? I'll root around more but if you > could just list a few more, it'll help. unmount -o sync mounts last-close kind of syncs on block devices (yes databases do this and care) fdatasync()/fsync() open() (does a read seek, and may even do cluster locking stuff) flock() |
From: Shailabh N. <na...@wa...> - 2006-02-27 22:17:44
|
Arjan van de Ven wrote: >On Mon, 2006-02-27 at 03:22 -0500, Shailabh Nagar wrote: > > >>delayacct-swapin.patch >> >>Record time spent by a task waiting for its pages to be swapped in. >>This statistic can help in adjusting the rss limits of >>tasks (process), especially relative to each other, when the system is >>under memory pressure. >> >> > > >ok this poses a question: how do you deal with nested timings? > I don't :-( An earlier version used local variables instead of one within the task_delay_info struct but we moved to using a var within to save on stack space in critical paths. >Say an >O_SYC write which internally causes a pagefault? > > And here we hit the problem of nesting being needed....so.... >delayacct_timestamp_start() at minimum has to get event-type specific, >or even implement a stack of some sorts. > > Would keeping the timespec vars on the stacks of the functions being accounted be too expensive vs. keeping bunches of vars within task_delay_info to deal with the nesting ? Unfortunately, the need for accuracy also means the variables needed are timespecs and not something smaller. --Shailabh |
From: Nick P. <nic...@ya...> - 2006-02-28 00:25:21
|
Chandra Seetharaman wrote: >On Mon, 2006-02-27 at 20:17 +1100, Nick Piggin wrote: > >>> #ifdef CONFIG_SCHEDSTATS >>>+ >>>+int schedstats_sysctl = 0; /* schedstats turned off by default */ >>> >>Should be read mostly. >> >> >>>+static DEFINE_PER_CPU(int, schedstats) = 0; >>>+ >>> >>When the above is in the read mostly section, you won't need this at all. >> >>You don't intend to switch the sysctl with great frequency, do you? >> > >No, it is not expected to switch often. > >We originally coded it as __read_mostly, but thought the variable >bouncing between CPUs would be costly. Is it cheaper with >__read_mostly ? or it doesn't matter ? > > Well it will only "bounce" when the cacheline it is in is written to by a different CPU. Considering this happens with your per-cpu implementation _anyway_, they don't buy you anything much. Putting it in __read_mostly means that you won't happen to share a cacheline with a variable that is being written to frequently. Nick -- Send instant messages to your online friends http://au.messenger.yahoo.com |
From: chandra s. <sek...@us...> - 2006-02-28 01:40:53
|
On Tue, 2006-02-28 at 11:25 +1100, Nick Piggin wrote: > Chandra Seetharaman wrote: > > >On Mon, 2006-02-27 at 20:17 +1100, Nick Piggin wrote: > > > >>> #ifdef CONFIG_SCHEDSTATS > >>>+ > >>>+int schedstats_sysctl = 0; /* schedstats turned off by default */ > >>> > >>Should be read mostly. > >> > >> > >>>+static DEFINE_PER_CPU(int, schedstats) = 0; > >>>+ > >>> > >>When the above is in the read mostly section, you won't need this at all. > >> > >>You don't intend to switch the sysctl with great frequency, do you? > >> > > > >No, it is not expected to switch often. > > > >We originally coded it as __read_mostly, but thought the variable > >bouncing between CPUs would be costly. Is it cheaper with > >__read_mostly ? or it doesn't matter ? > > > > > > Well it will only "bounce" when the cacheline it is in is written to by > a different CPU. Considering this happens with your per-cpu implementation > _anyway_, they don't buy you anything much. > > Putting it in __read_mostly means that you won't happen to share a cacheline > with a variable that is being written to frequently. > Thanks for the clarification Nick. > Nick > > -- > > Send instant messages to your online friends http://au.messenger.yahoo.com > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast > and join the prime developer group breaking into this new coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Shailabh N. <na...@wa...> - 2005-12-07 22:13:22
|
Add kernel utility functions for - nanosecond resolution timestamps, adjusted for lost ticks - interval (diff) between two such timestamps, in nanoseconds, adjusting for overflow The timestamp part of this patch is identical to the one proposed by Matt Helsley (as part of adding timestamps to process event connectors) http://www.uwsg.indiana.edu/hypermail/linux/kernel/0512.0/1373.html Signed-off-by: Shailabh Nagar <na...@wa...> include/linux/time.h | 16 ++++++++++++++++ kernel/time.c | 22 ++++++++++++++++++++++ 2 files changed, 38 insertions(+) Index: linux-2.6.15-rc5/include/linux/time.h =================================================================== --- linux-2.6.15-rc5.orig/include/linux/time.h +++ linux-2.6.15-rc5/include/linux/time.h @@ -95,6 +95,7 @@ struct itimerval; extern int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue); extern int do_getitimer(int which, struct itimerval *value); extern void getnstimeofday (struct timespec *tv); +extern void getnstimestamp(struct timespec *ts); extern struct timespec timespec_trunc(struct timespec t, unsigned gran); @@ -113,6 +114,21 @@ set_normalized_timespec (struct timespec ts->tv_nsec = nsec; } +/* + * timespec_nsdiff - Return difference of two timestamps in nanoseconds + * In the rare case of @end being earlier than @start, return zero + */ +static inline unsigned long long +timespec_nsdiff(struct timespec *start, struct timespec *end) +{ + long long ret; + + ret = end->tv_sec*(1000000000) + end->tv_nsec; + ret -= start->tv_sec*(1000000000) + start->tv_nsec; + if (ret < 0) + return 0; + return ret; +} #endif /* __KERNEL__ */ #define NFDBITS __NFDBITS Index: linux-2.6.15-rc5/kernel/time.c =================================================================== --- linux-2.6.15-rc5.orig/kernel/time.c +++ linux-2.6.15-rc5/kernel/time.c @@ -561,6 +561,28 @@ void getnstimeofday(struct timespec *tv) EXPORT_SYMBOL_GPL(getnstimeofday); #endif +void getnstimestamp(struct timespec *ts) +{ + unsigned int seq; + struct timespec wall2mono; + + /* synchronize with settimeofday() changes */ + do { + seq = read_seqbegin(&xtime_lock); + getnstimeofday(ts); + wall2mono = wall_to_monotonic; + } while(unlikely(read_seqretry(&xtime_lock, seq))); + + /* adjust to monotonicaly-increasing values */ + ts->tv_sec += wall2mono.tv_sec; + ts->tv_nsec += wall2mono.tv_nsec; + while (unlikely(ts->tv_nsec >= NSEC_PER_SEC)) { + ts->tv_nsec -= NSEC_PER_SEC; + ts->tv_sec++; + } +} +EXPORT_SYMBOL_GPL(getnstimestamp); + #if (BITS_PER_LONG < 64) u64 get_jiffies_64(void) { |
From: Christoph L. <cla...@en...> - 2005-12-12 18:50:50
|
On Wed, 7 Dec 2005, Shailabh Nagar wrote: > +void getnstimestamp(struct timespec *ts) There is already getnstimeofday in the kernel. |
From: Shailabh N. <na...@wa...> - 2005-12-12 19:32:07
|
Christoph Lameter wrote: > On Wed, 7 Dec 2005, Shailabh Nagar wrote: > > >>+void getnstimestamp(struct timespec *ts) > > > There is already getnstimeofday in the kernel. > > Yes, and that function is being used within the getnstimestamp() being proposed. However, John Stultz had advised that getnstimeofday could get affected by calls to settimeofday and had recommended adjusting the getnstimeofday value with wall_to_monotonic. John, could you elaborate ? Thanks, Shailabh |
From: Shailabh N. <na...@wa...> - 2005-12-12 20:00:57
|
john stultz wrote: > On Mon, 2005-12-12 at 19:31 +0000, Shailabh Nagar wrote: > >>Christoph Lameter wrote: >> >>>On Wed, 7 Dec 2005, Shailabh Nagar wrote: >>> >>> >>> >>>>+void getnstimestamp(struct timespec *ts) >>> >>> >>>There is already getnstimeofday in the kernel. >>> >> >>Yes, and that function is being used within the getnstimestamp() being proposed. >>However, John Stultz had advised that getnstimeofday could get affected by calls to >>settimeofday and had recommended adjusting the getnstimeofday value with wall_to_monotonic. >> >>John, could you elaborate ? > > > I think you pretty well have it covered. > > getnstimeofday + wall_to_monotonic should be higher-res and more > reliable (then TSC based sched_clock(), for example) for getting a > timestamp. > > There may be performance concerns as you have to access the clock > hardware in getnstimeofday(), but there really is no other way for > reliable finely grained monotonically increasing timestamps. > > thanks > -john Thanks, that clarifies. I guess the other underlying concern here would be whether these improvements (in resolution and reliability) should be going into getnstimeofday() itself (rather than creating a new func for the same) ? Or is it better to leave getnstimeofday as it is ? Thanks, Shailabh |
From: john s. <joh...@us...> - 2005-12-12 20:07:38
|
On Mon, 2005-12-12 at 20:00 +0000, Shailabh Nagar wrote: > john stultz wrote: > > On Mon, 2005-12-12 at 19:31 +0000, Shailabh Nagar wrote: > > > >>Christoph Lameter wrote: > >> > >>>On Wed, 7 Dec 2005, Shailabh Nagar wrote: > >>>>+void getnstimestamp(struct timespec *ts) > >>> > >>>There is already getnstimeofday in the kernel. > >> > >>Yes, and that function is being used within the getnstimestamp() being proposed. > >>However, John Stultz had advised that getnstimeofday could get affected by calls to > >>settimeofday and had recommended adjusting the getnstimeofday value with wall_to_monotonic. > >> > >>John, could you elaborate ? > > > > I think you pretty well have it covered. > > > > getnstimeofday + wall_to_monotonic should be higher-res and more > > reliable (then TSC based sched_clock(), for example) for getting a > > timestamp. > > > > There may be performance concerns as you have to access the clock > > hardware in getnstimeofday(), but there really is no other way for > > reliable finely grained monotonically increasing timestamps. > > > Thanks, that clarifies. I guess the other underlying concern here would be whether these > improvements (in resolution and reliability) should be going into getnstimeofday() > itself (rather than creating a new func for the same) ? Or is it better to leave > getnstimeofday as it is ? No, getnstimeofday() is very much needed to get a nanosecond grained wall-time clock, so a new function is needed for the monotonic clock. In my timeofday re-work I have used the name "get_monotonic_clock()" and "get_monotonic_clock_ts()" for basically the same functionality (providing a ktime and a timespec respectively). You might consider naming it as such, but resolving these naming collisions shouldn't be too difficult either way. thanks -john |
From: George A. <ge...@mv...> - 2005-12-13 00:55:11
|
john stultz wrote: > On Mon, 2005-12-12 at 20:00 +0000, Shailabh Nagar wrote: > >>john stultz wrote: >> >>>On Mon, 2005-12-12 at 19:31 +0000, Shailabh Nagar wrote: >>> >>> >>>>Christoph Lameter wrote: >>>> >>>> >>>>>On Wed, 7 Dec 2005, Shailabh Nagar wrote: >>>>> >>>>>>+void getnstimestamp(struct timespec *ts) >>>>> >>>>>There is already getnstimeofday in the kernel. >>>> >>>>Yes, and that function is being used within the getnstimestamp() being proposed. >>>>However, John Stultz had advised that getnstimeofday could get affected by calls to >>>>settimeofday and had recommended adjusting the getnstimeofday value with wall_to_monotonic. >>>> >>>>John, could you elaborate ? >>> >>>I think you pretty well have it covered. >>> >>>getnstimeofday + wall_to_monotonic should be higher-res and more >>>reliable (then TSC based sched_clock(), for example) for getting a >>>timestamp. >>> >>>There may be performance concerns as you have to access the clock >>>hardware in getnstimeofday(), but there really is no other way for >>>reliable finely grained monotonically increasing timestamps. >>> > > >>Thanks, that clarifies. I guess the other underlying concern here would be whether these >>improvements (in resolution and reliability) should be going into getnstimeofday() >>itself (rather than creating a new func for the same) ? Or is it better to leave >>getnstimeofday as it is ? > > > No, getnstimeofday() is very much needed to get a nanosecond grained > wall-time clock, so a new function is needed for the monotonic clock. > > In my timeofday re-work I have used the name "get_monotonic_clock()" and > "get_monotonic_clock_ts()" for basically the same functionality > (providing a ktime and a timespec respectively). You might consider > naming it as such, but resolving these naming collisions shouldn't be > too difficult either way. Indeed. Lets use a name with "monotonic" in it, please. And, possibly not "clock". How about get_nsmonotonic_time() or some such? -- George Anzinger ge...@mv... HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/ |
From: Nish A. <nis...@gm...> - 2005-12-13 03:48:44
|
On 12/12/05, George Anzinger <ge...@mv...> wrote: > john stultz wrote: > > On Mon, 2005-12-12 at 20:00 +0000, Shailabh Nagar wrote: > > > >>john stultz wrote: > >> > >>>On Mon, 2005-12-12 at 19:31 +0000, Shailabh Nagar wrote: > >>> > >>> > >>>>Christoph Lameter wrote: > >>>> > >>>> > >>>>>On Wed, 7 Dec 2005, Shailabh Nagar wrote: > >>>>> > >>>>>>+void getnstimestamp(struct timespec *ts) > >>>>> > >>>>>There is already getnstimeofday in the kernel. > >>>> > >>>>Yes, and that function is being used within the getnstimestamp() bein= g proposed. > >>>>However, John Stultz had advised that getnstimeofday could get affect= ed by calls to > >>>>settimeofday and had recommended adjusting the getnstimeofday value w= ith wall_to_monotonic. > >>>> > >>>>John, could you elaborate ? > >>> > >>>I think you pretty well have it covered. > >>> > >>>getnstimeofday + wall_to_monotonic should be higher-res and more > >>>reliable (then TSC based sched_clock(), for example) for getting a > >>>timestamp. > >>> > >>>There may be performance concerns as you have to access the clock > >>>hardware in getnstimeofday(), but there really is no other way for > >>>reliable finely grained monotonically increasing timestamps. > >>> > > > > > >>Thanks, that clarifies. I guess the other underlying concern here would= be whether these > >>improvements (in resolution and reliability) should be going into getns= timeofday() > >>itself (rather than creating a new func for the same) ? Or is it better= to leave > >>getnstimeofday as it is ? > > > > > > No, getnstimeofday() is very much needed to get a nanosecond grained > > wall-time clock, so a new function is needed for the monotonic clock. > > > > In my timeofday re-work I have used the name "get_monotonic_clock()" an= d > > "get_monotonic_clock_ts()" for basically the same functionality > > (providing a ktime and a timespec respectively). You might consider > > naming it as such, but resolving these naming collisions shouldn't be > > too difficult either way. > > Indeed. Lets use a name with "monotonic" in it, please. And, > possibly not "clock". How about get_nsmonotonic_time() or some such? I agree -- personal preference, though, I prefer units at the end, i.e. get_monotonic_time_ns() or get_monotonic_time_nsecs(). Thanks, Nish |
From: Shailabh N. <na...@wa...> - 2005-12-07 22:16:16
|
Changes since 11/14/05 - use nanosecond resolution, adjusted wall clock time for timestamps instead of sched_clock (akpm, andi, marcelo) - kernel param, sysctl option to control delay stats collection (parag) - better CONFIG parameter name (parag) 11/14/05: First post delayacct-init.patch Initialization code related to collection of per-task "delay" statistics which measure how long it had to wait for cpu, sync block io, swapping etc.. The collection of statistics and the interface are in other patches. This patch sets up the data structures and enables the statistics collection to be dynamically enabled (through a kernel boot paramater and through /proc/sys/kernel/delayacct). Signed-off-by: Shailabh Nagar <na...@wa...> Documentation/kernel-parameters.txt | 2 ++ include/linux/delayacct.h | 26 ++++++++++++++++++++++++++ include/linux/sched.h | 11 +++++++++++ include/linux/sysctl.h | 1 + init/Kconfig | 13 +++++++++++++ kernel/Makefile | 1 + kernel/delayacct.c | 36 ++++++++++++++++++++++++++++++++++++ kernel/fork.c | 2 ++ kernel/sysctl.c | 14 ++++++++++++++ 9 files changed, 106 insertions(+) Index: linux-2.6.15-rc5/init/Kconfig =================================================================== --- linux-2.6.15-rc5.orig/init/Kconfig +++ linux-2.6.15-rc5/init/Kconfig @@ -162,6 +162,19 @@ config BSD_PROCESS_ACCT_V3 for processing it. A preliminary version of these tools is available at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>. +config TASK_DELAY_ACCT + bool "Enable per-task delay accounting (EXPERIMENTAL)" + help + Collect information on time spent by a task waiting for system + resources like cpu, synchronous block I/O completion and swapping + in pages. Such statistics can help in setting a task's priorities + relative to other tasks for cpu, io, rss limits etc. + + Unlike BSD process accounting, this information is available + continuously during the lifetime of a task. + + Say N if unsure. + config SYSCTL bool "Sysctl support" ---help--- Index: linux-2.6.15-rc5/include/linux/sched.h =================================================================== --- linux-2.6.15-rc5.orig/include/linux/sched.h +++ linux-2.6.15-rc5/include/linux/sched.h @@ -541,6 +541,14 @@ struct sched_info { extern struct file_operations proc_schedstat_operations; #endif +#ifdef CONFIG_TASK_DELAY_ACCT +struct task_delay_info { + spinlock_t lock; + + /* Add stats in pairs: uint64_t delay, uint32_t count */ +}; +#endif + enum idle_type { SCHED_IDLE, @@ -857,6 +865,9 @@ struct task_struct { int cpuset_mems_generation; #endif atomic_t fs_excl; /* holding fs exclusive resources */ +#ifdef CONFIG_TASK_DELAY_ACCT + struct task_delay_info delays; +#endif }; static inline pid_t process_group(struct task_struct *tsk) Index: linux-2.6.15-rc5/kernel/fork.c =================================================================== --- linux-2.6.15-rc5.orig/kernel/fork.c +++ linux-2.6.15-rc5/kernel/fork.c @@ -43,6 +43,7 @@ #include <linux/rmap.h> #include <linux/acct.h> #include <linux/cn_proc.h> +#include <linux/delayacct.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -923,6 +924,7 @@ static task_t *copy_process(unsigned lon if (p->binfmt && !try_module_get(p->binfmt->module)) goto bad_fork_cleanup_put_domain; + delayacct_tsk_init(p); p->did_exec = 0; copy_flags(clone_flags, p); p->pid = pid; Index: linux-2.6.15-rc5/include/linux/delayacct.h =================================================================== --- /dev/null +++ linux-2.6.15-rc5/include/linux/delayacct.h @@ -0,0 +1,26 @@ +/* delayacct.h - per-task delay accounting + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2005 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKDELAYS_H +#define _LINUX_TASKDELAYS_H + +#include <linux/sched.h> + +#ifdef CONFIG_TASK_DELAY_ACCT +extern int delayacct_on; /* Delay accounting turned on/off */ +extern void delayacct_tsk_init(struct task_struct *tsk); +#else +static inline void delayacct_tsk_init(struct task_struct *tsk) +{} +#endif /* CONFIG_TASK_DELAY_ACCT */ +#endif /* _LINUX_TASKDELAYS_H */ Index: linux-2.6.15-rc5/kernel/sysctl.c =================================================================== --- linux-2.6.15-rc5.orig/kernel/sysctl.c +++ linux-2.6.15-rc5/kernel/sysctl.c @@ -124,6 +124,10 @@ extern int sysctl_hz_timer; extern int acct_parm[]; #endif +#ifdef CONFIG_TASK_DELAY_ACCT +extern int delayacct_on; +#endif + int randomize_va_space = 1; static int parse_table(int __user *, int, void __user *, size_t __user *, void __user *, size_t, @@ -656,6 +660,16 @@ static ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif +#if defined(CONFIG_TASK_DELAY_ACCT) + { + .ctl_name = KERN_TASK_DELAY_ACCT, + .procname = "delayacct", + .data = &delayacct_on, + .maxlen = sizeof (int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif { .ctl_name = 0 } }; Index: linux-2.6.15-rc5/include/linux/sysctl.h =================================================================== --- linux-2.6.15-rc5.orig/include/linux/sysctl.h +++ linux-2.6.15-rc5/include/linux/sysctl.h @@ -146,6 +146,7 @@ enum KERN_RANDOMIZE=68, /* int: randomize virtual address space */ KERN_SETUID_DUMPABLE=69, /* int: behaviour of dumps for setuid core */ KERN_SPIN_RETRY=70, /* int: number of spinlock retries */ + KERN_TASK_DELAY_ACCT=71, /* turn task delay accounting on/off */ }; Index: linux-2.6.15-rc5/Documentation/kernel-parameters.txt =================================================================== --- linux-2.6.15-rc5.orig/Documentation/kernel-parameters.txt +++ linux-2.6.15-rc5/Documentation/kernel-parameters.txt @@ -410,6 +410,8 @@ running once the system is up. Format: <area>[,<node>] See also Documentation/networking/decnet.txt. + delayacct [KNL] Enable per-task delay accounting + devfs= [DEVFS] See Documentation/filesystems/devfs/boot-options. Index: linux-2.6.15-rc5/kernel/Makefile =================================================================== --- linux-2.6.15-rc5.orig/kernel/Makefile +++ linux-2.6.15-rc5/kernel/Makefile @@ -32,6 +32,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <al...@li...>, the -fno-omit-frame-pointer is Index: linux-2.6.15-rc5/kernel/delayacct.c =================================================================== --- /dev/null +++ linux-2.6.15-rc5/kernel/delayacct.c @@ -0,0 +1,36 @@ +/* delayacct.c - per-task delay accounting + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2005 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#include <linux/sched.h> + +int delayacct_on; /* Delay accounting turned on/off */ + +int __init delayacct_setup_enable(char *str) +{ + delayacct_on = 1; + return 1; +} +__setup("delayacct", delayacct_setup_enable); + +inline void delayacct_tsk_init(struct task_struct *tsk) +{ + memset(&tsk->delays, 0, sizeof(tsk->delays)); + spin_lock_init(&tsk->delays.lock); +} + +static int __init delayacct_init(void) +{ + delayacct_tsk_init(&init_task); + return 0; +} +core_initcall(delayacct_init); |
From: Shailabh N. <na...@wa...> - 2005-12-07 22:23:53
|
This patch attempts to record all the time spent by a task waiting for completion of (user-initiated) block I/O. Ideally, it would have been nice to be able to record the time spent by a task waiting for I/O events that are related to async block I/O. While that can be done now (by measuring time spent in wait_for_async_kiocb) once (if ?) network aio is implemented, AFAIK, it won't be possible to distinguish async block and network aio events (and I suspect async I/O to pipes too...) so async block I/O gets ignored for now. Suggestions on how async block I/O wait can be accounted accurately would be welcome. Changes since 11/14/05 - use nanosecond resolution, adjusted wall clock time for timestamps instead of sched_clock (akpm, andi, marcelo) - collect stats only if delay accounting enabled (parag) - stats collected for delays in all userspace-initiated block I/O including fsync/fdatasync but not counting waits for async block io events. 11/14/05: First post delayacct-blkio.patch Record time spent by a task waiting for completion of userspace initiated synchronous block I/O. This can help determine the right I/O priority for the task. Signed-off-by: Shailabh Nagar <na...@wa...> fs/buffer.c | 6 ++++++ fs/read_write.c | 10 +++++++++- include/linux/delayacct.h | 4 ++++ include/linux/sched.h | 2 ++ kernel/delayacct.c | 31 +++++++++++++++++++++++++++++++ mm/filemap.c | 10 +++++++++- mm/memory.c | 17 +++++++++++++++-- 7 files changed, 76 insertions(+), 4 deletions(-) Index: linux-2.6.15-rc5/include/linux/sched.h =================================================================== --- linux-2.6.15-rc5.orig/include/linux/sched.h +++ linux-2.6.15-rc5/include/linux/sched.h @@ -546,6 +546,8 @@ struct task_delay_info { spinlock_t lock; /* Add stats in pairs: uint64_t delay, uint32_t count */ + uint64_t blkio_delay; /* wait for sync block io completion */ + uint32_t blkio_count; }; #endif Index: linux-2.6.15-rc5/fs/read_write.c =================================================================== --- linux-2.6.15-rc5.orig/fs/read_write.c +++ linux-2.6.15-rc5/fs/read_write.c @@ -14,6 +14,8 @@ #include <linux/security.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/time.h> +#include <linux/delayacct.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -224,8 +226,14 @@ ssize_t do_sync_read(struct file *filp, (ret = filp->f_op->aio_read(&kiocb, buf, len, kiocb.ki_pos))) wait_on_retry_sync_kiocb(&kiocb); - if (-EIOCBQUEUED == ret) + if (-EIOCBQUEUED == ret) { + __attribute__((unused)) struct timespec start, end; + + getnstimestamp(&start); ret = wait_on_sync_kiocb(&kiocb); + getnstimestamp(&end); + delayacct_blkio(&start, &end); + } *ppos = kiocb.ki_pos; return ret; } Index: linux-2.6.15-rc5/mm/filemap.c =================================================================== --- linux-2.6.15-rc5.orig/mm/filemap.c +++ linux-2.6.15-rc5/mm/filemap.c @@ -28,6 +28,8 @@ #include <linux/blkdev.h> #include <linux/security.h> #include <linux/syscalls.h> +#include <linux/time.h> +#include <linux/delayacct.h> #include "filemap.h" /* * FIXME: remove all knowledge of the buffer layer from the core VM @@ -1062,8 +1064,14 @@ generic_file_read(struct file *filp, cha init_sync_kiocb(&kiocb, filp); ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos); - if (-EIOCBQUEUED == ret) + if (-EIOCBQUEUED == ret) { + __attribute__((unused)) struct timespec start, end; + + getnstimestamp(&start); ret = wait_on_sync_kiocb(&kiocb); + getnstimestamp(&end); + delayacct_blkio(&start, &end); + } return ret; } Index: linux-2.6.15-rc5/mm/memory.c =================================================================== --- linux-2.6.15-rc5.orig/mm/memory.c +++ linux-2.6.15-rc5/mm/memory.c @@ -48,6 +48,8 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/init.h> +#include <linux/time.h> +#include <linux/delayacct.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -2200,11 +2202,22 @@ static inline int handle_pte_fault(struc old_entry = entry = *pte; if (!pte_present(entry)) { if (pte_none(entry)) { + int ret; + __attribute__((unused)) struct timespec start, end; + if (!vma->vm_ops || !vma->vm_ops->nopage) return do_anonymous_page(mm, vma, address, pte, pmd, write_access); - return do_no_page(mm, vma, address, - pte, pmd, write_access); + + if (vma->vm_file) + getnstimestamp(&start); + ret = do_no_page(mm, vma, address, + pte, pmd, write_access); + if (vma->vm_file) { + getnstimestamp(&end); + delayacct_blkio(&start, &end); + } + return ret; } if (pte_file(entry)) return do_file_page(mm, vma, address, Index: linux-2.6.15-rc5/fs/buffer.c =================================================================== --- linux-2.6.15-rc5.orig/fs/buffer.c +++ linux-2.6.15-rc5/fs/buffer.c @@ -41,6 +41,8 @@ #include <linux/bitops.h> #include <linux/mpage.h> #include <linux/bit_spinlock.h> +#include <linux/time.h> +#include <linux/delayacct.h> static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); static void invalidate_bh_lrus(void); @@ -337,6 +339,7 @@ static long do_fsync(unsigned int fd, in struct file * file; struct address_space *mapping; int ret, err; + __attribute__((unused)) struct timespec start, end; ret = -EBADF; file = fget(fd); @@ -349,6 +352,7 @@ static long do_fsync(unsigned int fd, in goto out_putf; } + getnstimestamp(&start); mapping = file->f_mapping; current->flags |= PF_SYNCWRITE; @@ -371,6 +375,8 @@ static long do_fsync(unsigned int fd, in out_putf: fput(file); out: + getnstimestamp(&end); + delayacct_blkio(&start, &end); return ret; } Index: linux-2.6.15-rc5/include/linux/delayacct.h =================================================================== --- linux-2.6.15-rc5.orig/include/linux/delayacct.h +++ linux-2.6.15-rc5/include/linux/delayacct.h @@ -19,8 +19,12 @@ #ifdef CONFIG_TASK_DELAY_ACCT extern int delayacct_on; /* Delay accounting turned on/off */ extern void delayacct_tsk_init(struct task_struct *tsk); +extern void delayacct_blkio(struct timespec *start, struct timespec *end); #else static inline void delayacct_tsk_init(struct task_struct *tsk) {} +static inline void delayacct_blkio(struct timespec *start, struct timespec *end) +{} + #endif /* CONFIG_TASK_DELAY_ACCT */ #endif /* _LINUX_TASKDELAYS_H */ Index: linux-2.6.15-rc5/kernel/delayacct.c =================================================================== --- linux-2.6.15-rc5.orig/kernel/delayacct.c +++ linux-2.6.15-rc5/kernel/delayacct.c @@ -12,6 +12,7 @@ */ #include <linux/sched.h> +#include <linux/time.h> int delayacct_on; /* Delay accounting turned on/off */ @@ -34,3 +35,33 @@ static int __init delayacct_init(void) return 0; } core_initcall(delayacct_init); + +inline void delayacct_blkio(struct timespec *start, struct timespec *end) +{ + unsigned long long delay; + + if (!delayacct_on) + return; + + delay = timespec_nsdiff(start, end); + + spin_lock(¤t->delays.lock); + current->delays.blkio_delay += delay; + current->delays.blkio_count++; + spin_unlock(¤t->delays.lock); +} + +inline void delayacct_swapin(struct timespec *start, struct timespec *end) +{ + unsigned long long delay; + + if (!delayacct_on) + return; + + delay = timespec_nsdiff(start, end); + + spin_lock(¤t->delays.lock); + current->delays.swapin_delay += delay; + current->delays.swapin_count++; + spin_unlock(¤t->delays.lock); +} |
From: Dave H. <hav...@us...> - 2005-12-07 22:34:35
|
On Wed, 2005-12-07 at 22:23 +0000, Shailabh Nagar wrote: > > + if (-EIOCBQUEUED == ret) { > + __attribute__((unused)) struct timespec start, end; > + Those "unused" things suck. They're really ugly. Doesn't making your delay functions into static inlines make the unused warnings go away? -- Dave |
From: Shailabh N. <na...@wa...> - 2005-12-07 23:06:55
|
Dave Hansen wrote: > On Wed, 2005-12-07 at 22:23 +0000, Shailabh Nagar wrote: > >>+ if (-EIOCBQUEUED == ret) { >>+ __attribute__((unused)) struct timespec start, end; >>+ > > > Those "unused" things suck. They're really ugly. > > Doesn't making your delay functions into static inlines make the unused > warnings go away? They do indeed. Thanks ! It was a holdover from when the delay funcs were macros. Will fix everywhere. --Shailabh > > -- Dave |
From: Shailabh N. <na...@wa...> - 2005-12-12 15:57:17
|
Creates /proc/<pid>/delay interface for getting per-task delay statistics (time spent by a task waiting for cpu, sync block I/O completion, swapping in pages etc.) The cpu stats are available only if CONFIG_SCHEDSTATS is enabled. The interface allows a task's delay stats (excluding cpu) to be reset to zero. This is particularly useful if delay accounting is being turned on/off dynamically. Signed-off-by: Shailabh Nagar <na...@wa...> fs/proc/base.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/delayacct.h | 6 ++++ kernel/delayacct.c | 33 +++++++++++++++++++++++ 3 files changed, 104 insertions(+) Index: linux-2.6.15-rc5/fs/proc/base.c =================================================================== --- linux-2.6.15-rc5.orig/fs/proc/base.c +++ linux-2.6.15-rc5/fs/proc/base.c @@ -71,6 +71,8 @@ #include <linux/cpuset.h> #include <linux/audit.h> #include <linux/poll.h> +#include <linux/delayacct.h> +#include <linux/kernel.h> #include "internal.h" /* @@ -166,6 +168,10 @@ enum pid_directory_inos { PROC_TID_OOM_SCORE, PROC_TID_OOM_ADJUST, +#ifdef CONFIG_TASK_DELAY_ACCT + PROC_TID_DELAY_ACCT, + PROC_TGID_DELAY_ACCT, +#endif /* Add new entries before this */ PROC_TID_FD_DIR = 0x8000, /* 0x8000-0xffff */ }; @@ -220,6 +226,9 @@ static struct pid_entry tgid_base_stuff[ #ifdef CONFIG_AUDITSYSCALL E(PROC_TGID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO), #endif +#ifdef CONFIG_TASK_DELAY_ACCT + E(PROC_TGID_DELAY_ACCT,"delay", S_IFREG|S_IRUGO), +#endif {0,0,NULL,0} }; static struct pid_entry tid_base_stuff[] = { @@ -262,6 +271,9 @@ static struct pid_entry tid_base_stuff[] #ifdef CONFIG_AUDITSYSCALL E(PROC_TID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO), #endif +#ifdef CONFIG_TASK_DELAY_ACCT + E(PROC_TID_DELAY_ACCT,"delay", S_IFREG|S_IRUGO), +#endif {0,0,NULL,0} }; @@ -1066,6 +1078,53 @@ static struct file_operations proc_secco }; #endif /* CONFIG_SECCOMP */ +#ifdef CONFIG_TASK_DELAY_ACCT +ssize_t proc_delayacct_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + struct task_struct *tsk = proc_task(file->f_dentry->d_inode); + char kbuf[DELAYACCT_PROC_MAX_WRITE + 1]; + int cmd, ret; + + if (count > DELAYACCT_PROC_MAX_WRITE) + return -EINVAL; + if (copy_from_user(&kbuf, buffer, count)) + return -EFAULT; + + cmd = simple_strtoul(kbuf, NULL, 10); + ret = delayacct_task_write(tsk, cmd); + + if (ret) + return ret; + return count; +} + +ssize_t proc_delayacct_read(struct file *file, char __user *buffer, + size_t count, loff_t *ppos) +{ + struct task_struct *tsk = proc_task(file->f_dentry->d_inode); + char kbuf[DELAYACCT_PROC_MAX_READ + 1]; + size_t len; + loff_t __ppos = *ppos; + + len = delayacct_task_read(tsk, kbuf); + + if (__ppos >= len) + return 0; + if (count > len-__ppos) + count = len-__ppos; + if (copy_to_user(buffer, kbuf + __ppos, count)) + return -EFAULT; + *ppos = __ppos + count; + return count; +} + +static struct file_operations proc_delayacct_operations = { + .read = proc_delayacct_read, + .write = proc_delayacct_write, +}; +#endif + static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd) { struct inode *inode = dentry->d_inode; @@ -1786,6 +1845,12 @@ static struct dentry *proc_pident_lookup inode->i_fop = &proc_loginuid_operations; break; #endif +#ifdef CONFIG_TASK_DELAY_ACCT + case PROC_TID_DELAY_ACCT: + case PROC_TGID_DELAY_ACCT: + inode->i_fop = &proc_delayacct_operations; + break; +#endif default: printk("procfs: impossible type (%d)",p->type); iput(inode); Index: linux-2.6.15-rc5/include/linux/delayacct.h =================================================================== --- linux-2.6.15-rc5.orig/include/linux/delayacct.h +++ linux-2.6.15-rc5/include/linux/delayacct.h @@ -16,11 +16,17 @@ #include <linux/sched.h> +/* Maximum data that a user can read/write from/to /proc/<tgid>/delay */ +#define DELAYACCT_PROC_MAX_READ 256 +#define DELAYACCT_PROC_MAX_WRITE 8 + #ifdef CONFIG_TASK_DELAY_ACCT extern int delayacct_on; /* Delay accounting turned on/off */ extern void delayacct_tsk_init(struct task_struct *tsk); extern void delayacct_blkio(struct timespec *start, struct timespec *end); extern void delayacct_swapin(struct timespec *start, struct timespec *end); +extern int delayacct_task_write(struct task_struct *tsk, int cmd); +extern size_t delayacct_task_read(struct task_struct *tsk, char *buf); #else static inline void delayacct_tsk_init(struct task_struct *tsk) {} Index: linux-2.6.15-rc5/kernel/delayacct.c =================================================================== --- linux-2.6.15-rc5.orig/kernel/delayacct.c +++ linux-2.6.15-rc5/kernel/delayacct.c @@ -13,6 +13,7 @@ #include <linux/sched.h> #include <linux/time.h> +#include <linux/delayacct.h> int delayacct_on; /* Delay accounting turned on/off */ @@ -65,3 +66,35 @@ inline void delayacct_swapin(struct time current->delays.swapin_count++; spin_unlock(¤t->delays.lock); } + +/* User writes @cmd to /proc/<tgid>/delay */ +inline int delayacct_task_write(struct task_struct *tsk, int cmd) +{ + if (cmd == 0) { + spin_lock(&tsk->delays.lock); + memset(&tsk->delays, 0, sizeof(tsk->delays)); + spin_unlock(&tsk->delays.lock); + } + return 0; +} + +/* User reads from /proc/<tgid>/delay */ +inline size_t delayacct_task_read(struct task_struct *tsk, char *buf) +{ + unsigned long long run_delay = 0; + unsigned long run_count = 0; + +#ifdef CONFIG_SCHEDSTATS + run_delay = jiffies_to_usecs(tsk->sched_info.run_delay) * 1000; + run_count = tsk->sched_info.pcnt ; +#endif + return snprintf(buf, DELAYACCT_PROC_MAX_READ, + "%lu %llu %llu %u %llu %u %llu\n", + run_count, + (uint64_t) current_sched_time(tsk), + (uint64_t) run_delay, + (unsigned int) tsk->delays.blkio_count, + (uint64_t) tsk->delays.blkio_delay, + (unsigned int) tsk->delays.swapin_count, + (uint64_t) tsk->delays.swapin_delay); +} - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to maj...@vg... More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
From: Jay L. <jl...@en...> - 2005-12-13 18:35:11
|
john stultz wrote: > On Mon, 2005-12-12 at 19:31 +0000, Shailabh Nagar wrote: > >>Christoph Lameter wrote: >> >>>On Wed, 7 Dec 2005, Shailabh Nagar wrote: >>> >>> >>> >>>>+void getnstimestamp(struct timespec *ts) >>> >>> >>>There is already getnstimeofday in the kernel. >>> >> >>Yes, and that function is being used within the getnstimestamp() being proposed. >>However, John Stultz had advised that getnstimeofday could get affected by calls to >>settimeofday and had recommended adjusting the getnstimeofday value with wall_to_monotonic. >> >>John, could you elaborate ? > > > I think you pretty well have it covered. > > getnstimeofday + wall_to_monotonic should be higher-res and more > reliable (then TSC based sched_clock(), for example) for getting a > timestamp. How is this proposed function different from do_posix_clock_monotonic_gettime()? It calls getnstimeofday(), it also adjusts with wall_to_monotinic. It seems to me we just need to EXPORT_SYMBOL_GPL the do_posix_clock_monotonic_gettime()? Thanks, - jay > > There may be performance concerns as you have to access the clock > hardware in getnstimeofday(), but there really is no other way for > reliable finely grained monotonically increasing timestamps. > > thanks > -john > |