From: Jes S. <je...@sg...> - 2006-04-27 14:55:04
|
Hi, Attached is the patch for the simple task notifier support that has been discussed on lse-tech on multiple occasions (it superceeds pnotify/pagg/task_notify). I provides a way to hook into the fork/exit etc. places and can be used for things like task grouping and accounting, but also to clean up all the conditional callouts we have in the fork/exit paths. I will be posting one example in the next email and I will be happy to do a bunch more if we can come to agreement on this approach. I am using the raw notifier chain to give the clients the same sleep/blocking rules as the parents currently have. IMHO this makes the most sense to keep the overhead at a minimum. Comments/oppinions/flames? Cheers, Jes Simple task notifier support. This uses the raw notifier chain allowing the users the same calls to sleeping functions as are in fork() etc. Signed-off-by: Jes Sorensen <je...@sg...> ---- fs/exec.c | 3 +++ include/linux/init_task.h | 2 ++ include/linux/notifier.h | 5 +++++ include/linux/sched.h | 3 +++ kernel/exit.c | 5 +++++ kernel/fork.c | 16 ++++++++++++++++ 6 files changed, 34 insertions(+) Index: linux-2.6/fs/exec.c =================================================================== --- linux-2.6.orig/fs/exec.c +++ linux-2.6/fs/exec.c @@ -49,6 +49,7 @@ #include <linux/rmap.h> #include <linux/acct.h> #include <linux/cn_proc.h> +#include <linux/notifier.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -1001,6 +1002,8 @@ suid_keys(current); exec_keys(current); + raw_notifier_call_chain(¤t->task_notifier, TN_EXEC, NULL); + task_lock(current); unsafe = unsafe_exec(current); security_bprm_apply_creds(bprm, unsafe); Index: linux-2.6/include/linux/init_task.h =================================================================== --- linux-2.6.orig/include/linux/init_task.h +++ linux-2.6/include/linux/init_task.h @@ -3,6 +3,7 @@ #include <linux/file.h> #include <linux/rcupdate.h> +#include <linux/notifier.h> #define INIT_FDTABLE \ { \ @@ -123,6 +124,7 @@ .journal_info = NULL, \ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ .fs_excl = ATOMIC_INIT(0), \ + .task_notifier = RAW_NOTIFIER_INIT(tsk.task_notifier), \ } Index: linux-2.6/include/linux/notifier.h =================================================================== --- linux-2.6.orig/include/linux/notifier.h +++ linux-2.6/include/linux/notifier.h @@ -154,5 +154,10 @@ #define CPU_DOWN_FAILED 0x0006 /* CPU (unsigned)v NOT going down */ #define CPU_DEAD 0x0007 /* CPU (unsigned)v dead */ +#define TN_FORK 0x0001 /* Task notifier - fork */ +#define TN_EXIT 0x0002 /* Task notifier - exit */ +#define TN_FREE 0x0003 /* Task notifier - free */ +#define TN_EXEC 0x0004 /* Task notifier - exec */ + #endif /* __KERNEL__ */ #endif /* _LINUX_NOTIFIER_H */ Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -36,6 +36,7 @@ #include <linux/seccomp.h> #include <linux/rcupdate.h> #include <linux/futex.h> +#include <linux/notifier.h> #include <linux/auxvec.h> /* For AT_VECTOR_SIZE */ @@ -888,6 +889,8 @@ * cache last used pipe for splice */ struct pipe_inode_info *splice_pipe; + + struct raw_notifier_head task_notifier; }; static inline pid_t process_group(struct task_struct *tsk) Index: linux-2.6/kernel/exit.c =================================================================== --- linux-2.6.orig/kernel/exit.c +++ linux-2.6/kernel/exit.c @@ -34,6 +34,7 @@ #include <linux/mutex.h> #include <linux/futex.h> #include <linux/compat.h> +#include <linux/notifier.h> #include <linux/pipe_fs_i.h> #include <asm/uaccess.h> @@ -910,6 +911,10 @@ if (unlikely(tsk->compat_robust_list)) compat_exit_robust_list(tsk); #endif + /* + * Must call the exit notifier before losing the mm struct + */ + raw_notifier_call_chain(&tsk->task_notifier, TN_EXIT, tsk); exit_mm(tsk); exit_sem(tsk); Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c +++ linux-2.6/kernel/fork.c @@ -44,6 +44,7 @@ #include <linux/rmap.h> #include <linux/acct.h> #include <linux/cn_proc.h> +#include <linux/notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -114,6 +115,7 @@ WARN_ON(atomic_read(&tsk->usage)); WARN_ON(tsk == current); + raw_notifier_call_chain(&tsk->task_notifier, TN_FREE, tsk); if (unlikely(tsk->audit_context)) audit_free(tsk); security_task_free(tsk); @@ -1043,6 +1045,11 @@ if (clone_flags & CLONE_THREAD) p->tgid = current->tgid; + /* + * Any of the below could be trying to use this + */ + RAW_INIT_NOTIFIER_HEAD(&p->task_notifier); + if ((retval = security_task_alloc(p))) goto bad_fork_cleanup_policy; if ((retval = audit_alloc(p))) @@ -1103,6 +1110,14 @@ p->exit_state = 0; /* + * Call the task notifiers for the current thread. It is their + * job to determine whether or not to preserve the parent's + * task_notifiers. + */ + if (raw_notifier_call_chain(¤t->task_notifier, TN_FORK, p)) + goto bad_fork_cleanup_namespace; + + /* * Ok, make it visible to the rest of the system. * We dont wake it up yet. */ @@ -1239,6 +1254,7 @@ audit_free(p); bad_fork_cleanup_security: security_task_free(p); + raw_notifier_call_chain(&p->task_notifier, TN_EXIT, p); bad_fork_cleanup_policy: #ifdef CONFIG_NUMA mpol_free(p->mempolicy); |
From: Chandra S. <sek...@us...> - 2006-04-27 20:14:33
|
On Thu, 2006-04-27 at 10:54 -0400, Jes Sorensen wrote: > Hi, > > Attached is the patch for the simple task notifier support that has > been discussed on lse-tech on multiple occasions (it superceeds > pnotify/pagg/task_notify). I provides a way to hook into the fork/exit > etc. places and can be used for things like task grouping and > accounting, but also to clean up all the conditional callouts we have > in the fork/exit paths. I will be posting one example in the next > email and I will be happy to do a bunch more if we can come to > agreement on this approach. > > I am using the raw notifier chain to give the clients the same > sleep/blocking rules as the parents currently have. IMHO this makes > the most sense to keep the overhead at a minimum. > > Comments/oppinions/flames? I like the idea of using the basic notifier mechanism for doing task notification. I totally agree that this approach will help us move some code out from fork(), exec(), exit(), etc., and help make some compiled-in functionality to become available as modules(process event connector, for example). But I have concerns with this implementation: Issue 1: As i told before in this context, IMHO, having a global notifier head for these events is better than having a per task notifier chains due to following reasons: 1 how does this model makes sure that references to all the notifier block for that module is removed when a module unregisters the notifier ? All references (one per task) of the notifier block (and hence the notifier_call) has to be removed. Attached patch doesn't do it. If we want to do it, then we have hold a system-wide lock/semaphore and walk thru _all_ the tasks in the system and remove the notifier block from them. Basically we have to make sure addition/deletion of notifier block to _all_ tasks is done atomically. Note that changing the notifier to atomic_notifier alone will not help as we have to remove the reference to the notifier_call in _all_ the tasks atomically. 2 same issue has to be solved during register also, and the module has to do the registration for each task it is interested in (under the same lock as (1)). 3 If a module wants a new task to inherit the module's notification callback from the task's parent, then the module has to call the register_notifier for the new task during the TN_FORK event. This is an additional overhead for _each_ module at fork(). Also this has to done with the lock(described in (1) above) held. 4 Consider the situation: At fork(), a module tries to allocate a notifier block to attach to the new task, and the alloc fails. What does the module do ? fail the callback and hence fail the fork() ? 5 Where would we free the notifier blocks that are attached with _every_ task in the system ? (the semundo patch doesn't free the notifier block it alloc'd in getundo_list() function). Note that it is _illegal_ to do it in the corresponding notifier_call. Which means we need to have a garbage collector of some sort to free up the notifier blocks in _every_ task. 6 wastage of memory (notifier head and multiple notifier blocks for every task in the system). I understand there is "one function call overhead" at fork() for you and the MVFS folks, but having a global notifier block will be a lot simpler. BTW, there is no locking involved during call_chain of atomic notifiers, so the overhead is minimal. Issue 2: I think the task notify mechanism has to provide the necessary protection to the notifier chain to make sure that the notifier_blocks are not unregistered when the notifier_call_chain is walked through. Individual module owners cannot protect the notifier chain when it is walked through during call_chain. We could simply use atomic notifier. Why did you choose not to use it ? If any module wants to block, then we may have to provide a blocking task notify call chains also. IMO, we need a simple wrapper that uses a global atomic notifier head and provides - register_task_notifier - unregister_task_notifier - task_call_chain with task_call_chain called from appropriate points with appropriate flags. regards, chandra -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Chandra S. <sek...@us...> - 2006-04-27 22:04:59
|
On Thu, 2006-04-27 at 13:14 -0700, Chandra Seetharaman wrote: Hi Jes, Here is the patch that describes what i was talking. I used most of your code. Except that replaced per task raw notifier head with the global atomic notifier head. fs/exec.c | 2 ++ include/linux/notifier.h | 5 +++++ include/linux/sched.h | 1 + kernel/exit.c | 2 ++ kernel/fork.c | 5 +++++ kernel/sched.c | 21 +++++++++++++++++++++ 6 files changed, 36 insertions(+) Index: linux-2617-rc2-tnotify/kernel/sched.c =================================================================== --- linux-2617-rc2-tnotify.orig/kernel/sched.c +++ linux-2617-rc2-tnotify/kernel/sched.c @@ -50,6 +50,7 @@ #include <linux/times.h> #include <linux/acct.h> #include <linux/kprobes.h> +#include <linux/notifier.h> #include <asm/tlb.h> #include <asm/unistd.h> @@ -350,6 +351,26 @@ static inline void finish_lock_switch(ru } #endif /* __ARCH_WANT_UNLOCKED_CTXSW */ +/* Task notifier Mechanism */ +static ATOMIC_NOTIFIER_HEAD(task_notifier); + +int register_task_notifier(struct notifier_block *nb) +{ + return atomic_notifier_chain_register(&task_notifier, nb); +} +EXPORT_SYMBOL(register_task_notifier); + +int unregister_task_notifier(struct notifier_block *nb) +{ + return atomic_notifier_chain_unregister(&task_notifier, nb); +} +EXPORT_SYMBOL(unregister_task_notifier); + +int task_notifier_call_chain(unsigned long val, void *v) +{ + return atomic_notifier_call_chain(&task_notifier, val, v); +} + /* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without Index: linux-2617-rc2-tnotify/kernel/exit.c =================================================================== --- linux-2617-rc2-tnotify.orig/kernel/exit.c +++ linux-2617-rc2-tnotify/kernel/exit.c @@ -35,6 +35,7 @@ #include <linux/futex.h> #include <linux/compat.h> #include <linux/pipe_fs_i.h> +#include <linux/notifier.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -910,6 +911,7 @@ fastcall NORET_TYPE void do_exit(long co if (unlikely(tsk->compat_robust_list)) compat_exit_robust_list(tsk); #endif + task_notifier_call_chain(TN_EXIT, tsk); exit_mm(tsk); exit_sem(tsk); Index: linux-2617-rc2-tnotify/kernel/fork.c =================================================================== --- linux-2617-rc2-tnotify.orig/kernel/fork.c +++ linux-2617-rc2-tnotify/kernel/fork.c @@ -44,6 +44,7 @@ #include <linux/rmap.h> #include <linux/acct.h> #include <linux/cn_proc.h> +#include <linux/notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -114,6 +115,7 @@ void __put_task_struct(struct task_struc WARN_ON(atomic_read(&tsk->usage)); WARN_ON(tsk == current); + task_notifier_call_chain(TN_FREE, tsk); if (unlikely(tsk->audit_context)) audit_free(tsk); security_task_free(tsk); @@ -1101,6 +1103,8 @@ static task_t *copy_process(unsigned lon p->pdeath_signal = 0; p->exit_state = 0; + if (task_notifier_call_chain(TN_FORK, p)) + goto bad_fork_cleanup_namespace; /* * Ok, make it visible to the rest of the system. * We dont wake it up yet. @@ -1238,6 +1242,7 @@ bad_fork_cleanup_audit: audit_free(p); bad_fork_cleanup_security: security_task_free(p); + task_notifier_call_chain(TN_EXIT, p); bad_fork_cleanup_policy: #ifdef CONFIG_NUMA mpol_free(p->mempolicy); Index: linux-2617-rc2-tnotify/fs/exec.c =================================================================== --- linux-2617-rc2-tnotify.orig/fs/exec.c +++ linux-2617-rc2-tnotify/fs/exec.c @@ -49,6 +49,7 @@ #include <linux/rmap.h> #include <linux/acct.h> #include <linux/cn_proc.h> +#include <linux/notifier.h> #include <asm/uaccess.h> #include <asm/mmu_context.h> @@ -1001,6 +1002,7 @@ void compute_creds(struct linux_binprm * suid_keys(current); exec_keys(current); + task_notifier_call_chain(TN_EXEC, NULL); task_lock(current); unsafe = unsafe_exec(current); security_bprm_apply_creds(bprm, unsafe); Index: linux-2617-rc2-tnotify/include/linux/notifier.h =================================================================== --- linux-2617-rc2-tnotify.orig/include/linux/notifier.h +++ linux-2617-rc2-tnotify/include/linux/notifier.h @@ -154,5 +154,10 @@ extern int raw_notifier_call_chain(struc #define CPU_DOWN_FAILED 0x0006 /* CPU (unsigned)v NOT going down */ #define CPU_DEAD 0x0007 /* CPU (unsigned)v dead */ +#define TN_FORK 0x0001 /* Task notifier - fork */ +#define TN_EXIT 0x0002 /* Task notifier - exit */ +#define TN_FREE 0x0003 /* Task notifier - free */ +#define TN_EXEC 0x0004 /* Task notifier - exec */ + #endif /* __KERNEL__ */ #endif /* _LINUX_NOTIFIER_H */ Index: linux-2617-rc2-tnotify/include/linux/sched.h =================================================================== --- linux-2617-rc2-tnotify.orig/include/linux/sched.h +++ linux-2617-rc2-tnotify/include/linux/sched.h @@ -206,6 +206,7 @@ extern void cpu_init (void); extern void trap_init(void); extern void update_process_times(int user); extern void scheduler_tick(void); +extern int task_notifier_call_chain(unsigned long, void *); #ifdef CONFIG_DETECT_SOFTLOCKUP extern void softlockup_tick(void); -- > On Thu, 2006-04-27 at 10:54 -0400, Jes Sorensen wrote: > > Hi, > > > > Attached is the patch for the simple task notifier support that has > > been discussed on lse-tech on multiple occasions (it superceeds > > pnotify/pagg/task_notify). I provides a way to hook into the fork/exit > > etc. places and can be used for things like task grouping and > > accounting, but also to clean up all the conditional callouts we have > > in the fork/exit paths. I will be posting one example in the next > > email and I will be happy to do a bunch more if we can come to > > agreement on this approach. > > > > I am using the raw notifier chain to give the clients the same > > sleep/blocking rules as the parents currently have. IMHO this makes > > the most sense to keep the overhead at a minimum. > > > > Comments/oppinions/flames? > > I like the idea of using the basic notifier mechanism for doing task > notification. > > I totally agree that this approach will help us move some code out from > fork(), exec(), exit(), etc., and help make some compiled-in > functionality to become available as modules(process event connector, > for example). > > But I have concerns with this implementation: > > Issue 1: As i told before in this context, IMHO, having a global > notifier head for these events is better than having a per task notifier > chains due to following reasons: > 1 how does this model makes sure that references to all the notifier > block for that module is removed when a module unregisters the > notifier ? > All references (one per task) of the notifier block (and hence the > notifier_call) has to be removed. Attached patch doesn't do it. > If we want to do it, then we have hold a system-wide lock/semaphore > and walk thru _all_ the tasks in the system and remove the notifier > block from them. > Basically we have to make sure addition/deletion of notifier block > to _all_ tasks is done atomically. > Note that changing the notifier to atomic_notifier alone will not > help as we have to remove the reference to the notifier_call in _all_ > the tasks atomically. > 2 same issue has to be solved during register also, and the module > has to do the registration for each task it is interested in (under > the same lock as (1)). > 3 If a module wants a new task to inherit the module's notification > callback from the task's parent, then the module has to call the > register_notifier for the new task during the TN_FORK event. > This is an additional overhead for _each_ module at fork(). > Also this has to done with the lock(described in (1) above) held. > 4 Consider the situation: At fork(), a module tries to allocate a > notifier block to attach to the new task, and the alloc fails. What > does the module do ? fail the callback and hence fail the fork() ? > 5 Where would we free the notifier blocks that are attached with > _every_ task in the system ? (the semundo patch doesn't free the > notifier block it alloc'd in getundo_list() function). > Note that it is _illegal_ to do it in the corresponding > notifier_call. Which means we need to have a garbage collector of > some sort to free up the notifier blocks in _every_ task. > 6 wastage of memory (notifier head and multiple notifier blocks for > every task in the system). > > I understand there is "one function call overhead" at fork() for you and > the MVFS folks, but having a global notifier block will be a lot > simpler. > > BTW, there is no locking involved during call_chain of atomic notifiers, > so the overhead is minimal. > > Issue 2: I think the task notify mechanism has to provide the necessary > protection to the notifier chain to make sure that the notifier_blocks > are not unregistered when the notifier_call_chain is walked through. > > Individual module owners cannot protect the notifier chain when it is > walked through during call_chain. > > We could simply use atomic notifier. Why did you choose not to use it ? > > If any module wants to block, then we may have to provide a blocking > task notify call chains also. > > IMO, we need a simple wrapper that uses a global atomic notifier head > and provides > - register_task_notifier > - unregister_task_notifier > - task_call_chain > > with task_call_chain called from appropriate points with appropriate > flags. > > regards, > > chandra > > -- > ---------------------------------------------------------------------- > Chandra Seetharaman | Be careful what you choose.... > - sek...@us... | .......you may get it. > ---------------------------------------------------------------------- > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Jes S. <je...@sg...> - 2006-04-28 13:46:54
|
Chandra Seetharaman wrote: > On Thu, 2006-04-27 at 13:14 -0700, Chandra Seetharaman wrote: > > Hi Jes, > > Here is the patch that describes what i was talking. I used most of your > code. Except that replaced per task raw notifier head with the global > atomic notifier head. Hi Chandra, This patch implies that all tasks will want to run all the same callbacks which is not the case. There will be cases where some tasks will have a certain callback and others won't. If one was to filter tasks in the actual callback handler it would be very expensive. Cheers, Jes |
From: Chandra S. <sek...@us...> - 2006-04-28 22:32:09
|
On Fri, 2006-04-28 at 15:46 +0200, Jes Sorensen wrote: > Chandra Seetharaman wrote: > > On Thu, 2006-04-27 at 13:14 -0700, Chandra Seetharaman wrote: > > > > Hi Jes, > > > > Here is the patch that describes what i was talking. I used most of your > > code. Except that replaced per task raw notifier head with the global > > atomic notifier head. > > Hi Chandra, > > This patch implies that all tasks will want to run all the same > callbacks which is not the case. There will be cases where some tasks > will have a certain callback and others won't. If one was to filter > tasks in the actual callback handler it would be very expensive. How would it be ? In one's callback, given a task won't they readily know if they are interested in that task ? They could just check that and return immediately if they are not interested in that task. That is what i meant by "one function call overhead". BTW, i am assuming there will be _more than one_ notifier block in the list. Do you assume otherwise ? > Cheers, > Jes > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Jes S. <je...@sg...> - 2006-05-01 09:33:38
|
Chandra Seetharaman wrote: > On Fri, 2006-04-28 at 15:46 +0200, Jes Sorensen wrote: >> This patch implies that all tasks will want to run all the same >> callbacks which is not the case. There will be cases where some tasks >> will have a certain callback and others won't. If one was to filter >> tasks in the actual callback handler it would be very expensive. > > How would it be ? In one's callback, given a task won't they readily > know if they are interested in that task ? They could just check that > and return immediately if they are not interested in that task. That is > what i meant by "one function call overhead". Thats the problem, how does a mobule know if it is interested in said task? The only way to do that is to keep a list or hash table and traverse that to find out whether or not the task is registered. On larger systems it of interest to group tasks together and apply accounting, resource limitations etc. based on the task group. In a system you are likely to have many groups which means the list for the module can become quite large. The performance keeping track of it the way you suggest is going to bite big time. SGI has a GPL'ed module called Job which does this, but there could be other ways of doing it. Erik has posted older versions of it in the past, and I have a somewhat cleaned up version sitting around somewhere as well if there's interest. > BTW, i am assuming there will be _more than one_ notifier block in the > list. Do you assume otherwise ? There can be, there can also be none. The idea is that you will only have modules of interest registered, so in the current fork() case we could possibly to get rid of some of the conditional calls which aren't always in use (keys, audit, security task etc). Cheers, Jes |
From: Matt H. <mat...@us...> - 2006-04-28 00:31:48
|
On Thu, 2006-04-27 at 13:14 -0700, Chandra Seetharaman wrote: <snip> > I totally agree that this approach will help us move some code out from > fork(), exec(), exit(), etc., and help make some compiled-in > functionality to become available as modules(process event connector, > for example). The implementations I've seen so far don't plug into fork() and exit() at quite the same places as process events. Those spots were chosen for a reason -- either to avoid subtle bugs or to have access to information before it's destroyed. 1) The fork notification takes place before the new task's real_parent is set. Process events needs the real parent's pid. Seems like the easiest fix is to move the call down a bit. 2) exec notification takes place before the kernel has confirmed that the binary has been successfully loaded and hence it's not suitable as an exec event notification point. The call graph used in the patches: do_execve -> search_binary_handler -> load_elf_binary -> compute_creds The call graph used for proc_exec_connector: do_execve -> search_binary_handler (just after current->did_exec = 1) A minor point: current is not passed into the chain during exec. Seems like it should be. I'd rather not use current in proc_exec_connector since the parameter is going to be passed anyway. Seems like the easiest fix is to move the call to the same spot as proc_exec_connector in fs/exec.c (and pass current). 3) There are no id notifications. This is minor. I can add them. 4) The exit notification happens before tsk->exit_code is set. I've been looking through the intervening functions and I believe we can move the assignment to the exit code above the notification. Cheers, -Matt Helsley |
From: Chandra S. <sek...@us...> - 2006-04-28 00:35:48
|
On Thu, 2006-04-27 at 17:24 -0700, Matt Helsley wrote: > On Thu, 2006-04-27 at 13:14 -0700, Chandra Seetharaman wrote: > > <snip> > > > I totally agree that this approach will help us move some code out from > > fork(), exec(), exit(), etc., and help make some compiled-in > > functionality to become available as modules(process event connector, > > for example). > > The implementations I've seen so far don't plug into fork() and exit() > at quite the same places as process events. Those spots were chosen for > a reason -- either to avoid subtle bugs or to have access to information > before it's destroyed. > > 1) The fork notification takes place before the new task's real_parent > is set. Process events needs the real parent's pid. Seems like the > easiest fix is to move the call down a bit. > > 2) exec notification takes place before the kernel has confirmed that > the binary has been successfully loaded and hence it's not suitable as > an exec event notification point. > The call graph used in the patches: > do_execve -> search_binary_handler -> load_elf_binary > -> compute_creds > The call graph used for proc_exec_connector: > do_execve -> search_binary_handler > (just after current->did_exec = 1) > > A minor point: current is not passed into the chain during exec. Seems > like it should be. I'd rather not use current in proc_exec_connector > since the parameter is going to be passed anyway. > > Seems like the easiest fix is to move the call to the same spot as > proc_exec_connector in fs/exec.c (and pass current). > > 3) There are no id notifications. This is minor. I can add them. > > 4) The exit notification happens before tsk->exit_code is set. I've been > looking through the intervening functions and I believe we can move the > assignment to the exit code above the notification. > I do not see any problems in making these changes. > Cheers, > -Matt Helsley > -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Matt H. <mat...@us...> - 2006-04-28 01:32:33
|
As I was reviewing the second patch this occurred to me: 5) The TN_EXIT notification from the fork path would subtly change the process events interface by generating more exit events and at different times than existing users would expect too. If the TN_FORK notification is moved as I've recommended in 1) this TN_EXIT notification should not exist. Unfortunately, I think this would mean that exit_sem() could not be implemented with the task notifier chain. Cheers, -Matt Helsley Nit: This should be [patch 01/02] since the semundo patch depends on this patch. |
From: Jes S. <je...@sg...> - 2006-04-28 13:44:29
|
Matt Helsley wrote: > On Thu, 2006-04-27 at 13:14 -0700, Chandra Seetharaman wrote: > > <snip> > >> I totally agree that this approach will help us move some code out from >> fork(), exec(), exit(), etc., and help make some compiled-in >> functionality to become available as modules(process event connector, >> for example). > > The implementations I've seen so far don't plug into fork() and exit() > at quite the same places as process events. Those spots were chosen for > a reason -- either to avoid subtle bugs or to have access to information > before it's destroyed. We can discuss the specific places we need to plug in. The points used in my patch were based on context that I have seen a need for, but it is likely that it could move to accommodate all. > 1) The fork notification takes place before the new task's real_parent > is set. Process events needs the real parent's pid. Seems like the > easiest fix is to move the call down a bit. What are process events? > 2) exec notification takes place before the kernel has confirmed that > the binary has been successfully loaded and hence it's not suitable as > an exec event notification point. What do you mean by an exec event notification point? It may be that there should be a TN_EXEC and TN_EXEC_COMPLETED event. > A minor point: current is not passed into the chain during exec. Seems > like it should be. I'd rather not use current in proc_exec_connector > since the parameter is going to be passed anyway. Since a task notifier is always run from current context, passing in current would just add unnecessary overhead. > Seems like the easiest fix is to move the call to the same spot as > proc_exec_connector in fs/exec.c (and pass current). The reason it currently is where it is is because I had a patch to handle the key stuff similar to how the semundo stuff is done. > 3) There are no id notifications. This is minor. I can add them. What ids? > 4) The exit notification happens before tsk->exit_code is set. I've been > looking through the intervening functions and I believe we can move the > assignment to the exit code above the notification. I think you are right. Don't see any obvious reason why not. Cheers, Jes |
From: Matt H. <mat...@us...> - 2006-04-29 01:10:52
|
On Fri, 2006-04-28 at 15:44 +0200, Jes Sorensen wrote: > Matt Helsley wrote: > > On Thu, 2006-04-27 at 13:14 -0700, Chandra Seetharaman wrote: > > > > <snip> > > > >> I totally agree that this approach will help us move some code out from > >> fork(), exec(), exit(), etc., and help make some compiled-in > >> functionality to become available as modules(process event connector, > >> for example). > > > > The implementations I've seen so far don't plug into fork() and exit() > > at quite the same places as process events. Those spots were chosen for > > a reason -- either to avoid subtle bugs or to have access to information > > before it's destroyed. > > We can discuss the specific places we need to plug in. The points used > in my patch were based on context that I have seen a need for, but it > is likely that it could move to accommodate all. Yup. That's why I sent my response :). > > 1) The fork notification takes place before the new task's real_parent > > is set. Process events needs the real parent's pid. Seems like the > > easiest fix is to move the call down a bit. > > What are process events? They are event notifications passed to userspace during a process' lifetime. They generally include pid and timestamp information. See include/linux/cn_proc.h and drivers/connector/cn_proc.c for what is sent and how. You can search for proc_(fork|exec|exit)_connector() calls in fork.c, exec.c, and exit.c to see where they are called from. There are also events for when the [re][ug]id change, which is what I was referring to in 3) below. > > 2) exec notification takes place before the kernel has confirmed that > > the binary has been successfully loaded and hence it's not suitable as > > an exec event notification point. > > What do you mean by an exec event notification point? It may be that > there should be a TN_EXEC and TN_EXEC_COMPLETED event. Or use two bits in the event enum to indicate one of three variations of the notification: STARTED (typically used to allocate/init) RETURN_OK (typically used to allocate/init) RETURN_ERR (used to free/cleanup from CALLED) The problem I see is TN_FOO is terribly ambiguous. Different notification handlers are going to want these notifications at slightly different times (during task struct initialization for instance). So there could be an endless series of patches moving the TN_FOO notification, adding new ones with slight variations in timing, or ensuring that additions to the notifier chains happen in a specific order. (the "STARTED" idea has this problem too) The best way to avoid this is to uniformly use a clearly-defined point of notification -- which is why process events only happen as early as possible in the "RETURN_OK" path. I think some things like semundo may not be a good candidate for *any* task notifier chain because they are so path-sensitive -- better to highlight that sensitivity by calling it explicitly in the code. > > A minor point: current is not passed into the chain during exec. Seems > > like it should be. I'd rather not use current in proc_exec_connector > > since the parameter is going to be passed anyway. > > Since a task notifier is always run from current context, passing in > current would just add unnecessary overhead. I don't believe your reason stated above is valid. You aren't avoiding overhead because the notifier functions pass a pointer parameter already. Passing current instead of relying on it in the handlers could simplify code analysis and future changes by not requiring handlers to assume that current is the task they must deal with. Perhaps it's a difference of personal style. > > Seems like the easiest fix is to move the call to the same spot as > > proc_exec_connector in fs/exec.c (and pass current). > > The reason it currently is where it is is because I had a patch to > handle the key stuff similar to how the semundo stuff is done. Ok. I don't see that patch so I can't suggest alternatives at the moment. > > 3) There are no id notifications. This is minor. I can add them. > > What ids? [re][ug]ids. I mentioned these ids in my description of process events above. Cheers, -Matt Helsley |
From: Jes S. <je...@sg...> - 2006-05-02 09:56:59
|
Matt Helsley wrote: > On Fri, 2006-04-28 at 15:44 +0200, Jes Sorensen wrote: > Or use two bits in the event enum to indicate one of three variations of > the notification: > STARTED (typically used to allocate/init) > RETURN_OK (typically used to allocate/init) > RETURN_ERR (used to free/cleanup from CALLED) > > The problem I see is TN_FOO is terribly ambiguous. Different > notification handlers are going to want these notifications at slightly > different times (during task struct initialization for instance). So > there could be an endless series of patches moving the TN_FOO > notification, adding new ones with slight variations in timing, or > ensuring that additions to the notifier chains happen in a specific > order. (the "STARTED" idea has this problem too) > > The best way to avoid this is to uniformly use a clearly-defined point > of notification -- which is why process events only happen as early as > possible in the "RETURN_OK" path. I think some things like semundo may > not be a good candidate for *any* task notifier chain because they are > so path-sensitive -- better to highlight that sensitivity by calling it > explicitly in the code. I think semundo is quite flexible in it's needs and should be using a notifier. Same goes for some of the other conditional stuff thats currently being called, such as keys and audit. They are pretty much all not very dependant on the order, they just need to clean up before the stuff they are hanging from disappears. But I agree, the points where this is happening should be well defined and documented so we don't end up with 17 callouts. If at all, the most there should ever be is an early and a late callout in a function, but I'd like to avoid that. >>> A minor point: current is not passed into the chain during exec. Seems >>> like it should be. I'd rather not use current in proc_exec_connector >>> since the parameter is going to be passed anyway. >> Since a task notifier is always run from current context, passing in >> current would just add unnecessary overhead. > > I don't believe your reason stated above is valid. You aren't avoiding > overhead because the notifier functions pass a pointer parameter > already. Passing current instead of relying on it in the handlers could > simplify code analysis and future changes by not requiring handlers to > assume that current is the task they must deal with. Perhaps it's a > difference of personal style. I'd like to reduce the code overhead of sticking the value in a struct when it's already available in a global register (on most architectures). >>> Seems like the easiest fix is to move the call to the same spot as >>> proc_exec_connector in fs/exec.c (and pass current). >> The reason it currently is where it is is because I had a patch to >> handle the key stuff similar to how the semundo stuff is done. > > Ok. I don't see that patch so I can't suggest alternatives at the > moment. It was way too complicated. I'll look into doing the other ones in a simpler way once we get the unregister problem resolved. Cheers, Jes |
From: Jes S. <je...@sg...> - 2006-04-28 13:30:18
|
Chandra Seetharaman wrote: > On Thu, 2006-04-27 at 10:54 -0400, Jes Sorensen wrote: >> I am using the raw notifier chain to give the clients the same >> sleep/blocking rules as the parents currently have. IMHO this makes >> the most sense to keep the overhead at a minimum. >> >> Comments/oppinions/flames? > But I have concerns with this implementation: > > Issue 1: As i told before in this context, IMHO, having a global > notifier head for these events is better than having a per task notifier > chains due to following reasons: > 1 how does this model makes sure that references to all the notifier > block for that module is removed when a module unregisters the > notifier ? > All references (one per task) of the notifier block (and hence the > notifier_call) has to be removed. Attached patch doesn't do it. > If we want to do it, then we have hold a system-wide lock/semaphore > and walk thru _all_ the tasks in the system and remove the notifier > block from them. > Basically we have to make sure addition/deletion of notifier block > to _all_ tasks is done atomically. > Note that changing the notifier to atomic_notifier alone will not > help as we have to remove the reference to the notifier_call in _all_ > the tasks atomically. Sorry, I simply don't follow you there. If you are talking about module unload then it's the responsibility of the module to keep a reference count and not unload until all users have exited which should be easy to do with the TN_EXIT handler. I don't see what lock you need here. > 2 same issue has to be solved during register also, and the module > has to do the registration for each task it is interested in (under > the same lock as (1)). Again why do you need a lock? Registration is inherited so something from userland creates it first and all children inherits it. Look at how this is done with cpuset for instance. > 3 If a module wants a new task to inherit the module's notification > callback from the task's parent, then the module has to call the > register_notifier for the new task during the TN_FORK event. > This is an additional overhead for _each_ module at fork(). > Also this has to done with the lock(described in (1) above) held. Which makes perfect sense to me. The overhead here is miniscule compared to the cost of having a global lock for this. > 4 Consider the situation: At fork(), a module tries to allocate a > notifier block to attach to the new task, and the alloc fails. What > does the module do ? fail the callback and hence fail the fork() ? Yes. Same problem if you are in the fork and there is not enough memory to allocate the task_struct or anything else that is allocated during the fork. > 5 Where would we free the notifier blocks that are attached with > _every_ task in the system ? (the semundo patch doesn't free the > notifier block it alloc'd in getundo_list() function). > Note that it is _illegal_ to do it in the corresponding > notifier_call. Which means we need to have a garbage collector of > some sort to free up the notifier blocks in _every_ task. Huh? Beware that semundo is special and requires a semaphore to be installed if clone() is called with a specific flag. If the semundo patch doesn't free the notifier block in the exit path then I introduced a bug. It's the responsibility of the TN_EXIT handler to clean up. > 6 wastage of memory (notifier head and multiple notifier blocks for > every task in the system). Sorry this argument is bogus. You're wasting a notifier head, but compared to what else is in the task struct this is nothing, in fact one could probably eliminate some elements in the task_struct and tag them onto notifier block so they are only allocated when really needed. A notifier block is only allocated if something is using it. > I understand there is "one function call overhead" at fork() for you and > the MVFS folks, but having a global notifier block will be a lot > simpler. Having a global notifier block will be a lot costlier because it will require a global lock. > BTW, there is no locking involved during call_chain of atomic notifiers, > so the overhead is minimal. atomic notifiers make no sense in the fork path. An atomic notifier implies that the service function cannot sleep, however that is not the case during fork/exit etc. Either we go the full way with a global lock or we do it lock free using raw notifiers as I suggest. > Issue 2: I think the task notify mechanism has to provide the necessary > protection to the notifier chain to make sure that the notifier_blocks > are not unregistered when the notifier_call_chain is walked through. Since a task notifier should only be unregistered in the TN_EXIT case this is not a problem. Besides, how many cases do we have where a module is unloaded at runtime? A classical case for this is for things like accounting and it's not a module that will be loaded and unloaded at random. > Individual module owners cannot protect the notifier chain when it is > walked through during call_chain. Nobody should need to walk the notifier chain outside the specified context. > We could simply use atomic notifier. Why did you choose not to use it ? As explained above, atomic notifiers makes no sense there. > If any module wants to block, then we may have to provide a blocking > task notify call chains also. The task notification chain is allowed to sleep. We already use GFP_KERNEL all over the place there. > IMO, we need a simple wrapper that uses a global atomic notifier head > and provides > - register_task_notifier > - unregister_task_notifier > - task_call_chain > > with task_call_chain called from appropriate points with appropriate > flags. Sorry, but I haven't seen any argument for where this would be required. If you could provide some examples please. Regards, Jes |
From: Chandra S. <sek...@us...> - 2006-04-28 20:40:20
|
On Fri, 2006-04-28 at 15:29 +0200, Jes Sorensen wrote: > Chandra Seetharaman wrote: > > On Thu, 2006-04-27 at 10:54 -0400, Jes Sorensen wrote: > >> I am using the raw notifier chain to give the clients the same > >> sleep/blocking rules as the parents currently have. IMHO this makes > >> the most sense to keep the overhead at a minimum. > >> > >> Comments/oppinions/flames? > > But I have concerns with this implementation: > > > > Issue 1: As i told before in this context, IMHO, having a global > > notifier head for these events is better than having a per task notifier > > chains due to following reasons: > > 1 how does this model makes sure that references to all the notifier > > block for that module is removed when a module unregisters the > > notifier ? > > All references (one per task) of the notifier block (and hence the > > notifier_call) has to be removed. Attached patch doesn't do it. > > If we want to do it, then we have hold a system-wide lock/semaphore > > and walk thru _all_ the tasks in the system and remove the notifier > > block from them. > > Basically we have to make sure addition/deletion of notifier block > > to _all_ tasks is done atomically. > > Note that changing the notifier to atomic_notifier alone will not > > help as we have to remove the reference to the notifier_call in _all_ > > the tasks atomically. > > Sorry, I simply don't follow you there. If you are talking about module > unload then it's the responsibility of the module to keep a reference > count and not unload until all users have exited which should be easy > to do with the TN_EXIT handler. > Do you mean to say that the module will not be allowed to unload _until_ there is _a single_ task that the module has attached a notifier block with ? That doesn't sound appealing Consider a filesystem example: - Administrator insmods a filesystem - administrator mounts a filesystem - random user chandra gets to the filsystem - chandra creates a task p1 - p1 does something which makes filesystem module interested in p1 - filesystem module attaches a notifier block to p1 - user chandra moves off the filsystem, but p1 remains alive. - administrator unmounts the filesystem, succeeds as no user in the fs - administrator tries to rmmod the filesystem, and it fails due to the above logic. How can the administrator fix the problem ? Also, i do not think it is advisable to call unregister from the notifier call. > I don't see what lock you need here. > even though the notifier block is per task, notifier_head in the task is global, in other words, there can be many users (of the task notify mechanism) that can be doing one of the following at the _same time_ with a single notifier head: - register a new notifier block - unregister a notifier block - walk through the call chain in notifier_call_chain Unless we have some kind of protection, how can we guarantee that they will not be running over each other, walking off the list, calling notifier call of a removed module etc., ? > > 2 same issue has to be solved during register also, and the module > > has to do the registration for each task it is interested in (under > > the same lock as (1)). > > Again why do you need a lock? Registration is inherited so something > from userland creates it first and all children inherits it. Look at how > this is done with cpuset for instance. cpuset is not a valid comparison here. cpuset is the _only_ user of the data structures it plays with, and it knows how to protect it. Whereas task notify mechanism will be used by many users and hence a common protection mechanism is needed. > > > 3 If a module wants a new task to inherit the module's notification > > callback from the task's parent, then the module has to call the > > register_notifier for the new task during the TN_FORK event. > > This is an additional overhead for _each_ module at fork(). > > Also this has to done with the lock(described in (1) above) held. > > Which makes perfect sense to me. The overhead here is miniscule compared > to the cost of having a global lock for this. I am _not_ saying that we need a global lock here, i am just pointing another problem w.r.t using per task notifier. > > > 4 Consider the situation: At fork(), a module tries to allocate a > > notifier block to attach to the new task, and the alloc fails. What > > does the module do ? fail the callback and hence fail the fork() ? > > Yes. Same problem if you are in the fork and there is not enough > memory to allocate the task_struct or anything else that is allocated > during the fork. > > > 5 Where would we free the notifier blocks that are attached with > > _every_ task in the system ? (the semundo patch doesn't free the > > notifier block it alloc'd in getundo_list() function). > > Note that it is _illegal_ to do it in the corresponding > > notifier_call. Which means we need to have a garbage collector of > > some sort to free up the notifier blocks in _every_ task. > > Huh? Beware that semundo is special and requires a semaphore to be > installed if clone() is called with a specific flag. If the semundo > patch doesn't free the notifier block in the exit path then I introduced > a bug. It's the responsibility of the TN_EXIT handler to clean up. As i pointed above it is not advisable to unregister from the callback, and that is the reason i am pointing this as a problem, not specifically on how the semundo patch is written. > > > 6 wastage of memory (notifier head and multiple notifier blocks for > > every task in the system). > > Sorry this argument is bogus. You're wasting a notifier head, but not just notifier head, you need notifier blocks associated with each user of the task notifier mechanism. It is certainly not bogus, if the user of the task notify mechanism is interested in _all_ tasks, then they have to allocate a notifier block for _every_ task in the system. Process event connector is an example that would be interested in _all_ tasks. > compared to what else is in the task struct this is nothing, in fact > one could probably eliminate some elements in the task_struct and tag > them onto notifier block so they are only allocated when really needed. > A notifier block is only allocated if something is using it. According to your description above, you expect more people to start using task notifier, which in turn mean a notifier block for _each_ task for every such usage of task notifier mechanism. > > > I understand there is "one function call overhead" at fork() for you and > > the MVFS folks, but having a global notifier block will be a lot > > simpler. > > Having a global notifier block will be a lot costlier because it will > require a global lock. My point is that we are not getting away from using a global lock in the per task notifier_head case also. > > > BTW, there is no locking involved during call_chain of atomic notifiers, > > so the overhead is minimal. > > atomic notifiers make no sense in the fork path. An atomic notifier > implies that the service function cannot sleep, however that is not the > case during fork/exit etc. Either we go the full way with a global lock > or we do it lock free using raw notifiers as I suggest. By global lock you mean a semaphore, i presume. > > > Issue 2: I think the task notify mechanism has to provide the necessary > > protection to the notifier chain to make sure that the notifier_blocks > > are not unregistered when the notifier_call_chain is walked through. > > Since a task notifier should only be unregistered in the TN_EXIT case as i said above this is not advisable. > this is not a problem. Besides, how many cases do we have where a module may not be many, but we don't want to be oopsing when they do, right ? As long as the kernel supports unloading a module, we should consider that as a valid usage. > is unloaded at runtime? A classical case for this is for things like > accounting and it's not a module that will be loaded and unloaded at > random. I will not conclude on that. We have heard in this channel that MVFS will use this feature. I am sure that many more will follow. They may not be doing at random, but they may do it, and IMO, we should work correctly in those cases also. > > > Individual module owners cannot protect the notifier chain when it is > > walked through during call_chain. > > Nobody should need to walk the notifier chain outside the specified > context. They won't be walking the list, the notifier_call_chain would be walking, and it needs protection. > > > We could simply use atomic notifier. Why did you choose not to use it ? > > As explained above, atomic notifiers makes no sense there. > > > If any module wants to block, then we may have to provide a blocking > > task notify call chains also. > > The task notification chain is allowed to sleep. We already use > GFP_KERNEL all over the place there. We can just go with a blocking semaphore. > > > IMO, we need a simple wrapper that uses a global atomic notifier head > > and provides > > - register_task_notifier > > - unregister_task_notifier > > - task_call_chain > > > > with task_call_chain called from appropriate points with appropriate > > flags. > > Sorry, but I haven't seen any argument for where this would be required. > If you could provide some examples please. > > Regards, > Jes -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Jes S. <je...@sg...> - 2006-05-01 10:07:14
|
Chandra Seetharaman wrote: > On Fri, 2006-04-28 at 15:29 +0200, Jes Sorensen wrote: > Do you mean to say that the module will not be allowed to unload _until_ > there is _a single_ task that the module has attached a notifier block > with ? That doesn't sound appealing > > Consider a filesystem example: > - Administrator insmods a filesystem > - administrator mounts a filesystem > - random user chandra gets to the filsystem > - chandra creates a task p1 > - p1 does something which makes filesystem module interested in p1 > - filesystem module attaches a notifier block to p1 > - user chandra moves off the filsystem, but p1 remains alive. > - administrator unmounts the filesystem, succeeds as no user in the fs > - administrator tries to rmmod the filesystem, and it fails due to the > above logic. How can the administrator fix the problem ? Unregistering doesn't fail, it just becomes delayed. How likely is it that someone will unregister a file system like that while the task that accessed it is still live? The cost of adding locking is far worse than the above IMHO fairly artificial case. > Also, i do not think it is advisable to call unregister from the > notifier call. There is nothing wrong with that. You're in the task context at the time, nothing else will conflict. > even though the notifier block is per task, notifier_head in the task is > global, in other words, there can be many users (of the task notify > mechanism) that can be doing one of the following at the _same time_ > with a single notifier head: > - register a new notifier block > - unregister a notifier block > - walk through the call chain in notifier_call_chain Nobody will walk the notifier chain outside of task context, anything doing that is broken. In addition, since the registration is done by the task itself this isn't a problem and unregistration is done at exit time. We could use the rcu approach if it was really deemed necessary to be able to add tasks from elsewhere. > Unless we have some kind of protection, how can we guarantee that they > will not be running over each other, walking off the list, calling > notifier call of a removed module etc., ? The walking the list argument doesn't make sense. The only way to do that is by adding heavy locking which we don't need by applying the simple constraints. >> Again why do you need a lock? Registration is inherited so something >> from userland creates it first and all children inherits it. Look at how >> this is done with cpuset for instance. > > cpuset is not a valid comparison here. cpuset is the _only_ user of the > data structures it plays with, and it knows how to protect it. Whereas > task notify mechanism will be used by many users and hence a common > protection mechanism is needed. It's totally valid for the mechanism. If you wish to apply accounting or resource control to the system, then it's expectable that this is enabled by the init task and not at random later. It doesn't mean the module has to be in use at all time, but once you insmod your accounting then it's on the system. Hence the cpusets comparison makes sense IMHO. >> Huh? Beware that semundo is special and requires a semaphore to be >> installed if clone() is called with a specific flag. If the semundo >> patch doesn't free the notifier block in the exit path then I introduced >> a bug. It's the responsibility of the TN_EXIT handler to clean up. > > As i pointed above it is not advisable to unregister from the callback, > and that is the reason i am pointing this as a problem, not specifically > on how the semundo patch is written. Please elaborate, you're in task context so you have control of it. What is wrong calling the unregister function like this? >>> 6 wastage of memory (notifier head and multiple notifier blocks for >>> every task in the system). >> Sorry this argument is bogus. You're wasting a notifier head, but > > not just notifier head, you need notifier blocks associated with each > user of the task notifier mechanism. > > It is certainly not bogus, if the user of the task notify mechanism is > interested in _all_ tasks, then they have to allocate a notifier block > for _every_ task in the system. Process event connector is an example > that would be interested in _all_ tasks. For things that wants to know about every task in the system it doesn't make sense to register it with a task notifier. It should rather be put in with a fixed callout in the path or there should be two levels of notifier chains, global notifiers and task notifiers. >> compared to what else is in the task struct this is nothing, in fact >> one could probably eliminate some elements in the task_struct and tag >> them onto notifier block so they are only allocated when really needed. >> A notifier block is only allocated if something is using it. > > According to your description above, you expect more people to start > using task notifier, which in turn mean a notifier block for _each_ task > for every such usage of task notifier mechanism. Yes, however thats adding 24 bytes per task on a 64 bit platform or 12 bytes per task on a 32 bit platform. Compared to the overhead of running many fixed callouts and the number of other data structures in the task struct which are rarely used, this really isn't worth counting. >>> I understand there is "one function call overhead" at fork() for you and >>> the MVFS folks, but having a global notifier block will be a lot >>> simpler. >> Having a global notifier block will be a lot costlier because it will >> require a global lock. > > My point is that we are not getting away from using a global lock in the > per task notifier_head case also. I still haven't seen any arguments for why we need it. Sure a module may not unload if it's still in use, but thats what it is for all modules anyway. >>> BTW, there is no locking involved during call_chain of atomic notifiers, >>> so the overhead is minimal. >> atomic notifiers make no sense in the fork path. An atomic notifier >> implies that the service function cannot sleep, however that is not the >> case during fork/exit etc. Either we go the full way with a global lock >> or we do it lock free using raw notifiers as I suggest. > > By global lock you mean a semaphore, i presume. Any lock that would be taken for each task at the global level. >> this is not a problem. Besides, how many cases do we have where a module > > may not be many, but we don't want to be oopsing when they do, right ? > As long as the kernel supports unloading a module, we should consider > that as a valid usage. It's not unsupported, just with constraints. task notifiers are not meant to be a kitchen sink, they should be used carefully like many other apis within the kernel. >> is unloaded at runtime? A classical case for this is for things like >> accounting and it's not a module that will be loaded and unloaded at >> random. > > I will not conclude on that. We have heard in this channel that MVFS > will use this feature. I am sure that many more will follow. Pardon my ignorance, but what is MVFS? What does it do and how does it rely on this feature. > They may not be doing at random, but they may do it, and IMO, we should > work correctly in those cases also. The philosophy behind the task notifiers is that they are to be simple and efficient. If we start catering for random corner cases in the hot path it's going to be very expensive. Remember, this is Linux, not AIX or IRIX :))) >>> Individual module owners cannot protect the notifier chain when it is >>> walked through during call_chain. >> Nobody should need to walk the notifier chain outside the specified >> context. > > They won't be walking the list, the notifier_call_chain would be > walking, and it needs protection. Since the notifier_call_chain will only be called from within task context then it doesn't need any locking as long as we only allow walking the chain from within task context. >>> If any module wants to block, then we may have to provide a blocking >>> task notify call chains also. >> The task notification chain is allowed to sleep. We already use >> GFP_KERNEL all over the place there. > > We can just go with a blocking semaphore. And add another blocking semaphore to the hot path, don't think thats going to do anything good for scalability. Regards, Jes |
From: Chandra S. <sek...@us...> - 2006-05-01 21:33:55
|
On Mon, 2006-05-01 at 12:07 +0200, Jes Sorensen wrote: > Chandra Seetharaman wrote: > > On Fri, 2006-04-28 at 15:29 +0200, Jes Sorensen wrote: > > Do you mean to say that the module will not be allowed to unload _until_ > > there is _a single_ task that the module has attached a notifier block > > with ? That doesn't sound appealing > > > > Consider a filesystem example: > > - Administrator insmods a filesystem > > - administrator mounts a filesystem > > - random user chandra gets to the filsystem > > - chandra creates a task p1 > > - p1 does something which makes filesystem module interested in p1 > > - filesystem module attaches a notifier block to p1 > > - user chandra moves off the filsystem, but p1 remains alive. > > - administrator unmounts the filesystem, succeeds as no user in the fs > > - administrator tries to rmmod the filesystem, and it fails due to the > > above logic. How can the administrator fix the problem ? > > Unregistering doesn't fail, it just becomes delayed. How likely is it yes, delayed _until_ the task dies, which is an unknown. > that someone will unregister a file system like that while the task that > accessed it is still live? The cost of adding locking is far worse than > the above IMHO fairly artificial case. even if you consider this as an artificial case, don't you think the kernel code should work properly ? > > > Also, i do not think it is advisable to call unregister from the > > notifier call. > > There is nothing wrong with that. You're in the task context at the > time, nothing else will conflict. Here is the unregister code (as is in 2.6.17-rc3): 1: static int __kprobes notifier_call_chain(struct notifier_block **nl, 2: unsigned long val, void *v) 3: { 4: int ret = NOTIFY_DONE; 5: struct notifier_block *nb; 6: 7: nb = rcu_dereference(*nl); 8: while (nb) { 9: ret = nb->notifier_call(nb, val, v); a: if ((ret & NOTIFY_STOP_MASK) == NOTIFY_STOP_MASK) b: break; c: nb = rcu_dereference(nb->next); d: } e: return ret; f: } line 7 gets nb, line c resets nb with nb->next, if nb is freed in notifier_call on line 9, how can we be sure that nb->next in line c is sane ? Hence, IMO you cannot free the notifier block from the callback itself. > > > even though the notifier block is per task, notifier_head in the task is > > global, in other words, there can be many users (of the task notify > > mechanism) that can be doing one of the following at the _same time_ > > with a single notifier head: > > - register a new notifier block > > - unregister a notifier block > > - walk through the call chain in notifier_call_chain > > Nobody will walk the notifier chain outside of task context, anything > doing that is broken. In addition, since the registration is done by > the task itself this isn't a problem and unregistration is done at > exit time. So your model is based on all of the above operations (register, unregister and call_chain) happening in the 'said' task's context. right ? I have a module that is interested in few tasks in the system. When my module is insmoded, how do i get into each process's context to register my notifier_block ? > > We could use the rcu approach if it was really deemed necessary to be > able to add tasks from elsewhere. > > > Unless we have some kind of protection, how can we guarantee that they > > will not be running over each other, walking off the list, calling > > notifier call of a removed module etc., ? > > The walking the list argument doesn't make sense. The only way to do > that is by adding heavy locking which we don't need by applying the > simple constraints. IMO, the constraints are not simple. > > >> Again why do you need a lock? Registration is inherited so something > >> from userland creates it first and all children inherits it. Look at how > >> this is done with cpuset for instance. > > > > cpuset is not a valid comparison here. cpuset is the _only_ user of the > > data structures it plays with, and it knows how to protect it. Whereas > > task notify mechanism will be used by many users and hence a common > > protection mechanism is needed. > > It's totally valid for the mechanism. If you wish to apply accounting or > resource control to the system, then it's expectable that this is > enabled by the init task and not at random later. It doesn't mean the > module has to be in use at all time, but once you insmod your accounting > then it's on the system. Hence the cpusets comparison makes sense IMHO. kernel does not impose "once you insmod your accounting then it's on the system". So, we _should_ not have that as an assumption. Task notify should be able to handle insmod and rmmod of a module cleanly. > > >> Huh? Beware that semundo is special and requires a semaphore to be > >> installed if clone() is called with a specific flag. If the semundo > >> patch doesn't free the notifier block in the exit path then I introduced > >> a bug. It's the responsibility of the TN_EXIT handler to clean up. > > > > As i pointed above it is not advisable to unregister from the callback, > > and that is the reason i am pointing this as a problem, not specifically > > on how the semundo patch is written. > > Please elaborate, you're in task context so you have control of it. What > is wrong calling the unregister function like this? > > >>> 6 wastage of memory (notifier head and multiple notifier blocks for > >>> every task in the system). > >> Sorry this argument is bogus. You're wasting a notifier head, but > > > > not just notifier head, you need notifier blocks associated with each > > user of the task notifier mechanism. > > > > It is certainly not bogus, if the user of the task notify mechanism is > > interested in _all_ tasks, then they have to allocate a notifier block > > for _every_ task in the system. Process event connector is an example > > that would be interested in _all_ tasks. > > For things that wants to know about every task in the system it doesn't > make sense to register it with a task notifier. It should rather be put > in with a fixed callout in the path or there should be two levels of > notifier chains, global notifiers and task notifiers. So, you are suggesting that there should be two kind of notifiers, one for global notification and one for per task ? Still need to resolve the register/unregister/locking above for task notify. > > >> compared to what else is in the task struct this is nothing, in fact > >> one could probably eliminate some elements in the task_struct and tag > >> them onto notifier block so they are only allocated when really needed. > >> A notifier block is only allocated if something is using it. > > > > According to your description above, you expect more people to start > > using task notifier, which in turn mean a notifier block for _each_ task > > for every such usage of task notifier mechanism. > > Yes, however thats adding 24 bytes per task on a 64 bit platform or 12 > bytes per task on a 32 bit platform. Compared to the overhead of running > many fixed callouts and the number of other data structures in the task > struct which are rarely used, this really isn't worth counting. > > >>> I understand there is "one function call overhead" at fork() for you and > >>> the MVFS folks, but having a global notifier block will be a lot > >>> simpler. > >> Having a global notifier block will be a lot costlier because it will > >> require a global lock. > > > > My point is that we are not getting away from using a global lock in the > > per task notifier_head case also. > > I still haven't seen any arguments for why we need it. Sure a module > may not unload if it's still in use, but thats what it is for all > modules anyway. here the module is _not_ being used, the user stopped using it but he still can't unload the module. > > >>> BTW, there is no locking involved during call_chain of atomic notifiers, > >>> so the overhead is minimal. > >> atomic notifiers make no sense in the fork path. An atomic notifier > >> implies that the service function cannot sleep, however that is not the > >> case during fork/exit etc. Either we go the full way with a global lock > >> or we do it lock free using raw notifiers as I suggest. > > > > By global lock you mean a semaphore, i presume. > > Any lock that would be taken for each task at the global level. > > >> this is not a problem. Besides, how many cases do we have where a module > > > > may not be many, but we don't want to be oopsing when they do, right ? > > As long as the kernel supports unloading a module, we should consider > > that as a valid usage. > > It's not unsupported, just with constraints. task notifiers are not > meant to be a kitchen sink, they should be used carefully like many > other apis within the kernel. > > >> is unloaded at runtime? A classical case for this is for things like > >> accounting and it's not a module that will be loaded and unloaded at > >> random. > > > > I will not conclude on that. We have heard in this channel that MVFS > > will use this feature. I am sure that many more will follow. > > Pardon my ignorance, but what is MVFS? What does it do and how does it > rely on this feature. In an earlier thread on this same topic, John Kohl mentioned about it. Here is the link http://sourceforge.net/mailarchive/message.php? msg_id=13827470 John can provide more info. > > > They may not be doing at random, but they may do it, and IMO, we should > > work correctly in those cases also. > > The philosophy behind the task notifiers is that they are to be simple > and efficient. If we start catering for random corner cases in the hot > path it's going to be very expensive. Remember, this is Linux, not AIX > or IRIX :))) If it is AIX or IRIX you and i won't be talking about it here, will we :) > > >>> Individual module owners cannot protect the notifier chain when it is > >>> walked through during call_chain. > >> Nobody should need to walk the notifier chain outside the specified > >> context. > > > > They won't be walking the list, the notifier_call_chain would be > > walking, and it needs protection. > > Since the notifier_call_chain will only be called from within task > context then it doesn't need any locking as long as we only allow > walking the chain from within task context. > > >>> If any module wants to block, then we may have to provide a blocking > >>> task notify call chains also. > >> The task notification chain is allowed to sleep. We already use > >> GFP_KERNEL all over the place there. > > > > We can just go with a blocking semaphore. > > And add another blocking semaphore to the hot path, don't think thats > going to do anything good for scalability. > Regards, > Jes > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Matt H. <mat...@us...> - 2006-05-01 23:06:33
|
On Mon, 2006-05-01 at 14:33 -0700, Chandra Seetharaman wrote: > On Mon, 2006-05-01 at 12:07 +0200, Jes Sorensen wrote: > > Chandra Seetharaman wrote: > > > On Fri, 2006-04-28 at 15:29 +0200, Jes Sorensen wrote: > > > Do you mean to say that the module will not be allowed to unload _until_ > > > there is _a single_ task that the module has attached a notifier block > > > with ? That doesn't sound appealing > > > > > > Consider a filesystem example: > > > - Administrator insmods a filesystem > > > - administrator mounts a filesystem > > > - random user chandra gets to the filsystem > > > - chandra creates a task p1 > > > - p1 does something which makes filesystem module interested in p1 > > > - filesystem module attaches a notifier block to p1 > > > - user chandra moves off the filsystem, but p1 remains alive. > > > - administrator unmounts the filesystem, succeeds as no user in the fs > > > - administrator tries to rmmod the filesystem, and it fails due to the > > > above logic. How can the administrator fix the problem ? > > > > Unregistering doesn't fail, it just becomes delayed. How likely is it > > yes, delayed _until_ the task dies, which is an unknown. > > that someone will unregister a file system like that while the task that > > accessed it is still live? The cost of adding locking is far worse than > > the above IMHO fairly artificial case. > > even if you consider this as an artificial case, don't you think the > kernel code should work properly ? > > > > > > Also, i do not think it is advisable to call unregister from the > > > notifier call. > > > > There is nothing wrong with that. You're in the task context at the > > time, nothing else will conflict. > > Here is the unregister code (as is in 2.6.17-rc3): > > 1: static int __kprobes notifier_call_chain(struct notifier_block **nl, > 2: unsigned long val, void *v) > 3: { > 4: int ret = NOTIFY_DONE; > 5: struct notifier_block *nb; > 6: > 7: nb = rcu_dereference(*nl); > 8: while (nb) { > 9: ret = nb->notifier_call(nb, val, v); > a: if ((ret & NOTIFY_STOP_MASK) == NOTIFY_STOP_MASK) > b: break; > c: nb = rcu_dereference(nb->next); > d: } > e: return ret; > f: } > > line 7 gets nb, line c resets nb with nb->next, if nb is freed in > notifier_call on line 9, how can we be sure that nb->next in line c is > sane ? > > Hence, IMO you cannot free the notifier block from the callback itself. > > > > > > even though the notifier block is per task, notifier_head in the task is > > > global, in other words, there can be many users (of the task notify > > > mechanism) that can be doing one of the following at the _same time_ > > > with a single notifier head: > > > - register a new notifier block > > > - unregister a notifier block > > > - walk through the call chain in notifier_call_chain > > > > Nobody will walk the notifier chain outside of task context, anything > > doing that is broken. In addition, since the registration is done by > > the task itself this isn't a problem and unregistration is done at > > exit time. > > So your model is based on all of the above operations (register, > unregister and call_chain) happening in the 'said' task's context. > right ? > > I have a module that is interested in few tasks in the system. When my > module is insmoded, how do i get into each process's context to register > my notifier_block ? I think is an excellent point. As far as process events, I'd like to make it into a module. There's little point in making it a module if it can't be inserted at any time but instead must be inserted while only init is active. > > We could use the rcu approach if it was really deemed necessary to be > > able to add tasks from elsewhere. > > > > > Unless we have some kind of protection, how can we guarantee that they > > > will not be running over each other, walking off the list, calling > > > notifier call of a removed module etc., ? > > > > The walking the list argument doesn't make sense. The only way to do > > that is by adding heavy locking which we don't need by applying the > > simple constraints. > > IMO, the constraints are not simple. Well, I think they are simple but too restrictive, as I think your point re: module insertion shows. <snip> > > >> compared to what else is in the task struct this is nothing, in fact > > >> one could probably eliminate some elements in the task_struct and tag > > >> them onto notifier block so they are only allocated when really needed. > > >> A notifier block is only allocated if something is using it. > > > > > > According to your description above, you expect more people to start > > > using task notifier, which in turn mean a notifier block for _each_ task > > > for every such usage of task notifier mechanism. > > > > Yes, however thats adding 24 bytes per task on a 64 bit platform or 12 > > bytes per task on a 32 bit platform. Compared to the overhead of running > > many fixed callouts and the number of other data structures in the task > > struct which are rarely used, this really isn't worth counting. I would disagree with that. 24 bytes per task per module interested in the task. If you have 2000 tasks and 10 modules interested in every task that's 469K of memory for the notifier blocks system compared to 0.25K of memory for the global notifier blocks. Whereas the latter can fit in cache memory at many levels, the former cannot. Also, I think doing the 0.25K on a per-cpu basis is also quite reasonable if scalability turns out to be a problem with a global notifier chain. Cheers, -Matt Helsley |
From: Matt H. <mat...@us...> - 2006-05-02 04:54:28
|
On Mon, 2006-05-01 at 12:07 +0200, Jes Sorensen wrote: > Chandra Seetharaman wrote: <snip> > > It is certainly not bogus, if the user of the task notify mechanism is > > interested in _all_ tasks, then they have to allocate a notifier block > > for _every_ task in the system. Process event connector is an example > > that would be interested in _all_ tasks. > > For things that wants to know about every task in the system it doesn't > make sense to register it with a task notifier. It should rather be put > in with a fixed callout in the path or there should be two levels of > notifier chains, global notifiers and task notifiers. Yes, I agree. The question is only which of those approaches you've outlined above is "best". Here's a summary of the alternatives discussed in threads so far, as I see it: 0) Exclusive: i. global notifiers and an undefined per-task association mechanism ii. per-task notifiers and duplication of global notifiers for each task The three of us have all been arguing about 0 and you've just proposed 1 and 2 below. I've elaborated on 1 and 2 below: 1) Fixed, separate callouts i. with the same event enums a. in the same places (little code sharing) b. in different places (less code sharing) ii. with different event enums a. in the same places (no code sharing) b. in different places (no code sharing) (Seems to me that 1.i.a is the place where we'd have some common ground yet be able to satisfy our own needs semi-independently. 1.ii is not worth considering right now since we haven't discussed 2.*. ...) 2) 2-level dispatch ("overhead" is relative to code without task notify): Given T tasks, M_a modules interested in all tasks and an average of M_p modules interested per-task (0 < M_p <= M_a): i. global notifier block calls per-task chain space overhead: O(M_a + 1 + T*M_p) space overhead due to 2nd level mechanism: if only per-task users (M_a = 0): O(1) if only global users (M_p = 0): O(1) avg time overhead per notification: O(M_a + 1 + M_p) avg time overhead per notification due to 2nd level: if only per-task users (M_a = 0): O(1) if only global users (M_p = 0): O(1) total notification time overhead: O(T*M_a + T + T*M_p) NOTES: One notification block to trigger 2nd level. ii. per-task chain notifier block calls global notifier chain space overhead: O(M_a + T*(M_p + 1)) space overhead due to 2nd level mechanism: if only per-task users (M_a = 0): O(T) if only global users (M_p = 0): O(T) avg time overhead per notification: O(M_a + 1 + M_p) avg time overhead per notification due to 2nd level: if only per task users (M_a = 0): O(1) if only global users (M_p = 0): O(1) total notification time overhead: O(T*M_a + T + T*M_p) NOTES: T notifier blocks to trigger 2nd level Since both 2.i and 2.ii have both chains, the following NOTES seem to apply to both: Notification happens entirely within given task's execution context (for both chains). Proposed rwsem to protect global chain (necessary?). Global chain [un]registration can happen any time (no rwsem) or any time sleep is allowed (rwsem needed). Concerns over rwsem contention impact on CPU scalability (not backed by measurements). Per-cpu replication of global chain could eliminate contention in notification path. Lock only for [un]register. Is this feasible? Per-task _[un]registration takes place entirely within given task's execution context Unregistration within a handler may be problematic: How does unregistration within a handler func interfere with simultaneous chain traversal? Since per-task block [un]registration times are more restricted than global block [un]registration times, being unable to [un]register from within a handler would make [un]registration for per-task blocks more complicated than the current patches indicate. 2.i and 2.ii share the problem that they have some overhead for the other users of task notification. For instance, 2.i has some overhead due to per-task chains: exactly 1 notifier block + T chain heads, and, depending on implementation, concerns over contention on a semaphore. Similarly, 2.ii has some overhead due to the need to keep T notifier blocks around that trigger the global chain, 1 chain head, and, depending on implementation, concerns over contention on the same semaphore. I think 2.i is the most reasonable. Scalability concerns should be measured to ensure we're not prematurely optimizing/complicating and, if measurements indicate the necessity, addressed as outlined above. Lastly, 2.i has better overhead characteristics. There are no differences in notification time overhead, while 2.ii has a significant difference in space overhead. If only one or the other kind of chain is in use 2.i uses less space, whereas 2.ii always uses more space. As I pointed out earlier, this difference in space could be the difference between fitting in cache (2.i) or not (2.ii) -- a definite performance impact, and one that's likely more important to those running small to medium numbers of processors. <snip> > >>> If any module wants to block, then we may have to provide a blocking > >>> task notify call chains also. > >> The task notification chain is allowed to sleep. We already use > >> GFP_KERNEL all over the place there. > > > > We can just go with a blocking semaphore. > > And add another blocking semaphore to the hot path, don't think thats > going to do anything good for scalability. > > Regards, > Jes If it's an rwsem and you don't allow unregistration within the handler it would rarely, if ever, block on that semaphore because notification could be protected by the "read" side and register/unregister could be protected by the "write" side. Cheers, -Matt Helsley |
From: Jes S. <je...@sg...> - 2006-05-02 09:45:20
|
Matt Helsley wrote: > On Mon, 2006-05-01 at 12:07 +0200, Jes Sorensen wrote: > I think 2.i is the most reasonable. Scalability concerns should be > measured to ensure we're not prematurely optimizing/complicating and, if > measurements indicate the necessity, addressed as outlined above. > Lastly, 2.i has better overhead characteristics. There are no > differences in notification time overhead, while 2.ii has a significant > difference in space overhead. If only one or the other kind of chain is > in use 2.i uses less space, whereas 2.ii always uses more space. As I > pointed out earlier, this difference in space could be the difference > between fitting in cache (2.i) or not (2.ii) -- a definite performance > impact, and one that's likely more important to those running small to > medium numbers of processors. Obviously 2.i has better overhead for your specific case because you are so focused on memory consumption. However the real case is that you are likely to have maybe 2-3 notifiers per task, since what the per task notifier chain solves is also it gets rid of a ton of rarely used per task callouts that are currently hard coded. If we are looking at performance the cost of bouncing additional semaphores around is much more interesting than using an extra 50KB of memory on a real system. In addition the extra memory one needs to spend will be lost on hash tables tracking which notification group a task belongs to when called from the global callout mechanism. So when it comes to that, the memory consumption is going to be the same or possibly worse using the global mechanism. So using that argument, 2.i is clearly not the most reasonable solution. Or in other words, I think the real solution here is to provide both as it will lead to the lowest overhead and minimum memory consumption for real since won't need to apply shoehorn meassures to get from one method to the other. Anyway, simley intended. It's just proof that we are trying to solve two different problems. > If it's an rwsem and you don't allow unregistration within the handler > it would rarely, if ever, block on that semaphore because notification > could be protected by the "read" side and register/unregister could be > protected by the "write" side. An rwsem is still going to result in nasty atomic bus operations. For per-task notification it isn't going to be too heavy, but for a global notification chain this means you are going to have a global semaphore taken on each schedule call. That may not be too noticable on a 2-way box, but try it on a 16-way, or worse 512-way and you'll see some very unpleasant effects. In short, if we add a global callout chain, then I think it's even more important than for the per-task chain that the chain is RCU'd rather than uses a semaphore. The latter will just nuke the system. Cheers, Jes |
From: Matt H. <mat...@us...> - 2006-05-02 22:56:26
|
On Tue, 2006-05-02 at 11:45 +0200, Jes Sorensen wrote: > Matt Helsley wrote: > > On Mon, 2006-05-01 at 12:07 +0200, Jes Sorensen wrote: > > I think 2.i is the most reasonable. Scalability concerns should be > > measured to ensure we're not prematurely optimizing/complicating and, if > > measurements indicate the necessity, addressed as outlined above. > > Lastly, 2.i has better overhead characteristics. There are no > > differences in notification time overhead, while 2.ii has a significant > > difference in space overhead. If only one or the other kind of chain is > > in use 2.i uses less space, whereas 2.ii always uses more space. As I > > pointed out earlier, this difference in space could be the difference > > between fitting in cache (2.i) or not (2.ii) -- a definite performance > > impact, and one that's likely more important to those running small to > > medium numbers of processors. > > Obviously 2.i has better overhead for your specific case because you are Not just my specific case -- yours too. See the cases where my email talks about the overhead when there are no global chain users? > so focused on memory consumption. However the real case is that you are > likely to have maybe 2-3 notifiers per task, since what the per task > notifier chain solves is also it gets rid of a ton of rarely used per > task callouts that are currently hard coded. I wonder if you're over-exaagerating what it can get rid of. They seem useful, but "a ton".. > If we are looking at performance the cost of bouncing additional > semaphores around is much more interesting than using an extra 50KB of > memory on a real system. In addition the extra memory one needs to spend > will be lost on hash tables tracking which notification group a task > belongs to when called from the global callout mechanism. So when it > comes to that, the memory consumption is going to be the same or > possibly worse using the global mechanism. The 2-level mechanisms do not get rid of per-task chains or global chains -- so this argument does not apply since you won't need a hash table or other mechanism. Each of the 2-level mechanisms I referred to use one chain to trigger notification on the other. Which chain triggers the second chain is the difference between 2.i and 2.ii and what I was trying to analyze in my previous email. > So using that argument, 2.i is clearly not the most reasonable solution. You seem to assume that I've dismissed per-task chains when I have not. Perhaps my email (http://marc.theaimsgroup.com/?l=lse-tech&m=114654571406352&w=2) was unclear -- I think both chains are needed. Points 1 and 2 which I outlined in that email discuss a range of ways in which the two chains can work together. > Or in other words, I think the real solution here is to provide both Yeah, this is what I was thinking about. I was trying to determine if there was a difference (in memory and/or time) between the global triggering notification on the per-task chain and vice-versa. That's what this 2.i and 2.ii stuff was all about. > as it will lead to the lowest overhead and minimum memory consumption > for real since won't need to apply shoehorn meassures to get from one > method to the other. Yup. > Anyway, simley intended. It's just proof that we are trying to solve two > different problems. :) Sure, but they aren't completely different problems. We should find a way to share some of the common bits. > > If it's an rwsem and you don't allow unregistration within the handler > > it would rarely, if ever, block on that semaphore because notification > > could be protected by the "read" side and register/unregister could be > > protected by the "write" side. > > An rwsem is still going to result in nasty atomic bus operations. For > per-task notification it isn't going to be too heavy, but for a global > notification chain this means you are going to have a global semaphore > taken on each schedule call. That may not be too noticable on a 2-way > box, but try it on a 16-way, or worse 512-way and you'll see some very > unpleasant effects. I'm going to hopefully save us some time and refer to the same argument we had in January: http://lkml.org/lkml/2006/1/14/31 I distinctly recall that you used a boot-time anecdote -- which is a distinctly different set of paths than we're talking about here -- to justify your argument. I'm saying we should go with the simpler solution until you can empirically show that a more complex solution is needed. Perhaps you could test this patch or something like it on your 512-way to show what kind of impact it has: http://lkml.org/lkml/2006/1/12/379 Then if you find there are problems we can choose a more complex solution for the global notifier chain. > In short, if we add a global callout chain, then I think it's even more > important than for the per-task chain that the chain is RCU'd rather > than uses a semaphore. The latter will just nuke the system. OK, but I believe RCU will still require a lock to modify the list on [un]register -- just like per-cpu chains. Cheers, -Matt Helsley |
From: Paul E. M. <pa...@us...> - 2006-05-02 23:38:17
|
On Tue, May 02, 2006 at 03:49:54PM -0700, Matt Helsley wrote: > On Tue, 2006-05-02 at 11:45 +0200, Jes Sorensen wrote: > > In short, if we add a global callout chain, then I think it's even more > > important than for the per-task chain that the chain is RCU'd rather > > than uses a semaphore. The latter will just nuke the system. > > OK, but I believe RCU will still require a lock to modify the list on > [un]register -- just like per-cpu chains. RCU does indeed save only the read-side locks. Update-side mutual exclusion of some sort or another, be it locking, single designated updater task, non-blocking synchronization, or whatever, is still required. If I remember correctly, the issue with RCU is that some notifier invocations want to sleep. Is that still the major issue? Thanx, Paul |
From: Chandra S. <sek...@us...> - 2006-05-03 00:00:22
|
On Tue, 2006-05-02 at 16:38 -0700, Paul E. McKenney wrote: > On Tue, May 02, 2006 at 03:49:54PM -0700, Matt Helsley wrote: > > On Tue, 2006-05-02 at 11:45 +0200, Jes Sorensen wrote: > > > In short, if we add a global callout chain, then I think it's even more > > > important than for the per-task chain that the chain is RCU'd rather > > > than uses a semaphore. The latter will just nuke the system. > > > > OK, but I believe RCU will still require a lock to modify the list on > > [un]register -- just like per-cpu chains. > > RCU does indeed save only the read-side locks. Update-side mutual > exclusion of some sort or another, be it locking, single designated > updater task, non-blocking synchronization, or whatever, is still > required. > > If I remember correctly, the issue with RCU is that some notifier > invocations want to sleep. Is that still the major issue? Yes Paul, It is still an issue :(. Some of them do kmalloc with GFP_KERNEL. > > Thanx, Paul -- ---------------------------------------------------------------------- Chandra Seetharaman | Be careful what you choose.... - sek...@us... | .......you may get it. ---------------------------------------------------------------------- |
From: Matt H. <mat...@us...> - 2006-05-03 00:13:19
|
On Tue, 2006-05-02 at 16:38 -0700, Paul E. McKenney wrote: > On Tue, May 02, 2006 at 03:49:54PM -0700, Matt Helsley wrote: > > On Tue, 2006-05-02 at 11:45 +0200, Jes Sorensen wrote: > > > In short, if we add a global callout chain, then I think it's even more > > > important than for the per-task chain that the chain is RCU'd rather > > > than uses a semaphore. The latter will just nuke the system. > > > > OK, but I believe RCU will still require a lock to modify the list on > > [un]register -- just like per-cpu chains. > > RCU does indeed save only the read-side locks. Update-side mutual > exclusion of some sort or another, be it locking, single designated > updater task, non-blocking synchronization, or whatever, is still > required. > > If I remember correctly, the issue with RCU is that some notifier > invocations want to sleep. Is that still the major issue? > > Thanx, Paul Hi Paul, Can we use a list_for_each_continue_rcu() for this? Roughly: rcu_read_lock(); pos = head; list_for_each_continue_rcu(pos, head) { rcu_read_unlock(); <call handler> rcu_read_lock(); } rcu_read_unlock(); Or was this not the way it was intended to be used? Is there a reason there are no "safe" and "entry" rcu mutations of the above? Thanks, -Matt Helsley |
From: Paul E. M. <pa...@us...> - 2006-05-03 14:48:28
|
On Tue, May 02, 2006 at 05:05:57PM -0700, Matt Helsley wrote: > On Tue, 2006-05-02 at 16:38 -0700, Paul E. McKenney wrote: > > On Tue, May 02, 2006 at 03:49:54PM -0700, Matt Helsley wrote: > > > On Tue, 2006-05-02 at 11:45 +0200, Jes Sorensen wrote: > > > > In short, if we add a global callout chain, then I think it's even more > > > > important than for the per-task chain that the chain is RCU'd rather > > > > than uses a semaphore. The latter will just nuke the system. > > > > > > OK, but I believe RCU will still require a lock to modify the list on > > > [un]register -- just like per-cpu chains. > > > > RCU does indeed save only the read-side locks. Update-side mutual > > exclusion of some sort or another, be it locking, single designated > > updater task, non-blocking synchronization, or whatever, is still > > required. > > > > If I remember correctly, the issue with RCU is that some notifier > > invocations want to sleep. Is that still the major issue? > > > > Thanx, Paul > > Hi Paul, > > Can we use a list_for_each_continue_rcu() for this? Roughly: > > rcu_read_lock(); > pos = head; > list_for_each_continue_rcu(pos, head) { > rcu_read_unlock(); > <call handler> > rcu_read_lock(); > } > rcu_read_unlock(); > > Or was this not the way it was intended to be used? Almost. The code would need to look as follows: rcu_read_lock(); pos = head; list_for_each_continue_rcu(pos, head) { <do something to nail down pos> rcu_read_unlock(); <call handler> rcu_read_lock(); <do something to unnail pos> } rcu_read_unlock(); The "nailing" and "unnailing" could be reference counts, which would be almost as heavyweight as locks. However, there -are- lightweight reference counts, for example, splitting the reference count into per-CPU counters. Of course, taking this approach on a per-element basis consumes more memory. So yet another twist is to maintain a "bulk" reference count on all the elements in the list as a group. This can work well, but for a couple possible issues: 1. You might still need memory barriers. You should be able to get around this by RCU-freeing any elements after removal -- which you would be doing anyway. 2. If the lists are being scanned extremely intensively, there will be no time at which any element may be removed. One way to fix this is to maintain a pair of reference counts, one of which is incremented for new traversals, the other of which is only being decremented. Once the "other" counter decreases to zero, the roles of the two counters may be switched. Freeing then happens as follows: a. Remove an element from the list. b. Wait for an RCU grace period. This ensures that any CPUs that were in the process of gaining a reference to the just-removed element have now done so -- and thus that and future list traversals will fail to gain such a reference. c. Wait for the roles of the counters to be switched (but if the "other" counter is zero, you can simply switch them immediately -- under the protection of some appropriate lock/sema, of course!). d. Wait for the "other" counter to reach zero. e. Wait for another grace period, then free the element. This second wait for a grace period removes the need for memory barriers near the reference count increments and decrements. > Is there a reason there are no "safe" and "entry" rcu mutations of the > above? Well, I am actually (slowly) getting rid of the non-"entry" RCU list-traversal primitives -- there does not seem to be much justification for the non-"entry" forms other than history, and they are harder to use. The reason the full set is not present is that we are adding them only as they are actually needed. Thanx, Paul |
From: Jes S. <je...@sg...> - 2006-05-04 13:21:25
|
Matt Helsley wrote: > On Tue, 2006-05-02 at 11:45 +0200, Jes Sorensen wrote: >> so focused on memory consumption. However the real case is that you are >> likely to have maybe 2-3 notifiers per task, since what the per task >> notifier chain solves is also it gets rid of a ton of rarely used per >> task callouts that are currently hard coded. > > I wonder if you're over-exaagerating what it can get rid of. They seem > useful, but "a ton".. Why write it with a pen if you can have it on a neon sign? Right now we have a lot of things in the fork path which is to support what I'd qualify as uninteresting for the majority of users, such as security_task_alloc, audit_alloc, copy_keys ... unfortunately the distros have gone obsessed about the selinux/apparmor stuff so everybody is being forcefed it now :( Anyway probably better to discuss over beer :) > The 2-level mechanisms do not get rid of per-task chains or global > chains -- so this argument does not apply since you won't need a hash > table or other mechanism. Each of the 2-level mechanisms I referred to > use one chain to trigger notification on the other. > > Which chain triggers the second chain is the difference between 2.i and > 2.ii and what I was trying to analyze in my previous email. Ok, but it seems that if we go that way we end up with more things to evaluate at runtime ;( >> Or in other words, I think the real solution here is to provide both > > Yeah, this is what I was thinking about. I was trying to determine if > there was a difference (in memory and/or time) between the global > triggering notification on the per-task chain and vice-versa. That's > what this 2.i and 2.ii stuff was all about. Sounds like I got lost in the numbers there :) >> Anyway, simley intended. It's just proof that we are trying to solve two >> different problems. > > :) Sure, but they aren't completely different problems. We should find a > way to share some of the common bits. Agreed > I distinctly recall that you used a boot-time anecdote -- which is a > distinctly different set of paths than we're talking about here -- to > justify your argument. The boot time case was just an example of what happens when a lock is too contested. The problem in that case wasn't specific to booting, even if one managed to boot the machine, the cost of bouncing cache lines around can be quite high if one does it too much. > I'm saying we should go with the simpler solution until you can > empirically show that a more complex solution is needed. Perhaps you > could test this patch or something like it on your 512-way to show what > kind of impact it has: > > http://lkml.org/lkml/2006/1/12/379 > > Then if you find there are problems we can choose a more complex > solution for the global notifier chain. Note that what was requested recently was a blocking semaphore, not a spin lock. In addition it is going to depend on the actual load, for the case where one lanches NR_CPUS jobs and they all run for hours it's not going to cost a lot. The point I am after is that we know that locks are expensive, we should stop going around adding more unless we absolutely have to. Another example I can mention, which wasn't an actual lock, just a hot cacheline, was back when I wrote the acenic GigE driver. Going from a global SLAB pointer to per-CPU gave a 10% net performance increase on a 2 CPU 400MHz P2 box. You don't need a 512 CPU machine to see the effects of this, it's all about the conditions. >> In short, if we add a global callout chain, then I think it's even more >> important than for the per-task chain that the chain is RCU'd rather >> than uses a semaphore. The latter will just nuke the system. > > OK, but I believe RCU will still require a lock to modify the list on > [un]register -- just like per-cpu chains. Sure, I make the assumption that register/unregister is going to be far less common than traversing the list. Cheers, Jes |